Experimental Evaluation of Residential Demand Response in California
Abstract
We evaluate the causal effect of hourahead price interventions on the reduction of residential electricity consumption, using a largescale experiment on 7,000 households in California. In addition to this experimental approach, we also develop a nonexperimental framework that allows for an estimation of the desired treatment effect on an individual level by estimating userlevel counterfactuals using timeseries prediction. This approach crucially eliminates the need for a randomized experiment. Both approaches estimate a reduction of 0.10 kWh (11%) per Demand Response event and household. Using different incentive levels, we find a weak price elasticity of reduction. We also evaluate the effect of an adaptive targeting scheme, which discriminates users based on their estimated responses in order to increase the perdollar reduction ratio by 30%. Lastly, we find that households with smart home automation devices reduce significantly more than households without, namely 0.28 kWh (37%).
I Introduction
This paper studies the causal effect of incentivizing residential households to participate in Demand Response (DR) to temporarily reduce electricity consumption. DR has been promoted by the introduction of demandside management programs (DSM) after the 1970s energy crisis [1], enabled by the integration of information and communications technology in the electric grid. The rationale behind DSM is the inelasticity of energy supply due to the slowness of power plants’ output adjustment, which causes small increases and decreases in demand to result in a price boom or bust, respectively. Since utilities are obligated to provide endusers with electricity at a quasifixed tariff at all times [2], e.g. TimeofUse pricing, they have to bear price risks. Therefore, DSM attempts to protect utilities against such price risks by partially relaying them to endusers, which increases market efficiency according to the economic consensus [3].
In 2015, the California Public Utilities Commission (CPUC) launched a Demand Response Auction Mechanism (DRAM) [4], requiring utilities to procure a certain amount of reduction capacity from DR providers. These aggregators incentivize their customers (also called “Proxy Demand Resource” (PDR) [5]) under contract to temporarily reduce their consumption relative to their projected usage without intervention, referred to in this context as counterfactual or baseline, based on which compensations for (non)fulfilled reductions are settled: If the consumer uses less (more) energy than the baseline, she receives a reward (incurs a penalty). Figure 1 illustrates the interactions between agents.
The estimation of the materialized reduction arguably is the most critical component of the DR bidding process. If the reductions are estimated with a biased counterfactual, either the DR provider or the utility clearing the bids is systematically discriminated against. If the baseline is unbiased but plagued by high variance, the profit settlement is highly volatile. Existing baselines employed by major power grid operators in the United States (e.g. California Independent System Operator (CAISO), New York ISO) are calculated with simple arithmetic averages of previous observations [5] and therefore are inaccurate. The estimation of more accurate baselines is a significant contribution of this paper.
Ia Contributions
We estimate the average treatment effect (ATE) of hourahead notifications on the reduction of electricity consumption by evaluating a Randomized Controlled Trial (RCT) on residential households in California serviced by the three main electric utilities (PG&E, SDG&E, SCE). This experiment is funded by the California Energy Commission, which to the best of our knowledge is the first one to experiment with hourahead notifications on a residential household level. We estimate an ATE of kWh per DR Event and user and further discover notable geographic and temporal heterogeneity among users, as the largest estimated reductions occur in summer months as well as in regions with warmer climate.
In addition to this experimental approach, we also develop a nonexperimental method for estimating this causal effect on an individual user level, which is easily aggregated into an ATE. Importantly, we find that the identified ATEs in both cases are close to each other. Interestingly, the nonexperimental approach even achieves tighter confidence intervals of the estimated causal effect. This suggests that our methodology is capable of identifying the causal impact of any intervention in settings with highfrequency data, thereby circumventing financial and ethical constraints which frequently arise in clinical trials, transportation, or education.
Lastly, we design an adaptive targeting method to exploit the heterogeneity in users’ responses to incentive signals to assign differing price levels to different subsets of the treatment population. Specifically, we separate users based on their previous responses into two distinct groups, each of which either only receives low or high incentives. This method yields an increase of the perdollar yield of 30%.
This paper is structured as follows: Section II explains the experimental setup and provides summary statistics on the RCT data. We then develop the nonexperimental estimation framework in Section III, where we pay particular attention to estimation bias and empirical debiasing methods (Section IIIC). Nonexperimental estimation results are provided in Section IV. Next, in Section V we estimate the ATE using a classical FixedEffects Estimator [6]. Section VF compares the estimates obtained by both approaches. Lastly, the effect of adaptive targeting is discussed in Section VI. Section VII concludes. Additional figures and numeric data are relegated to the appendix.
IB Related Work
Causal inference seeks to extrapolate the effect of interventions to general settings, however, many experiments are infeasible or unethical due to budget or ethical factors. With the rapid growth of collected user data, nonexperimental estimates become more and more valuable, as the new hope is to use such estimates in place of experiments. These facts have spurred research at the intersection of machine learning and economics, whose general idea is to partition observations under treatment and control in order to fit a nominal model on the latter set, which, when applied on the treatment set, yields the treatment effect of interest.
Examples for such models are [7], who evaluates welfare effects of home automation by calculating the KolmogorovSmirnov Statistic between users, or [8], who constructs a convex combination of US states as the counterfactual estimate for tobacco consumption to estimate the effect of a tobacco control program in California on tobacco consumption. In [9], the estimators are random forests trained by recursive partitioning of the feature space and novel crossvalidation criteria. [10] develops Bayesian structural time series models combined with a MonteCarlo sampling method for treatment effect inference of market interventions.
Fitting an estimator on smart meter timeseries is essentially a shortterm load forecasting (STLF) problem, whose goal is to fit estimators on observed data to predict future consumption with the highest possible accuracy. Within STLF, tools employed are ARIMA models with a seasonal component [11] and classic regression models where support vector regression (SVR) and neural networks yield the highest accuracy [12, 13]. A comprehensive comparison between ML techniques for forecasting and differing levels of load aggregation is provided in [14].
In the context of smart meter data mining, much of the existing work focuses on disaggregation of energy consumption to identify contributions of discrete appliances from the total observed consumption [15] and to learn consumption patterns [16, 17]. Studies in applied economics typically emphasize the estimation of ATEs of experimental interventions. To increase precision of the estimates, the employed regression models often employ unitlevel fixed effects [18, 19], which is an implicit way of training models for the consumption of individual consumers. In this work, we make these userlevel models explicit, allowing for more general ML techniques. Importantly, our approach is original as it permits to perform causal inference on the level of individual treatment effects in a straightforward fashion by employing estimators from STLF. To the best of our knowledge, this paper is the first of its kind to analyze the potential of Demand Response interventions on a residential level, combining ideas at the intersection of causal inference from econometrics and Machine Learning for estimation.
Ii Experimental Setup and Data Characteristics
Iia Setup of the Experiment
The experiment is carried out by OhmConnect, Inc., using funds provided by the California Energy Commission. Figure 2 draws a flowchart of the experimental setup.
Over the course of the experimental time period (Nov. 2016  Dec. 2017), each consumer that signs up for the study is randomly assigned to one of the following groups:

TreatmentEncouraged: The user receives an average number of 25 DR events in the 90 days following the signup, with incentive levels being randomly chosen from the set . Additionally, the user is given a rebate for purchasing a smart home automation device.

TreatmentNonEncouraged: Same as in TreatmentEncouraged, but without smart home automation rebate.

Control: Users do not receive any notifications for DR events for a period of 90 days after sigup.
These three groups form Phase 1 of the experiment. Users in the control group that have reached 90 days of age are removed from the study. Users in either the TreatmentEncouraged or TreatmentNonEncouraged groups that have reached 90 days of age are pooled and systematically assigned to one of the following groups for Phase 2 interventions:

TargetedHigh: The user receives an average number of 25 DR events for a period of 90 days after being rolled over into Phase 2. Each reward level is randomly drawn from the set .

TargetedLow: Same as in TargetedHigh, but rewards are randomly drawn from .

NonTargeted: Same as in targeted groups, with rewards drawn from .
Users with completed Phase 2 are removed from the study. In Sections IIIV, we evaluate Phase 1 of the experiment whereas Section VI is dedicated to adaptive targeting (Phase 2). In the remainder of this paper, we use the term “treatment users” to refer to users in the “TreatmentEncouraged” and “TreatmentNonEncouraged” group. Users receive notifications of a DR event with the incentive level up to 20 minutes into an hour, which lasts until the end of the hour.
IiB Summary Statistics
Table I reports the number of users by experiment group and proportion of users for which we were able to scrape historical smart meter reading data. The table shows that the randomized assignment of users to groups roughly follows a 1:2:2 ratio (Control vs. TreatmentEncouraged vs. TreatmentNonEncouraged).
Historical Smart Meter Data Availability by Group  
Group  # Enrolled  # With Data  # With DR 
Control  
TreatmentEncouraged  
TreatmentNonEnc. 
Users without DR events or for which we were unable to scrape historical data are omitted from the study. Since the assignment of users into the different experimental groups was randomized (see Section IID), dropping such users does not affect the evaluation of the experiment. Figure 3 shows the geographic distribution of the remaining users.
IiC Weather Data
Hourly measurements of ambient air temperature are scraped from the publicly accessible California Irrigation Management Information System [20]. As there are fewer weather stations than distinct user ZIP codes, we linearly interpolate userspecific temperatures at their ZIP codes from the two closest weather stations in latitude and longitude by calculating geodesic distances with Vincenty’s formulae [21].
IiD Balance Checks
To verify that users were randomly assigned to control and treatment groups, we perform a balance check on the distribution of observed air temperatures and electricity consumptions across both groups. Notice that the relatively large sample size renders a classical differencesinmeans test inappropriate. Therefore, we utilize Cohen’s to estimate the effect size based on the differences between means, which is insensitive to the large sample size. Given two discrete distributions and with sample sizes / and means /, Cohen’s is defined as
(1) 
where and are the sample standard deviations for distributions and , respectively. In addition, we use the Hellinger Distance as a nonparametric comparison to quantify the similarity between the distributions [22]:
(2) 
where and . To compute (1) and (2), we discretize the temperature and consumption distributions appropriately. Table VI in the Appendix provides these metrics together with the differences in means for a selected subset of hours of the day, which was chosen to coincide with those hours of the day for which DR events were observed (see Figure 10). We omit the metrics for the remaining hours of the day as they are very similar to the listed ones. As the Hellinger Distance , with 0 corresponding to a perfect similarity and 1 to total dissimilarity, we can assume that the assignment of users into treatment and control group is as good as random.
Iii Nonexperimental Treatment Effect Estimation
Iiia Potential Outcomes Framework
To estimate the effect of the DR intervention program, we adopt the potential outcomes framework introduced by Rubin (1974) [23]. Let denote the set of users. The indicator encodes the fact whether or not user received DR treatment at time . Each user is endowed with a consumption time series and associated covariates , , where time is indexed by and is the dimension of the covariate space . Let and denote user ’s electricity consumption at time for and , respectively. Let and denote the set of control and treatment times for user . That is,
(3) 
The number of treatment hours is much smaller than the number of nontreatment hours. Thus .
Further, let and denote user ’s covariateoutcome pairs of treatment and control times, respectively. That is,
(4) 
The onesample estimate of the treatment effect on user at time , given the covariates , is
(5) 
which varies across time, the covariate space, and the user population. Marginalizing this onesample estimate over the set of treatment times and the covariate space yields the userspecific Individual Treatment Effect (ITE)
(6) 
The average treatment effect on the treated (ATT) follows from (6):
(7) 
Since users were put into different experimental groups in a randomized fashion, the ATT and the average treatment effect (ATE) are identical [24]. Lastly, the conditional average treatment effect (CATE) on is obtained by marginalizing the conditional distribution of onesample estimates (5) on over all users and treatment times, where is a subvector of :
(8) 
The CATE captures heterogeneity among users, e.g. with respect to specific hours of the day, the geographic distribution of users, the extent to which a user possesses “smart home” appliances, group or peer effects, etc. To rule out the existence of unobserved factors that could influence the assignment mechanism generating the complete observed data set , we make the following standard assumptions:
Assumption 1 (Unconfoundedness of Treatment Assignment).
Given the covariates , the potential outcomes are independent of treatment assignment:
(9) 
Assumption 2 (Stationarity of Potential Outcomes).
Given the covariates , the potential outcomes are independent of time, that is,
(10) 
Assumption 1 is the “ignorable treatment assignment” assumption introduced by Rosenbaum and Rubin [25]. Under this assumption, the assignment of DR treatment to users is implemented in a randomized fashion, which allows the calculation of unbiased ATEs (7) and CATEs (8). Assumption 2, motivated by the timeseries nature of the observational data, ensures that the set of observable covariates can capture seasonality effects in the estimation of the potential outcomes. That is, the conditional distribution of the potential outcomes, given covariates, remains constant.
The fundamental problem of causal inference [26] refers to the fact that either the treatment or the control outcome can be observed, but never both (granted there are no missing observations). That is,
(11) 
Thus, the ITE (6) is not identified, because one and only one of both potential outcomes is observed, namely for the treatment times and for the control times. It therefore becomes necessary to estimate counterfactuals.
IiiB NonExperimental Estimation of Counterfactuals
Consider the following model for the estimation of such counterfactuals:
(12) 
where denotes noise uncorrelated with covariates and treatment assignment. is the conditional mean function and pertains to . To obtain an estimate for , denoted with , control outcomes are first regressed on , namely their observable covariates. In a second step, the counterfactual for any can be estimated by evaluating on its associated covariate vector . Finally, subtracting from isolates the onesample estimate , from which the userspecific ITE (6) can be estimated. Figure 4 illustrates this process of estimating the reduction during a DR event by subtracting the actual consumption from the predicted counterfactual . Despite the fact that consumption can be predicted for horizons longer than a single hour, we restrict our estimators to a single hour prediction horizon as DR events are at most one hour long.
To estimate , we use the following classical regression methods [27], referred to as estimators:

Ordinary Least Squares Regression (OLS)

L1 Regularized (LASSO) Linear Regression (L1)

L2 Regularized (Ridge) Linear Regression (L2)

Nearest Neighbors Regression (KNN)

Decision Tree Regression (DT)

Random Forest Regression (RF)
DT (IIIB) and RF (IIIB) follow the procedure of Classification and Regression Trees [28]. We compare estimators (IIIB)(IIIB) to the CAISO 10in10 Baseline (BL) [5], which, for any given hour on a weekday, is calculated as the mean of the hourly consumptions on the 10 most recent business days during the selected hour. For weekend days and holidays, the mean of the 4 most recent observations is calculated. This BL is further adjusted with a Load Point Adjustment, which corrects the BL by a factor proportional to the consumption three hours prior to a DR event [5].
Since users tend to exhibit a temporary increase in consumption in the hours following the DR intervention [1], we remove hourly observations following each DR event in order to prevent estimators (IIIB)(IIIB) from learning from such spillover effects. This process is illustrated in Figure 5.
Hence the training data used to estimate the conditional mean function (12) consists of all observations leading up to a DR event, excluding those that are within 8 hours of any DR event. To estimate user ’s counterfactual outcome during a DR event , we use the following covariates:

5 hourly consumption values preceding time

Air temperature at time and 4 preceding measurements

Hour of the day, an indicator variable for (non)business days, and month of the year as categorical variables
Thus, the covariate vector writes
(13)  
In (13), denotes temperature, hour of day, an indicator variable for business days, and the month of year (all for user at time ). “C” denotes dummy variables and “:” their interaction.
IiiC Placebo Treatments and Debiasing of Estimators
As previously mentioned, a crucial element of an estimator is unbiasedness. If an estimator systematically predicts counterfactuals that are too large (small), users receive an excess reward (are paid less) proportional to the amount of prediction bias. For a fair economic settlement, it is thus desirable to minimize the amount of bias. In our application, such prediction bias is caused by the following two factors:

Seasonal and temporal bias: Due to the experimental design, DR events for a particular user are concentrated within a period of 180 days after signing up. Further, DR events are called only in the afternoon and early evening (see Figure 10). Thus, fitting an estimator on all available historical data is likely to introduce bias during these time periods of interest.
To deal with these challenges, we use the debiasing procedure presented in Algorithm 1, which was first introduced in [29].
Input: Treatment data , control data , Estimator,
(14) 
We first separate a subset of nonDR events from user ’s control data , which we call the placebo set with associated placebo treatment times (we chose to be of size 25). This placebo set is drawn according to user ’s empirical distribution of Phase 1 DR events by hour of day and month of year. Next, the nonexperimental estimator of choice is fitted (using crossvalidation to find hyperparameters to minimize the mean squared prediction error) on the training set . Importantly, to account for temporal bias, we assign weights to the training samples, ensuring that samples in “similar” hours or seasons as actual DR events are assigned larger weights. Specifically, the weights are determined as follows:
(15a)  
(15b)  
(15c) 
where is a constant to be chosen apriori.
Then, the fitted model is used to predict counterfactuals associated with placebo events. This yields a set of paired samples from which we can obtain a proxy of the estimation bias that remains even after assigning sample weights according to the previous step. Finally, to obtain an empirically debiased estimate of actual Phase 1 DR events, we simply subtract this proxy of the estimation bias from predicted Phase 1 DR event outcomes.
IiiD Estimation of Individual Treatment Effects
To obtain point estimates for user ’s ITE , we simply average all onesample estimates (5) according to (6). To obtain an estimate of whether or not a given user has actually reduced consumption, we utilize a nonparametric permutation test with the null hypothesis of a zero ITE:
(16) 
Given user ’s paired samples during DR periods, the value associated with (16) is
(17) 
In (17), denotes the mean of . denotes the set of all possible assignments of signs to the pairwise differences in the set . That is,
(18) 
which is of size . Finally, the value from (16) is calculated as the fraction of all possible assignments whose means are less than or equal the estimated ITE . In practice, as the number of DR events per user in Phase 1 is about 25 (see Figure 2), the number of total possible assignments becomes computationally infeasible. Thus, we randomly generate a subset of assignments from to compute the value in (17). Moreover, we use the percentile bootstrap method [30] to compute a confidence interval of the estimated ITE for user around the point estimate .
Iv Nonexperimental Estimation Results
Iva Average Treatment Effect
Figure 6 shows ATE point estimates and their 99% bootstrapped confidence intervals conditional on differing reward levels for all estimators as well as the CAISO BL. Due to the empirical debiasing procedure (see Section IIIC), the point estimates for estimators IIIBIIIB are close to each other. BL appears to be biased in favor of the DRP, as it systematically predicts smaller reductions than IIIBIIIB.
The ATE averaged over the predictions of estimators IIIBIIIB is kWh / . The intercept and the slope of the demand curve are kWh / kWh/USD, meaning that users reduce an additional 0.013 kWh per dollar offered, which is only a small change. Due to the idiosyncratic nature of the CATE for , the slope and intercept have to be interpreted with caution. However, the results give rise to a notable correlation between incentive levels and reductions.
To compare the prediction accuracy of the estimators, Table II reports the width of the confidence intervals for each method and incentive level. The inferiority of the CAISO baseline compared to the nonexperimental estimators, among which RF achieves the tightest confidence intervals, becomes apparent. Therefore, in the remainder of this paper, we restrict all results achieved with nonexperimental estimators to those obtained with RF.
Width of CATE Confidence Intervals (kWh) by Incentive Level  
0.05  0.25  0.50  1.00  3.00  
BL  
KNN  
OLS  
L1  
L2  
DT  
RF  0.0211  0.0210  0.0212  0.0211  0.0205 
IvB Individual Treatment Effects
Figure 7 plots ITEs for a randomly selected subset of 800 users who received at least 10 DR events in Phase 1, estimated with RF. Users are sorted by their point estimates (blue), whose 95% bootstrapped confidence intervals are drawn in black. Yellow lines represent users with at least one active smart home automation device. By marginalizing the point estimates over all users with at least 10 events, we obtain an ATE of kWh (11.4%), which is close to kWh as reported earlier. The difference ensues from only considering users with at least 10 DR events. The 99% ATE confidence interval is kWh.
Table III reports estimated ATEs for users with or without active smart home automation devices, which are obtained by aggregating the relevant estimated ITEs from Figure 7. We notice larger responses as well as a larger percentage of estimated reducers among automated users.
ATEs Conditional on Automation Status for Users with 10 DR Events  
# Users  % Reducers  ATE (kWh)  ATE (%)  
Automated  
NonAutomated  
All 
Table IV reports the percentage of significant reducers for different confidence levels, obtained with the permutation test under the null (16). From Tables III and IV, it becomes clear that automated users show larger reductions than nonautomated ones, which agrees with expectations.
Fraction of Significant Reducers (among sample of size )  
# Automated  
% of Total  
# NonAutomated  
% of Total  
# All  
% of Total 
V ATE Estimation with Fixed Effects Models
To estimate the ATE of DR interventions on electricity consumption, we consider the following fixedeffects model with raw consumption (kWh) as the dependent variable:
(19) 
In (19), subscripts and refer to user at time , respectively. is a row vector of observable covariates, are unobserved fixed effects, and is the noise term which is assumed to be uncorrelated with the regressors and Gaussian distributed with zero mean and finite variance. The fixed effects term removes persistent differences across users in their hourly and monthly consumption interacted with a business day indicator variable:
(20) 
Va Estimation by Incentive Level
To estimate the CATE by incentive level, the covariate matrix in (19) is specified as follows:
(21a)  
(21b) 
In (21a), is an indicator set to one for all treatment users (and zero for all control users). is the CAISO baseline for user at time , which is necessary to control for the nonrandom assignment of reward levels to users, is the ambient air temperature, and is the reward level.
VB Estimation by Hour of the Day
To estimate the CATE by hour of the day, we pool all reward levels into the indicator variable , which is one if user received treatment at time and zero otherwise:
(22) 
VC Estimation by Month of the Year
The CATE by month of the year is found in a similar fashion to the CATE by hour of the day:
(23) 
VD Role of Smart Home Automation
The CATE by automation status is determined by introducing the indicator :
(24) 
VE Effect of Automation Uptake Encouragement
Lastly, the effect of incentivizing users to purchase a smart home automation device on energy consumption during DR events is determined as follows:
(25)  
In (25), the indicators and are for all users in the “TreatmentEncouraged” and in “TreatmentNonEncouraged”, respectively, and zero otherwise.
VF Comparison of Estimation Methods
We now benchmark the results obtained from the best nonexperimental estimator (RF) to those from the fixed effects model with specification (20).
Figure 8 compares the point CATEs by reward levels and their 95% confidence intervals. It can be seen that the point estimates are close to each other ( kWh aggregated for fixed effects vs. for nonexperimental approach with RF, a less than difference), a finding that suggests that our nonexperimental estimation technique produces reliable estimates comparable to the experimental gold standard. The fact that the confidence intervals are notably tighter for RF corroborates this notion.
Vi Effect of Adaptive Targeting
The goal of adaptive targeting is to maximize the reduction per dollar paid to the users, which is achieved by either minimizing the payout and/or maximizing users’ reductions. We evaluate the reduction by reward ratios for the targeted and nontargeted groups by averaging the perevent reductions (5) normalized by the reward :
(26a)  
(26b) 
where denotes user ’s set of Phase 2 DR events.
Via Targeting Assignment Algorithm
Algorithm 2 describes the targeting assignment algorithm on a given set of users, which we denote with .
Users are transitioned into Phase 2 on a weekly basis. That is, for a particular week, all users who have reached 90 days of age in Phase 1 form the current weekly cohort, which is randomly split into a nontargeted group and targeted group of equal size (ties are broken randomly). For each user in , we calculate the ITE based on Phase 1 events. These ITEs are then sorted in ascending order. The 50% of the largest reducers (with the most negative ITEs) are defined to be the lowtargeted group , whereas the other half is assigned to hightargeted group . This targeting scheme appears to be a doubleedged sword: On the one hand, the DRP pays less money to large reducers and also achieves larger reductions for previously small reducers, increasing the desired ratio. On the other hand, previously large reducers now reduce less (in response to smaller rewards) and previously small reducers are paid more money for increased reductions, thereby counteracting the desired goal. However, the latter factors are dominated by the gains from the former ones, as we show in Section VIC.
ViB Validation of Adaptive Targeting
Algorithm 2 exploits the fact that users are relatively price inelastic (indeed Figure 8 only shows a weak negative slope of the demand curve) to assign large incentives to low reducers (and small incentives to high reducers) to minimize the total payout from DRP to users. The attentive reader might wonder why the targeting criterion had been determined to be the estimated ITE rather than any other criterion. Indeed, we performed a targeting exercise to determine which criterion is most suitable for maximizing (26a). The idea is to assign users into one of two targeted groups, based on one of the following criteria estimated from Phase 1 responses:

ITE

ITE normalized by average reward level received

Intercept of estimated individual demand curve

Slope of estimated individual demand curve
Each of these four criteria are computed in kWh and % values, and after a sufficient number of iterations it was determined that the ITE indeed is the criterion that maximizes (26a).
ViC Results of Adaptive Targeting
Table V reports the targeting metrics together with the CATE by treatment group as well as the number of observations for targeted and nontargeted users . We restrict our analysis to samples obtained after June 27, 2017, as we observe larger effects of targeting in summer months.
Targeting Metrics for Phase 2  
Estimator  
BL  
RF 
RF predicts a difference of , or an increase of about compared to the nontargeted scheme. For BL, this increase is smaller (15%). However, due to the biasedness of the BL (see Figure 6), the RF estimate is more reliable. We can observe the tradeoff between smaller reductions (indeed RF predicts a CATE for targeted users that is 34% smaller compared to nontargeted users) and a reduced average payout, which decreases by 85% (not reported in Table V). The latter effect dominates the decrease in net reductions, resulting in the 30% increase of the reduction per reward ratio (26a).
Vii Conclusion
We analyzed Residential Demand Response as a humanintheloop cyberphysical system that incentivizes users to curtail electricity consumption during designated hours. Utilizing data collected from a Randomized Controlled Trial funded by the CEC and conducted by a Demand Response provider in the San Francisco Bay Area, we estimated the causal effect of hourahead price interventions on electricity reduction. To the best of our knowledge, this is the first major study to investigate DR on such short time scales.
We developed a nonexperimental estimation framework and benchmarked its estimates against those obtained from an experimental FixedEffects Linear Regression Model. Importantly, the former does not depend on the existence of an experimental control group to construct counterfactuals that are necessary to estimate the treatment effect. Instead, we employ offtheshelf regression models to learn a consumption model on nonDR periods, which can then be used to predict counterfactuals during DR hours of interest. We find that the estimated treatment effects from both approaches are close to each other. The estimated ATE is kWh (11%) per Demand Response event and household. Further, we observe a weak positive correlation between the incentive level and the estimated reductions, suggesting that users are only weakly elastic in response to incentives.
The fact that the estimates obtained from both approaches are close to each other is encouraging, as our nonexperimental framework permits to go a step further compared to the experimental method in that it allows for an estimation of individual treatment effects. From an economic perspective, being able to differentiate low from high responders allows for an adaptive targeting scheme, whose goal is to minimize the total payout to users while maximizing total reductions. We utilize this fact to achieve an increase of the reductionperreward ratio of 30%.
Lastly, we emphasize that our nonexperimental estimation framework presented in this paper has to potential to generalize to similar humanintheloop cyberphysical systems that require the incentivization of users to achieve a desired objective. This is because our nonexperimental framework, whose techniques are general rather than specific to Demand Response, admits results on an individual user level, which could be of particular interest in the incentivization of users in transportation or financial systems.
Future work includes the analysis of adversarial user behavior (baseline gaming) and advanced effects including peer and network effects influencing Demand Response. Also, we intend to investigate the effect of moral suasion and other nonmonetary incentives on the reduction of electricity consumption of residential households.
References
 [1] P. Palensky and D. Dietrich, “Demand Side Management: Demand Response, Intelligent Energy Systems, and Smart Loads,” IEEE Transactions on Industrial Informatics, vol. 7, no. 3, pp. 381–388, 2011.
 [2] Federal Energy Regulatory Commission, “Assessment of Demand Response and Advanced Metering,” Tech. Rep., 2016.
 [3] S. Borenstein, “The LongRun Efficiency of RealTime Electricity Pricing,” The Energy Journal, 2005.
 [4] “Public Utilities Commission of the State of California: Resolution E4728. Approval with Modifications to the Joint Utility Proposal for a Demand Response Auction Mechanism Pilot,” July 2015.
 [5] “California Independent System Operator Corporation (CAISO): Fifth Replacement FERC Electric Tariff,” 2014.
 [6] P. J. Diggle, P. Heagarty, K.Y. Liang, and S. L. Zeger, Analysis of Longitudinal Data. Oxford University Press, 2013, vol. 2.
 [7] B. Bollinger and W. R. Hartmann, “Welfare Effects of Home Automation Technology with Dynamic Pricing,” Stanford University, Graduate School of Business Research Papers, 2015.
 [8] A. Abadie, A. Diamond, and J. Hainmueller, “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program,” Journal of the American Statistical Association, vol. 105, no. 490, pp. 493–505, 2012.
 [9] S. Athey and G. W. Imbens, “Recursive Partitioning for Heterogeneous Causal Effects,” Proceedings of the National Academy of Sciences of the United States of America, vol. 113, no. 27, pp. 7353–7360, 2016.
 [10] K. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S. Scott, “Inferring Causal Impact Using Bayesian Structural TimeSeries Models,” The Annals of Applied Statistics, vol. 9, no. 1, pp. 247–274, 2015.
 [11] J. W. Taylor and P. E. Sharry, “ShortTerm Load Forecasting Methods: An Evaluation Based on European Data,” IEEE Transactions on Power Systems, vol. 22, no. 4, pp. 2213–2219, 2007.
 [12] T. Senjyu, H. Takara, K. Uezato, and T. Funabashi, “OneHourAhead Load Forecasting Using Neural Network,” IEEE Transactions on Power Systems, vol. 17, no. 1, pp. 113–118, 2002.
 [13] E. E. Elattar, J. Goulermas, and Q. H. Wu, “Electric Load Forecasting Based on Locally Weighted Support Vector Regression,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 40, no. 4, 2010.
 [14] P. Mirowski, S. Chen, T. K. Ho, and C.N. Yu, “Demand Forecasting in Smart Grids,” Bell Labs Technical Journal, 2014.
 [15] F. Chen, J. Dai, B. Wang, S. Sahu, M. Naphade, and C.T. Lu, “Activity Analysis Based on Low Sample Rate Smart Meters,” Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 240–248, 2011.
 [16] A. MolinaMarkham, P. Shenoy, K. Fu, E. Cecchet, and D. Irwin, “Private Memoirs of a Smart Meter,” Proceedings of the 2nd ACM Workshop on Embedded Sensing Systems for EnergyEfficiency in Building, pp. 61–66, 2010.
 [17] D. Zhou, M. Balandat, and C. Tomlin, “A Bayesian Perspective on Residential Demand Response Using Smart Meter Data,” 54th Allerton Conference on Communication, Control, and Computing, 2016.
 [18] H. Allcott, “Rethinking RealTime Electricity Pricing,” Resource and Energy Economics, vol. 33, no. 4, pp. 820–842, 2011.
 [19] K. K. Jessoe, D. L. Miller, and D. S. Rapson, “Can HighFrequency Data and NonExperimental Research Designs Recover Causal Effects?” Working Paper, 2015.
 [20] “California Irrigation Management Information System,” 2017.
 [21] T. Vincenty, “Geodetic Inverse Solution Between Antipodal Points,” Tech. Rep., 1975.
 [22] M. S. Nikulin, “Hellinger distance,” "http://www.encyclopediaofmath.org/index.php?title=Hellinger_distance&oldid=16453".
 [23] D. B. Rubin, “Estimating Causal Effects of Treatments in Randomized and NonRandomized Studies,” Journal of Educational Psychology, vol. 66, no. 5, pp. 688–701, 1974.
 [24] J.S. Pischke and J. D. Angrist, Mostly Harmless Econometrics, 1st ed. Princeton University Press, 2009.
 [25] P. R. Rosenbaum and D. B. Rubin, “The Central Role of the Propensity Score in Observational Studies for Causal Effects,” Biometrika, vol. 70, no. 1, pp. 41–55, 1983.
 [26] P. W. Holland, “Statistics and Causal Inference,” Journal of the American Statistical Association, vol. 81, no. 396, pp. 945–960, 1986.
 [27] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer New York, 2009.
 [28] L. Breiman, J. Friedman, C. Stone, and R. A. Olshen, “Classification and Regression Trees,” CRC Press, 1984.
 [29] M. Balandat, “New Tools for Econometric Analysis of HighFrequency Time Series Data  Application to DemandSide Management in Electricity Markets,” University of California, Berkeley, PhD Dissertation, 2016.
 [30] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. CRC Press, 1994.
Appendix
a Summary Statistics
B Balance Checks
Balance Metrics for Control and Treatment Group  
Cohen’s D  Hellinger Dist.  Diff. Mean  
kWh,  
kWh,  
kWh,  
kWh,  
kWh,  
kWh,  
kWh,  
kWh,  
air_temp,  0.005  
air_temp,  
air_temp,  
air_temp,  
air_temp,  
air_temp,  
air_temp,  
air_temp,  
historical obs. (hours) 
C Fixed Effects Regression Tables
Tables VIIIX provide the results of the Fixed Effects Regressions presented in Section V. The point estimates of interest are printed in boldface and are accompanied by the standard errors as well as their 95% confidence intervals. The value of the regression gives rise to the value, where we use to denote statistical significance at the confidence level, respectively.
Effect of DR by Incentive Level on Electricity Consumption  
Parameter 

Value  95% Conf. Int.  value  

2.100  [0.013, 0]  0.047  

88.89  [0.859, 0.900]  0.001  

10.79  [0.017, 0.024]  0.001  


8.532  [0.148, 0.091]  0.001  


6.910  [0.157, 0.085]  0.001  


7.369  [0.147, 0.083]  0.001  


6.219  [0.166, 0.083]  0.001  


12.95  [0.157, 0.114]  0.001  
Effect of DR by Month of Year on Electricity Consumption  
Parameter 

Value  95% Conf. Int.  value  

1.962  [0.014, 0.001]  0.078  

55.52  [0.844, 0.915]  0.001  

3.326  [0.007, 0.034]  0.008  


4.298  [0.063, 0.020]  0.002  


2.571  [0.041, 0.003]  0.028  


18.62  [0.085, 0.067]  0.002  


15.14  [0.071, 0.053]  0.001  


20.07  [0.104, 0.083]  0.001  


20.25  [0.172, 0.138]  0.001  


32.68  [0.242, 0.211]  0.001  


19.39  [0.177, 0.141]  0.001  


9.071  [0.218, 0.142]  0.001  


3.055  [0.050, 0.008]  0.012  


2.172  [0.045, 0.001]  0.055 
Effect of Home Automation on Electricity Consumption  
Parameter 

Value  95% Conf. Int.  value  

2.101  [0.013, 0]  0.047  

88.94  [0.859, 0.900]  0.001  

10.79  [0.017, 0.024]  0.001  


7.800  [0.418, 0.243]  0.001  


7.310  [0.132, 0.074]  0.001 
Effect of Automation Uptake Incentive on Electricity Consumption  
Parameter 

Value  95% Conf. Int.  value  

1.422  [0.012, 0.002]  0.168  

2.485  [0.015, 0.001]  0.021  

38.38  [0.886, 0.987]  0.001  

10.794  [0.017, 0.024]  0.001  


7.703  [0.153, 0.088]  0.001  


8.304  [0.156, 0.094]  0.001 
D Comparison of Estimation Methods
Figure 8 visually compares the ATEs broken out by incentive level, and it can be seen that both methods produce similar estimates. Figure 13 does the same for month of the year. Agreeing with intuition, the reductions are notably larger in summer months compared to winter periods. Conditional on the automation status, Table IX states that the reductions are and kWh for automated and nonautomated users, respectively, compared to and kWh calculated by the nonexperimental case. These values are close to each other. Lastly, no significant difference in the magnitude of reductions can be found between encouraged and nonencouraged users.
E Correlation of Temperature and ITE
As mentioned in the previous subsection, larger reductions are estimated in warm summer months. To test the hypothesis whether or not there exists such a correlation, Figure 14 scatter plots estimated ITEs as a function of the average ambient air temperature observed during the relevant DR events. We can notice a notable positive correlation of ambient air temperature and the magnitude of reductions. Indeed, a subsequent hypothesis test with the null being a zero slope is rejected with a value of less than .
To support this notion, we marginalize ITEs for each ZIP code to obtain the geographic distribution of CATEs by location, see Figure 15, and it is visually striking that users in coastal areas in California show smaller reductions than users in the Central Valley, where the climate is hotter.