Optimal Provision of Regulation Service Reserves Under Dynamic Energy Service Preferences

# Optimal Provision of Regulation Service Reserves Under Dynamic Energy Service Preferences

Bowen Zhang,  Michael C. Caramanis,  and John Baillieul,  B. Zhang is a visiting scholar at Department of Mechanical Engineering, Boston University, Boston, MA, 02215 USA e-mail: bowenz@bu.edu.M. C. Caramanis is with the Department of Mechanical Engineering and the Division of Systems Engineering, Boston University, Boston, MA, 02215 USA e-mail: mcaraman@bu.edu.J. Baillieul is with the Department of Electrical and Computer Engineering, the Department of Mechanical Engineering, and the Division of Systems Engineering, Boston University, Boston, MA, 02215 USA e-mail: johnb@bu.edu.Research reported here was supported by NSF grant 1038230.Manuscript received ********; revised ********.
###### Abstract

We propose and solve a stochastic dynamic programming (DP) problem addressing the optimal provision of bidirectional regulation service reserves (RSR) by pools of duty cycle appliances. Using our previously introduced concept of packetized energy, the approach (i) models the dynamics of the utility that each consumer associates with a block/packet of energy when the prevailing temperature is in the appliance specific comfort zone, and (ii) derives a dynamic pricing policy that closes the loop between time varying appliance specific utilities and a regulation signal broadcast by the Independent System Operator (ISO) every 4 seconds. Since evolves dynamically as a function of past energy block purchases and heat losses, and are time varying. Consumer utility levels are modeled by the probability distribution of utility levels across the ensemble of all idle appliances at time . This probability distribution provides a statistic of the detailed appliance specific utility information which is sufficient for the determination of an optimal pricing policy. Moreover, it can capture appliance specific diversity within comfort zone limits and temperature-utility function relationships. Dynamic pricing policies investigated in the literature assume that this ensemble probability distribution is time invariant and in fact uniform. A contribution of the work that follows is the replacement of this unrealistic simplifying assumption with the introduction of a new state variable modeling the aforementioned dynamic probability density function. The new state variable evolves as a function of the dynamic pricing policy which aims to track the centrally broadcast regulation signal trading off optimally between tracking error and appliance energy consumption utility. We specifically model active and idle appliances by means of a closed queuing system with price-controlled transitions from the idle to the active queue. Empirical evidence indicates that the dynamically changing frequency distribution of appliance specific utilities is closely approximated by a trapezoid-shape distribution function characterized by a single dynamic parameter which is in fact the aforementioned sufficient statistic. The optimal provision of RSR is addressed within a DP framework where we (i) derive an analytic characterization of the optimal pricing policy and the differential cost function, and (ii) prove optimal policy monotonicity and value function convexity. These properties enable us to implement a smart assisted value iteration (AVI) and an approximate DP (ADP) algorithm that exploit related functional approximations. Numerical results demonstrate (1) the validity of the solution techniques and the computational advantage of the proposed ADP on realistic, large-state-space problems, and (2) that the RSR provision based on the optimal control that respects dynamic consumer preferences has higher tracking accuracy relative to approaches assuming static uniformly distributed consumer preferences.

Approximate dynamic programming, smart grid, electricity demand response, regulation service reserves, demand responce under dynamic preferences.

## I Introduction

The urgently needed reduction of emissions will rely on the adoption of significant renewable electricity generation, whose volatility and intermittency will result in commensurate increase in the provision of secondary or Regulation Service Reserve (RSR) required to ensure power system stability through adequate Frequency Control (FC) and Area Control Error (ACE) management [1], [2], [3], [4]. The provision of these additional RSRs from centralized generation units, until now has been the primary approach, is expensive and poses a major challenge to massive renewable integration. The option of employing much lower cost demand-control-based RSR is rightly a major Federal Energy Regulatory Commission target [5]. Not surprisingly, there is extensive literature on demand management addressing, for example, direct load control (DLC) of thermostatic appliances to provide frequency reserve or load shedding [6], [7], [8], [9], [10], [11], [12], [13], the optimal coordination of distributed resources in a micro-grid flexible loads [14], [4], [15], and real time pricing approaches to reduce peak demand [16], [17], [18]. In contrast to DLC where consumers transfer the control of powering their appliances to a centralized decision maker, dynamic price directed demand response by individual consumers can achieve similar system benefits with lower consumer utility loss. The PNNL Olympic Peninsula project [19] is a successful pilot study where consumer specified price elasticities have been used to control thermostatic appliance by dynamic prices that proved superior to fixed or peak/off-peak prices.

Customer acceptance of price directed demand response, however, requires that customer preferences are modeled accurately and that dynamic prices are adapted to these preferences. In this paper we focus on the deployment of bidirectional regulation service reserves (RSR) by a pool of duty cycle cooling appliances in a smart building or neighborhood microgrid. Based up-to-date consumer preferences and the current price signal, the microgrid operator will purchase and allocate a block or packet of energy at the beginning of each cycle. Dynamic prices encourage (or discourage) idle appliances to purchase a packet of energy and start a cycle. Two distinct time scales, one hour and 4 seconds, are involved in promising and deploying RSRs. A smart building promises to the Independent System Operator (ISO) a certain quantity of up and down RSR in the hour ahead market and is compensated at the hour ahead marketâs RSR clearing price. During this hour, however, the smart building must deploy these reserves to track the RSR signal that is rebroadcast by the ISO at 4 second intervals taking values in the interval from -100% to +100%. Optimal deployment of the RSRs at the 4 second time scale is defined as the deployment driven by dynamic prices decided at the same 4 second time scale by the smart building operator, so as to maximize the average utility of cooling appliances over the hour. The smart building operator can not anticipate the ISO RSR signal, but it has access to a statistical model describing its stochastic dynamics during the relevant hour [7], [20].

To obtain the optimal dynamic pricing policy, we propose and solve an infinite horizon discounted cost stochastic dynamic programming (DP) problem. The hour long horizon is considered for all intents and purposes to be infinite relative to the 4 second time scale characterizing the regulation signal and price updates. State variables consist of the RSR signal level , the direction of RSR signal defined as the sign of , the number of connected appliances , and the sufficient statistic characterizing the probability distribution of utility level. More specifically, we model the dynamics of the utility that each space conditioning appliance enjoys from the consumption of a block/packet of energy purchased when the prevailing temperature is at a specific level within the appliance specific temperature comfort zone. The utility at time is modeled as an affine function of the temperature prevailing at time . The utility of purchasing a packet of energy by a cooling appliance will be low when its temperature is close to , and high when it is close to . To model diversity among appliances, coefficients are appliance specific independent random variables and is normalized by . The variable at a specific appliance evolves dynamically as a function of that appliance’s past trajectory of energy block purchases and heat losses, and hence and are time varying. The optimal pricing policy, however, does not require individual appliance utility levels at time . The sufficient information is captured by the probability distribution of utility levels across the ensemble of all idle appliances at time . This probability distribution is therefore the sufficient statistic that replaces appliance specific utilities in the stochastic DP state vector.

The principal contribution in what follows is a modeling extension in which the system state is a time varying probability density function that we argue provides important information about the dynamics of aggregated effects of individual consumer utility functions. This represents a significant improvement over the more traditional models (e.g. [7], [21], [20]) that assume a virtually infinite pool of homogeneous users and appliances whose preferences are static and independent of the recent history of price control signals. Our approach provides a more appropriate level of granularity that will enhance the value of RSRs that come from finite pools of duty-cycle appliances. The concepts to be described may also have useful applications in related areas such as dynamic pricing policies in the context of Internet service and mobile telephony bandwidth management [22], [23], [24], [25]. The bottom line message of the paper is that in communication between consumers and service providers, it is important to stop relying on the uniform and time invariant assumption and thereby avoid ratepayer revolt [26].

Focusing on a smart building with a finite number of duty cycle cooling appliances, we use a dynamic probability distribution to represent cooling zone specific occupant preferences to transition their appliance from an idle to an active state. This is achieved by employing a dynamically evolving sufficient statistic that captures the ensemble of idle cooling zone appliance preferences, namely the recent-price-trajectory-dependent probability distribution of a sampled cooling zone occupant’s reservation price. Moreover, we improve the tractability of the DP formulation by (i) deriving an analytic characterization of the optimal policy and the differential cost function, and (ii) proving useful monotonicity and convexity properties. These properties motivate the use of appropriate basis functions to construct a parametrized analytic approximation of the value function used to design an approximate DP (ADP) algorithm [27] that estimates optimal value function approximation parameters and near optimal policies. To support the relative advantage of our approach, we compare the real time RSR tracking performance of our dynamic utility based optimal dynamic pricing policy to that of a uniform static utility based dynamic pricing policy. The tracking of our policy will be shown to be clearly superior.

The rest of the paper proceeds as follows. In Sec.II we establish a DP problem by describing the state dynamics, the sufficient statistic of cooling zone preferences, the period costs, and the Bellman Equation. Time is quantized into 4-second intervals, and the electricity used by the appliances is also quantized using our previously announced concept of packetized energy. Sec.III compares the static uniform distribution representation of cooling zone preferences adopted in past work, with the proposed dynamic and non-uniform preference distribution formulation. More specifically, it uses the Bellman equation to prove analytical expressions relating the optimal policy to partial differences in the value function. Sec.IV proves monotonicity properties of the differential cost function and the optimal policy. It also discusses asymptotic behaviour of optimal policy sensitivities as the number of appliances becomes large. Sec.V utilizes the proven properties of the value function and the optimal control policy to propose and implement (i) an assisted value iteration algorithm, and (ii) a parametrized approximation of the value function. It also develops an ADP approach for estimating the value function approximation parameters. Numerical results demonstrate the superior performance of dynamic utility versus static utility modeling pricing policies. Sec.VI concludes the paper and proposes future directions.

## Ii Problem Formulation

We consider an advanced energy management building with cooling appliances. The smart building operator (SBO) contracts to regulate in real time its electricity consumption within an upper and a lower limit agreed upon in the hour-ahead market, where is the constant energy consumption rate that the SBO purchased in the hour ahead market, and is the maximum up or down RSRs that the SBO agreed to provide in the hour-ahead market. equals the average number of active appliances, which is related to the average number of duty cycles per hour required to maintain the cooling zone temperature within its comfort zone for the prevailing outside temperature and building heat transfer properties. is chosen as a fraction of that can be determined based on the appliance response property, such as cooling rate. In real time, the ISO would broadcast the RSR signal to the SBO, which is the output of a pre-specified proportional integral filter of observed Area Control Error (ACE) and System Frequency excursions. The RSR signal has an average value of 0 during each hour and its dynamics can be modeled as a Markov process [28]. Given these three values, the SBO assumes the responsibility to modulate its energy consumption to track with specified by the ISO in almost real time (usually every 2 or 4 seconds). To this end the SBO broadcasts a real time price signal to all cooling appliances to modulate the number of connected appliances and hence control the aggregated consumption. Appliances will detect price at rate to resume cooling by comparing its utility to .111In this paper we assume consumers to have the same sensitivity preference across zones, namely parameter in the utility function is fixed such that the mapping between temperature threshold and the price signal is bijective. However, and are not the same for all consumers. is normalized by the sensitivity parameter . Alternative implementations in the literature rely on similar approaches, but, in our opinion, less likely to be accepted by consumers. For example, the PNNL Olympic Peninsula project [19], has assumed the presence of aggregators, who, typically provide participants with multiple comfort level choices, and participants choose the comfort level based on their own preferences. Aggregators group participants into multiple subgroups based on their choices. Different groups of participants are characterized by different price elasticities and provide different type of reserves. For example, a participant group with the highest elasticity provides regulation service reserves (RSR) with a response time of 4 seconds. Participant groups with medium price elasticity provide spinning reserves with response times of 15 to 30 minutes. Participant groups with low price elasticity provide capacity reserves that have even slower response times. Appliances utility values depend on their corresponding cooling zone temperature . In this paper is a monotonically increasing function of over . Once , the appliance will connect and consume an packet of energy having departure rate . ( equals the average time it takes a cooling appliance to complete the consumption of an energy packet.) See [8], [29] where the associated electric power consumption is defined as a packet of electric energy needed to provide the work required by a single cooling cycle.

The RSR signal tracking error, , and occupant utility realizations constitute period costs. The objective is to find a state feedback optimal policy that minimizes the associated infinite horizon discounted cost. Individual cooling zone preferences are modeled by a dynamically evolving probability distribution of idle appliance zone temperatures . Assuming a known relationship connecting cooling zone preferences to temperatures, we derive the probability distribution of preferences (or derived utility levels) for cooling appliance activation as well as the sufficient statistics that characterizes . We next describe system dynamics, and formulate the period cost function of the relevant stochastic dynamic problem.

### Ii-a State Dynamics

The state variables contain and . Queues and constitute a closed queuing network where the service rate of one queue determines the arrival rate into the other. Queue behaves like an infinite server queue with each server exhibiting a stochastic Markov modulated service rate that depends on the control and the probability distribution . Queue behaves also as an infinite server queue with each server exhibiting a constant service rate . The dynamics of and the dependent state variable are characterized by transitions taking place in short but constant time intervals, 222This varies across ISOs. In PJM it is either 2 or 4 seconds depending on the type of regulation service offered., resulting in staying constant, increasing or decreasing by a typical amount of [28]. These transitions are outputs of a proportional integral filter operated by the ISO whose inputs are system frequency deviations from 60 hz and Area Control Error (ACE). Since the frequency deviation and ACE signal can be approximated by a white noise process resulting from imbalance between stochastic demands and supply, is then an unanticipated random variable which is described by memoryless transitions that depends only on the current value. Therefore we can approximate by a continuous time jump Markovian process that allows us to uniformize the DP problem formulation. To uniformize the DP problem we introduce a control update period of which assures that during the period , the probability that more than one event can take place is negligible. We further set the time unit so that , and scale transition rate parameters accordingly and derive the following state dynamics.

#### Ii-A1 Dynamics of y(t)

The transition probabilities of the discrete time Markov process depend on as well as [7], [20], which is ’s changing direction, either positive or negative, for every 4 seconds. . Statistical analysis on historical PJM data on trajectories indicate a weak dependence on yielding the reasonable approximation:

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩Prob(y(t+τy)=y(t)+Δy|D(t)=1)=0.8Prob(y(t+τy)=y(t)−Δy|D(t)=1)=0.2Prob(y(t+τy)=y(t)−Δy|D(t)=−1)=0.8Prob(y(t+τy)=y(t)+Δy|D(t)=−1)=0.2.

Denoting by () the rate at which will jump up by during a control update period when (), and by () the corresponding rate that will jump down when (), and selecting these rates to correspond to the exponential rates that are consistent with the geometric probability distribution described above we have:

 {γu1 = 0.8Δt/τy,γu2 = 0.2Δt/τyγd2 = 0.8Δt/τy,γd1 = 0.2Δt/τy

Setting the control update period as the operative time unit, i.e., , we have the following uniformized dynamics of :

 ⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩Prob(y(t+1)=y(t)+Δy|D(t)=1)=γu1Prob(y(t+1)=y(t)−Δy|D(t)=1)=γu2Prob(y(t+1)=y(t)−Δy|D(t)=−1)=γd2Prob(y(t+1)=y(t)+Δy|D(t)=−1)=γd1.

#### Ii-A2 Dynamics of i(t)

The dynamics of is governed by the following arrival and the departure rates.

The arrival rate depends on the policy . Denote by the proportion of idle appliances with cooling zone temperature . As described in the notation subsection, idle appliances observe the price broadcast by the SBO at a rate , and decide to connect and resume cooling when the price is smaller than their utility for cooling at time . Therefore the arrival rate into is

 a(t)=[N−i(t)]λpu(t)=(N−i(t))λTmax∫u(t)pt(T)dt, (1)

namely equals the product of the number of idle appliances that observe the broadcast price times the probability that .

The departure rate is independent of , It equals the product of active appliances times the departure rate of a cooling cycle. Modeling the consumption of an energy packet as an exponential random variable with rate such that equals the average cooling cycle duration, the departure rate is

 d(t)=i(t)μ. (2)

The stochastic dynamics of in the homogenized model is thus given by where the random variable satisfies the following probability relations

 ⎧⎪ ⎪⎨⎪ ⎪⎩p(~i=1)=a(t)p(~i=−1)=d(t)p(~i=0)=1−a(t)−d(t)−γ,

where is the total probability that the ISO RSR signal will change.

Fig. 1 represents the stochastic dynamics of the number of active and idle appliances as a two queue closed queuing network with queue levels summing to .

#### Ii-A3 Dynamics of D(t)

Recalling that , it is clear that the dynamics of are fully determined by the dynamics of . We next show that the dynamics of are also determined by the dynamics of .

#### Ii-A4 Dynamics of ^T(t)

Based on related work in HVAC system modelling [9] and [30], we use standard energy transfer relations to simulate the dynamics of the frequency histogram of idle appliance cooling zone temperatures, which appear to conform to a three parameter functional representation . We simulate two RSR signals from the ISO for the observation of the time varying probability distribution of consumers’ temperature. The first is a standard ISO RSR signal trajectory that aspiring RSR market participants must respond to in order to demonstrate that they have the ability to track. This is referred to in the PJM manual as the standard T-50 qualifying test [28]. The second is a real time signal downloaded from PJM [31]. We record the temperature levels prevailing across the cooling zones when a control trajectory is applied that results in near-perfect tracking of the ISO RSR requests implied by the aforementioned two signals. Assuming standard heat transfer relationships, extensive simulation reported below indicates that the time evolution of the probability distribution of cooling zone temperatures conforms to a dynamically changing trapezoid characterized fully by , see Fig. 2. is the probability density function (pdf) of at time , . and are the threshold values of room temperature that determine the appliance cooling zone occupant’s comfort zone. is of small value and approximated to be zero. Trapezoid has base to and top side to where . Note that and hence are time varying quantities. We will using to denote the pdf by dropping the time index in derivation and proof if time is not considered.

Fig. 3 shows the accuracy of using a trapezoid probability distribution to represent idle appliance cooling zone temperatures obtained from a Monte Carlo Simulation of the price controlled system. We discretize the temperature into 20 states and simulate a total number of 16000 appliances to observe a smooth probability distribution. The duty cycle on and off time are both 10 minutes. The price detection rate from idle appliances is 1 minute. PJM’s RSR signal is broadcast every 4 seconds. In Fig. 3, the numerous black curves are the actual probability distribution recorded at different time stamps for trapezoids characterized with . The red curve is the mean value of the set of black curves taken at each temperature state. The red curve is then approximated by the trapezoid green curve proposed in Fig. 2. and the trapezoid approximation are then the mean statistics of the actual frequency distribution of the temperature distribution. Note that the trapezoids are completely specified by two static quantities, and , forming the base of the trapezoid, and the time varying quantity that determines the height of the top horizontal side. It should be noted that the trapezoid approximation is based on the assumption of (1) small value of and (2) equal probability for temperature . In this paper, we model the trapezoid distribution based on the single sufficient statistics , future work will consider modeling the preference distribution with additional statistics to fully capture the distribution shape.

The upper left plot in Fig. 4 shows the time evolution of the trapezoids representing the idle appliance cooling zone temperature histograms throughout time when we are simulating the RSR tracking for the real signal. The upper right plot is the contour plot of the number of appliances where we can see clearly a time varying trapezoid distribution shaping the preferences. We filter the contour plot to get in lower right figure. Based on the time series recorded for (lower left figure) and , we find a strong anti-correlation between the two vectors in both simulation for the T-50 standard signal and the real RSR signal that are given by -0.9833 and -0.8106, respectively. We propose the regression of on with the following linear function

 ^T(t)=α0+α1y(t)+ω, (3)

where corresponds to the value of when the building’s energy consumption level is , , and is a zero mean symmetrically distributed error. These results not only explain most of the variability but are also sensible and conform with our expectations. Indeed, large values of indicate a history of repeated broadcasts of low prices to achieve high consumption levels requested by the ISO. Small values of approaching are observed for high levels, while for low levels of , is large. These findings support our a priori expectation that is a reasonable sufficient statistic of past state and control trajectories in the information vector available at time . This a priori expectation is based on the fact that levels are in fact integrators of recent price control trajectories. The verification of expectations by actual observations justifies the adoption of a tractable dynamic utility model conforming to the dynamics where is a zero mean symmetric random variable. Since the SBO is able to observe the actual cooling zone temperatures through its access to Building Automation Control (BAC), the dynamics above are adequate for optimal control estimation. We finally note that the mapping of temperature to consumption utility allows the dynamic and past-control-dependent distribution of cooling area zone temperatures to provide a dynamic and past-control-dependent distribution of cooling area consumption utility levels. In the end, Fig. 5 shows the actual (black curve), the predicted (red curve) based on linear regression, and the error between the two values. It can be seen that the error is zeros mean and symmetrically distributed that is consistent with our assumption in proposing (3).

### Ii-B Period Cost

The period cost consists of two parts: a penalty for deficient ISO RSR signal tracking and the utility realized by appliance users. The penalty for deficient tracking rate at time is defined as:

 g(i(t),y(t))=K[i(t)−¯n−y(t)RR]2, (4)

where is the penalty per unit of deficient tracking. Defining we can write the penalty for deficient tracking as,

 g(i(t),y(t))=κ[i(t)−¯n−y(t)R]2. (5)

The expected utility realized from a group of idle appliances can be calculated by three components: (1) the number of idle appliances, , that can potentially connect, (2) the connection probability of each appliance, , that depends on the dynamic frequency distribution , and (3) the utility rate per connection, which is the expected monetary value per connection calculated by consumers’ temperature based utility function and the probability distribution of room temperature . It should be noted that while the utility for a given consumer is determined by his room temperature , for a group of consumers the aggregated utility is based on the frequency distribution of room temperatures of the group, and therefore depends on instead of .

For a given price signal , there is a threshold temperature value obtained by solving . Idle appliances having temperature at least will connect. The consequent expected utility value realized by each connecting cooling appliance is

 Uu(t)=Tmax∫u(t)U(T)pt(T)dTTmax∫u(t)pt(T)dT. (6)

Denoting as the expected number of connections from the idle appliance group, the period expected utility value is

 a(t)Uu(t)=[N−i(t)]λpu(t)Uu(t),=[N−i(t)]λTmax∫u(t)pt(T)dtTmax∫u(t)U(T)pt(T)dTTmax∫u(t)pt(T)dT,=[N−i(t)]λTmax∫u(t)U(T)pt(T)dT. (7)

Equations (5) and (7) imply that the total cost rate is

 c(i(t),y(t),u(t))=κ[i(t)−¯n−y(t)R]2−[N−i(t)]λTmax∫u(t)U(T)pt(T)dT. (8)

### Ii-C Bellman Equation

The state variables can be grouped according to their dependence on : depends explicitly on . is also dependent on the past trajectory of controls, but, to the extent that this trajectory is consistent with a reasonable tracking the ISO RSR signal, it can be considered as a function of , which, as discussed earlier, is the sufficient statistic of this trajectory. We can thus consider all state variables, other than , to have dynamics that do not depend on . For notation simplicity we let () to be the state variables that make up the complement of when the RSR signal is going up (down), so that is the representation of the full state vector. Given the cost function and dynamics described above, we can formulate an infinite horizon discounted cost problem with the following Bellman equation for states including .

 J(i,¯id) = minu∈[Tmin,Tmax]{g(i,¯id)−a(t)Uu (9) +α[a(t)J(i+1,¯id)+d(t)J(i−1,¯id) +γd1J(i,¯iu+Δy)+γd2J(i,¯id−Δy) +(1−a(t)−d(t)−γd1−γd2)J(i,¯id)]}.

is the value function satisfying the Bellman equation, with denoting the discount factor. For notational simplicity we denote by the new state realized when the regulation signal increases from to rendering , while the rest of the state variables remain unchanged. Similarly we denote by the new state when the regulation signal decreases from to rendering , while the rest state variables remain unchanged. The superscripts standing for upwards (downwards) RSR signals. can be written similarly with minor notational changes.

## Iii Utility Realization and Optimal Policy

### Iii-a Uniform Utility Probability Distribution Model

For illustration purposes, we start with a simple utility function that represents a linear relationship between cooling zone temperature and utility enjoyed by activating an idle appliance and allowing it to embark on a cooling cycle

 U(T)=b(T−Tmin). (10)

The utility increases proportionately to the cooling zone temperature . If were selected to be a static and uniform probability distribution, as is the case with work published so far, the expected period utility rate would be a conveniently concave function of . Indeed, using (7) we can obtain that the expected period utility rate

 a(t)Uu=(N−i(t))λTmax∫uU(T)p(T)dT,=(N−i(t))λTmax∫ub(T−Tmin)1Tmax−TmindT,=(N−i(t))λb(Tmax−u)(Tmax+u−2Tmin)2(Tmax−Tmin), (11)

is a concave function of the policy .

In general, this concavity property of the expected period utility rate holds for broader class of utility functions – where is the temperature, is the utility sensitivity, and is a consumer specific value to characterize individual utility preference. We require two properties hold for the utility function: we need to be a random parameter independent of , and is a monotonically increasing function of . We prove this property by the following reasoning. Taking expectation on in (6), we have

 Uu=Eϵ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣Tmax∫u(t)U(T,b,ϵ)p(T)dTTmax∫u(t)p(T)dT⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=Tmax∫u(t)Eϵ[U(T,b,ϵ)]p(T)dTTmax∫u(t)p(T)dT. (12)

Taking the derivative of the expected utility rate with respect to

 ddua(t)Uu=ddu(N−i(t))λTmax∫uU(T)p(T)dT,=−(N−i(t))λTmax−TminEϵ[U(u,b,ϵ)]. (13)

Since increases with , we have , which leads to

 d2du2a(t)Uu=−(N−i(t))λTmax−TmindduEϵ[U(u,p,ϵ)]<0 (14)

Hence the expected period utility rate is a concave function of the policy .

### Iii-B Generalized Utility Probability Distribution Model

The concavity property no longer holds true under the realistic modelling of by a dynamic trapezoid characterized additionally by the time varying quantity . Indeed, the realistic representation implies the following consumers’ preferences distribution

 p(T)=⎧⎪ ⎪⎨⎪ ⎪⎩2Tmax+^T−2Tmin,T≤^T,2(T−Tmax)(^T−Tmax)(Tmax+^T−2Tmin),T≥^T.

For example, if we use a linear utility function as in (10), it yields the following expected period utility rate

 a(t)Uu=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩[N−i(t)]λ2b(C1−12u2+Tminu)Tmax+^T−2Tmin,u≤^T,[N−i(t)]λ2b(C2−13u3+Tmin+Tmax2u2−TminTmaxu)(^T−Tmax)(Tmax+^T−2Tmin),u≥^T. (15)

where and are some constants. The introduction of a dynamic dependent removes the concavity of the expected utility rate since the second derivative of the expected utility is

 d2du2a(t)Uu∝Tmin+Tmax−2u. (16)

And therefore the expected period utility rate is concave for , and convex for .

Under the static uniform probability distribution , the optimal policy can be easily obtained since we can set derivative to zero to get a local maximum which is also global. In addition, we proceed to show that a unique optimal policy exists as well under dynamic circumstances. We do this by showing first that a local maximum exists, and then prove that only one local maximum exists, and hence it is the global maximum as well.

### Iii-C Optimal Price Policy

We define the differential of the value function w.r.t. the active appliance state variable as

 Δ(i+1,¯id)=J(i+1,¯id)−J(i,¯id).

Using the Bellman equation, we can express the optimal policy in terms of

 u(i,¯id)=argminug(i,¯id)−λ(N−i)puUu+α{iμJ(i−1,¯id)+λ(N−i)puJ(i+1,¯id)+γd1J(i,¯iu+Δy)+γd2J(i,¯id−Δy)+[1−(iμ+λ(N−i)pu+γd1+γd2)]J(i,¯id)},=argmaxupuUu−αpuΔ(i+1,¯id), (17)

where the second equation is obtained by neglecting terms that are independent of . Letting

 f(u,Δ(i+1,¯id))=puUu−αpuΔ(i+1,¯id), (18)

we can write that the optimal policy must satisfy

 maxu∈[Tmin,Tmax]f(u,Δ(i+1,¯id)). (19)

Proposition 1 In the conventional assumption where consumers’ utility preference is statically uniformly distributed (), is a concave function of the policy in the allowable control set. The optimal policy is obtained either at the boundary of the allowable control set or at satisfying .

Proposition 1 is straightforward because the first term in is quadratic and the second term is a linear function of for . When with no longer uniform but trapezoid, no longer possesses the concavity property which under Proposition 1 guaranteed that a local maximum is the global maximum. We therefore proceed to prove existence and uniqueness as follows.

Proposition 2 For trapezoid consumers’ preference distribution with and non-homogeneous preferences function having monotonically increasing expected value , the optimal policy that solves (19) is described by the following relations

 (20)

where is the inverse function of the expected utility function satisfying .

Proof. Let , for the three conditions in (20) we claim the following:

1) When , namely is no less than the maximum possible utility per connection, we always have . Since is a monotonically increasing function and is a monotonically decreasing function of , reaches its maximum value at . On the other hand, if which is the optimal policy, we must have . To see this necessity, assume that , then there exists a policy such that and . Hence , which is a contradiction to the assumption that is optimal.

2) When , both and are monotonically decreasing function of . Therefore reaches its maximum at .

3) When , we can take the derivative of as is continuously differentiable on .

 dduf(u,Δ(i+1,¯id)),=ddu[puUu−αpuΔ(i+1,¯id)],=ddu[Tmax∫uEϵ[U(T,b,ϵ)]p(T)dT−αΔ(i+1,¯id)Tmax∫up(T)dT],=−p(u)[Eϵ[U(u,b,ϵ)]−αΔ(i+1,¯id)]. (21)

Denoting as the optimal control that minimizes , a necessary condition is to have be a local maximum of . Therefore it satisfies the first order condition

 p(u)[Eϵ[U(u,T,ϵ)]−αΔ(i+1,¯id)]∣∣u=u(i,¯id)=0. (22)

According to the proof in 1), () if and only if . Therefore in this case . The only solution to satisfy (22) is

 Eϵ[U(u(i,¯id),b,ϵ)]−αΔ(i+1,¯id)=0. (23)

Based on the definition of the inverse function, we have

 u(i,¯id)=I−1(αΔ(i+1,¯id)). (24)

For the second order condition, it can be verified that . Therefore is a local maximum. Moreover, given is first order differentiable, is continuous and has only one critical point inside the allowable control set, then the local maximum is the global maximum for , namely .

Remark 1 The optimal policy characterization between and does not rely on , namely it holds for broader possible realizations of consumers’ real time preferences. This is because (23) has only one solution which is the local and global optimal bearing the same argument in the proof. In addition, it holds for broader class of utility function (linear, quadratic, etc) and non-homogeneous utility incorporating consumer specific preference .

Remark 2 The optimal policy is determined by balancing (1) the utility rewards from connected consumers, and (2) the differential of the optimal cost viewed as an estimate of the value function difference across two adjacent states. Consumers utility sensitivity plays the following role: When increases, then the optimal policy will decrease for the same value of . In the extreme case when , we have namely the lowest price is broadcast to guarantee the largest possible utility reward; when , the optimal controller is bang-bang depending on the sign of indicating that consumers become extremely elastic.

Remark 3 The three cases in Proposition 2 correspond to different geometry of ; see Fig. 6. With different choice of utility function and , can be a monotonically increasing function of that leads to the optimal control , or it can be a monotonically decreasing function to render , or can be a non-concave and non-monotonic function whose local maximum is the global maximum on .

## Iv Properties of the Optimal Policy

Proposition 2 expresses the optimal policy as a function of . To study the properties of , we focus on the structure of . In this section we derive key properties of in terms of the changes in state space variables that lead to desirable structures for . There are three state variables that affect : the aggregate consumption over all active appliances , the ISO RSR signal , and the tracking error . When two of the three variables are given, the third variable can be expressed accordingly by since is fixed. To study the structure of as a function of , , and , we fix one variable each time and allow the other two to vary. Before proceeding, we prove the following:

Lemma 1 Denote

 ϕ(Δ(i+1,¯id))=maxu∈[Tmin,Tmax]puUu−αpuΔ(i+1,¯id), (25)

then is a monotonically non-increasing function.

Proof. For saturated optimal control or , and are constant and the statements stand. When the optimal control is not saturated, namely as in the last scenario in Proposition 2, we have

Therefore is a monotonically non-increasing function of .

In addition to the monotonicity properties of , we derive upper and lower bounds on the change in with respect to a change in shown in the following Lemma:

Lemma 2 has the following upper and lower bound:

(1)