Realtime Voltage Control Using
Deep Reinforcement Learning
Abstract
Modern distribution grids are currently being challenged by frequent and sizable voltage fluctuations, due mainly to the increasing deployment of electric vehicles and renewable generators. Existing approaches to maintaining bus voltage magnitudes within the desired region can cope with either traditional utilityowned devices (e.g., shunt capacitors), or contemporary smart inverters that come with distributed generation units (e.g., photovoltaic plants). The discrete onoff commitment of capacitor units is often configured on an hourly or daily basis, yet smart inverters can be controlled within milliseconds, thus challenging joint control of these two types of assets. In this context, a novel twotimescale voltage regulation scheme is developed for radial distribution grids by judiciously coupling datadriven with physicsbased optimization. On a fast timescale, say every second, the optimal setpoints of smart inverters are obtained by minimizing instantaneous bus voltage deviations from their nominal values, based on either the exact alternating current power flow model or a linear approximant of it; whereas, at the slower timescale (e.g., every hour), shunt capacitors are configured to minimize the longterm discounted voltage deviations using a deep reinforcement learning algorithm. Numerical tests on a realworld 47bus distribution feeder using real data corroborate the effectiveness of the novel scheme.
Index terms— Two timescales, voltage regulation, inverters, capacitors, deep reinforcement learning.
I Introduction
Frequent and sizable voltage fluctuations caused by the growing deployment of electric vehicles, demand response programs, and renewable energy sources, challenge modern distribution grids. Electric utilities are currently experiencing major issues related to the unprecedented levels of load peaks as well as renewable penetration. For instance, a solar farm connected at the end of a long distribution feeder in a rural area can cause voltage excursions along the feeder, while the apparent power capability of a substation transformer is strained by frequent reverse power flows. Moreover, overvoltage happens during midday when photovoltaic (PV) generation peaks and load demand is relatively low; whereas voltage sags occur mostly overnight due to low PV generation even when load demand is high [1]. This motivates why voltage regulation, the task of maintaining bus voltage magnitudes within desirable ranges, is critical in modern distribution grids.
Early approaches to regulating the voltages in a residential level have relied on utilityowned devices, including loadtapchanging transformers, voltage regulators, and capacitor banks, to name a few [2]. They offer a convenient means of controlling reactive power, through which the voltage profile at their terminal buses as well as at other buses can be regulated [3, p. 678]. Finding the optimal configuration for these devices entails dealing with mixedinteger programs, which are NPhard. To optimize the tap positions, a semidefinite relaxation approach was proposed in [4, 5]. Control rules based on heuristics were developed in [6, 1]. Nonetheless, these approaches can be computationally demanding, and do not guarantee optimal performance. A batch reinforcement learning (RL) scheme based on linear function approximation was lately advocated in [7].
Another characteristic inherent to utilityowned equipment is their limited life cycle, which prompts control on a daily or even monthly basis. Such configurations have been effective in traditional distribution grids without (or with low) renewable generation, and with slowly varying load. Yet, as distributed generation grows in a residential network nowadays [8], [9], rapid voltage fluctuations occur frequently [10]. According to a recent landmark bill, California mandated 50% of its electricity to be powered by renewable resources by 2025 and 60% by 2030. The power generated by a solar panel can vary by 15% of its nameplate rating within oneminute intervals [11]. Voltage control would entail more frequent switching actions, and further installation of control devices.
Smart power inverters on the other hand, come with contemporary distributed generation units, such as PV panels, and wind turbines. Embedded with computing and communication units, these can be commanded to adjust reactive power output within seconds, and in a continuouslyvalued fashion [10]. Indeed, engaging smart inverters in reactive power control has recently emerged as a promising solution [10, 12]. Computing the optimal setpoints for inverters’ reactive power output is an instance of the optimal power flow task, which is nonconvex [13]; see [14] for a recent survey of convex relaxation solutions. To deal with the renewable uncertainty as well as other communication issues (e.g., delay and packet loss), stochastic, online, decentralized, and localized reactive control schemes have been advocated [12, 15, 16, 11, 17, 18, 19, 20, 21].
Despite considerable success of the aforementioned approaches, joint control of both utilityowned devices as well as emerging power inverters has not yet been fully explored. In this context, voltage control is dealt with in the present paper using shunt capacitors and smart inverters.
A novel twotimescale solution combining first principles based on physical models and datadriven advances is put forth. On the slow timescale (e.g., hourly or daily basis), the optimal configuration (corresponding to the discrete onoff commitment) of capacitors is formulated as a Markov decision process (MDP), by carefully defining state, action, and cost according to the available control variables in the grid. The solution of this MDP is approached by means of a deep (D) RL algorithm. This framework leverages the merits of the sotermed target network and experience replay, which can remove the correlation among the sequence of observations, to make the DRL stable and tractable. On the other hand, the setpoints of the inverters’ reactive power output, are computed by minimizing the instantaneous voltage deviation using the exact or approximate grid models on the fast timescale (e.g., every few seconds). Evidently, capacitor decisions have a longstanding impact on the bus voltage profile, and are thus intertwined with the inverter optimization.
Besides this longterm dependency dealing with discrete actions, the unknown dynamic evolution of load demand and solar generation, motivates well RL solutions that can learn the optimal control policy from data. Indeed, RL has shown great potential in several challenging power engineering tasks [22, 7, 23]. In settings involving highdimensional or continuous state spaces however, conventional RL approaches suffer from the socalled ‘curse of dimensionality,’ that discourages their employment [24].
Compared with past works, our contributions can be summarized as follows.

Joint control of two types of assets. A hybrid data and physicsdriven approach to managing both utilityowned equipment as well as smart inverters;

Slowtimescale learning. Modeling demand and generation as Markovian processes, optimal capacitor settings are learned from data using DRL; and,

Fasttimescale optimization. Using exact or approximate grid models, the optimal setpoints for inverters are found relying on the most recent slowtimescale solution.
Paper outline. Regarding the remainder of the paper, Section II introduces the twotimescale voltage regulation task. Section III deals with the fasttimescale optimization of inverters, followed by the DRLbased approach to configuring capacitors on a slow timescale, in Section IV. Numerical tests using a real feeder are presented in Section V, with concluding remarks drawn in Section VI.
Notation. Lower (upper) case boldface letters denote column vectors (matrices), with the exception of , and normal letters represent scalars. Calligraphic letters are reserved for sets, and denotes allone vectors. Symbol stands for (vector/matrix) transposition, and is the norm of .
Ii Realtime Voltage Control in Two Timescales
In this section, we describe the system model, and formulate the realtime voltage regulation problem.
Iia System model
Consider a distribution grid having buses modeled as a graph , where collects all buses, and all lines. The grid is typically operated radially as a tree and served by the substation (a.k.a. the root) indexed by , whose squared voltage magnitude is regulated to some constant (e.g., 1). All buses excluding the root comprise . For all buses , let denote their squared voltage magnitude, and be their complex power injected, where and with superscript () denoting generation (consumption). For notational convenience, collect all nodal quantities into column vectors , , , , , , and .
As mentioned earlier, there are two types of assets in contemporary residential networks that can be used for reactive power control, namely utilityowned equipment featuring discrete actions and limited lifespan, as well as smart inverters controllable within seconds and in a continuouslyvalued fashion. As the aggregate load changes in a relatively slow and predictable way, the traditional utilityowned devices have been sufficient for providing voltage support; while fastresponding solutions using smart inverters become indispensable with the increase of uncertain renewable penetration.
In this work, we focus on the task of voltage regulation by capitalizing on the reactive control capabilities of both capacitors as well as inverters, while our framework can also account for other reactive power control devices. To this end, we divide every day into intervals indexed by . Each of these intervals is further partitioned into time slots which are indexed by , as illustrated in Fig. 1. To match the slow load variations, the onoff decisions of capacitors are made (at the end of) every interval , which can be chosen to be e.g., an hour; yet, to accommodate the rapidly changing renewable generation, the inverter output is adjusted (at the beginning of) every slot , taken to be e.g., a minute. We assume that quantities , , and remain the same within each slot, but may change from slot to .
Suppose there are shunt capacitors installed in the grid, whose bus indices are collected in , and are in onetoone correspondence with entries of (a simple renumbering). Assume that every bus is equipped with either a shunt capacitor or a smart inverter, but not both. The remaining buses, after removing entries in from , collected in , are assumed equipped with inverters. This assumption is made without loss of generality as one can simply set the upper and lower bounds on the reactive output to zero at buses having no inverters installed.
Since the shunt capacitor configuration is updated on a slow timescale (every interval ), the reactive compensation provided by capacitor (or, the capacitor installed at bus ) is represented by
(1) 
where is the onoff commitment of capacitor for the entire interval . Clearly, if , a constant amount (nameplate value) of reactive power is injected in the grid during this interval, and otherwise. For convenience, the onoff decisions of capacitor units at interval are collected in a column vector .
On the other hand, the reactive power generated by inverter is adjusted on the fast timescale (every ), which is constrained as
(2) 
where and are the nameplate values of apparent power and active power of inverter , respectively; see e.g., [12].
IiB Twotimescale voltage regulation formulation
Given realtime load consumption and generation that we model as Markovian processes [25], the task of voltage regulation is to find the optimal reactive power support per slot by configuring capacitors in every interval and adjusting inverter outputs in every slot, such that the longterm average voltage deviation is minimized. As voltage magnitudes depend solely on the control variables , they are expressed as implicit functions of , yielding , whose actual function forms for postulated grid models will be given Section III. The novel twotimescale voltage control scheme entails solving the following stochastic optimization problem
(3a)  
(3b)  
(3c)  
(3d) 
for some discount factor , where the expectation is taken over the joint distribution of across all intervals and slots. Clearly, the optimization problem (3) involves infinitely many variables and , which are coupled across time via the cost function and the constraint (3b). Moreover, discrete variables render problem (3) nonconvex and generally NPhard. Last but not least, it is a multistage optimization, whose decisions are not all made at the same stage, and must also account for the power variability during realtime operation. In words, tackling (3) exactly is challenging.
Instead, our objective is to design realtime algorithms that sequentially observe predictions , and solve near optimally the stochastic optimization problem (3). The working assumption is that, even though no distributional knowledge of those stochastic processes involved is given, their realizations can be made available in real time, by means of e.g., accurate forecasting methods [26]. In this sense, the physics governing power systems will be utilized together with data to solve (3) in real time. Specifically, on the slow timescale, say at the end of each interval , the optimal onoff capacitor decisions will be set through a DRL algorithm that can learn from the predictions collected within the current interval ; while, on the fast timescale, say at the beginning of each slot within interval , our twostage control scheme will compute the optimal reactive power setpoints for inverters, by minimizing the instantaneous bus voltage deviations while respecting physical constraints, given the current onoff commitment of capacitor units found at the very end of interval . These two timescales are elaborated in Sections III and IV, respectively.
Iii Fasttimescale Optimization of Inverters
As alluded earlier, the actual forms of will be specified in this section, relying on the exact AC model or a linearized approximant of it. Leveraging convex relaxation to deal with the nonconvexity, the considered AC model yields a secondorder cone program (SOCP), whereas the linearized one leads to a linearly constrained quadratic program. In contrast, the latter offers an approximate yet computationally more affordable alternative to the former. Selecting between these two models relies on affordable computational capabilities.
Iiia Branch flow model
Due to the radial structure of distribution grids, every nonroot bus has a unique parent bus termed . The two are joined through the th distribution line represented by having impedance . Let stand for the complex power flowing from buses to seen at the ‘front’ end at time slot of interval , as depicted in Fig. 2. Throughout this section, the interval index will be dropped when it is clear from the context.
With further denoting the squared current magnitude on line , the celebrated branch flow model is described by the following equations for all buses , and for all within every interval [27, 28]
(4a)  
(4b)  
(4c)  
(4d) 
where we have ignored the dependence on for notational brevity; and denotes the set of all children buses for bus . Likewise, we collect , , , and into vectors , , and , accordingly.
Equations (4a)(4c) are linear in variables , , , and . Nonetheless, the set of equations in (4d) is quadratic in and , giving rise to a nonconvex feasible set. To address this challenge, consider relaxing the equalities in (4d) into inequalities (a.k.a. hyperbolic relaxation, see e.g., [13])
(5) 
which can be equivalently rewritten as the following secondorder cone constraints
(6) 
Equations (4a)(4c), together with (6) now define a convex feasible set. Recent efforts have leveraged this relaxed set (instead of the original nonconvex one) to study several key grid management tasks; see e.g., [28, 14] for recent surveys. This procedure is also known as SOCP relaxation. Interestingly, it has been shown that under certain conditions, SOCP relaxation is exact in the sense that the set of inequalties (6) holds with equalities at the optimum; see [29] and references therein.
Given the capacitor configuration found at the end of the last interval , under the aforementioned relaxed grid model, the voltage regulation on the fast timescale based on the exact AC model can be described as follows
(7a)  
(7b)  
(7c)  
(7d) 
which is readily a convex SOCP and can be efficiently solved by offtheshelf convex programming toolboxes. The optimal setpoints of smart inverters for the exact AC model are found as the minimizer of .
Nevertheless, solving SOCPs could be computationally demanding when dealing with relatively largescale distribution grids, say of several hundred buses. Trading off modeling accuracy for computational efficiency, our next instantiation of the fasttimescale voltage control relies on an approximate grid model.
IiiB Linearized power flow model
The linearized distribution flow model can be obtained as follows. Since the line squared current magnitudes are relatively small compared to the line flows, the last term in (4a)(4c) can be ignored yielding the next linear equations for all buses , and for all in every interval [30]
(8a)  
(8b)  
(8c) 
In this fashion, all squared voltage magnitudes can be expressed as linear functions of .
Adopting the approximate model in (8), the optimal setpoints of smart inverters can be found by solving the following optimization problem per slot in interval , provided is available from the last interval on the slow timescale
(9a)  
(9b)  
(9c)  
(9d) 
As all constraints are linear and the cost is quadratic, (9) constitutes a standard convex quadratic program. As such, it can be solved efficiently by e.g., primaldual algorithms, or offtheshelf convex programming solvers, whose implementation details are skipped due to space limitations.
Iv Slowtimescale Capacitor Reconfiguration
Here we deal with reconfiguration of shunt capacitors on the slow timescale. This amounts to determining their onoff status in the ensuing interval. Past approaches to solving the resultant integervalued optimization were heuristic, or, relied on semidefinite programming relaxation. They do not guarantee optimality, while they also incur high computational and storage complexities. We take a different route by drawing from advances in artificial intelligence, to develop datadriven solutions that could near optimally learn, track, as well as adapt to unknown generation and consumption dynamics.
Iva A datadriven solution
It is evident from (7b) and (9b) that the capacitor decisions made at the end of interval (slowtimescale learning) influence the inverters’ setpoints during the entire interval (fasttimescale optimization). The other way around, smart inverters’ regulation on voltages influences the capacitor commitment for the next interval. This twoway interaction between the capacitor configuration and the optimal setpoints of smart inverters motivates our RL formulation. RL deals with learning actiontaking policy functions in an environment with actiondependent dynamically evolving states and costs. By interacting with the environment (through successive actions as well as observed states and costs), RL seeks a policy function (of states) to draw actions from, in order to minimize the average cumulative cost [31].
Modeling load demand and solar generation dynamics as Markovian processes, the optimal configuration of shunt capacitors can be formulated as an MDP, which can be efficiently solved through RL algorithms. An MDP is defined as a 5tuple , where is a set of states; is a set of actions; is a set of transition matrices; is a cost function such that, for and , are the realvalued immediate costs after the system operator takes an action at state ; and is the discount factor. These components are defined next before introducing our voltage regulation scheme.
Action space . Each action corresponds to one possible onoff commitment of capacitors to , giving rise to an action vector per interval . The set of binary action vectors constitutes the action space , whose cardinality is exponential in the number of capacitors, meaning .
State space . This includes per interval the average active power at all buses except for the substation, along with the current capacitor configurations; that is, , which contains both continuous and discrete variables. Clearly, it holds that .
The action is determined according to the configuration policy which is a function of the most recent state , given as
(10) 
Cost function . The cost on the slow timescale is
(11) 
Set of transition probability matrices . While being at a state upon taking an action , the system moves to a new state probabilistically. Let denote the transition probability matrix from state to the next state under a given action . Evidently, it holds that .
Discount factor . The discount factor , trades off the current versus future costs. The smaller is, the more weight the current cost has in the overall cost.
Given the current state and action, the sotermed actionvalue function under the control policy is defined as
(12) 
where the expectation is taken with respect to all sources of randomness.
To find the optimal capacitor configuration policy , that minimizes the average voltage deviation in the long run, we resort to the Bellman optimality equations; see e.g., [31]. Solving those yields the actionvalue function under the optimal policy on the fly, given by
(13) 
With obtained, the optimal capacitor configuration policy can be found as
(14) 
It is clear from (13) that if all transition probabilities were available, we can derive , and subsequently the optimal policy from (14). Nonetheless, obtaining those transition probabilities is impractical in practical distribution systems. This calls for approaches that aim directly at , without assuming any knowledge of .
One celebrated approach of this kind is Qlearning, which can learn by approximating ‘onthefly’ [31, p. 107]. Due to its highdimensional continuous state space however, Qlearning is not applicable for the problem at hand. This motivates function approximation based Qlearning schemes that can deal with continuous state domains.
IvB A deep reinforcement learning approach
Parameterizing the Qfunction with a deep neural network (DNN) has lately been demonstrated to be effective in dealing with highdimensional and/or continuous state spaces [32]. Praised as the first artificial agents to achieve humanlevel performance across diverse challenging domains, deep RL based on the socalled deep Qnetworks (DQN) was introduced.
DQN offers a NN function approximator of the function, chosen to be e.g., a fully connected feedforward NN, or a convolutional NN, depending on the application [32]. It takes as input the state vector, to generate at its output values for all possible actions (one for each). As corroborated in [32], such a NN indeed enables learning the values of all stateaction pairs, from just a few observations obtained by interacting with the environment. Hence, it effectively addresses the challenge brought by the ‘curse of dimensionality’ [32]. Inspired by this, we employ a feedforward NN to approximate the function in our setting. Specifically, our DNN consists of fully connected hidden layers with ReLU activation functions, depicted in Fig. 3. At the input layer, each neuron is fed with one entry of the state vector , which, after passing through the ReLU layers, outputs a predicted value for each of all possible actions (i.e., capacitor configurations). Since every output unit corresponds to a particular configuration of all capacitors, there is a total of neurons at the output layer. For ease of exposition, let us collect all weight parameters of this DQN into a vector ; and denote the actionvalue function of a particular stateaction pair with , which is an estimate for (c.f. (IVA)). At the end of a given interval, upon passing the state vector through this DQN, the corresponding predicted values for all possible actions become available at the output. Based on these predicted values, the system operator selects the action having the smallest predicted value to be in effect over the next interval.
Intuitively, the weights should be chosen such that the DQN outputs match well the actual values with input any state vector. Toward this objective, the popular stochastic gradient descent (SGD) method is employed to update ‘on the fly’ [32]. At the end of a given interval , precisely when i) the system operator has made decision , ii) the grid has completed the transition from the state to a new state , and, (iii) the network has incurred and revealed cost , we perform a SGD update based on the current estimate to yield . The sotermed temporaldifference learning [31] confirms that a sample approximation of the optimal costtogo from interval is given by , where is the immediate cost observed, and represents the smallest possible predicted costtogo from state , which can be computed through our DQN with weights , and is discounted by factor . In words, the target value is readily available at the end of interval . Adopting the norm error criterion, a meaningful approach to tuning the weights entails minimizing the following loss function
(15) 
for which the SGD update is given by
(16) 
where is a preselected learning rate, and denotes the (sub)gradient.
However, due to the compositional structure of DNNs, the update in (16) does not work well in practice. In fact, the resultant DQN oftentimes does not provide a stable result; see e.g., [33, 34]. To bypass these hurdles, several modifications have been introduced [33]. In this work, we adopt the target network and experience replay [32]. To this aim, let us define an experience , to be a tuple of state, action, cost, and the next state. Consider also having a replay buffer onthefly, which stores the most recent experiences visited by the agent. For instance, the replay buffer at any interval is . Furthermore, as another effective remedy to stabilizing the DQN updates, we replicate the DQN to create a second DNN, commonly referred to as the target network, whose weight parameters are concatenated in the vector . It is worth highlighting that this target network is not trained, but its parameters are only periodically reset to realtime estimates of , say every training iterations of the DQN. Consider now the temporaldifference loss for some randomly drawn experience from at interval
(17) 
Upon taking expectation with respect to all sources of randomness generating this experience, we arrive at
(18) 
In practice however, the underlying Markov distribution is unknown, which challenges evaluating and hence minimizing exactly. A commonly adopted alternative is to approximate the expected loss with an empirical loss over a few samples (that is, experiences here). To this end, we draw a minibatch of experiences uniformly at random from the replay buffer , whose indices are collected in the set , i.e., . Upon computing for each of those sampled experiences an output using the target network with parameters , the empirical loss is
(19) 
In a nutshell, the weight parameter vector of the DQN is efficiently updated ‘onthefly’ using SGD over the empirical loss , with iterates given by
(20) 
Incorporating target network and experience replay remedies for stable DRL, our proposed twotimescale voltage regulation scheme is summarized in Alg. 1.
V Numerical Tests
The twotimescale voltage regulation scheme presented in Alg. 1 is numerically examined using the Southern California Edison 47bus distribution feeder [13], depicted in Fig. 4. This feeder is integrated with four shunt capacitors installed on buses 1, 3, 37, and 47, and five large PV plants on buses 2, 16, 18, 21, and 22. As the voltage magnitude of the substation bus is regulated to be a constant ( in all our tests) through a voltage transformer, the capacitor at the substation was excluded from our control. Thus, a total of three shunt capacitors along with five smart inverters embedded with the PV plants were engaged in realtime voltage regulation. To test our scheme in a realistic setting, real consumption as well as solar generation data were obtained from the Smart project collected on August 24, 2011 [35], which were first preprocessed by following the procedure described in our precursor work [12].
In our tests, to match the availability of real data, every slot was set to a minute, while every interval was five minutes. A power factor of 0.8 was assumed for all loads. The DQN used a fully connected feedforward neural network with two hidden layers, which was found sufficient for the task at hand. ReLU activation functions (namely, ) were employed in the hidden layers, and logistic sigmoid functions were used at the output layer. The replay buffer size was set to , the discount factor , and the minibatch size . During training, the target network was updated every iterations. To benchmark the performance of our proposed scheme, we simulated a fixed capacitor configuration policy as well as a randomly switching policy as baselines. As in our proposed approach, both schemes compute the optimal setpoints for inverters by solving (7) or (9) on a fast timescale, while the former employs a fixed capacitor configuration throughout this experiment, and the latter switches its capacitor configuration randomly every slow timescale interval.
We first examined our DRLbased voltage control approach using the linearized power flow model. The immediate costs incurred by the three simulated schemes over the first intervals are plotted in Fig. 5. Evidently, the proposed scheme attains a lower cost than the other two after a short period of learning and interacting with the environment. Fig. 6 depicts the successive actions (that is, the onoff commitment of capacitors) taken by three approaches in real time. Since there are capacitors under configuration in this bus feeder, the number of valid actions is . The jumps reveal the learning ability of our DRL scheme. In addition, voltage magnitude profiles at all buses regulated by the three schemes are presented in Fig. 7. Again, after a short period of training by interacting with the environment, our DRLbased voltage control scheme quickly learns a stable and (near) optimal policy. Curves showcase the effectiveness of the DRL scheme in smoothing voltage fluctuations incurred due to large solar generation as well as heavy load demand.
To further assess the performance of our novel scheme, tests were replicated using the exact AC grid model. Fig. 8 depicts the immediate costs incurred by three simulated schemes, over the first intervals. Curves again show that the proposed scheme results in smaller voltage deviations than its competing alternatives. The corresponding actions taken are shown in Fig. 9. Realtime voltage magnitude profiles of all buses under three approaches are plotted in Fig. 10, which corroborate the merits of our twotimescale DRLbased voltage regulation scheme in realworld settings.
Vi Conclusions
In this work, joint control of traditional utilityowned equipment and contemporary smart inverters for voltage regulation through reactive power provision was investigated. To address different response times of those assets, a realtime twotimescale approach to minimizing bus voltage deviations from their nominal values was put forth, by combining physics and datadriven stochastic optimization. Load consumption and active power generation dynamics were modeled as Markov decision processes. On a fast timescale, the setpoints of smart inverters were found by minimizing instantaneous bus voltage deviations, while on a slower timescale, capacitor banks were configured to minimize longterm expected voltage deviations using a deep reinforcement learning algorithm. The developed voltage regulation scheme was shown to be efficient and easy to implement, through numerical tests on a realworld distribution feeder using real solar and consumption data.
References
 [1] P. M. Carvalho, P. F. Correia, and L. A. Ferreira, “Distributed reactive power generation control for voltage rise mitigation in distribution networks,” IEEE Trans. Power Syst., vol. 23, no. 2, pp. 766–772, May 2008.
 [2] W. H. Kersting, Distribution System Modeling and Analysis. New York, NY, USA: CRC press, 2006.
 [3] P. Kundur, N. J. Balu, and M. G. Lauby, Power System Stability and Control. Duisburg, Germany: McGrawhill New York, May 1994.
 [4] B. A. Robbins, H. Zhu, and A. D. DomínguezGarcía, “Optimal tap setting of voltage regulation transformers in unbalanced distribution systems,” IEEE Trans. Power Syst., vol. 31, no. 1, pp. 256–267, Feb. 2016.
 [5] M. Bazrafshan, N. Gatsis, and H. Zhu, “Optimal tap selection of stepvoltage regulators in multiphase distribution networks,” in Power Syst. Comput. Conf., Dublin, Irelands, Jun. 2018.
 [6] D. A. Tziouvaras, P. McLaren, G. Alexander, D. Dawson, J. Esztergalyos, C. Fromen, M. Glinkowski, I. Hasenwinkle, M. Kezunovic, L. Kojovic et al., “Mathematical models for current, voltage, and coupling capacitor voltage transformers,” IEEE Trans. Power Del., vol. 15, no. 1, pp. 62–72, Jan. 2000.
 [7] H. Xu, A. D. DomínguezGarcía, and P. W. Sauer, “Optimal tap setting of voltage regulation transformers using batch reinforcement learning,” arXiv:1807.10997, 2018.
 [8] W. Su, J. Wang, and J. Roh, “Stochastic energy scheduling in microgrids with intermittent renewable energy resources,” IIEEE Trans. Smart Grid, vol. 5, no. 4, pp. 1876–1883, July 2014.
 [9] A. Ipakchi and F. Albuyeh, “Grid of the future,” IEEE Power Energy Mag., vol. 7, no. 2, pp. 52–62, Feb. 2009.
 [10] K. Turitsyn, P. Sulc, S. Backhaus, and M. Chertkov, “Options for control of reactive power by distributed photovoltaic generators,” Proc. IEEE, vol. 99, no. 6, pp. 1063–1073, Jun. 2011.
 [11] G. Wang, V. Kekatos, A. J. Conejo, and G. B. Giannakis, “Ergodic energy management leveraging resource variability in distribution grids,” IEEE Trans. Power Syst., vol. 31, no. 6, pp. 4765–4775, Nov. 2016.
 [12] V. Kekatos, G. Wang, A. J. Conejo, and G. B. Giannakis, “Stochastic reactive power management in microgrids with renewables,” IEEE Trans. Power Syst., vol. 30, no. 6, pp. 3386–3395, Dec. 2015.
 [13] M. Farivar, C. R. Clarke, S. H. Low, and K. M. Chandy, “Inverter VAR control for distribution systems with renewables,” in Proc. IEEE SmartGridComm., Brussels, Belgium, Oct. 2011, pp. 457–462.
 [14] D. K. Molzahn and I. A. Hiskens, “A survey of relaxations and approximations of the power flow equations,” Foundations and Trends® Electric Energy Syst., vol. 4, no. 12, pp. 1–221, Feb. 2019.
 [15] H. Zhu and H. J. Liu, “Fast local voltage control under limited reactive power: Optimality and stability analysis,” IEEE Trans. Power Syst., vol. 31, no. 5, pp. 3794–3803, Dec. 2016.
 [16] V. Kekatos, L. Zhang, G. B. Giannakis, and R. Baldick, “Voltage regulation algorithms for multiphase power distribution grids,” IEEE Trans. Power Syst., vol. 31, no. 5, pp. 3913–3923, Sep. 2016.
 [17] S. Magnússon, C. Fischione, and N. Li, “Voltage control using limited communication,” IEEE Trans. Control Netw. Syst., to appear 2019.
 [18] W. Lin, R. Thomas, and E. Bitar, “Realtime voltage regulation in distribution systems via decentralized PV inverter control,” in Proc. Annual Hawaii Intl. Conf. System Sciences, Waikoloa Village, Hawaii, Jan. 26, 2018.
 [19] Y. Zhang, M. Hong, E. DallâAnese, S. V. Dhople, and Z. Xu, “Distributed controllers seeking AC optimal power flow solutions using ADMM,” IEEE Trans. Smart Grid, vol. 9, no. 5, pp. 4525–4537, Sept. 2018.
 [20] X. Zhou, E. DallâAnese, L. Chen, and A. Simonetto, “An incentivebased online optimization framework for distribution grids,” IEEE Trans. Autom. Control, vol. 63, no. 7, pp. 2019–2031, July 2018.
 [21] L. Zhang, V. Kekatos, and G. B. Giannakis, “Scalable electric vehicle charging protocols,” IEEE Trans. Power Syst., vol. 32, no. 2, pp. 1451–1462, Mar. 2017.
 [22] D. Ernst, M. Glavic, and L. Wehenkel, “Power systems stability control: reinforcement learning framework,” IEEE Trans. Power Syst., vol. 19, no. 1, pp. 427–435, Feb. 2004.
 [23] A. Sadeghi, G. Wang, and G. B. Giannakis, “Adaptive caching via deep reinforcement learning,” arXiv:1902.10301, 2019.
 [24] A. S. Zamzam, B. Yang, and N. D. Sidiropoulos, “Energy storage management via deep Qnetworks,” arXiv:1903.11107, 2019.
 [25] J. A. Carta, P. Ramirez, and S. Velazquez, “A review of wind speed probability distributions used in wind energy analysis: Case studies in the Canary Islands,” Renew. Sust. Energ. Rev., vol. 13, no. 5, pp. 933–955, Jun. 2009.
 [26] L. Zhang, G. Wang, and G. B. Giannakis, “Realtime power system state estimation and forecasting via deep neural networks,” arXiv:1811.06146, Nov. 2018.
 [27] M. Baran and F. F. Wu, “Optimal sizing of capacitors placed on a radial distribution system,” IEEE Trans. Power Del., vol. 4, no. 1, pp. 735–743, Jan. 1989.
 [28] S. H. Low, “Convex relaxation of optimal power flow—Part II: Exactness,” IEEE Trans. Control Netw. Syst., vol. 1, no. 2, pp. 177–189, May 2014.
 [29] L. Gan, N. Li, U. Topcu, and S. H. Low, “Exact convex relaxation of optimal power flow in radial networks,” IEEE Trans. on Autom. Control, vol. 60, no. 1, pp. 72–87, Jan. 2015.
 [30] M. E. Baran and F. F. Wu, “Network reconfiguration in distribution systems for loss reduction and load balancing,” IEEE Trans. Power Del., vol. 4, no. 2, pp. 1401–1407, Apr. 1989.
 [31] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT press, 2018.
 [32] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, Feb. 2015.
 [33] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, p. 484, Jan. 2016.
 [34] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv:1609.08144, 2016.
 [35] S. Barker, A. Mishra, D. Irwin, E. Cecchet, P. Shenoy, and J. Albrecht, “Smart*: An open data set and tools for enabling research in sustainable homes,” SustKDD, vol. 111, no. 112, p. 108, Aug. 2012.