Hierarchical Decision Making In Electricity Grid Management
Abstract
The power grid is a complex and vital system that necessitates careful reliability management. Managing the grid is a difficult problem with multiple time scales of decision making and stochastic behavior due to renewable energy generations, variable demand and unplanned outages. Solving this problem in the face of uncertainty requires a new methodology with tractable algorithms. In this work, we introduce a new model for hierarchical decision making in complex systems. We apply reinforcement learning (RL) methods to learn a proxy, i.e., a level of abstraction, for realtime power grid reliability. We devise an algorithm that alternates between slow timescale policy improvement, and fast timescale value function approximation. We compare our results to prevailing heuristics, and show the strength of our method.
1 Introduction
The power grid is a complex and vital system that requires high level of reliability. Reliability is of utmost importance, as the consequences of outages can be catastrophic. System operators (SOs) achieve reliability by means of sophisticated control operations and planning, which often require solving sequential stochastic decision problems. Sequential decision making under uncertainty in energy systems is studied in different communities such as control theory, dynamic programming, stochastic programming and robust optimization Powell & Meisel (2015); Bertsimas et al. (2013); Bienstock (2011); Koutsopoulos & Tassiulas (2012); Bienstock et al. (2014).
Reliability assessment and control are highly complicated tasks in complex realworld systems such as the power grid. Complications in the power grid arise because of strict physical restrictions, such as generation must meet consumption continuously and transmission lines can not exceed their limited thermal capacity. Further complications stem from the structure of decision making in different timehorizons. For example, longterm system expansion and development such as building a new wind farm or a highvoltage line take years, midterm asset management decisions such as performing maintenance are decided upon months in advance, shortterm generation schedules are planned daily, and realtime operational control decisions are made on the scale of minutes. In these interdependent hierarchical decision making processes decisions are taken by multiple stakeholders. Furthermore, over the last decade, wind and solar energy sources become increasingly preeminent with further significant expansion being envisaged Talbot (2009). These generators introduce high uncertainty to the system, making the control task significantly more difficult. The complex dependence between multiple timehorizon with growing uncertainty, the curse of dimensionality when dealing with large systems, and the nonlinear dependence of reliability measures to the multiple timehorizon decisions, make this problem extremely hard to tackle.
To stress the dimensionality complexity, consider the IEEE RTS96 power network used in our experiments Wong et al. (1999). This network is an example for a power grid of a medium sized European country or a state in the USA. Its statespace is , and its action space is ; see Sec. 5. Assessment of each control choice carries a computational burden as it requires solving a set of nonlinear trigonometric equations named alternating current power flow (ACPF); see Sec. 2.1.
Nowadays, the common practice in industry is solving large mixed integer programs (MIP), often with a linear relaxation, in an attempt to reach a valid solution Grainger & Stevenson (1994); Allan et al. (2013). Although this model is extensive, its computational burden makes it hard even for deterministic predictions (taking an order of a day in realworld systems), and inappropriate in the stochastic case. This limits SOs to sample snapshots of future grid states or analyze a few sequential trajectories. The narrow view of possible outcomes is likely to miss important benefits and increase the costs of decisions, thereby offering little in terms of dealing with uncertainty.
To handle uncertainty, work has been done in stochastic optimization and control theory. These often use restrictive simplifications such as independence between the decision processes in the different timescales or consider myopic decisions only AbiriJahromi et al. (2009); Wu et al. (2010); AbiriJahromi et al. (2013).
Another approach is to use approximate dynamic programming Powell (2007); Si (2004). However, the natural hierarchical structure of the problem, where several stakeholders operating in different timescales and exposed to different information are making decisions with mutual influence, does not naturally fit the standard Markov Decision Process (MDP) structure. Furthermore, the problem is heavily constrained, since physical electrical restrictions must be met at all times.
Making this problem tractable requires a level of abstraction in the form of fast proxy methods to approximate the impact of realtime decisions on longerterm reliability and costs. To our knowledge, few attempts have been made to construct such proxies using tools from machine learning. An example for such, is the work conducted in a recent European project, iTesla iTe (). This work focuses on analyzing snapshots of system states at different time points using datamining methods. Then, classification and clustering algorithms are used for constructing security rules for predicting reliability level, given a failure and an electrical network state Anil (2013). Such approaches can aid SOs in realtime control, but lack the dynamic perspective of stateaction evolution needed to evaluate consequences of policies in a sequential decision making scenario.
In this work we suggest a novel approach to mitigate the intractability of the hierarchical decision making problem of the dayahead (DA) and realtime (RT) reliability of the power grid. The contributions of our work are:

We introduce an interleaved MDPs hierarchical structure with separate state space, action space, and reward metric.

We devise an algorithm that alternates between highlevel policy improvement and lowerlevel value approximation, i.e., the policy improvement in the first MDP is based on the second MDP’s value function.

We show the efficacy of our method on a mediumsized power grid problem.

We introduce a new realworld application to the RL community and provide a simulation environment.
The rest of paper is organized as follows. In Sec. 2 we present background on power system engineering. In Sec. 3, we formulate the twolayer MDPs. In Sec. 4, we introduce our interleaved approximate policy improvement (IAPI) algorithm and present results on the IEEE RTS96 network. We conclude our work in Sec. 6.
2 Background
In this section we present a brief introduction to the field of power systems engineering. This is vast a field with extensive background and theory. For more information please refer to Grainger & Stevenson (1994); Allan et al. (2013).
2.1 Decision Processes and Power Flow in Power Grids
To better explain the multiple timehorizon decision processes we use a toy 6bus power grid example Wood & Wollenberg (1996), shown in Fig. 1. The 6bus system is composed of 6 electrical nodes referred to as “buses”. Each bus can have loads and generators attached to it. Loads (shown in blue) are consumers (e.g., large neighborhoods or cities and factories), and generators (shown in red) are power producers such as nuclear plants, coal plants, wind turbines, and solar panels. Load values change continuously throughout the day and closely follow daily, weekly, and yearly profiles. Controllable generators are operated such that the overall power generation meets the overall load at all times (up to transmission losses). The edges connecting the buses represent transmission lines which, due to thermal restrictions, can only transfer a limited amount of power before risking tripping.
Given a snapshot of loads and generation values, and the power grid topology (buses and transmission lines), it is possible to solve the complete alternating current power flow (ACPF) equations. The ACPF is a set of nonconvex trigonometric equations that model the physical electrical characteristics of the power grid, i.e., voltage magnitude and angles of each node Cain et al. (2012). The ACPF solution includes the amount of power passing through each transmission line (shown in green in Fig. 1).
In general, reliability of a power system is measured based on the avoidance of full or partial blackouts (both planned and unplanned) and their negative effect on social welfare. A blackout is an event where demand cannot be met. This can occur predominantly because of contingencies (i.e., asset malfunctions) which lead to unsafe operation and may require the SOs to disconnect loads in order to avoid catastrophes. Contingencies can stem from multiple causes, such as a tree falling, lighting strike, poor maintenance or exceeding the thermal limits of a transmission line. To maintain a high reliability level at all times, the current practice of SOs is to immunize the system against a predetermined contingency list. A common choice for this list is all single asset contingencies, resulting in the socalled reliability criterion.
However, contingency probabilities are difficult to obtain and their impact is hard to assess. Furthermore, the high penetration of stochastic and often uncontrollable renewable generators, makes the planing tasks significantly harder for several reasons. First, generation must equal demand at all times. Second, multiple decision making processes are taking place simultaneously on multiple timescales. Third, each decision process involves high dimensional decision variables, and complex nonlinear Powell & Meisel (2015), often intractable mathematical formulations.
For example, in the 6bus system in Fig. 1, a system developer might plan to expand the system by building a new transmission line between buses 3 and 4. Expanding the grid is a long term process and a decision must be taken years in advance. However, this decision affects the future maintenance decisions, which will affect future daily planning that in turn affects the future realtime control room operations. Ideally, the system developer should consider all possible future realizations of the environment, grid, and the decision processes in all other time horizons.
2.2 Related work
Several works in the literature of power systems, operational research and more recently machine learning offer approaches for solving sequential stochastic problems using dynamic programming. Of which, the majority of these works focus on energy storage Lai et al. (2010); Xi et al. (2014); Jiang et al. (2014); Scott & Powell (2012), unit commitment Padhy (2004); Dalal & Mannor (2015); Ernst et al. (2007), and energy market bidding strategies Song & Wang (2003); Urieli & Stone (2014); Jiang & Powell (2015). To our knowledge, no work has been done to use MDPs for assessing the reliability in power grids.
For our proxy abstraction devise a hierarchical model. Hierarchical models, offer several benefits over flat models when appropriate. They can improve exploration, enable learning from fewer trials, and allow faster learning for new problems by reusing subtasks learned on previous problems Dietterich (1998). Standard approaches for hierarchical models include: planning with options (often referred to as skills) Sutton et al. (1999), task hierarchy Barto & Mahadevan (2003) and hierarchy of abstract machines Parr & Russell (1998). These models include levels of decision making that share the same statespace and a termination condition to switch between controllers. This structure does not fit our problem well where two separate decision makers run on different statespaces and temporal resolutions.
3 Problem Formulation
Here we present a formulation for the two sequential decision processes occurring in the day ahead (DA) and realtime (RT) in terms of a hierarchal two MDP model. DA decisions are taken in order to maximize the system’s next day reliability. However, the next day reliability can only be assessed in RT, and is dependent on the system operator decision taken in RT. This results in a complex dependence between DA and RT actions and system reliability. We therefore formulate the problem using two layers of interleaved MDPs: a RTMDP, describing the state of the system, reliability, and decisions on an hourly basis, and a DAMDP describing the DA action of choosing a daily subset of active generators based on the upcoming day predictions. In our terminology, the former serves as a proxy for assessing decisions taken in the latter, see Fig. 2.
3.1 DayAhead MDP
The DAMDP is a tuple . Time index is , denoting days. Dayahead state consists of a day ahead prediction of hourly demand on each bus, and wind generation of each wind generator.
3.2 RealTime MDP
The RTMDP is a tuple . It represents the real time reliability control process. Time index is , denoting intraday time steps (e.g., hours). In RT power network operation, an operator may choose preventive actions at each time step, trying to immunize the system against potential malfunctions by attempting to avoid unreliable states. We model this decision making process using poststates Powell (2007), where at the beginning of each time interval, the agent observes the current state , i.e., the realized demand and wind values for this interval and chooses an action . Following the agent’s action, the system is now in a postdecision state , which is the new state, after performing action from state . Next, exogenous random information is obtained, informing whether equipment malfunction (contingency) occurred during time interval . Given and , the real time reward which represents the system’s reliability, can be calculated, and a transition to occurs, governed by . The history of this RT process can be written as .
RealTime State Space
We define a RT state to be the tuple , where:

is a vector of stochastic nodal demand.

is a vector of stochastic nodal wind generation.

is a vector of controllable generation values. The DA action determines which generators will have positive values, and which will be set to throughout the day. Each generator has minimal and maximal generation limits while in operation.

is the topology of the grid. Includes information of current state of each edge (transmission line). , where is operational and the rest is a countdown process till the line is fixed.
RealTime Action Space
A RT action is a preventive action, that attempts to achieve better reliability of the system by immunizing against potential contingencies. The action involves redispatch , i.e., change the generation values of the working controllable generators (chosen in DA):
Any action is allowed as long as it is within the minimal and maximal generator limits. Notice that for working generators only ().
RealTime Transition Kernel
The RT transition kernel can be factorized to exogenous transitions of demand, wind generation, and contingencies. It is conditioned on the last RT state and action (encoded in the RT poststate), and on the corresponding last DA decision taken to determine participating generators:
The dependence between RT and DA states is expressed using two sets of equations. The first is RT demand process, based on DA demand prediction:
(1)  
(2) 
where is the RT demand vector at time , and is the DA prediction vector for time of the day. The dynamics in Eqs. (1)(2) also hold for the wind generation process. For this work we chose this autoregressive random bias process for simplicity, however more complicated methods, such as in Box et al. (2015); Papavasiliou & Oren (2013); Taylor & Buizza (2002), can be considered. The second equation coupling DA and RT determines the generators participating in current day generation process:
(3) 
where is the index set of generators chosen by DA action .
Lastly, random exogenous information specifies whether a contingency happened in the system, causing transmission line to fail, changing the network topology to . The probability of line to fail at each timestep is if at the last timestep was , and otherwise.
RealTime Reward
We choose the RT reward to be the reliability level of the power system at the current time. To assess the level of reliability, we employ the common criterion used in the industry, termed , which assesses the system ability to withstand any contingency of a single asset.
To calculate the reliability of the system, it is examined using a sequence of tests (contingency list), where each test is an attempt to take out a single line (contingency) and check if the system retains safe operation. Hence, the reward is a number in , expressing the portion of tests passed out of the predetermined contingency list, which includes all single contingencies . The reliability is calculated for a given state of the grid, and is dependent of current topology () and the changes to the topology due to possible new contingencies () . In practice, preserving the system in safe operation means being able to obtain a feasible solution to the power flow equations (see Sec. 2) of the network circuit. We define to be if a power flow solution exists, and otherwise. As a result, the RT reward is:
4 Interleaved Approximate Policy Improvement
In this section we present our algorithm, called Interleaved Approximate Policy Improvement (IAPI), presented in Alg. 1, for jointly learning the RT reliability value function while searching for an optimal DA policy. We use the term interleaved since the policy improvement in one MDP is based on the second MDP’s value function. We use simulation based value learning to assess the RT reliability of the system and the cross entropy method De Boer et al. (2005); Szita & Lörincz (2006) for improving the DA policy. Our method scales to large systems since it uses simple models with carefully engineered features and design to run on distributed computing. Since the algorithm is massively parallelizable, the more cores available, the faster the convergence will be.
Our goal is to find an optimal DA policy , under the assumption that the RT policy is known. Henceforth, we will use to symbolize . As explained in Sec. 3, reliability is not explicitly defined on the DA level and we instead use the RT value function as a surrogate for comparing between different DA policies. Differently than the common notation, denotes the RT value function, under the fixed RT policy , and a DA policy .
Our method includes the following components:
Day Ahead Policy Approximation We define a parametric DA policy as , where is the day ahead action dictating which generators will be active during the day, are features of DA state and action .
A plausible choice for mapping DA state to an action is using multiclass classifiers. However, for large number of classes ( in our experiments) these methods require a significant number of simulations for training Bishop (2006). Furthermore, approaches for classificationbased policy learning often require obtaining multiple rollouts for all the actions from a state during the training procedure Gabillon et al. (2011), which in our case will result in a full value evaluation per each action and might prove overly encumbering. To mitigate these complexities, our policy chooses the action that maximizes the inner product with .
Day Ahead Policy Comparison A comparison between different DA policies is done by calculating the empirical mean of RT value function , using a set of representative RT initial states . This set is composed of the full history of all RT states visited during the current IAPI iteration, enabling expected value estimation using many probable states with only linear computational complexity in .
Day Ahead Policy Improvement using Cross Entropy DA policy improvement is achieved using the cross entropy method De Boer et al. (2005); Szita & Lörincz (2006). In this method, initial policies are sampled from a distribution . Following which, in each iteration policy parameters are drawn from , and their top percentile, according to the RT value, is used to update De Boer et al. (2005); Szita & Lörincz (2006). In our experiments we set such that it includes that equally separate , making this inner product equal, for all the different actions . The distribution is a Gaussian mixture with means set to that belong to the top percentile. The convergence criterion we use in our experiments with the difference between the toppercentile values average of two consecutive iterations . By using the cross entropy method, we avoid using gradientbased optimization which may be difficult to compute in our case due to the discrete, nonlinear nature of ACPF solutions and their dependence of generation Cain et al. (2012), which dictate the level of reliability.
The criterion for comparing policies is a parametric RT value function, , as oppose to using rollouts for policy evaluation Gabillon et al. (2011). The reason for this choice is threefold. First, since a rollout only explores a small part of the space, assuming a structure allows us to better generalize to unvisited states. This assumption is supported by our experiments; see Fig. 5. Second, this functional representation allows us to fairly compare different DA policies using a common set of representative RT initial states . Third, our endgoal is to use the value function learned by this algorithm as a proxy for system reliability in RT.
5 Experiments
In this section we show results of IAPI algorithm on the IEEE RTS96 test system, that is considered a standard testcase in the power systems literature Wong et al. (1999); see Fig. 4. This testcase is an example for a power grid of a mediumsized country, containing buses, generators, and transmission lines. We updated the testcase to include additional wind generators to better represent current power grids. We use daily demand and wind profiles based on real historical records as published in Pandzic et al. (2015). As stated in Sec. 1, this is a complicated, high dimensional system, which cannot be solved using bruteforce methods. The state space of this system can have line configurations, with demand values () and wind generation values () at each time, which are of a stochastic nature. This is without accounting for the dayahead prediction, which will be the power of of this number (for each hour of the day). Controlling which controllable generators are on/off makes integer decisions, and generation levels for possible values per each generator.
To compose the DA action set , we define subsets of active generators chosen at random, and fix it for the rest of the simulation. These subsets contain varying numbers of generators with different capacities, to enable meeting demand for the different possible daily profiles. For the DA we use a feature vector
where

is the number of actions ( in our experiments).

indicates if generation can meet maximal predicted daily demand.

indicates if generation can meet minimal predicted daily demand.

is a barrier penalty function that penalizes if the average demand is close to the upper or lower generation bounds achieved by .

is an indicator function over the selected DA action.
For the RT policy we employ a simple heuristic, of shifting the hourly generation values to meet the realized effective demand. We consider effective demand to be demand values minus wind generation values. This is a natural approach as wind generation is not under the decision maker’s control and therefore is not considered a part of regular controllable generation. The RT feature vector contains polynomial features of , where

is the total RT effective demand,

is the demand entropy across the different buses, and

is the generation entropy across the different buses,
resulting in a dimensional vector. We use the entropy feature since it compactly maps the spread of generation and demand across the network. The spread is important as the concentrations of generation and demand are directly linked to reliability issues, see Fig. 5.
For parameters for the dynamics described in Eqs. (1)(2) we use Lu et al. (2013) and choose for the wind forecast error, and for the demand forecast error. The real time variation is chosen to be . Line failure probability is set to for each line, and its timefillfix .
In our simulation we use episodes, each with a day horizon. Each episode starts from a random DA state , drawn from several representing demand and wind profiles, to which we add normally distributed noise. The next day transition corresponds to adding a normally distributed bias to the previous day profile. In each crossentropy iteration we evaluate DA policies () and choose the top th percentile for updating . The DA policies are evaluated in parallel, on a cores cluster. For the TD(0) algorithm we use discounting with .
In Fig. 5 we show the learned RT value , as a function of the deviation of the overall effective demand (demand minus wind) from the DA prediction, and generation entropy across the different buses. The RT value shown is marginalized over the rest of the features, time, and daily profiles. As shown in the figure, as the realtime demand deviates from the predicted demand, reliability suffers in a quadratic dependence. This is because the generators chosen in the DA will reach their upper or lower thresholds, causing generation to not meet the demand. The monotonic dependence in generation entropy implies that the higher the entropy the more reliable the system. This can be understood since high entropy corresponds to a more distributed generation throughout the network, mitigating the consequence of line outages. The reason this mitigation occurs is that when less emphasis is put on specific areas of the network, the system has more flexibility to find alternative routes from generation to demand. This, however, incurs a price in real life since generation cannot be concentrated only on cheap generators.
Next, in Fig. 6 we show the top th percentile convergence of the IAPI algorithm. As can be seen, the average value is increasing and converging after iterations, while the variance of the top percentile solutions is decreasing. In Fig. 7, we visualize the convergence of the IAPI algorithm by projecting on the top two principal components (PC) of the DA policy parameters . We use the same PCs for all the plots. The figure shows the scattering of the drawn in each iteration. As described in Alg. 1, each defines a policy for which we calculate the estimated expected value . The dark ’+’ mark the s corresponding to the top percentile of . As can be seen, the IAPI algorithm explores the policy space until converging to local minima.
In Fig. 8 we present different daily effective demand profiles, colored according to the DA action chosen by the DA policy , that was learned by the IAPI algorithm. A clear clustering can be observed between different daily demand profiles and the resulting action taken by the DA policy. The policy distinguishes between different consumption patterns and maps them to a corresponding set of active generators for reliable operation of the day to come.
To test our algorithm we compare the learned DA policy to three common heuristics. Taking the daily state as an input, these heuristics choose an eligible generator subset that can satisfy the maximal effective demand according to that day’s DA prediction. The difference between them is the way they choose among the eligible subsets of each day. ’Random’ chooses one at random, ’Cost’ chooses the cheapest combination of generators, and ’Elastic’ chooses the subset with the most flexible generators, having the largest ratio between upper and lower generation limits. We evaluate the performance of the different policies using rollouts of episodes per each policy. Fig. 9 presents the boxplots of the results. As can be seen, the value varies greatly between the different methods. In the ’Random’ policy, there is an almost flat spread, demonstrating a lack of preference for a single subset when encountering a new day. The ’Cost’ and ’Elastic’ policies produce a more concentrated spread, corresponding to their preference of subset choices. The policy learned using IAPI obtains higher reward than the heuristics. This result shows the IAPI algorithm’s ability to learn a diverse DA policy.
6 Discussion
In this work we present an interleaved twoMDP model, inspired by the hierarchical decision making problem of managing power grid reliability. The IAPI algorithm presented alternates between improving the DA policy, and learning the RT reliability value. The IEEE RTS96 network in our experiments is a large enough network to capture computational complexities that arise in realworld networks.
In this work we focus on the power grid, however our model can be adapted to other important applications with an hierarchical decision making structure in different timescales where high level of reliability and sustainability is required. Examples for such applications are sewer systems, smart cities and traffic control.
The coarse model presented in this work was crafted jointly as an initial step with several SOs. This work is the tip of the iceberg and many enhancements can be considered. For example, an important aspect that is not covered by it is budget consideration. Following the practice in the power system industry, reliability and money are often treated as different “currencies”. Considering a budget will impose limitations on action selection and will complicate this problem even more. Another addition that can be made to extend the IAPI algorithm to interleave in reverse, i.e., alternating the DA improvement with improving the RT policy. Suspected drawbacks in this case are convergence problems, and the need for even more intense simulation.
Managing high reliability in stochastic complex systems, with interleaved decision making in different time horizons, is inherently difficult and results in intractable formulations. To mitigate this, there is a growing interest in the power system community to utilize proxies that will enable quick assessment of reliability for different states of the grid. In this work we introduce new models and formulations, along with a simulation environment. Our hope is that this will provide a platform for other researchers in the community to develop and explore their own innovative methods, and will help to bring these two fields closer. The code for the simulation environment is available at hidden to preserve anonymity.
Footnotes
 In this work we consider only wind generation as a renewable source for simplicity.
References
 Innovative tools for electrical system security within large areas. http://www.iteslaproject.eu/. Accessed: 20160203.
 AbiriJahromi, A, FotuhiFiruzabad, M, and Abbasi, E. An efficient mixedinteger linear formulation for longterm overhead lines maintenance scheduling in power distribution systems. Power Delivery, IEEE Transactions on, 24(4):2043–2053, 2009.
 AbiriJahromi, Amir, Parvania, Masood, Bouffard, Francois, and FotuhiFiruzabad, Mahmud. A twostage framework for power transformer asset maintenance management – Part I: Models and formulations. Power Systems, IEEE Transactions on, 28(2):1395–1403, 2013.
 Allan, RN et al. Reliability evaluation of power systems. Springer Science & Business Media, 2013.
 Anil, Can. Benchmarking of data mining techniques as applied to power system analysis. 2013.
 Barto, Andrew G and Mahadevan, Sridhar. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379, 2003.
 Bertsimas, Dimitris, Litvinov, Eugene, Sun, Xu Andy, Zhao, Jinye, and Zheng, Tongxin. Adaptive robust optimization for the security constrained unit commitment problem. Power Systems, IEEE Transactions on, 28(1):52–63, 2013.
 Bienstock, Daniel. Optimal control of cascading power grid failures. In Decision and control and European control conference (CDCECC), 2011 50th IEEE conference on, pp. 2166–2173. IEEE, 2011.
 Bienstock, Daniel, Chertkov, Michael, and Harnett, Sean. Chanceconstrained optimal power flow: Riskaware network control under uncertainty. SIAM Review, 56(3):461–495, 2014.
 Bishop, Christopher M. Pattern recognition. Machine Learning, 2006.
 Box, George EP, Jenkins, Gwilym M, Reinsel, Gregory C, and Ljung, Greta M. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
 Cain, Mary B, Oâneill, Richard P, and Castillo, Anya. History of optimal power flow and formulations. Federal Energy Regulatory Commission, 2012.
 Dalal, Gal and Mannor, Shie. Reinforcement learning for the unit commitment problem. In PowerTech, 2015 IEEE Eindhoven, pp. 1–6. IEEE, 2015.
 De Boer, PieterTjerk, Kroese, Dirk P, Mannor, Shie, and Rubinstein, Reuven Y. A tutorial on the crossentropy method. Annals of operations research, 134(1):19–67, 2005.
 Dietterich, Thomas G. The MAXQ method for hierarchical reinforcement learning. In ICML, pp. 118–126. Citeseer, 1998.
 Ernst, Damien, Glavic, Mevludin, Stan, GuyBart, Mannor, Shie, and Wehenkel, Louis. The crossentropy method for power system combinatorial optimization problems. In 2007 Power Tech, 2007.
 Gabillon, Victor, Lazaric, Alessandro, Ghavamzadeh, Mohammad, and Scherrer, Bruno. Classificationbased policy iteration with a critic. 2011.
 Grainger, John J and Stevenson, William D. Power system analysis. McGrawHill, 1994.
 Jiang, Daniel R and Powell, Warren B. Optimal hourahead bidding in the realtime electricity market with battery storage using approximate dynamic programming. INFORMS Journal on Computing, 27(3):525–543, 2015.
 Jiang, Daniel R, Pham, Thuy V, Powell, Warren B, Salas, Daniel F, and Scott, Waymond R. A comparison of approximate dynamic programming techniques on benchmark energy storage problems: Does anything work? In Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014 IEEE Symposium on, pp. 1–8. IEEE, 2014.
 Koutsopoulos, Iordanis and Tassiulas, Leandros. Optimal control policies for power demand scheduling in the smart grid. Selected Areas in Communications, IEEE Journal on, 30(6):1049–1060, 2012.
 Lai, Guoming, Margot, François, and Secomandi, Nicola. An approximate dynamic programming approach to benchmark practicebased heuristics for natural gas storage valuation. Operations research, 58(3):564–582, 2010.
 Lu, Ning, Diao, Ruisheng, Hafen, Ryan P, Samaan, Nancy, and Makarov, Yuri V. A comparison of forecast error generators for modeling wind and load uncertainty. In Power and Energy Society General Meeting (PES), 2013 IEEE, pp. 1–5. IEEE, 2013.
 Padhy, Narayana Prasad. Unit commitmenta bibliographical survey. Power Systems, IEEE Transactions on, 19(2):1196–1205, 2004.
 Pandzic, Hrvoje, Wang, Yannan, Qiu, Ting, Dvorkin, Yury, and Kirschen, Daniel S. Nearoptimal method for siting and sizing of distributed storage in a transmission network. 2015.
 Papavasiliou, Anthony and Oren, Shmuel S. Multiarea stochastic unit commitment for high wind penetration in a transmission constrained network. Operations Research, 61(3):578–592, 2013.
 Parr, Ronald and Russell, Stuart. Reinforcement learning with hierarchies of machines. Advances in neural information processing systems, pp. 1043–1049, 1998.
 Powell, Warren B. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. John Wiley & Sons, 2007.
 Powell, Warren B and Meisel, Stephan. Tutorial on stochastic optimization in energy – Part I: Modeling and Policies. 2015.
 Scott, W and Powell, Warren B. Approximate dynamic programming for energy storage with new results on instrumental variables and projected bellman errors. Submitted to Operations Research (Under Review), 2012.
 Si, Jennie. Handbook of learning and approximate dynamic programming, volume 2. John Wiley & Sons, 2004.
 Song, YongHua and Wang, XiFan. Operation of marketoriented power systems. Springer Science & Business Media, 2003.
 Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction. MIT press, 1998.
 Sutton, Richard S, Precup, Doina, and Singh, Satinder. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1):181–211, 1999.
 Szita, István and Lörincz, András. Learning Tetris using the noisy crossentropy method. Neural computation, 18(12):2936–2941, 2006.
 Talbot, David. Lifeline for renewable power. Technol Rev, 112:40–47, 2009.
 Taylor, James W and Buizza, Roberto. Neural network load forecasting with weather ensemble predictions. Power Systems, IEEE Transactions on, 17(3):626–632, 2002.
 Urieli, Daniel and Stone, Peter. Tactex’13: a champion adaptive power trading agent. In Proceedings of the 2014 international conference on Autonomous agents and multiagent systems, pp. 1447–1448. International Foundation for Autonomous Agents and Multiagent Systems, 2014.
 Wong, Paul, Albrecht, P, Allan, R, Billinton, Roy, Chen, Qian, Fong, C, Haddad, Sandro, Li, Wenyuan, Mukerji, R, Patton, Diane, et al. The IEEE reliability test system1996. a report prepared by the reliability test system task force of the application of probability methods subcommittee. Power Systems, IEEE Transactions on, 14(3):1010–1020, 1999.
 Wood, Allen J and Wollenberg, B. Power generation operation and controlâ2nd edition. In Fuel and Energy Abstracts, volume 3, pp. 195, 1996.
 Wu, Lei, Shahidehpour, Mohammad, and Fu, Yong. Securityconstrained generation and transmission outage scheduling with uncertainties. Power Systems, IEEE Transactions on, 25(3):1674–1685, 2010.
 Xi, Xiaomin, Sioshansi, Ramteen, and Marano, Vincenzo. A stochastic dynamic programming model for cooptimization of distributed energy storage. Energy Systems, 5(3):475–505, 2014.