Sparse Wide-Area Control of Power Systems using Data-driven Reinforcement Learning
In this paper we present an online wide-area oscillation damping control (WAC) design for uncertain models of power systems using ideas from reinforcement learning. We consider that the exact small-signal model of the power system at the onset of a contingency is not known to the operator and use online measurements of the generator states and control inputs to recursively learn a state-feedback controller that minimizes a given quadratic energy cost. However, unlike conventional linear quadratic regulators (LQR), we intend our controller to be sparse, so its implementation reduces the communication costs. We, therefore, employ the gradient support pursuit (GraSP) optimization algorithm to impose sparsity constraints on the control gain matrix during learning. The sparse controller is thereafter implemented using distributed communication. We highlight various implementation, convergence, and numerical benefits versus challenges associated with the proposed approach using the IEEE 39-bus power system model with 1149 unknown parameters.
Sparse Wide-Area Control of Power Systems using Data-driven Reinforcement Learning
Over the past few years, the occurrence of a series of blackouts in different parts of the world has led power system utility owners to look beyond the traditional approach of controlling the grid via local feedback, and instead transition to system-wide control, often referred to as wide-area control (WAC). Several papers on WAC design for damping of electromechanical oscillations have been reported in the recent literature [1, 2, 3, 4, 5]. But a common limitation among all these designs is that they are based on perfect knowledge of the grid model. The basic approach is to linearize the system model around a given operating point, and design linear state-feedback or output-feedback type controllers for taking damping control action via the generator excitation control system. In reality, however, the operating point of a grid may move over wide ranges, and therefore using just one fixed controller might not be optimal. The problem is becoming more notable with increasing penetration of renewables and electric vehicles in the grid, each of which come with its own share of intermittency and uncertainty. One way to counteract these uncertainties would be to design a robust WAC. The challenge, however, is that with millions of electric vehicles and inverter-based generation points being envisioned to be integrated to the US grid in very near future, primarily in a completely plug-and-play fashion, it is extremely difficult to quantify a reliable upper bound for these uncertainties that can be used for robust control designs. Recent papers such as [6, 7, 8] have proposed robust sparse control, but those designs usually work for fairly limited amount of uncertainty. Operators are, therefore, more interested in learning the power system model using online measurements available from sophisticated sensors, such as Phasor Measurement Units (PMU), after contingencies or developing real-time control actions based on learning.
Motivated by this problem, in this paper we present a LQR-based WAC design using online reinforcement learning (RL). RL has been shown to be an excellent candidate for online optimal control design under model uncertainty in several recent papers such as [9, 10, 11, 12]. Other variants of online learning such as approximate dynamic programming (ADP) [13, 14], Q-learning , and integral concurrent learning , for both continuous-time and discrete-time dynamic systems have also been proposed. In this paper, we adopt the RL design proposed in , whereby online measurements of generator states and control inputs are used to learn an optimal LQR controller, given a choice of the objective function. However, the algorithm in  has very long convergence time due to the assumption of completely unknown system model. In this paper, we exploit the knowledge of the approximate, or nominal, model to speed up convergence significantly.
Moreover, designing a traditional LQR controller is not suitable for WAC since it demands a dense all-to-all communication graph between every pair of generators. To save on communication costs, we integrate the RL design with Gradient Support Pursuit (GraSP) that imposes sparsity constraints on the control gain matrix . Sparse control designs for WAC have been reported in several recent papers such as , , , but without any model uncertainty. The proposed algorithm incorporates the advantages of RL control and offline sparse controllers. This algorithm learns a sparse controller, thus simultaneously satisfying the communications cost constraint and overcoming the model uncertainty. The proposed design are carried out in two sequential stages: (1) following a contingency, state estimates generated by decentralized Kalman filters at each generator, as well as the generator control inputs stream in to a central coordinator that serves as a ’critic’ and iteratively learns the sparse optimal controller ; (2) Once the -learning loop converges, the controller is implemented by a distributed sparse communication topology connecting the selected sets of generators. We validate these two stages using simulations of the IEEE 39-bus 10-generator power system model with 675 unknown feedback gains. We highlight the numerical trade-offs of the two stages for learning versus implementation for different levels of uncertainty.
The main contributions of this paper are:
Reduce the convergence time of online RL control algorithm by exploiting the knowledge of the nominal model for WAC of power systems.
Develop a a sparsity-constrained online learning control algorithm that reduces the communication cost.
The rest of the paper is organized as follows. Section II formulates the proposed sparse WAC problem. Section III briefly reviews the use of RL for LQR designs and presents the main sparse learning algorithm by integrating RL with GraSP. Section IV presents simulation results and numerical analysis. Section V concludes the paper.
Ii Problem Statement
Ii-a Power System Model
Consider a -machine power system. Each synchronous generator is modeled by a standard set of swing equations and excitation equations that are required for wide-area control. For example, one simple model can be just the swing and excitation circuit dynamics with power balance equations
where for the generator, is the phase angle, is the generator bus voltage, is the bus phase angle, is the rotor velocity, is the quadrature-axis internal emf, denotes the synchronous frequency, and are the active and reactive power, is the generator inertia, is the generator damping, is the mechanical power input, is the excitation time constant, , , and are the direct axis salient reactances, direct-axis transient reactance, and quadrature-axis salient reactance, respectively. All variables, except for the phase angles (radians), are expressed in per unit. The control input for this model is considered as the field voltage , which can be split as
where, the first term is a constant that fixes the equilibrium value, and the second term is a designable control input. As such, our control design does not necessarily need the generators to follow this simple model. Detailed models of generators are allowed, provided all the generator states can be measured or estimated (a short description of decentralized state estimation will be given shortly). In general, we assume the generator to consist of states, say denoted as , and one scalar control input , which is the field excitation voltage. The differential-algebraic model of the generators and the power flow are converted to a state-space model using Kron reduction , and linearized about a desired equilibrium. The small-signal model of the system is written as
The power system model (6) can be written in compact form by stacking state and input vectors as
where indicates the initial condition. However, model parameters and operating conditions in the grid change frequently between contingencies, and therefore having the exact knowledge of and is almost impractical. In the current state of art, utilities use offline models, which may have been constructed years ago, and simply depend on the inherent robustness of the grid to save the closed-loop response even if the controller is not properly matched with the actual matrices that apply to that situation. With gradual increase in renewable penetration and their power electronic interfaces, and also stochastic loads such as electric vehicles, this robustness can no longer be counted on. Operators are more interested in designing by online learning. One choice is to design a LQR controller for . For this, we will assume the state to be available for feedback. This can be done by placing PMUs at geometrically observable set of buses such that the voltage and current phasors at every generator bus is computable. A decentralized unscented Kalman filter is assumed to be installed at every generator. The computed (or measured if a PMU is already at the generator bus) values of the bus phasors are used by the Kalman filter to estimate every generator state vector . Assuming that the KF is run continuously and is sufficiently faster than the generator dynamics, for the rest of the paper we will simply assume . For more details on this KF please see .
Ii-B Optimal Wide-Area Control
The objective of optimal LQR wide-area control for the continuous time, LTI system defined in (7) is to find a controller that minimizes the following cost function:
where , are semi-positive and positive definite design matrices, respectively, with appropriate dimensions, which are selected by the user. The LQR state feedback controller is written as
where is the feedback gain matrix that minimizes in (8), which can be constructed by solving the algebraic Riccati equation where is a positive definite matrix . In the vector form, (9) can be expressed as
where the sub-matrix indicates feedback gain from the states of generator to the controller of generator , while is the self-feedback gain for generator . When is optimal, every and , in general, are non-zero matrices, implying that the communication network needed for exchanging the states is a dense graph. Such dense graphs can result in high communication costs. In order to reduce this cost, we therefore impose an extra constraint of reducing the cardinality of , defined as
where nnz(.) operator returns the number of nonzero elements of a matrix. The self-feedback gains are not counted in the definition (11) as self-links do not cost much.
Ii-C Model Uncertainties
We assume that the exact values of the matrices , in (7) are not known to the power system operator. Only a nominal pair () is known. Typical uncertainties in , in an actual power system may result from various unknown parameters such as inertias of the generators, especially when power electronic converters are added resulting in low-inertia equivalents , or exact values of the line reactances, especially when series compensation may be used in long transmission lines for certain unforeseen contingencies, or even the internal time constants of the generator circuits, and so on.
We refer to the LQR controller based on the nominal model as the mismatched LQR . When the actual deviate significantly from the nominal model matrices , the performance of the mismatched LQR controller suffers and can even become unstable as illustrated in section 4. Thus, we investigate RL for this uncertain model under constrained communication cost.
Iii RL control for WAC
Reinforcement learning has been proposed as a tool to implement optimal control (9) for unknown or uncertain systems . A combination of Q-learning and adaptive dynamic programming is proposed in , which provides an actor-critic structure capable of learning the optimal control policy for completely unknown, continuous-time dynamic systems using value iteration. This algorithm is implemented online using state and control input measurements of the system. Unlike a general RL problem, where is learnt online starting from any arbitrary initial guess, uncertainties in power systems are typically not that unstructred and drastic. This means that if () is the model during one contingency, and () is the model during a contingency that occurs within a few hours, then most probably only a few entries of () will be different from those of () as only a few line and generator parameters may have changed between the two events. The initial guess for can therefore be picked as the controller from the previous contingency. If the difference between the models is indeed not significant, then this choice would expedite the convergence of this loop considerably. As the discrepancy between the two models grows, choosing an initial model () still increases the convergence speed of RL significantly relative to a random guess as will be illustrated in section 4. The notation used in this section is summarized in Table 1.
|Critic convergence speed coefficient.|
|Actor convergence speed coefficient.|
|Concatenated vector of states and control inputs.|
|Quadratic basis vector of states and control inputs.|
|Change in the basis vector after time-step T.|
|Optimized control input|
|Half vectorization operator, stacks elements of the upper triangular part of a matrix into a vector, multiplying diagonal elements by 2.|
|Frobenius norm of the matrix , defined by trace().|
|supp()||The support set of the matrix , i.e., the set of indices of the nonzero entries of matrix .|
|The matrix obtained by preserving only the largest-magnitude entries of the matrix , and setting all other entries to zero.|
|The gradient of the scalar matrix w.r.t . Assuming , is given by matrix with the elements .|
|The restricted Newton step of function w.r.t. the matrix under structural constraint supp() . First, all elements of is stacked in vector , and the function is defined as . Then the restricted Newton step vector of at is computed using the conjugate gradient method. The vector is then converted back to matrix by stacking consecutive segments of .|
Iii-a Sparsity-constrained WAC using RL
First, we briefly review the RL algorithm in . In a two-step learning iterative process, it starts from a randomly generated actor and critic vector, and iterates until these vectors converge to their optimal values. The critic approximator is responsible for estimation of the Q-function while the actor approximator aims to find the optimal control policy. In each step, the actor selects a control policy () and applies it to the system (6) using (9). The critic evaluates the performance of the control policy applied by the actor using state and control input measurements. This evaluation is then used by the actor to update its control policy. This process continues until the actor update results in unchanged within a desired amount of error.
The critic and actor update rules in  are implemented by gradient descent and do not require the knowledge of the system model. Thus the algorithm in  converges to the optimal LQR controller for a completely unknown system. However, this method has long convergence time. Since in WAC partial system knowledge is available to the designer, we employ the known matrices of the nominal model to initialize the RL algorithm, thus reducing the convergence time.
The constrained optimization (III-A) results in a sparse matrix denoted by , which has at most communication links. To impose this constraint in RL, we employ GraSP algorithm  instead of the gradient descend in the actor update of. GraSP was shown to find sparse solutions for a wide class of optimization problems. The details of the sparse RL for WAC are provided in Algorithm 1.
We exploit the knowledge of the nominal model to initialize Algorithm 1. The initial critic vector is formed as follows. First, we solve the Riccati equation for the nominal model to find positive definite
and then we form the kernel
Next, using we find the Q-function and
The initial actor matrix is given by
where and are desired errors for the actor and the critic, respectively, and indicates the duration of the exploration noise , chosen to provide sufficient excitation for the convergence of the critic . Convergence of the critic approximator requires the exploration noise to provide persistence of excitation condition for the system under learning. The noise signal is created by of sum the sinusoids with a sufficient number of frequencies . Since the critic must converge faster than the actor, we choose . Steps 14-18 of Alg. 1 limit the directions in which the actor update is performed. In each iteration, is extended along its steepest gradient-descent directions (step 15). Then the set of descent directions is created by merging the indices of non-zero elements of found in the previous iteration and largest elements of the gradient. As a result, the direction of descent will have at most elements (step 16). After descent based on these directions, an -norm is applied to the resulting to remove the extra entries and ensure (step 18).
Iii-B Timeline and CPS Implementation
The Timeline (Fig. 1) and the CPS diagram (Fig. 2) show the stages of design and implementation of the proposed algorithm. As shown in Fig. 1, in stage 1, the learning Algorithm 1 finds a sparse controller () suitable for damping oscillations of the actual system. In stage 2, we apply right after learning it to damp oscillations. We also apply it at , when new disturbances occur, assuming the system model remains unchanged. This assumption is based on typical power system conditions for disturbances within a short time interval of each other. In stage 1, centralized computation is carried out to learn the controller as shown in Fig. 2(a). Stage 1 consists of two phases. In phase one (steps 4 to 20 of Alg. 1), both the critic and the actor estimators are updating. While the actor is sparse with the sparsity constraint , its sparsity pattern can change, i.e. the indices of non-zero elements in might change. When the critic converges to its final value (sooner than the actor), the phase one finishes, and the structure of the sparsity-constrained actor is fixed.
In phase two, only the actor parameters are updated while the sparsity pattern, which determines the communication graph between the nodes of the plant, is fixed. Phase 2 extends over steps 4-20, but excludes steps 10-13, i.e. the critic vector update. Finally, Fig. 1(b) illustrates stage 2. It shows a sparse controller produced by the learning algorithm for a typical 4-generator grid. In this case, the sparse feedback gain matrix is fixed and known to each generator. Thus there is no need for the central controller, and each generator can compute its own control input using state measurements from other generators and , thus implementing eq. (9) in a distributed fashion. Here, the generator pairs (2,3) and (3,4) do not communicate. Assuming each generator has one state, the sparsity constraint for this is given by .
Iv Numerical Results
The IEEE 39-bus model of New England power system is used in this section to study the effectiveness and performance of the proposed algorithm. This model consists of 10 synchronous generators. The state-space model in the form of (9) has the state vector of size 75, i.e. and 9 control inputs, resulting in and . The details of the model can be found in . The feedback gain matrix has 675 entries, so the LQR controller requires a dense communication network to provide the state measurements to the generators. Following the discussion in section 2.C, the uncertainty of the power system model is limited to the parameters corresponding to inertias of the machines and line reactances, which results in 1132 uncertain parameters of the matrix . The level of uncertainty in both and is measured by the parameter . To generate uncertain parameters, we randomly vary the corresponding entries of using the uniform distribution within of their original values in and matrices, respectively. Finally, the duration of the exploration noise in Algorithm 1 is 2 seconds.
Several WAC scenarios are simulated using two different values of 70% and 100%. We compare four controllers: the ideal LQR controller (), designed assuming the knowledge of the actual system model ; the mismatched LQR controller (), designed for the nominal system and applied to the actual system ; the dense learning controller () found by Algorithm 1 for the maximum value of ; the sparse learning controller () designed using Algorithm 1 for a range of values (). Moreover, performance of the open-loop system controlled by local PSS (power system stabilizers), designed for , is shown for comparison.
First, we assume that only the nominal system is known, the disturbance happens at , and the the learning algorithm converges at (see Fig. 1). In figure 3, we illustrate the rotor speed of the generators for different controllers and two uncertainty levels. Moreover, the convergence time of the learning algorithm is illustrated as in Fig. 3(e-h). Note that when , the mismatched LQR becomes unstable while the learning controller suppresses oscillations in both sparse and dense cases. The convergence time of Algorithm 1 increases with the level of uncertainty for both sparse and dense controllers.
Next, we apply learned in stage 1 (fig 2(a)) to damp the oscillations caused by new disturbances (see fig 1) for both uncertainty levels. Figure 4 indicates the rotor speed of the generators in this scenario. Note that the mismatched LQR is still unstable when while successfully damps the oscillations for both -values. We found that does not depend on the initial condition at . Therefore, designed by Alg. 1 performs well for any incoming disturbances.
The performance of the controllers in terms of the objective function values (8), and their convergence times are reported in Table 2. The -value is calculated from (when the initial disturbance happens) to seconds when the oscillations are practically damped. Note that in this case the overall energy is not significantly affected by learning, due to relatively short duration of Stage 1 when the nominal knowledge of the system is used, and the -values are comparable in Fig. 3 and 4 for fixed -values. In fact, is greater in Fig. 4 (without learning) due to a different disturbance.
As expected, the value of the objective function (8) increases as the sparsity level grows (decreasing ). This is due not only to sub-optimality of the sparse controller, but also to increased duration of learning, and thus more persistent oscillations for lower values. Figure 4 illustrates the trade-off between the sparsity constraint and for the uncertainty level of 70%. Note that these results were generated for a different disturbance than in Fig. 3-4 and Table 2. We observe that as approaches its maximum value of 675, converges to the optimal value (-LQR) of 5.9466. Moreover, the performance loss is small for . As decreases below 300, the energy degrades significantly. Thus, there is a trade-off between the optimality of the controller and the communication cost, and choosing provides an acceptable controller performance at significantly decreased communication cost.
Finally, the convergence time of Alg. 1 (also shown in Table 2) depends on two factors. First, it increases with the uncertainty level As the deviation between the nominal model and the actual model increases, the convergence time also increases sharply since the initial critic and actor do not approximate their optimal values closely for large -values. Moreover, decreasing the sparsity constraint results in slower convergence. For sparsity levels between -675 the convergence time of Stage 1 ranges from 7.34 to 2.29 seconds, assuming . If Alg. 1 was initialized with randomly generated actor and critic as in , the convergence time would be on the order of hours and thus exceed significantly the acceptable range for WAC applications. Moreover, the exploration noise duration would increase greatly as well, adding extra oscillations to the system and degrading the quality of service during that period. Hence, using the knowledge of the nominal model to initialize the RL algorithm enables its application in WAC.
Finally, we briefly discuss the convergence and stability properties of the proposed algorithm. While the problem is non-convex, and convergence and stability guarantees cannot be provided even in the known models cases , , extensive numerical studies show that the proposed algorithm converges in many practical scenarios for all uncertainly values and sparsity ranges , and the closed-loop system is observed to be stable. For , the convergence time increases dramatically, thus precluding utilization of extremely sparse controllers in WAC when employing RL.
Wide-area oscillation damping control was developed for unknown power systems using online, data-driven reinforcement learning. First, the convergence time was reduced significantly by using the knowledge of the nominal model. Second, the communication cost of WAC was constrained by the GraSP method. The effectiveness of the proposed controller was illustrated for the IEEE New England power system model with uncertain parameters. It was demonstrated that the proposed sparse controller successfully damps the wide-area oscillations even for highly uncertain systems, where the LQR controller matched to the nominal model becomes unstable. Future work will further investigate the convergence properties of the proposed RL algorithm and extend this centralized learning method to distributed and multi-agent implementations.
- N. R. Chaudhuri, D. Chakraborty, and B. Chaudhuri, “Damping control in power systems under constrained communication bandwidth: A predictor corrector strategy,” IEEE Transactions on Control Systems Technology, vol. 20, no. 1, pp. 223–231, 2012.
- S. Zhang and V. Vittal, “Design of wide-area power system damping controllers resilient to communication failures,” IEEE Transactions on Power Systems, vol. 28, no. 4, pp. 4292–4300, 2013.
- R. A. Jabr, B. C. Pal, and N. Martins, “A sequential conic programming approach for the coordinated and robust design of power system stabilizers,” IEEE Transactions on Power Systems, vol. 25, no. 3, pp. 1627–1637, 2010.
- J. H. Chow and S. G. Ghiocel, “An adaptive wide-area power system damping controller using synchrophasor data,” in Control and Optimization Methods for Electric Smart Grids. Springer, 2012, pp. 327–342.
- F. Dörfler, M. R. Jovanović, M. Chertkov, and F. Bullo, “Sparsity-promoting optimal wide-area control of power networks,” IEEE Transactions on Power Systems, vol. 29, no. 5, pp. 2281–2291, 2014.
- C. Lidström and A. Rantzer, “Optimal state feedback for systems with symmetric and hurwitz state matrix,” in American Control Conference (ACC), 2016. IEEE, 2016, pp. 3366–3371.
- R. Arastoo, M. Bahavarnia, M. V. Kothare, and N. Motee, “Closed-loop feedback sparsification under parametric uncertainties,” in Decision and Control (CDC), 2016 IEEE 55th Conference on. IEEE, 2016, pp. 123–128.
- F. Lian, A. Chakrabortty, F. Wu, and A. Duel-Hallen, “Sparsity-constrained mixed H/H control,” in 2018 American Control Conference (ACC), 2018.
- F. L. Lewis and D. Liu, Reinforcement learning and approximate dynamic programming for feedback control. John Wiley & Sons, 2013, vol. 17.
- B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems, 2017.
- T. Bian, Y. Jiang, and Z.-P. Jiang, “Decentralized adaptive optimal control of large-scale systems with application to power systems,” IEEE Transactions on Industrial Electronics, vol. 62, no. 4, pp. 2439–2447, 2015.
- F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control. John Wiley & Sons, 2012.
- Y. Jiang and Z.-P. Jiang, “Robust adaptive dynamic programming with an application to power systems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 7, pp. 1150–1156, 2013.
- X. Zhong, H. He, D. Wang, and Z. Ni, “Model-free adaptive control for unknown nonlinear zero-sum differential game,” IEEE transactions on Cybernetics, 2017.
- J. Y. Lee, J. B. Park, and Y. H. Choi, “Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems,” Automatica, vol. 48, no. 11, pp. 2850–2859, 2012.
- A. Parikh, R. Kamalapurkar, and W. E. Dixon, “Integral concurrent learning: Adaptive control with parameter convergence without pe or state derivatives,” arXiv preprint arXiv:1512.03464, 2015.
- K. G. Vamvoudakis, “Q-learning for continuous-time linear systems: A model-free infinite horizon optimal control approach,” Systems & Control Letters, vol. 100, pp. 14–20, 2017.
- S. Bahmani, B. Raj, and P. T. Boufounos, “Greedy sparsity-constrained optimization,” Journal of Machine Learning Research, vol. 14, no. Mar, pp. 807–841, 2013.
- F. Lian, A. Chakrabortty, and A. Duel-Hallen, “Game-theoretic multi-agent control and network cost allocation under communication constraints,” IEEE Journal on Selected Areas in Communications, vol. 35, no. 2, pp. 330–340, 2017.
- A. Jain, A. Chakrabortty, and E. Biyik, “An online structurally constrained LQR design for damping oscillations in power system networks,” in American Control Conference (ACC), 2017. IEEE, 2017, pp. 2093–2098.
- P. Kundur, N. J. Balu, and M. G. Lauby, Power system stability and control. McGraw-hill New York, 1994, vol. 7.
- A. K. Singh and B. C. Pal, “Decentralized dynamic state estimation in power systems using unscented transformation,” IEEE Transactions on Power Systems, vol. 29, no. 2, pp. 794–804, 2014.
- I. A. Hiskens and J. Alseddiqui, “Sensitivity, approximation, and uncertainty in power system dynamic simulation,” IEEE Transactions on Power Systems, vol. 21, no. 4, pp. 1808–1820, 2006.