Learning robust controllers for LQR systems

# Learning Robust Controllers for Linear Quadratic Systems with Multiplicative Noise via Policy Gradient

Benjamin Gravell Peyman Mohajerin Esfahani  and  Tyler Summers
July 1, 2019
###### Abstract.

The linear quadratic regulator (LQR) problem has reemerged as an important theoretical benchmark for reinforcement learning-based control of complex dynamical systems with continuous state and action spaces. In contrast with nearly all recent work in this area, we consider multiplicative noise models, which are increasingly relevant because they explicitly incorporate inherent uncertainty and variation in the system dynamics and thereby improve robustness properties of the controller. Robustness is a critical and poorly understood issue in reinforcement learning; existing methods which do not account for uncertainty can converge to fragile policies or fail to converge at all. Additionally, intentional injection of multiplicative noise into learning algorithms can enhance robustness of policies, as observed in ad hoc work on domain randomization. Although policy gradient algorithms require optimization of a non-convex cost function, we show that the multiplicative noise LQR cost has a special property called gradient domination, which is exploited to prove global convergence of policy gradient algorithms to the globally optimum control policy with polynomial dependence on problem parameters. Results are provided both in the model-known and model-unknown settings where samples of system trajectories are used to estimate policy gradients.

The authors are with the Control, Optimization, and Networks lab, UT Dallas, The United States ({benjamin.gravell, tyler.summers}@utdallas.edu), and the Delft Center for Systems and Control, TU Delft, The Netherlands (P.MohajerinEsfahani@tudelft.nl). This material is based on work supported by the Army Research Office under grant W911NF-17-1-0058.

## 1. Introduction

Reinforcement learning-based control has recently achieved impressive successes in games [31, 32] and simulators [28]. But these successes are significantly more challenging to translate to complex physical systems with continuous state and action spaces, safety constraints, and non-negligible operation and failure costs that demand data efficiency. An intense and growing research effort is creating a large array of models, algorithms, and heuristics for approaching the myriad of challenges arising from these systems. To complement a dominant trend of more computationally focused work, the canonical linear quadratic regulator (LQR) problem in control theory has reemerged as an important theoretical benchmark for learning-based control [30, 12]. Despite its long history, there remain fundamental open questions for LQR with unknown models, and a foundational understanding of learning in LQR problems can give insight into more challenging problems.

All recent work on learning in LQR problems has utilized either deterministic or additive noise models [30, 12, 14, 8, 15, 1, 23, 35, 2, 37, 26], but here we consider multiplicative noise models. In control theory, multiplicative noise models have been studied almost as long as their deterministic and additive noise counterparts [39, 11], although this area is somewhat less developed and far less widely known. We believe the study of learning in LQR problems with multiplicative noise is important for three reasons. First, this class of models is much richer than deterministic or additive noise while still allowing exact solutions when models are known, which makes it a compelling additional benchmark. Second, they explicitly incorporate model uncertainty and inherent stochasticity, thereby improving robustness properties of the controller. Robustness is a critical and poorly understood issue in reinforcement learning; existing methods which do not account for uncertainty can converge to fragile policies or fail to converge at all. Additionally, intentional injection of multiplicative noise into learning algorithms is known to enhance robustness of policies from ad hoc work on domain randomization [33]. Moreover, stochastic representations of model uncertainty (via multiplicative noise) are perhaps most natural when models are estimated from noisy and incomplete data; these representations can be obtained directly from non-asymptotic statistical concentration bounds and bootstrap methods. Third, in emerging difficult-to-model complex systems where learning-based control approaches are perhaps most promising, multiplicative noise models are increasingly relevant; examples include networked control systems with noisy communication channels [3, 17], modern power networks with large penetration of intermittent renewables [10, 27], turbulent fluid flow [25], and neuronal brain networks [9].

### 1.1. Related literature

Multiplicative noise LQR problems have been studied in control theory since the 1960s [39]. Since then a line of research parallel to deterministic and additive noise has developed, including basic stability and stabilizability results [38], semidefinite programming formulations [13, 7, 24], robustness properties [11, 6, 19, 4], and numerical algorithms [5]. This line of research is less widely known perhaps because much of it studies continuous time systems, where the heavy machinery required to formalize stochastic differential equations is a barrier to entry for a broad audience. Multiplicative noise models are well-poised to offer data-driven model uncertainty representations and enhanced robustness in learning-based control algorithms and complex dynamical systems and processes.

Recent work on learning in LQR problems has focused entirely on deterministic or additive noise models. In contrast to classical work on system identification and adaptive control, which has a strong focus on asymptotic results, more recent work has focused on non-asymptotic analysis using recent tools from statistics and machine learning. There remain fundamental open problems for learning in LQR problems, with several addressed only recently, including non-asymptotic sample complexity [12, 35], regret bounds [1, 2, 26], and algorithmic convergence [14].

### 1.2. Our contributions

We give several fundamental results for policy gradient algorithms on linear quadratic problems with multiplicative noise. Our main contributions are as follows, which can be viewed as a generalization of the recent results of Fazel et al. [14] for deterministic LQR to multiplicative noise LQR:

• In §3.1 we show that although the multiplicative noise LQR cost is generally non-convex, it has a special property called gradient domination, which facilitates its optimization (Lemmas 3.1 and 3.2).

• In particular, in §3.2 the gradient domination property is exploited to prove global convergence of three policy gradient algorithm variants (namely, exact gradient descent, “naturalââ gradient descent, and Gauss-Newton/policy iteration) to the globally optimum control policy with a rate that depends polynomially on problem parameters (Theorems 3.4, 3.5, and 3.6).

• Furthermore, in §4 we show that a model-free policy gradient algorithm, where the cost gradient is estimated from trajectory data rather than computed from model parameters, also converges globally (with high probability) with an appropriate exploration scheme and sufficiently many samples (also polynomial in problem data) (Theorem 4.1).

• When the multiplicative noise variances are all zero, we recover the step sizes and convergence rates of [14].

Thus, policy gradient algorithms for the multiplicative noise LQR problem enjoy the same global convergence properties as deterministic LQR, while significantly enhancing the resulting controllerâs robustness to variations and inherent stochasticity in the system dynamics, as demonstrated by our numerical experiments in §5.

To our best knowledge, the present paper is the first work to consider and obtain global convergence results using reinforcement learning algorithms for the multiplicative noise LQR problem. Our approach allows the explicit incorporation of a model uncertainty representation that significantly improves the robustness of the controller compared to deterministic and additive noise approaches.

## 2. Linear Quadratic Optimal Control with Multiplicative Noise

We consider the linear quadratic regulator problem with multiplicative noise

 minimizeπ∈Π \mathdsEx0,{δti},{γtj}∞∑t=0(xTtQxt+uTtRut), (1) subject to xt+1=(A+p∑i=1δtiAi)xt+(B+q∑j=1γtjBj)ut,

where is the system state, is the control input, the initial state is distributed according to distribution , and and . The dynamics are described by a dynamics matrix and input matrix and incorporate multiplicative noise terms modeled by the i.i.d. (across time), zero-mean, mutually independent scalar random variables and , which have variances and , respectively. The matrices and specify how each scalar noise term affects the system dynamics and input matrices. Equivalently, the terms and are zero-mean random matrices with a joint covariance structure over their entries. We define the covariance matrices and ; the variances and and matrices and are simply the eigenvalues and (reshaped) eigenvectors of and , respectively111We assume that and are independent for simplicity, but it is also straightforward to include correlations between the entries of and into the model.. The goal is to determine an optimal closed-loop state feedback policy with from a set of admissible policies.

We assume that the problem data , , , , , and permit existence and finiteness of the optimal value of the problem, in which case the system is called mean-square stabilizable and requires mean-square stability of the closed-loop system [22, 38]. The system in (1) is called mean-square stable if for any given initial covariance . Mean-square stability is a form of robust stability, requiring stricter and more complicated conditions than stabilizability of the nominal system . This essentially can limit the size of the multiplicative noise covariance, which can be viewed as a representation of uncertainty in the nominal system model or as inherent variation in the system dynamics.

### 2.1. Control design with known models: Value Iteration

Dynamic programming can be used to show that the optimal policy is linear state feedback , where denotes the optimal gain matrix, and the resulting optimal cost for a fixed initial state is quadratic, i.e., , where is a symmetric positive definite matrix. When the model parameters are known, there are several ways to compute the optimal feedback gains and corresponding optimal cost. The optimal cost is given by the solution of the generalized Riccati equation

 P=Q+ATPA+p∑i=1αiATiPAi−ATPB(R+BTPB+q∑j=1βjBTjPBj)−1BTPA.

This can be solved via the value iteration recursion

 Pt+1=Q+ATPtA+p∑i=1αiATiPtAi−ATPtB(R+BTPtB+q∑j=1βjBTjPtBj)−1BTPtA,

with or via semidefinite programming formulations (see, e.g., [7, 13, 24]). The corresponding optimal gain matrix is then

 K∗=−(R+BTPB+q∑j=1βjBTjPBj)−1BTPA.

### 2.2. Control design with known models: Policy Gradient and Policy Iteration

Here we consider an alternative approach that facilitates data-driven approaches for learning optimal and robust policies. For a fixed linear state feedback policy , the closed-loop dynamics become

 xt+1=((A+p∑i=1δtiAi)+(B+q∑j=1γtjBj)K)xt,

and we define the corresponding value function for

 VK(x)=\mathdsE{δti},{γtj}∞∑t=0xTt(Q+KTRK)xt.

If gives closed-loop mean-square stability then the value function can be written as , where is the unique positive semidefinite solution to the generalized Lyapunov equation

 PK=Q+KTRK+(A+BK)TPK(A+BK)+p∑i=1αiATiPKAi+q∑j=1βjKTBTjPKBjK. (2)

Further, we define the state covariance matrices , which satisfy the recursion

 Σt+1=(A+BK)Σt(A+BK)T+p∑i=1αiAiΣtATi+q∑j=1βjBjKΣtKTBTj.

Defining the infinite-horizon aggregate state covariance matrix , then provided that gives closed-loop mean-square stability, also satisfies a generalized Lyapunov equation

 ΣK =Σ0+(A+BK)ΣK(A+BK)T+p∑i=1αiAiΣKATi+q∑j=1βjBjKΣKKTBTj. (3)

Defining the cost achieved by a gain matrix by , we have

 C(K)={trace((Q+KTRK)ΣK)=trace(PKΣ0)if K mean-square % stabilizing∞otherwise.

This leads to the idea of performing gradient descent on (i.e., policy gradient) via the update to find the optimal gain matrix. However, two properties of the LQR cost function complicate a convergence analysis of gradient descent. First, is extended valued since not all gain matrices provide closed-loop mean-square stability, so it does not have (global) Lipschitz gradients. Second, and even more concerning, is generally non-convex in (even for deterministic LQR problems, as observed by Fazel et al. [14]), so it is unclear if and when gradient descent converges to the global optimum, or if it even converges at all. Fortunately, as in the deterministic case, we show that the multiplicative LQR cost possesses further key properties that enable proof of global convergence despite the lack of Lipschitz gradients and non-convexity.

## 3. Gradient Domination and Global Convergence of Policy Gradient

In this section, we demonstrate that the multiplicative noise LQR cost function is gradient dominated, which facilitates optimization by gradient descent. Gradient dominated functions have been studied for many years in the optimization literature [29] and have recently been discovered in deterministic LQR problems by [14]. We then show that the policy gradient algorithm and two important variants for multiplicative noise LQR converge globally to the optimal policy. In contrast with [14], the policies we obtain are robust to uncertainties and inherent stochastic variations in the system dynamics. The proofs of all technical results can be found in the Appendices.

### 3.1. Multiplicative Noise LQR Cost is Gradient Dominated

First, we give the expression for the policy gradient for the multiplicative noise LQR cost.

###### Lemma 3.1 (Policy Gradient Expression).

The policy gradient is given by

 ∇KC(K)=2[(R+BTPKB+q∑j=1βjBTjPKBj)K+BTPKA]ΣK.

Next, we see that the multiplicative noise LQR cost is gradient dominated.

###### Lemma 3.2 (Gradient domination).

The multiplicative noise LQR cost satisfies the gradient domination condition

 ∥∇KC(K)∥2F≥4σmin(R)σmin(Σ0)2∥ΣK∗∥(C(K)−C(K∗)).

The gradient domination property gives the following stationary point characterization.

###### Corollary 3.3.

If then either or rank.

In other words, so long as is full rank, stationarity is both necessary and sufficient for global optimality, as for convex functions. Note that to ensure that is full rank, it is not sufficient to simply have multiplicative noise in the dynamics with a deterministic initial state . To see this, simply observe that if and then , which is clearly rank deficient. By contrast, additive noise is sufficient to ensure that is full rank with a deterministic initial state . Taking ensures rank and thus implies .

Although the gradient of the multiplicative noise LQR cost is not globally Lipschitz continuous, it is locally Lipschitz continuous over any subset of its domain (i.e., over any set of mean-square stabilizing gain matrices). The gradient domination is then sufficient to show that policy gradient descent will converge to the optimal gains at a linear rate (a short proof of this fact for globally Lipschitz functions is given in [21]). We prove this convergence of policy gradient to the optimum feedback gain by bounding the local Lipschitz constant in terms of the problem data, which bounds the maximum step size and the convergence rate.

### 3.2. Global Convergence of Policy Gradient for Multiplicative Noise LQR

We analyze three policy gradient algorithm variants:

• Exact gradient descent: \tabto6cm

• Natural gradient descent: \tabto6cm

• Gauss-Newton/policy iteration: \tabto6cm

The more elaborate natural gradient and Gauss-Newton variants provide superior convergence rates and simpler proofs. A development of the natural policy gradient is given in [14] building on ideas from [20]. The Gauss-Newton step with step size is in fact identical to the policy improvement step in policy iteration (a short derivation is given in Appendix C.1) and was first studied for deterministic LQR by Hewer in 1971 [18]. This was extended to a model-free setting using policy iteration and Q-learning in [8], proving asymptotic convergence of the gain matrix to the optimal gain matrix. For multiplicative noise LQR, we have the following results.222We include a factor of 2 on the gradient expression that was erroneously dropped in [14]. This affects the step size restrictions by a corresponding factor of 2.

###### Theorem 3.4 (Gauss-Newton/policy iteration convergence).

Using the Gauss-Newton step

 Ks+1=Ks−ηR−1Ks∇KC(Ks)Σ−1Ks

with step size gives global convergence to the optimal gain matrix at a linear rate described by

 (C(Ks+1)−C(K∗))≤(1−2ησmin(Σ0)∥ΣK∗∥)(C(Ks)−C(K∗)).
###### Theorem 3.5 (Natural policy gradient convergence).

Using the natural policy gradient step

 Ks+1=Ks−η∇KC(Ks)Σ−1Ks

with step size

 0<η≤(∥R∥+(∥B∥2+q∑j=1βj∥Bj∥2)C(K0)σmin(Σ0))−1

gives global convergence to the optimal gain matrix at a linear rate described by

 (C(Ks+1)−C(K∗))≤(1−2ησmin(R)σmin(Σ0)∥ΣK∗∥)(C(Ks)−C(K∗)).
###### Theorem 3.6 (Policy gradient convergence).

Using the policy gradient step

 Ks+1=Ks−η∇KC(Ks)

with step size gives global convergence to the optimal gain matrix at a linear rate described by

 (C(Ks+1)−C(K∗))≤(1−ησmin(R)σmin(Σ0)2∥ΣK∗∥)(C(Ks)−C(K∗))

where

 cpg =116min⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩∥B∥(σmin(Q)σmin(Σ0)C(K0))2(∥B∥2+∑qj=1βj∥Bj∥2)h1(K0)(∥A∥+∥B∥(h2(K0)+1),σmin(Q)C(K0)∥RK0∥⎫⎪ ⎪ ⎪⎬⎪ ⎪ ⎪⎭

and and are

 h1(K0)=2C(K0)σmin(Q)√∥RK0∥(C(K0)−C(K∗))σmin(Σ0)

and

 h2(K0)=1σmin(R)(√∥RK0∥(C(K0)−C(K∗))σmin(Σ0)+∥BTPK0A∥).

The proofs for these results are provided in the Appendices and explicitly incorporate the effects of the multiplicative noise terms and in the dynamics. For the exact and natural policy gradient algorithms, we show explicitly how the maximum allowable step size depends on problem data and in particular on the multiplicative noise terms. Compared to deterministic LQR, the multiplicative noise terms decrease the allowable step size and thereby decrease the convergence rate; specifically, the state-multiplicative noise increases the initial cost and the norms of the covariance and cost , and the input-multiplicative noise also increases the denominator term . This means that the algorithm parameters for deterministic LQR in [14] may cause failure to converge on problems with multiplicative noise. Moreover, even the optimal policies for deterministic LQR may actually destabilize systems in the presence of small amounts of multiplicative noise uncertainty, indicating the possibility for a catastrophic lack of robustness. The results and proofs also differ from that of [14] because a more complicated form of stochastic stability (namely, mean-square stability) must be accounted for, and because generalized Lyapunov equations must be solved to compute the gradient steps, which requires specialized solvers.

## 4. Global Convergence of Model-Free Policy Gradient

The results in the previous section are model-based; the policy gradient steps are computed exactly based on knowledge of the model parameters. In a model-free setting, the policy gradient can be estimated to arbitrary accuracy from sample trajectories with a sufficient number of sample trajectories of sufficiently long rollout length. We show for multiplicative noise LQR that with a finite number of samples polynomial in the problem data, the model-free policy gradient algorithm still converges to the globally optimal policy in the presence of small perturbations on the gradient.

In the model-free setting, the policy gradient method proceeds as before except that at each iteration Algorithm 1 is called to generate an estimate of the gradient via the zeroth-order optimization procedure described by Fazel et al. [14].

###### Theorem 4.1 (Model-Free Policy Gradient convergence).

Suppose the step size is chosen according to the restriction in Theorem 3.6 and at every iteration the gradient is estimated using Algorithm 1 where the number of samples , rollout length , and exploration radius are chosen according to fixed quantities ,, which are polynomial in the problem data , , , , , , , , , . Then with high probability of at least performing gradient descent results in convergence to the global optimum at the linear rate

 (C(Ks+1)−C(K∗))≤(1−ησmin(R)σmin(Σ0)22∥ΣK∗∥)(C(Ks)−C(K∗)).
###### Remark 4.2 (From deterministic to multiplicative noise LQR).

In comparison with the deterministic dynamics studied by [14], the following remarks are in order:

• When the multiplicative variances , are all zero, the assertions of Theorems 3.4, 3.5, 3.6, 4.1 recover the same step sizes and rates of the deterministic setting reported by [14].

• One of the critical effects of multiplicative noise is that the computational burden of performing policy gradient is increased. This is evident from the mathematical expressions which bound the relevant quantities whose exact relationship is developed in the Appendices. In particular, , , and are necessarily higher with either state- or input-dependent multiplicative noise, and is greater than . These increases all act to reduce the step size (and thus convergence rate), and in the model-free setting increase the number of samples and rollout length required.

## 5. Numerical Experiments

In this section we demonstrate the efficacy of the policy gradient algorithms. We first considered a system with 4 states and 1 input representing an active two-mass suspension converted from continuous to discrete time using a standard bilinear transformation. We considered the system dynamics with and without multiplicative noise. The system was open-loop mean stable, and in the presence of multiplicative noise it was open-loop mean-square unstable. We refer to the cost with multiplicative noise as the LQRm cost and the cost without any noise as the LQR cost. Let and be gains which optimize the LQRm and LQR cost, respectively.

We performed exact policy gradient descent in the model-known setting; at each iteration gradients were calculated by solving generalized Lyapunov equations (2) and (3) using the problem data. We performed the optimization for both settings of noise starting from the same random feasible initial gain. The step size was set to a small constant in accordance with Theorem 3.6. The optimization stopped once the Frobenius norm of the gradient fell below a small threshold. The plots in Fig. 1 show the cost of the gains at each iteration; Figs. 0(a) and 0(b) show gains during minimization of the LQRm cost and LQR cost, respectively.

When there was high multiplicative noise, the noise-aware controller minimized the LQRm cost as desired. However, the noise-ignorant controller actually destabilized the system in the mean-square sense; this can be seen in Fig. 0(b) as the LQRm cost exploded upwards to infinity. Looking at the converse scenario, indeed minimized the LQR cost as expected. However, while did lead to a slightly suboptimal LQR cost, it nevertheless ensured that at least the LQR cost was finite (gains were mean stabilizing) throughout the optimization. In this sense, the multiplicative noise-aware optimization is generally safer and more robust than noise-ignorant optimization, and in examples like this is actually necessary for mean-square stabilization.

We also considered 10-state, 10-input systems with randomly generated problem data. The systems were all open-loop mean-square stable with initial gains set to zero. We ran policy gradient using the exact gradient, natural gradient, and Gauss-Newton step directions on 20 unique problem instances using the largest feasible constant step sizes for a fixed number of iterations so that the final cost was no more than worse than optimal. The plots in Fig. 2 show the cost over the iterations; the bold centerline is the mean of all trials and the shaded region is between the maximum and minimum of all trials. It is evident that in terms of convergence the Gauss-Newton step was extremely fast, the natural gradient was somewhat slow and the exact gradient was the slowest. Nevertheless, all algorithms exhibited convergence to the optimum, empirically confirming the asserted theoretical claims.

Python code which implements the algorithms and generates the figures reported in this work can be found in the GitHub repository at https://github.com/TSummersLab/polgrad-multinoise/.

The code was run on a desktop PC with a quad-core Intel i7 6700K 4.0GHz CPU, 16GB RAM. No GPU computing was utilized.

## 6. Conclusions

We have shown that policy gradient methods in both model-known and model-unknown settings give global convergence to the globally optimal policy for LQR systems with multiplicative noise. These techniques are directly applicable for the design of robust controllers of uncertain systems and serve as a benchmark for data-driven control design. Our ongoing work is exploring ways of mitigating the relative sample inefficiency of model-free policy gradient methods by leveraging the special structure of LQR models and Nesterov-type acceleration, and exploring alternative system identification and adaptive control approaches. We are also investigating other methods of building robustness through and dynamic game approaches.

## Technical Proofs

Before proceeding with the proof of the main results of this study, we first review several basic matrix expressions that will be used later throughout the section.

## Appendix A Standard matrix expressions

In this section we let , , , be generic matrices , , be generic vectors, and be a generic scalar.

Spectral norm:

We denote the matrix spectral norm as which clearly satisfies

 ∥A∥=σmax(A)≥σmin(A). (4)
Frobenius norm:

We denote the matrix Frobenius norm as whose square satisfies

 ∥A∥2F=Tr(ATA). (5)
Frobenius norm spectral norm:

For any matrix the Frobenius norm is greater than or equal to the spectral norm:

 ∥A∥F≥∥A∥. (6)
Inverse of spectral norm inequality:
 ∥A−1∥≥∥A∥−1. (7)
Invariance of trace under cyclic permutation:
 Tr(n∏i=1Mi)=Tr(Mnn−1∏i=1Mi). (8)
Invariance of trace under arbitrary permutation for a product of three matrices:
 Tr(ABC)=Tr(BCA)=Tr(CAB)=Tr(ACB)=Tr(BAC)=Tr(CBA). (9)
Scalar trace equivalence:
 s=Tr(s). (10)
Trace-spectral norm inequalities:
 |Tr(ATB)|≤∥AT∥|Tr(B)|=∥A∥|Tr(B)|. (11)

If

 |Tr(A)|≤n∥A∥ (12)

and if

 Tr(A)≥∥A∥. (13)
Sub-multiplicativity of spectral norm:
 ∥AB∥≤∥A∥∥B∥. (14)
Positive semidefinite matrix inequality:

Suppose and . Then

 A+B⪰A \ and \ A+B⪰B. (15)
Vector self outer product positive semidefiniteness:
 aaT⪰0 (16)

since .

Singular value inequality for positive semidefinite matrices:

Suppose and and . Then

 σmin(A)≥σmin(B). (17)
Weyl’s Inequality for singular values:

Suppose . Let singular values of , , and be

 σ1(A)≥σ2(A)≥…≥σr(A)≥0 σ1(B)≥σ2(B)≥…≥σr(B)≥0 σ1(C)≥σ2(C)≥…≥σr(C)≥0

where . Then we have

 σi+j−1(B)≤σi(A)+σj(C) ∀ i∈{1,2,…r},j∈{1,2,…r},i+j−1∈{1,2,…r}. (18)

Consequently, we have

 ∥B∥≤∥A∥+∥C∥ (19)

and

 σmin(B)≥σmin(A)−∥C∥. (20)
Vector Bernstein inequality:

Suppose where are independent random vectors of dimension . Let , and the variance . If every has norm then with high probability we have

 ∥^a−a∥≤O(slogn+√σ2logn). (21)

This is the same inequality given in [14]. See [34] for the exact scale constants and a proof.

## Appendix B Policy Gradient Expression and Gradient Domination

### b.1. Policy gradient expression

We give the expression for the policy gradient for linear state feedback policies applied to the LQR-with-multiplicative-noise problem.

###### Lemma B.1 (Policy Gradient Expression).

The policy gradient is given by

 ∇KC(K)=2[(R+BTPKB+q∑j=1βjBTjPKBj)K+BTPKA]ΣK. (22)
###### Proof.

Substituting the RHS of the generalized Lyapunov equation into the cost yields

 C(K) =Tr((Q+KTRK)Σ0)+Tr((A+BK)TPK(A+BK)Σ0)+Tr(p∑i=1αiATiPKAiΣ0) (23) +Tr(q∑j=1βjKTBTjPKBjKΣ0). (24)

Taking the gradient with respect to and using the product rule we obtain

 ∇KC(K) (25) =∇K[Tr((Q+KTRK)Σ0)+Tr(A+BK)TPK(A+BK)Σ0)+Tr(p∑i=1αiATiPKAiΣ0) (26) +Tr(q∑j=1βjKTBTjPKBjKΣ0)] (27) =2[(R+BTPKB+q∑j=1βjBTjPKBj)K+BTPKA]Σ0 (28) +∇¯KTr[((A+BK)TP¯K(A+BK)+p∑i=1αiATiP¯KAi+q∑j=1βjKTBTjP¯KBjK)Σ0] (29) =2[(R+BTPKB+q∑j=1βjBTjPKBj)K+BTPKA]Σ0 (30) +∇¯KTr(P¯K[(A+BK)Σ0(A+BK)T+p∑i=1αiAiΣ0ATi+q∑j=1βjBjKΣ0KTBTj]) (31) =2[(R+BTPKB+q∑j=1βjBTjPKBj)K+BTPKA]Σ0+∇¯KTr(P¯KX1) (32)

where the overbar on is used to denote the term being differentiated. Applying this gradient formula recursively to the last term in the last line (namely ), we obtain

 ∇KC(K)=2[(R+BTPKB+q∑j=1βjBTjPKBj)K+BTPKA]ΣK (33)

which completes the proof. ∎

### b.2. Additional quantities

We define the stochastic system state transition matrices

 ˜A=A+p∑i=1δtiAi,˜B=B+q∑j=1γtjBj. (34)

We define

 RK=R+BTPKB+q∑j=1βjBTjPKBj (35)

and

 EK=(R+BTPKB+q∑j=1βjBTjPKBj)K+BTPKA=RKK+BTPKA (36)

so that

 ∇KC(K)=2EKΣK. (37)

We define the (deterministic) nominal closed-loop state transition matrix

 AK=A+BK. (38)

Similarly we define the stochastic closed-loop state transition matrix

 ˜AK=˜A+˜BK. (39)

We define the closed-loop LQR cost matrix

 QK=Q+KTRK. (40)

### b.3. State value function, state-action value function, and advantage

We have already defined the state value function (or simply the “value function” or “-function” in reinforcement learning jargon) in the main document. We now define an equivalent notation by moving the functional dependency on to the subscript, giving

 V(K,x)=VK(x)=\mathdsEδti,γtj∞∑t=0xTtQxt+uTtRut (41)

given that

 x0=x,ut=Kxt,xt+1=˜AKxt (42)

where we take expectation with respect to the and determining . Equivalently,

 VK(x)=xTPKx. (43)

The state-action value function (or simply the “-function” in reinforcement learning jargon) is

 Q(K,x,u)=QK(x,u) =xTQx+uTRu+\mathdsEδti,γtjVK(˜Ax+˜Bu) (44)

where we take expectation with respect to the and determining and respectively. Notice that the state and action which are the functional inputs do not have to be generated by the gain matrix in the subscript. Indeed we have if , but not in general. Also note that only the rightmost expression (the state value function) is dependent on the gain matrix. These facts will be crucial to proving the value difference lemma. Expanding, we can also write the state-action value function as

 QK(x,u) =xTQx+uTRu+\mathdsEδti,γtj[(˜Ax+˜Bu)TPK(˜Ax+˜Bu)] (45) (46)

The advantage function is defined as

 A(K,x,u)=AK(x,u) =QK(x,u)−VK(x). (47)

The advantage function can be thought of as the difference in cost (“advantage”) when starting in state of taking an action for one step instead of the action generated by policy .

We also define the state sequence

 {xt}K,x={x,AKx,A2Kx,...,AtKx,...} (48)

and the action sequence

 {ut}K,x=K{xt}K,x (49)

and the cost sequence

 {ct}K,x={xt}TK,xQK{xt}K,x. (50)

Note that , , , and are all random variables whose distributions are determined by the multiplicative noise data.

We can now derive the value-difference lemma, which Fazel refers to as the “cost-difference” lemma.

###### Lemma B.2 (Value difference).

Suppose and generate the (stochastic) state, action, and cost sequences

 {xt}K,x ,{ut}K,x ,{ct}K,x (51)

and

 {xt}K′,x ,{ut}K′,x ,{ct}K′,x (52)

respectively. Then the value difference is

 VK′(x)−VK(x)=\mathdsEδti,γtj∞∑t=0AK({xt}K′,x,{ut}K′,x). (53)

Also, the advantage satisfies

 AK(x,K′x) =2xTΔTEKx+xTΔTRKΔx (54)

where

 Δ=K′−K. (55)
###### Proof.

By definition we have

 VK(x)=\mathdsEδti,γtj∞∑t=0{ct}K,x (56)

so we can write the value difference as

 VK′(x)−VK(x) =\mathdsEδti,γtj∞∑t=0[{ct}K′,x]−VK(x) (57) =\mathdsEδti,γtj∞∑t=0[{ct}K′,x+VK({xt}K′,x)−VK({xt}K′,x)]−VK(x) (58) (59)

We can expand out the following value function difference as

 \mathdsEδti,γtj∞∑t=0[VK({xt}K′,x)]−VK(x) =\mathdsEδti,γtj∞∑t=0[VK({xt+1}K′,x)]+\mathdsEδti,γtjVK({x0}K′,x)−VK(x) (60) =\mathdsEδti,γtj∞∑t=0VK({xt+1}K′,x) (61)

where the last equality is valid by noting that the first term in sequence is .

Continuing the value difference expression we have

 VK′(x)−VK(x) (62) =\mathdsEδti,γtj∞∑t=0[{ct}K′,x+VK({xt+1}K′,x)−V