# Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator

###### Abstract

We study the sample complexity of approximate policy iteration (PI) for the Linear Quadratic Regulator (LQR), building on a recent line of work using LQR as a testbed to understand the limits of reinforcement learning (RL) algorithms on continuous control tasks. Our analysis quantifies the tension between policy improvement and policy evaluation, and suggests that policy evaluation is the dominant factor in terms of sample complexity. Specifically, we show that to obtain a controller that is within of the optimal LQR controller, each step of policy evaluation requires at most samples, where is the dimension of the state vector and is the dimension of the input vector. On the other hand, only policy improvement steps suffice, resulting in an overall sample complexity of . We furthermore build on our analysis and construct a simple adaptive procedure based on -greedy exploration which relies on approximate PI as a sub-routine and obtains regret, improving upon a recent result of Abbasi-Yadkori et al. [3].

## 1 Introduction

With the recent successes of reinforcement learning (RL) on continuous control tasks, there has been a renewed interest in understanding the sample complexity of RL methods. A recent line of work has focused on the Linear Quadratic Regulator (LQR) as a testbed to understand the behavior and trade-offs of various RL algorithms in the continuous state and action space setting. These results can be broadly grouped into two categories: (1) the study of model-based methods which use data to build an estimate of the transition dynamics, and (2) model-free methods which directly estimate the optimal feedback controller from data without building a dynamics model as an intermediate step. Much of the recent progress in LQR has focused on the model-based side, with an analysis of robust control from Dean et al. [14] and certainty equivalence control by Fiechter [19] and Mania et al. [28]. These techniques have also been extended to the online, adaptive setting [1, 15, 13, 4, 31]. On the other hand, for classic model-free RL algorithms such as Q-learning, SARSA, and approximate policy iteration (PI), our understanding is much less complete despite the fact that these algorithms are well understood in the tabular (finite state and action space) setting. Indeed, most of the model-free analysis for LQR [18, 26, 37] has focused exclusively on derivative-free random search methods.

In this paper, we extend our understanding of model-free algorithms for LQR by studying the performance of approximate PI on LQR, which is a classic approximate dynamic programming algorithm. Approximate PI is a model-free algorithm which iteratively uses trajectory data to estimate the state-value function associated to the current policy (via e.g. temporal difference learning), and then uses this estimate to greedily improve the policy. A key issue in analyzing approximate PI is to understand the trade-off between the number of policy improvement iterations, and the amount of data to collect for each policy evaluation phase. Our analysis quantifies this trade-off, showing that if least-squares temporal difference learning (LSTD-Q) [11, 22] is used for policy evaluation, then a trajectory of length for each inner step of policy evaluation combined with outer steps of policy improvement suffices to learn a controller that has -error from the optimal controller. This yields an overall sample complexity of . Prior to our work, the only known guarantee for approximate PI on LQR was the asymptotic consistency result of Bradtke [12] in the setting of no process noise.

We also extend our analysis of approximate PI to the online, adaptive LQR setting popularized by Abbasi-Yadkori and Szepesvári [1]. By using a greedy exploration scheme similar to Dean et al. [15] and Mania et al. [28], we prove a regret bound for a simple adaptive policy improvement algorithm. While the rate is sub-optimal compared to the regret from model-based methods [1, 13, 28], our analysis improves the regret (for ) from the model-free Follow the Leader (FTL) algorithm of Abbasi-Yadkori et al. [3]. To the best of our knowledge, we give the best regret guarantee known for a model-free algorithm. We leave open the question of whether or not a model-free algorithm can achieve optimal regret.

## 2 Main Results

In this paper, we consider the following linear dynamical system:

(2.1) |

We let denote the dimension of the state and denote the dimension of the input . For simplicity we assume that , e.g. the system is under-actuated. We fix two positive definite cost matrices , and consider the infinite horizon average-cost Linear Quadratic Regulator (LQR):

(2.2) |

We assume the dynamics matrices are unknown to us, and our method of interaction with (2.1) is to choose an input sequence and observe the resulting states .

We study the solution to (2.2) using least-squares policy iteration (LSPI), a well-known approximate dynamic programming method in RL introduced by Lagoudakis and Parr [22]. The study of approximate PI on LQR dates back to the Ph.D. thesis of Bradtke [12], where he showed that for noiseless LQR (when for all ), the approximate PI algorithm is asymptotically consistent. In this paper we expand on this result and quantify non-asymptotic rates for approximate PI on LQR.

#### Notation.

For a positive scalar , we let . A square matrix is called stable if where denotes the spectral radius of . For a symmetric matrix , we let denote the unique solution to the discrete Lyapunov equation . We also let denote the vectorized version of the upper triangular part of so that . Finally, denotes the inverse of , so that .

### 2.1 Least-Squares Temporal Difference Learning (LSTD-Q)

The first component towards an understanding of approximate PI is to understand least-squares temporal difference learning (LSTD-Q) for -functions, which is the fundamental building block of LSPI. Given a policy which stabilizes , the goal of LSTD-Q is to estimate the parameters of the -function associated to . Bellman’s equation for infinite-horizon average cost MDPs (c.f. Bertsekas [7]) states that the (relative) -function associated to a policy satisfies the following fixed-point equation:

(2.3) |

Here, is a free parameter chosen so that the fixed-point equation holds. LSTD-Q operates under the linear architecture assumption, which states that the -function can be described as , for a known (possibly non-linear) feature map . It is well known that LQR satisfies the linear architecture assumption, since we have:

Here, we slightly abuse notation and let denote the -function and also the matrix parameterizing the -function. Now suppose that a trajectory is collected. Note that LSTD-Q is an off-policy method (unlike the closely related LSTD estimator for value functions), and therefore the inputs can come from any sequence that provides sufficient excitation for learning. In particular, it does not have to come from the policy . In this paper, we will consider inputs of the form:

(2.4) |

where is a stabilizing controller for . Once again we emphasize that in general. The injected noise is needed in order to provide sufficient excitation for learning. In order to describe the LSTD-Q estimator, we define the following quantities which play a key role throughout the paper:

The LSTD-Q estimator estimates via:

(2.5) |

Here, denotes the Moore-Penrose pseudo-inverse. Our first result establishes a non-asymptotic bound on the quality of the estimator , measured in terms of . Before we state our result, we introduce a key definition that we will use extensively.

###### Definition 1.

Let be a square matrix. Let and . We say that is -stable if

While stability of a matrix is an asymptotic notion, Definition 1 quantifies the degree of stability by characterizing the transient response of the powers of a matrix by the parameter . It is closely related to the notion of strong stability from Cohen et al. [13].

With Definition 1 in place, we state our first result for LSTD-Q.

###### Theorem 2.1.

Fix a . Let policies and stabilize , and assume that both and are -stable. Let the initial state and consider the inputs with . For simplicity, assume that . Let denote the steady-state covariance of the trajectory :

(2.6) |

Define the proxy variance by:

(2.7) |

Suppose that satisfies:

(2.8) |

Then we have with probability at least ,

(2.9) |

Here the hides factors.

Theorem 2.1 states that:

timesteps are sufficient to achieve error w.h.p. Several remarks are in order. First, while the burn-in is likely sub-optimal, the dependence is sharp as shown by the asymptotic results of Tu and Recht [37]. Second, the dependence on the injected excitation noise will be important when we study the online, adaptive setting in Section 2.3. We leave improving the polynomial dependence of the burn-in period to future work.

The proof of Theorem 2.1 appears in Section A and rests on top of several recent advances. First, we build off the work of Abbasi-Yadkori et al. [3] to derive a new basic inequality for LSTD-Q which serves as a starting point for the analysis. Next, we combine the small-ball techniques of Simchowitz et al. [35] with the self-normalized martingale inequalities of Abbasi-Yadkori et al. [2]. While an analysis of LSTD-Q is presented in Abbasi-Yadkori et al. [3] (which builds on the analysis for LSTD from Tu and Recht [36]), a direct application of their result yields a dependence; the use of self-normalized inequalities is necessary in order to reduce this dependence to .

### 2.2 Least-Squares Policy Iteration (LSPI)

With Theorem 2.1 in place, we are ready to present the main results for LSPI. We describe two versions of LSPI in Algorithm 1 and Algorithm 2.

In Algorithms 1 and 2, is the Euclidean projection onto the set of symmetric matrices lower bounded by . Furthermore, the map takes an positive definite matrix and returns a matrix:

(2.10) |

Algorithm 1 corresponds to the version presented in Lagoudakis and Parr [22], where all the data is collected up front and is re-used in every iteration of LSTD-Q. Algorithm 2 is the one we will analyze in this paper, where new data is collected for every iteration of LSTD-Q. The modification made in Algorithm 2 simplifies the analysis by allowing the controller to be independent of the data in LSTD-Q. We remark that this does not require the system to be reset after every iteration of LSTD-Q. We leave analyzing Algorithm 1 to future work.

Before we state our main finite-sample guarantee for Algorithm 2, we review the notion of a (relative) value-function. Similarly to (relative) -functions, the infinite horizon average-cost Bellman equation states that the (relative) value function associated to a policy satisfies the fixed-point equation:

(2.11) |

For a stabilizing policy , it is well known that for LQR the value function with

Once again as we did for -functions, we slightly abuse notation and let denote the value function and the matrix that parameterizes the value function. Our main result for Algorithm 2 appears in the following theorem. For simplicity, we will assume that and .

###### Theorem 2.2.

Fix a . Let the initial policy input to Algorithm 2 stabilize . Suppose the initial state and that the excitation noise satisfies . Recall that the steady-state covariance of the trajectory is

Let denote the value function associated to the initial policy , and denote the value function associated to the optimal policy for the LQR problem (2.2). Define the variables as:

Fix an that satisfies:

(2.12) |

Suppose we run Algorithm 2 for policy improvement iterations where

(2.13) |

and we set the rollout length to satisfy:

(2.14) |

Then with probability , we have that each policy for stabilizes and furthermore:

Here the hides factors.

Theorem 2.2 states roughly that samples are sufficient for LSPI to recover a controller that is within of the optimal . That is, only iterations of policy improvement are necessary, and furthermore more iterations of policy improvement do not necessary help due to the inherent statistical noise of estimating the -function for every policy . We note that the polynomial factor in is by no means optimal and was deliberately made quite conservative in order to simplify the presentation of the bound. A sharper bound can be recovered from our analysis techniques at the expense of a less concise expression.

It is worth taking a moment to compare Theorem 2.2 to classical results in the RL literature regarding approximate policy iteration. For example, a well known result (c.f. Theorem 7.1 of Lagoudakis and Parr [22]) states that if LSTD-Q is able to return -function estimates with error bounded by at every iteration, then letting denote the approximate -function at the -th iteration of LSPI:

Here, is the discount factor of the MDP. Theorem 2.2 is qualitatively similar to this result in that we show roughly that error in the -function estimate translates to error in the estimated policy. However, there are several fundamental differences. First, our analysis does not rely on discounting to show contraction of the Bellman operator. Instead, we use the -stability of closed loop system to achieve this effect. Second, our analysis does not rely on bounds on the estimated -function, which are generally not possible to achieve with LQR since the -function is a quadratic function and the states and inputs are not uniformly bounded. And finally, our analysis is non-asymptotic.

The proof of Theorem 2.2 is given in Section B, and combines the estimation guarantee of Theorem 2.1 with a new analysis of policy iteration for LQR, which we believe is of independent interest. Our new policy iteration analysis combines the work of Bertsekas [8] on policy iteration in infinite horizon average cost MDPs with the contraction theory of Lee and Lim [24] for non-linear matrix equations.

### 2.3 LSPI for Adaptive LQR

We now turn our attention to the online, adaptive LQR problem as studied in Abbasi-Yadkori and Szepesvári [1]. In the adaptive LQR problem, the quantity of interest is the regret, defined as:

(2.15) |

Here, the algorithm is penalized for the cost incurred from learning the optimal policy , and must balance exploration (to better learn the optimal policy) versus exploitation (to reduce cost). As mentioned previously, there are several known algorithms which achieve regret [1, 4, 31, 13, 28]. However, these algorithms operate in a model-based manner, using the collected data to build a confidence interval around the true dynamics . On the other hand, the performance of adaptive algorithms which are model-free is less well understood. We use the results of the previous section to give an adaptive model-free algorithm for LQR which achieves regret, which improves upon the regret (for ) achieved by the adaptive model-free algorithm of Abbasi-Yadkori et al. [3]. Our adaptive algorithm based on LSPI is shown in Algorithm 3.

Using an analysis technique similar to that in Dean et al. [15], we prove the following regret bound for Algorithm 3.

###### Theorem 2.3.

Fix a . Let the initial feedback stabilize and let denote its associated value function. Also let denote the optimal LQR controller and let denote the optimal value function. Let . Suppose that is set to:

and suppose is set to . With probability at least , we have that the regret of Algorithm 3 satisfies:

The proof of Theorem 2.3 appears in Section C. We note that the regret scaling as in Theorem 2.3 is due to the dependence from LSTD-Q (c.f. (2.9)). As mentioned previously, the existing LSTD-Q analysis from Abbasi-Yadkori et al. [3] yields a dependence in LSTD-Q; using this dependence in the analysis of Algorithm 3 would translate into regret.

## 3 Related Work

For model-based methods, in the offline setting Fiechter [19] provided the first PAC-learning bound for infinite horizon discounted LQR using certainty equivalence (nominal) control. Later, Dean et al. [14] use tools from robust control to analyze a robust synthesis method for infinite horizon average cost LQR, which is applicable in regimes of moderate uncertainty when nominal control fails. Mania et al. [28] show that certainty equivalence control actually provides a fast rate of sub-optimality where is the size of the parameter error, unlike the sub-optimality guarantee of [19, 14]. For the online adaptive setting, [1, 20, 4, 28, 13] give regret algorithms. A key component of model-based algorithms is being able to quantify a confidence interval for the parameter estimate, for which several recent works [35, 16, 33] provide non-asymptotic results.

Turning to model-free methods, Tu and Recht [36] study the behavior of least-squares temporal difference (LSTD) for learning the discounted value function associated to a stabilizing policy. They evaluate the LSPI algorithm studied in this paper empirically, but do not provide any analysis. In terms of policy optimization, most of the work has focused on derivative-free random search methods [18, 26]. Tu and Recht [37] study a special family of LQR instances and characterize the asymptotic behavior of both model-based certainty equivalent control versus policy gradients (REINFORCE), showing that policy gradients has polynomially worse sample complexity. Most related to our work is Abbasi-Yadkori et al. [3], who analyze a model-free algorithm for adaptive LQR based on ideas from online convex optimization. LSTD-Q is a sub-routine of their algorithm, and their analysis incurs a sub-optimal dependence on the injected exploration noise, which we improve to using self-normalized martingale inequalities [2]. This improvement allows us to use a simple greedy exploration strategy to obtain regret. Finally, as mentioned earlier, the Ph.D. thesis of Bradtke [12] presents an asymptotic consistency argument for approximate PI for discounted LQR in the noiseless setting (i.e. for all ).

For the general function approximation setting in RL, Antos et al. [6] and Lazaric et al. [23] analyze variants of LSPI for discounted MDPs where the state space is compact and the action space finite. In Lazaric et al. [23], the policy is greedily updated via an update operator that requires access to the underlying dynamics (and is therefore not implementable). Farahmand et al. [17] extend the results of Lazaric et al. [23] to when the function spaces considered are reproducing kernel Hilbert spaces. Zou et al. [39] give a finite-time analysis of both Q-learning and SARSA, combining the asymptotic analysis of Melo et al. [29] with the finite-time analysis of TD-learning from Bhandari et al. [9]. We note that checking the required assumptions to apply the results of Zou et al. [39] is non-trivial (c.f. Section 3.1, [29]). We are un-aware of any non-asymptotic analysis of LSPI in the average cost setting, which is more difficult as the Bellman operator is no longer a contraction.

Finally, we remark that our LSPI analysis relies on understanding exact policy iteration for LQR, which is closely related to the fixed-point Riccati recurrence (value iteration). An elegant analysis for value iteration is given by Lincoln and Rantzer [25]. Recently, Fazel et al. [18] show that exact policy iteration is a special case of Gauss-Newton and prove linear convergence results. Our analysis, on the other hand, is based on combining the fixed-point theory from Lee and Lim [24] with recent work on policy iteration for average cost problems from Bertsekas [8].

## 4 Experiments

In this section, we evaluate LSPI in both the non-adaptive offline setting (Section 2.2) as well as the adaptive online setting (Section 2.3). Section G contains more details about both the algorithms we compare to as well as our experimental methodology.

We first look at the performance of LSPI in the non-adaptive, offline setting. Here, we compare LSPI to other popular model-free methods, and the model-based certainty equivalence (nominal) controller (c.f. [28]). For model-free, we look at policy gradients (REINFORCE) (c.f. [38]) and derivative-free optimization (c.f. [27, 26, 30]). We consider the LQR instance with

We choose an LQR problem where the matrix is stable, since the model-free methods we consider need to be seeded with an initial stabilizing controller; using a stable allows us to start at . We fix the process noise . The model-based nominal method learns using least-squares, exciting the system with Gaussian inputs with variance .

For policy gradients and derivative-free optimization, we use the projected stochastic gradient descent (SGD) method with a constant step size as the optimization procedure. For policy iteration, we evaluate both (Algorithm 1) and (Algorithm 2). For every iteration of LSTD-Q, we project the resulting -function parameter matrix onto the set with . For , we choose by picking the which results in the best performance after timesteps. For , we set which yields the lowest cost over the grid and such that .

Figure 1 contains the results of our non-adaptive evaluation. In Figure 1, we plot the relative error versus the number of timesteps. We see that the model-based certainty equivalence (nominal) method is more sample efficient than the other model-free methods considered. We also see that the value function baseline is able to dramatically reduce the variance of the policy gradient estimator compared to the simple baseline. The DFO method performs the best out of all the model-free methods considered on this example after timesteps, although the performance of policy iteration is comparable.

Next, we compare the performance of LSPI in the adaptive setting. We compare LSPI against the model-free linear quadratic control (MFLQ) algorithm of Abbasi-Yadkori et al. [3], the certainty equivalence (nominal) controller (c.f. [15]), and the optimal controller. We set the process noise , and consider the example of Dean et al. [14]:

Figure 2 shows the results of these experiments. In Figure 1(a), we plot the regret (c.f. Equation 2.15) versus the number of timesteps. We see that LSPI and MFLQ both perform similarly with MFLQ slightly outperforming LSPI. We also note that the model-based nominal controller performs significantly better than both LSPI and MFLQ, which is consistent with the experiments of Abbasi-Yadkori et al. [3]. In Figure 1(b), we plot the relative cost versus the number of timesteps. This quantity represents the sub-optimality incurred if further exploration ceases and the current controller is played indefinitely. Here, we see again that LSPI and MFLQ are both comparable, but both are outperformed by nominal control.

## 5 Conclusion

We studied the sample complexity of approximate PI on LQR, showing that order samples are sufficient to estimate a controller that is within of the optimal. We also show how to turn this offline method into an adaptive LQR method with regret. Several questions remain open with our work. The first is if policy iteration is able to achieve regret, which is possible with other model-based methods. The second is whether or not model-free methods provide advantages in situations of partial observability for LQ control. Finally, an asymptotic analysis of LSPI, in the spirit of Tu and Recht [37], is of interest in order to clarify which parts of our analysis are sub-optimal due to the techniques we use versus are inherent in the algorithm.

## Acknowledgments

We thank the authors of Abbasi-Yadkori et al. [3] for providing us with an implementation of their model-free LQ algorithm. ST is supported by a Google PhD fellowship. This work is generously supported in part by ONR awards N00014-17-1-2191, N00014-17-1-2401, and N00014-18-1-2833, the DARPA Assured Autonomy (FA8750-18-C-0101) and Lagrange (W911NF-16-1-0552) programs, a Siemens Futuremakers Fellowship, and an Amazon AWS AI Research Award.

## References

- Abbasi-Yadkori and Szepesvári [2011] Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret Bounds for the Adaptive Control of Linear Quadratic Systems. In Conference on Learning Theory, 2011.
- Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Online Least Squares Estimation with Self-Normalized Processes: An Application to Bandit Problems. In Conference on Learning Theory, 2011.
- Abbasi-Yadkori et al. [2019] Yasin Abbasi-Yadkori, Nevena Lazić, and Csaba Szepesvári. Model-Free Linear Quadratic Control via Reduction to Expert Prediction. In AISTATS, 2019.
- Abeille and Lazaric [2018] Marc Abeille and Alessandro Lazaric. Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems. In International Conference on Machine Learning, 2018.
- Alzahrani and Salem [2018] Faris Alzahrani and Ahmed Salem. Sharp bounds for the Lambert function. Integral Transforms and Special Functions, 29(12):971–978, 2018.
- Antos et al. [2008] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
- Bertsekas [2007] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Vol. II. 2007.
- Bertsekas [2017] Dimitri P. Bertsekas. Value and Policy Iterations in Optimal Control and Adaptive Dynamic Programming. IEEE Transactions on Neural Networks and Learning Systems, 28(3):500–509, 2017.
- Bhandari et al. [2018] Jalaj Bhandari, Daniel Russo, and Raghav Singal. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation. In Conference on Learning Theory, 2018.
- Bogachev [2015] Vladimir I. Bogachev. Gaussian Measures. 2015.
- Boyan [1999] Justin Boyan. Least-Squares Temporal Difference Learning. In International Conference on Machine Learning, 1999.
- Bradtke [1994] Steven J. Bradtke. Incremental Dynamic Programming for On-Line Adaptive Optimal Control. PhD thesis, University of Massachusetts Amherst, 1994.
- Cohen et al. [2019] Alon Cohen, Tomer Koren, and Yishay Mansour. Learning Linear-Quadratic Regulators Efficiently with only Regret. arXiv:1902.06223, 2019.
- Dean et al. [2017] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the Sample Complexity of the Linear Quadratic Regulator. arXiv:1710.01688, 2017.
- Dean et al. [2018] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator. In Neural Information Processing Systems, 2018.
- Faradonbeh et al. [2018] Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite Time Identification in Unstable Linear Systems. Automatica, 96:342–353, 2018.
- Farahmand et al. [2016] Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesvári, and Shie Mannor. Regularized Policy Iteration with Nonparametric Function Spaces. Journal of Machine Learning Research, 17(139):1–66, 2016.
- Fazel et al. [2018] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator. In International Conference on Machine Learning, 2018.
- Fiechter [1997] Claude-Nicolas Fiechter. PAC Adaptive Control of Linear Systems. In Conference on Learning Theory, 1997.
- Ibrahimi et al. [2012] Morteza Ibrahimi, Adel Javanmard, and Benjamin Van Roy. Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems. In Neural Information Processing Systems, 2012.
- Kailath et al. [2000] Thomas Kailath, Ali H. Sayed, and Babak Hassibi. Linear Estimation. 2000.
- Lagoudakis and Parr [2003] Michail G. Lagoudakis and Ronald Parr. Least-Squares Policy Iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.
- Lazaric et al. [2012] Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Finite-Sample Analysis of Least-Squares Policy Iteration. Journal of Machine Learning Research, 13:3041–3074, 2012.
- Lee and Lim [2008] Hosoo Lee and Yongdo Lim. Invariant metrics, contractions and nonlinear matrix equations. Nonlinearity, 21(4):857–878, 2008.
- Lincoln and Rantzer [2006] Bo Lincoln and Anders Rantzer. Relaxed Dynamic Programming. IEEE Transactions on Automatic Control, 51(8):1249–1260, 2006.
- Malik et al. [2019] Dhruv Malik, Kush Bhatia, Koulik Khamaru, Peter L. Bartlett, , and Martin J. Wainwright. Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems. In AISTATS, 2019.
- Mania et al. [2018] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. In Neural Information Processing Systems, 2018.
- Mania et al. [2019] Horia Mania, Stephen Tu, and Benjamin Recht. Certainty Equivalent Control of LQR is Efficient. arXiv:1902.07826, 2019.
- Melo et al. [2008] Francisco S. Melo, Sean P. Meyn, and M. Isabel Ribeiro. An Analysis of Reinforcement Learning with Function Approximation. In International Conference on Machine Learning, 2008.
- Nesterov and Spokoiny [2017] Yurii Nesterov and Vladimir Spokoiny. Random Gradient-Free Minimization of Convex Functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
- Ouyang et al. [2017] Yi Ouyang, Mukul Gagrani, and Rahul Jain. Control of unknown linear systems with Thompson sampling. In 55th Annual Allerton Conference on Communication, Control, and Computing, 2017.
- Rudelson and Vershynin [2013] Mark Rudelson and Roman Vershynin. Hanson-Wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18(82):1–9, 2013.
- Sarkar and Rakhlin [2019] Tuhin Sarkar and Alexander Rakhlin. Near optimal finite time identification of arbitrary linear dynamical systems. In International Conference on Machine Learning, 2019.
- Schäcke [2004] Kathrin Schäcke. On the Kronecker Product. Master’s thesis, University of Waterloo, 2004.
- Simchowitz et al. [2018] Max Simchowitz, Horia Mania, Stephen Tu, Michael I. Jordan, and Benjamin Recht. Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification. In Conference on Learning Theory, 2018.
- Tu and Recht [2018] Stephen Tu and Benjamin Recht. Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator. In International Conference on Machine Learning, 2018.
- Tu and Recht [2019] Stephen Tu and Benjamin Recht. The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint. In Conference on Learning Theory, 2019.
- Williams [1992] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4):229–246, 1992.
- Zou et al. [2019] Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-Sample Analysis for SARSA and Q-Learning with Linear Function Approximation. arXiv:1902.02234, 2019.

## Appendix A Analysis for LSTD-Q

We fix a trajectory . Recall that we are interested in finding the function for a given policy , and we have defined the vectors:

Also recall that the input sequence being played is given by , with . Both policies and are assumed to stabilize . Because of stability, we have that converges to a limit , where is:

The covariance of for is:

We define the following data matrices:

With this notation, the LSTD-Q estimator is:

Next, let be the matrix:

For what follows, we let the notation denote the symmetric Kronecker product. See Schäcke [34] for more details. The following lemma gives us a starting point for analysis. It is based on Lemma 4.1 of Abbasi-Yadkori et al. [3]. Recall that and is the matrix which parameterizes the -function for .

###### Lemma A.1 (Lemma 4.1, [3]).

Let . Suppose that has full column rank, and that

Then we have:

(A.1) |

###### Proof.

By the Bellman equation (2.3), we have the identity:

By the definition of , we have the identity:

where is the orthogonal projector onto the columns of . Combining these two identities gives us:

Next, the -th row of is:

Therefore, . Combining with the above identity:

Because has full column rank, this identity implies that:

Using the inequalities:

we obtain:

Next, let . By triangle inequality:

The claim now follows. ∎

In order to apply Lemma A.1, we first bound the minimum singular value . We do this using the small-ball argument of Simchowitz et al. [35].

###### Definition 2 (Definition 2.1, [35]).

Let be a real-valued stochastic process that is adapted to . The process satisfies the block martingale small-ball (BMSB) condition if for any we have that:

With the block martingale small-ball definition in place, we now show that the process satisfies this condition for any fixed unit vector .

###### Proposition A.2.

Given an arbitrary vector , define the process , the filtration , and matrix . Then satisfies the block martingale small-ball (BMSB) condition from Definition 2. That is, almost surely, we have:

###### Proof.

Let and . We have that:

Therefore: