From self-tuning regulators to reinforcement learning and back again

# From self-tuning regulators to reinforcement learning and back again

Nikolai Matni, Alexandre Proutiere, Anders Rantzer, Stephen Tu
###### Abstract

Machine and reinforcement learning (RL) are being applied to plan and control the behavior of autonomous systems interacting with the physical world – examples include self-driving vehicles, distributed sensor networks, and agile robots. However, if machine learning is to be applied in these new settings, the resulting algorithms must come with the reliability, robustness, and safety guarantees that are hallmarks of the control theory literature, as failures could be catastrophic. Thus, as RL algorithms are increasingly and more aggressively deployed in safety critical settings, it is imperative that control theorists be part of the conversation. The goal of this tutorial paper is to provide a jumping off point for control theorists wishing to work on RL related problems by covering recent advances in bridging learning and control theory, and by placing these results within the appropriate historical context of the system identification and adaptive control literatures.

\bstctlcite

IEEEexample:BSTcontrol

## I Introduction & Motivation

With their recent successes in image classification, video game playing [1], sophisticated robotic simulations [2, 3], and complex strategy games such as Go [4, 5], machine and reinforcement learning (RL) are now being applied to plan and control the behavior of autonomous systems that interact with physical environments. Such systems, which include self-driving vehicles and agile robots, must interact with complex environments that are ever changing and difficult to model, strongly motivating the use of data-driven techniques. However, if machine learning is to be applied in these new settings, the resulting algorithms must come with the reliability, robustness, and safety guarantees that typically accompany results in the control theory literature, as failures could be catastrophic. Thus, as RL algorithms are increasingly and more aggressively deployed in safety critical settings, control theorists must be part of the conversation.

To that end, it is important to recognize that while the applications areas and technical tools are new, the challenges faced – uncertain and time varying systems and environments, unreliable sensing modalities, the need for robust stability and performance, etc. – are not, and that many classical results from the system identification and adaptive control literature can be brought to bear on these problems. In the case of discrete time linear systems, adaptive control algorithms further come with strong guarantees of asymptotic consistency, stability, and optimality, and similarly elucidate some of the fundamental challenges that are still being wrestled with today, such as rapidly identifying a system model (exploration) while robustly/optimally controlling it (exploitation).

Indeed, at a cursory glance, classical self-tuning regulators have the same objective as contemporary RL: an initial control policy and/or model is posited, data is collected, and a refined model/policy is produced, often in an online fashion. However, until recently, there has been relatively little contact between the two research communities. As a result, the tools and analysis objectives are different. One such feature, which will be the focus of this tutorial, is that RL and online learning algorithms are often analyzed in terms of finite-data guarantees, and as such, are able to provide anytime guarantees on the quality of the current behavior. Such finite-data guarantees are obtained by integrating tools from optimal control, stochastic optimization, and high-dimensional statistics – whereas the first two tools are familiar to the controls community, the latter is less so. A major theme will be that of uncertainty quantification: indeed the importance of relating uncertainty quantification to control objectives was emphasized already in the 1960-70s [6]. Moreover, the fragility of certainty equivalent control was one of the motivating factors for the development of robust adaptive control methodologies.

In this tutorial paper and our companion paper [7], we highlight recent advances that provide non-asymptotic analysis of adaptive algorithms. Our aim is for these papers is for them to serve as a jumping off point for control theorists wanting to work in RL problems. In [7], we present an overview of tools and results on finite-data guarantees for system identification. This paper focuses on finite-data guarantees for self-tuning and adaptive control strategies, and is structured as follows:

• Section II: provides an extensive literature review of work spanning classical and modern results in system identification, adaptive control, and RL.

• Section III: introduces the fundamental problem and performance metrics considered in RL, and relates them to examples familiar to the controls community.

• Section IV: provides a survey of contemporary results for problems with finite state and action spaces.

• Section V: shows how system estimates and error bounds can be incorporated into model-based self-tuning regulators with finite-time performance guarantees.

• Section VI: presents guarantees for model-free methods, and shows that a complexity gap exists between model-based and model-free methods.

## Ii Literature Review

The results we present in this paper draw heavily from three broad areas of control and learning theory: system identification, adaptive control, and approximate dynamic programming (ADP) or, as it has come to be known, reinforcement learning. Each of these areas has a long and rich history and a general literature review is outside the scope of this tutorial. Below we will instead emphasize pointers to good textbooks and survey papers, before giving a more careful account of recent work.

#### Ii-1 System Identification

The estimation of system behavior from input/output experiments has a well-developed theory dating back to the 1960s, particularly in the case of linear-time-invariant systems. Standard reference texts on the topic include [6, 8, 9, 10]. The success of discrete time series analysis by Box and Jenkins [11] provided an early impetus for the extension of these methods to the controlled system setting. Important connections to information theory were established by Akaike [12]. The rise of robust control in the 1980s further inspired system identification procedures, wherein model errors were optimized under the assumption of adversarial noise processes [13]. Another important step was the development of subspace methods [14], which became a powerful tool for identification of multi-input multi-output systems.

#### Ii-2 Adaptive Control

System identification on real systems can be tedious, time consuming, and require skilled personnel. Adaptive control offers a simpler path forward. Moreover, adaptation offers an effective way to compensate for time-variations in the system dynamics. An early driving application was aircraft autopilot development in the 1950s. Aerospace applications needed control strategies that automatically compensated for changes in dynamics due to altitude, speed, and flight configuration [15]. Another important early application was ship steering, where adaptation is used to compensate for wave effects [16]. Standard textbooks include [17, 18, 19, 20, 21].

A direct approach to adaptive control expounded by Bellman [22] was to tackle the problem using dynamic programing by augmenting the state to contain the conditional distribution of the unknown parameters. From this it was seen that in such problems, control served the dual purpose of exciting the system to aid in its identification – hence the term dual control. This approach suffered from the curse of dimensionality, which lead to the development of approximation techniques [23, 24] that ultimately evolved into modern day RL.

A more successful approach was self-tuning adaptive control, pioneered by [25, 26] and followed by a long sequence of contributions to adaptive control theory, deriving conditions for convergence, stability, robustness and performance under various assumptions. For example, [27] analysed adaptive algorithms using averaging, [28] derived an algorithm that gives mean square stability with probability one. On the other hand, conditions that may cause instability were studied in [29], [30] and [31]. Finally, [32] gave conditions for optimal asymptotic rates of convergence. More recent adaptive approaches include the L1 adaptive controller [33], and model free adaptive control [34, 35, 36].

#### Ii-3 Automatic Tuning and Repeated Experiments

There is also an extensive literature on automatic procedures for initialization (tuning) of controllers, without further adaptation after the tuning phase. A successful example is auto-tuning of PID controllers [37], where a relay provides non-linear feedback during the tuning phase. Another important tuning approach, well established from an engineering perspective [38], is based on repeated experiments with linear time-invariant controllers. Theoretical bounds on such an approach were obtained already by Lai and Robbins [39]. Specifically, they showed that a pseudo-regret of the state variance is lower bounded by . Subsequent work by Lai [40] showed that this bound was tight. Recently, Raginsky [41] revisited this problem formulation, and showed that for any persistently exciting controller, the time taken to achieve state variance less than is at least for a system of dimension .

#### Ii-4 Dynamic Programming and Reinforcement Learning

A major part of the literature on dynamic programming is devoted to “tabular MDPs,” i.e. systems for which the state and action spaces are discrete and small enough to be stored in memory. The classic texts [42, 23, 24] highlight computationally efficient approximation techniques for solving these problems. They include Monte Carlo methods, temporal-difference (TD) learning [43, 44] (which encompass SARSA [45] and Q-learning [46, 47, 48]), value and Q-function approximation via Neural Networks, kernel methods, least-squares TD (LSTD) [49, 48, 50], and policy gradient methods such as REINFORCE [47, 51] and Actor-Critic Methods [52, 53].

Recent advances in both algorithms and computational power have allowed RL methods to solve incredibly complex tasks in very large discrete spaces that far exceed the tabular setting, including video games [1], Go [4], chess, and shogi [5]. This success has renewed an interest in applying traditional model-free RL methods, such as Q-learning [54] and policy optimization [55], to continuous problems in robotics [2, 3]. Thus far, however, the deployment of systems trained in this way has been limited to simulation environments [56] or highly controlled laboratory settings, as the training process for these systems is both data hungry and highly variable [57].

#### Ii-5 System Identification Revisited

With few exceptions (e.g., [58]), prior to the 2000s, the literature on system identification and adaptive control focused on asymptotic error characterization and consistency guarantees. In contrast, contemporary results in statistical learning seek to characterize finite time and finite data rates, leaning heavily on tools from stochastic optimization and concentration of measure. Such finite-time guarantees provide estimates of both system parameters and their uncertainty, allowing for a natural bridge to robust/optimal control. Early such results, characterizing rates for parameter identification [59, 60], featured conservative bounds which are exponential in the system degree and other relevant quantities. More recent results, focused on state-space parameter identification for LTI systems, have significantly improved upon these bounds. In [61], the first polynomial time guarantees for identifying a stable linear system were provided – however, these guarantees are in terms of predictive output performance of the model, and require rather stringent assumptions on the true system. In [62], it was shown, assuming that the state is directly measurable and the system is driven by white in time Gaussian noise, that solving a least-squares problem using independent data points taken from different trials achieves order optimal rates that are linear in the system dimension. This result was generalized to the single trajectory setting for (i) marginally stable systems in [63], (ii) unstable systems in [64], and (iii) partially observed stable systems in [65, 66, 67]. We note that analogous results also exist in the fully observed setting for the identification of sparse state-space parameters [68, 69], where rates are shown to be logarithmic in the ambient dimension, and polynomial in the number of nonzero elements to be estimated.

#### Ii-6 Automatic Tuning Revisited

There has been renewed interest, motivated in part by the expansion of reinforcement learning to continuous control problems, in the study of automatic tuning as applied to the Linear Quadratic Regulator. More closely akin to iterative learning control [70], Fietcher [71] showed that the discounted LQR problem is Probably Approximately Correct (PAC) learnable in an episodic setting. In [62], Dean et al. dramatically improved the generality and sharpness of this result by extending it to the traditional infinite horizon setting, and leveraging contemporary tools from concentration of measure and robust control.

#### Ii-7 Adaptive Control Revisited

Contemporary results tend to draw on ideas from the bandits literature. A non-asymptotic study of the adaptive LQR problem was initiated by Abbasi-Yadkori and Szepesvari [72]. They use an Optimism in the Face of Uncertainty (OFU) based approach, where they maintain confidence ellipsoids of system parameters and select those parameters that lead to the best closed loop performance. While the OFU method achieves the optimal regret, solving the OFU sub-problem is computationally challenging. To address this issue, other exploration methods were studied. Thompson sampling [73] is used to achieve regret for scalar systems [74], and [75] studies a Bayesian setting with a particular Gaussian prior. Both [76] and [77] give tractable algorithms which achieve sub-linear frequentist regret of without the Bayesian setting of [75]. Follow up work [78] showed that this rate could be improved to by leveraging a novel semi-definite relaxation. More recently, [79] show that as long as the initial system parameter estimates are sufficiently accurate, certainty equivalent (CE) control achieves regret with high probability. Finally, Rantzer [80] shows that for a scalar system with a minimum variance cost criterion, a simple self-tuning regulator scheme achieves expected regret after an initial burn in period, thus matching the lower bound established by [39].

Much of the recent work addressing the sample complexity of the LQR problem was motivated by the desire to understand RL algorithms on a simple baseline [81]. In addition to the model-based approaches described above, model-free methods have also been studied. Model-free methods for the LQR problem were put on solid theoretical footing in [82], where it was shown that controllability and persistence of excitation were sufficient to guarantee convergence to an optimal policy. In [83], the first finite time analysis for LSTD as applied to the LQR problem is given, in which they show that samples are sufficient to estimate the value function up to -accuracy, for the state dimension. Subsequently, Fazel et al. [84] also showed that randomized search algorithms similar to policy gradient can learn the optimal controller with a polynomial number of samples in the noiseless case; however an explicit characterization of the dependence of the sample complexity on the parameters of the true system was not given, and the algorithm is dependent on the knowledge of an initially stabilizing controller. Similarly, Malik et al. [85] study the behavior of random finite differencing for LQR. Finally, [86] shows that there exists a family of systems for which there is a sample complexity gap of at least a factor of state dimension between LSTD/policy gradient methods and simple CE model-based approaches.

## Iii Fundamentals

We study the behavior of Markov Decision Processes (MDP). In the finite horizon setting of length , we consider

 min\tfπ\E[∑T−1t=0ct(xt,ut)+cT(xT)]s.t.xt+1=ft(xt,ut,wt)ut=πt(x0:t,u0:t−1), (1)

for the system state, the control input, the state transition randomness, and the control policy, with a possibly random mapping. With slight abuse of notation, we will use , , and to denote (i) the dimension of , , and , respectively, when considering continuous state and action spaces, and (ii) the cardinality of , , and , respectively, when considering discrete state and action spaces.

We consider settings where both the cost functions and the dynamics functions may not be known. Finally, we assume that the primitive random variables are defined over a common probability space with known and independent distributions – the expectation in the cost is taken with respect to these and the policy .

### Iii-a Dynamic Programming Solutions

When the transition functions and costs are known, problem (1) can be solved using dynamic programming. As the dynamics are Markovian, we restrict our search to policies of the form without loss of optimality. Define the value function of problem (1) at time to be

 VT(xT)=\E[cT(xT)]Vt(xt)=minut\E[ct(xt,ut)+Vt+1(ft(xt,ut,wt))]. (2)

Iterating through this process yields both an optimal policty , and the optimal cost-to-go that it achieves.

Moving to the infinite horizon setting, we assume that the cost function and dynamics are static, i.e., that and for all .

We begin by introducing the discounted cost setting, wherein the cost-functional in optimization (1) is replaced with

 \E[∞∑t=0γtc(xt,ut)], (3)

for some . Note that if is bounded almost surely and , then the infinite sum (3) is guaranteed to remain bounded, greatly simplifying analysis. In this setting, evaluating the performance achieved by a fixed policy consists of finding a solution to the following equation:

 Vπ(x)=\E[c(x,π(x))+γVπ(f(x,π(x),w))], (4)

from which it follows immediately that the optimal value function will satisfy

 V⋆(x)=minu\E[c(x,u)+γV⋆(f(x,u,w))], (5)

and . One can show that under mild technical assumptions, iterative procedures such as policy iteration and value iteration will converge to the optimal policy.

Next, we consider the asymptotic average cost setting, in which case the cost-functional in problem (1) is set to

 \E[limT→∞1TT−1∑t=0c(xt,ut)+cT(xT)]. (6)

Care must be taken to ensure that the limit converges, thus somewhat complicating the analysis – however, this cost functional is often most appropriate for guaranteeing the stability for stochastic optimal control problems.

{example}

[Linear Quadratic Regulator] Consider the Linear Quadratic Regulator (LQR), a classical instantiation of MDP (1) from the optimal control literature, for and :

 min\tfπ1T\E[∑T−1t=0x⊤tQxt+u⊤tRut+x⊤TQTxT]s.t.xt+1=Axt+But+wtut=πt(x0:t,u0:t−1), (7)

where the system state , the control input , and the disturbance process are independently and identically distributed as zero mean Gaussian random variables with a known covariance matrix .

For a finite horizon and known matrices , this problem can be solved directly via dynamic programming, leading to the optimal control policy

 u⋆t=−(B⊤Pt+1B+R)−1B⊤Pt+1Axt, (8)

where satisfies the Discrete Algebraic Riccati (DAR) Recursion initialized at Further, when the triple is stabilizable and detectable, the closed loop system is stable and hence converges to a stationary distribution, allowing us to consider the asymptotic average cost setting (6), at which point the optimal control action is a static policy, defined as in (8), but with , for a solution of the corresponding DAR Equation.

{example}

[Tabular MDP] Consider the setting where the state-space , the control space , and the disturbance process space have finite cardinalities of , , and , respectively, and further suppose that the underlying dynamics are governed by transition probabilities . We assume that the cardinalities , , and are such that , , and can be stored in tabular form in memory and worked with directly. These then induce the dynamics functions:

 xt+1=wt(xt,ut), (9)

where with probability .

In the case of average cost, for simplicity, we restrict our attention to communicating and ergodic MDPs. The former correspond to scenario where for any two states, there exists a stationary policy leading from one to the other with positive probability. For the latter, any stationary policy induces an ergodic Markov chain. In the average cost setting, one wishes to minimize . More precisely, the objective is to identify as fast as possible a stationary policy with maximal gain function : for any , where denotes the average cost under starting in state over a time horizon . When the MDP is ergodic, this gain does not depend on the initial state. To compute the gain of a policy, we need to introduce the bias function (where is the Cesaro limit) that quantifies the advantage of starting in state . and satisfy for any :

 gπ(x)+hπ(x)=c(x,π(x))+∑yp(y|x,π(x))hπ(y).

The gain and bias functions and of an optimal policy verify Bellman’s equation: for all ,

 g⋆(x)+h⋆(x)=minu∈U(c(x,u)+∑yp(y|x,u)h⋆(y)).

is defined up to an additive constant.

When the transition probabilities and cost function are known, this problem can then be solved via value-iteration, policy-iteration, and linear programming.

### Iii-B Learning to Control MDPs with Unknown Dynamics

Thus far we have considered settings where the dynamics and costs are known. Our main interest is understanding what should be done when these models are not known. Our study will focus on the previous two examples, namely LQR and the tabular MDP setting. While much of modern reinforcement learning focusses on model-free methods, we adopt a more control theoretic perspective on the problem and study model-based methods wherein we attempt to approximately learn the system model , and then subsequently use this approximate model for control design.

Before continuing, we distinguish between episodic and single-trajectory settings. An episodic task is akin to traditional iterative learning control, wherein a task is repeated over a finite horizon, after which point the episode ends, and the system is reset to begin the next episode. In contrast, a single-trajectory task is akin to traditional adaptive control, in that no such resets are allowed, and a single evolution of the system under an adaptive policy is studied.

An underlying tension exists between identifying an unknown system and controlling it. Indeed it is well known that without sufficient exploration or excitation, an incorrect model will be learned, possibly leading to suboptimal and even unstable system behavior; however, this exploration inevitably degrades system performance. Informally, this tension leads to a fundamental tradeoff between how quickly a model can be learned, and how well it can be controlled during this process. Modern efforts seek to explicitly address and quantify these tradeoffs through the use of performance metrics such as the Probably Approximately Correct (PAC) and Regret frameworks, which we define next. For episodic tasks, we assume the horizon of each episode to be of length , and consider guarantees on performance as a function of the number of episodes that have been evaluated. For single trajectory tasks, we consider infinite horizon problems, and the definitions provided are equally applicable to the discounted and asymptotic average cost settings. The definitions that follow are adapted from [87, 88, 89], among others.

### Iii-C PAC-Bounds

##### Episodic PAC-Bounds

We consider episodic tasks over a horizon , where may be infinite but the user is allowed to reset the system at a prescribed time . Let be the optimal cost achievable, and be the number of episodes for which is not -optimal, i.e., the number of episodes for which . Then, a policy is said to be episodic--PAC if, after episodes, it satisfies111We note that in the discounted setting, we also ask that the bound on depend polynomially on . We also note that modern definitions of PAC-learning [87] require that the bound on depend polynomially on – this is a reflection of results from contemporary high-dimensional statistics that allow for more refined concentration of measure guarantees.

 \ProbNϵ>poly(nx,nu,Hr,1/ϵ,1/δ)≤δ. (10)

These guarantees state that the chosen policy is -optimal on all but a number of episodes polynomial in the problem parameters, with probability at least . Many PAC algorithms operate in two phases: the first is solely one of exploration so as to identify an approximate system model, and the second is solely one of exploitation, wherein the approximate system model is used to synthesize a control policy. Therefore, informally one can view PAC guarantees as characterizing the number of episodes needed to identify a model that can be used to synthesize an -optimal policy.

{example}

[LQR is episodic PAC-Learnable] The results in [62] imply that the LQR problem with an asymptotic average cost is episodic PAC-learnable. In particular, it was shown that a simple open-loop exploration process of injecting white in time Gaussian noise over at most episodes, followed by a least-squares system identification and uncertainty quantification step, can be used with a robust synthesis method to generate a policy which guarantees that

 \ProbV\tfπ−V⋆≥ϵ≤δ, (11)

when the LQR problem is initialized at . Hence the resulting algorithm meets the modern definition of being -PAC-learnable. We revisit example in Section V.

##### Single-trajectory PAC-Bounds

We consider single-trajectory tasks over an infinite horizon, and let denote the cost-to-go from state achieved by a policy , and be the optimal cost-to-go achievable. We further let be the number of time-steps for which is not -optimal, i.e., the number of time-steps for which . Then, a policy is said to be -PAC if it satisfies222We make the same modifications to this definition for the discounted case and the dependence on as in the episodic setting.

 \ProbNϵ>poly(nx,nu,1/ϵ,1/δ)≤δ. (12)

These guarantees should be interpreted as saying that the chosen policy is at worst -suboptimal on all but time-steps, with probability at least . As in the episodic setting, one can view these PAC guarantees as characterizing the number of time-steps needed to identify a model that can be used to synthesize an -optimal policy.

##### Limitations of PAC-Bounds

As an algorithm that is -PAC is only penalized for suboptimal behavior exceeding the threshold, there is no guarantee of convergence to an optimal policy. In fact, as pointed out in [87] and illustrated in the LQR example above, many PAC algorithms cease learning once they are able to produce an -suboptimal strategy.

### Iii-D Regret Bounds

We focus on regret bounds for the single-trajectory setting, as this is the most common type of guarantee found in the literature, but note that analogous episodic definitions exist (cf., [87]). The regret framework evaluates the quality of an adaptive policy by comparing its running cost to a suitable baseline. Let represent the baseline cost at time , and define the regret incurred by a policy to be

 R(T):=T∑t=0ct(xt,πt(x0:t,u0:t−1),vt)−bT. (13)

Note that is user specified, and is often chosen to be the expected optimal cost achievable by a policy with full knowledge of the system dynamics. The two most common regret guarantees found in the literature are expected regret bounds, and high probability regret bounds. In the expected regret setting, the goal is to show that

 ER(T)≤poly(nx,nu,T), (14)

whereas in the high-probability regret setting, the goal is to show that333As in the PAC setting, modern definitions often require the dependence to be polynomial in .

 \ProbR(T)≥poly(nx,nu,T,1/δ)≤1−δ. (15)

These bounds therefore quantify the rate of convergence of the cost achieved by the adaptive policy to the baseline cost, providing any time guarantees on performance relative to a desirable baseline. From the definition of , it is clear that one should strive for an dependence, as this implies that the cost achieved by the adaptive policy converges with at least sub-linear rate to the base cost . Further, in contrast to the PAC framework, all sub-optimal behavior is tallied by the running regret sum, and hence exploration and exploitation must be suitably balanced to achieve favorable bounds.

{example}

[Regret bounds for LQR] The study of regret bounds for LQR was initiated in [72]. Here we summarize a recent treatment of the problem, as provided in [79]. There, the authors study the performance of CE control for LQR, and study a regret measure of the form

 R(T):=T∑t=1x⊤tQxt+u⊤tRut−TV⋆, (16)

with the optimal asymptotic average cost achieved by the true optimal LQR controller. They show that the control policy which has an exploration term added to the CE controller, achieves the regret bound

 R(T)≤poly(nx,nu,log(1/δ))O(T1/2), (17)

with probability at least so long as and the initial estimates of the system dynamics are sufficiently accurate. We revisit this example in Section V.

##### Limitations of Regret Bounds

As regret only tracks the integral of suboptimal behavior, it does not distinguish between a few severe mistakes and many small ones. In fact, [87] shows that for Tabular MDP problems, an algorithm achieving optimal regret may still make infinitely many mistakes that are maximally suboptimal. Thus regret bounds cannot provide guarantees about transient worst-case deviations from the baseline cost , which may have implications on guaranteeing the robustness or safety of an algorithm. We comment further on regret for discrete MDPs in the next section.

## Iv Optimal control of unknown discrete systems

This section addresses reinforcement learning for stationary MDPs with finite state and control spaces of respective cardinalities and . When the system is in state and the control input is , the system evolves to state with probability , and the cost induced is independently drawn from a distribution with expectation . Costs are bounded, and for any , the distribution is absolutely continuous w.r.t. to a measure (for example, Lebesgue measure).

We consider both average-cost and discounted MDPs (refer to Example 2), and describe methods (i) to derive fundamental performance limits (in terms of regret and sample complexity), and (ii) to devise efficient learning algorithms. The way the learner samples the MDP may significantly differs in the literature, depending on the objective (average or discounted cost), and on whether one wishes to derive fundamental performance limits or performance guarantees of a given algorithm. For example, in the case of average cost, typically, the learner gathers information about the system in an online manner following the system trajectory, a sampling model referred previously to as the single trajectory model. Most sample complexity ananlyses are on the contrary derived under the so-called generative model, where any state-control pair can be sampled in time. Generative models are easier to analyze but hide the difficult issue of navigating the state space to explore various state-control pairs.

### Iv-a Average-cost MDPs

For average-cost MDPs, we are primarily interested in devising algorithms with minimum regret, defined for a given learning algorithm as

 Rπ(T)=T∑t=1ct(xπt,uπt)−T∑t=1ct(x⋆t,u⋆t),

where and are the state and control input under the policy at time-step , and similarly the superscript corresponds to an optimal stationary control policy. Next, we discuss the type of regret guarantees we could aim at.

Expected vs. high-probability regret. We would ideally wish to characterize the complete regret distribution. This is however hard, and most existing results provide guarantees either in expectation or with high probability. In the case of finite state and control spaces, guarantees in high-probability can be easy to derive and not very insightful. Consider for example, a stochastic bandit problem (a stateless MDP) where the goal is to identify the control input with the lowest expected cost. An algorithm exploring each control input at least times would yield a regret in with probability greater than (this is a direct application of Hoeffding’s inequality). This simple observation only holds for a fixed MDP (the gap between the costs of the various inputs cannot depend on nor on the time at which regret is evaluated), and relies on the assumption of bounded costs. Even in the simplistic stochastic bandit problem, analyzing the distribution of the regret remains an open and important challenge, refer to [90, 91] for initial discussions and results.

Problem-specific vs. minimax regret guarantees. A regret upper bound is problem-specific if it explicitly depends on the parameters defining the MDP. Such performance guarantees capture and quantify the hardness of learning to control the system. Minimax regret guarantees are far less precise and informative, since they concern the worst system among possibly all systems. An algorithm with good minimax regret upper bound behaves well in the worst case, but does not necessarily learn and adapt to the system it aims at controlling.

Guided by the above observations, we focus on the expected regret, and always aim, when this is possible, at deriving problem-specific performance guarantees.

#### Iv-A1 Regret lower bounds

We present here a unified simple method to derive both problem-specific and minimax regret lower bounds. This method has been developed mainly in the bandit optimization literature [92, 93, 94, 95] as a simplified alternative to Lai and Robbins techniques [96].

Let denote the true MDP. Consider a second MDP . For a given learning algorithm , define by the log-likelihood ratio of the corresponding observations under and . By a simple extension of the Wald’s lemma, we get:

 Eπϕ[L(T)]=∑x,uEπϕ[Nxu(T)]KLϕ|ψ(x,u), (18)

where is the number of times the state-control pair is observed under , and where is the KL divergence between the distributions of the observations made in under and . These observations concern the next state, and the realized reward, and hence:

 KLϕ|ψ(x,u)=∑y p(y|x,u)logp(y|x,u)p′(y|x,u)+∫q(r|x,u)logq(r|x,u)q′(r|x,u)λ(dr).

Now the data processing inequality states that for any event or any -valued random variable depending on all observations up to time , i.e., -measurable ( is the -algebra generated by the observations under up to time ):

 E[L(T)]≥max{kl(Pπϕ[E],Pπψ[E]),kl(Eπϕ[Z],Eπψ[Z])},

where is the KL divergence between two Bernoulli distributions of respective means and . Now combining the above inequality to (18) yields a lower bound on a weighted sum of the expected numbers of times each state-control pair is selected. These numbers are directly related to the regret as one can show that:

 Rπ(T)=∑x∈X∑u∉O(x,ϕ)Eπϕ[Nxu(T)]δ⋆(x,u;ϕ)+O(1),

where denotes the set of optimal control inputs in state under , and is the sub-optimality gap quantifying the regret obtained by selecting the control is state . It remains to select the event or the random variable to get a regret lower bound.

To derive problem-specific regret lower bounds, we introduce the notion of uniformly good algorithms. is uniformly good if for any ergodic MDP , any initial state and any constant , . As it will become clear later, uniformly good algorithms exist. Now select the event as: for some and where is the number of times is visited up to time . is chosen such that is very likely under invoking the ergodicity of the MDP. In the change-of-measure argument, is chosen such that , where is the set of optimal policies under . Now if is uniformly good, is very likely under , and very unlikely under . Formally, we can establish that:

 kl(Pπϕ[E],Pπψ[E])≥log(T)+O(1).

Putting the above ingredients together, we obtain:

{theorem}

(Theorem 1, [97]) Let be an ergodic MDP. For any uniformly good algorithm and for any initial state,

 liminfT→∞Rπ(T)logT≥K(ϕ), (19)

where is the value of the following optimization problem:

 infη∈FΦ(ϕ)  ∑x,uη(x,a)δ⋆(x,a;ϕ), (20)

where is the set of satisfying

 ∑(x,u)∈X×Uη(x,u)KLϕ|ψ(x,u)≥1,   ∀ψ∈Δ(ϕ) (21)

and .

In the above theorem means that the observations under have distributions absolutely continuous w.r.t. those of the observations under . By imposing the constraint for all , we consider all possible confusing MDP . It can be shown that the set of constraints defining can be reduced and decoupled: it is sufficient to consider different than in only one suboptimal state-control pair. In that case, the constraints can be written in the following form: for any and , for some absolute constant . As a consequence, the regret lower bound scales as . The theorem also indicates the optimal exploration rates of the various suboptimal state-control pairs: represents the expected number of times should be observed. Finally, we note that the method used to derive the regret lower bound can also be applied to obtain finite-time regret lower bounds, as in [95].

For minimax lower bounds, we do not need to restrict the attention to uniformly good algorithms, since for any given algorithm, we are free to pick the MDP for which the algorithm performs the worst. Instead, to identify a -measurable -valued random variable , such that is large, we leverage symmetry arguments. Specifically, The MDP is constructed so as to contain numerous equivalent states and control inputs, and is chosen as the proportion of time a particular state-action pair is selected. Refer to [98] for the construction of this MDP, and to [95] for more detailed explanations on how to apply change-of-measure arguments to derive minimax bounds.

{theorem}

(Theorem 5, [98]) For any algorithm , for all integers , , and , there is an MDP with states, control inputs, and diameter such that for any initial state:

 Eπϕ[Rπ(T)]≥0.015√DnxnuT. (22)

#### Iv-A2 Efficient Algorithms

A plethora of learning algorithms have been developed for average-cost MDPs. We can categorize these algorithms based on their design principles. A first class of algorithms aim at matching the asymptotic problem-specific regret lower bound derived above [99, 100, 97]. These algorithms rely on estimating the MDP parameters and in each round (or periodically) they solve the optimization problem (20) where the true MDP parameters are replaced by their estimators. The solution is then used to guide and minimize the exploration process. This first class of algorithms is discussed further in §IV-A3.

The second class of algorithms includes UCRL, UCRL2 and KL-UCRL[101, 98, 102]. These algorithms apply the ”optimism in front of uncertainty” principle and exhibit finite-time regret ganrantees. They consists in building confidence upper bounds on the parameters of the MDP, and based on these bounds select control inputs. The regret guarantees are anytime, but use worst-case regret as a performance benchmark. For example, UCRL2 with the confidence parameter as an input satisfies:

{theorem}

(Theorem 2, [98]) With probability at least , for any initial state and all , the regret under UCRL2 satisfies:

 Rπ(T)≤34Dnx√nuTlog(Tδ).

Anytime regret upper bounds with a logarithmic dependence in the time horizon have been also investigated for UCRL2 and KL-UCRL. For instance, UCRL2 is known to yield a regret in with probability at least .

The last class of algorithms apply a similar Bayesian approach as that used by the celebrated Thompson sampling algorithm for bandit problems. In [103], AJ, a posterior sampling algorithm, is proposed and enjoys the following regret guanrantees:

{theorem}

(Theorem 1, [103]) With probability at least , for any initial state, the regret under AJ with confidence parameter satisfies: for ,

 Rπ(T)=~O(D√nxnuT).

#### Iv-A3 Structured MDPs

The regret lower bounds derived for tabular MDPs have the major drawback of scaling with the product of the numbers of states and controls, . Hence, with large state and control spaces, it is essential to identify and exploit any possible structure existing in the system dynamics and cost function so as to minimize exploration phases and in turn reduce regret to reasonable values. Modern RL algorithms actually implicitly impose some structural properties either in the model parameters (transition probabilities and cost function, see e.g. [104]) or directly in the -function (for discounted RL problems, see e.g. [105]. Despite their successes, we do not have any regret guarantees for these recent algorithms. Recent efforts to develop algorithms with guarantees for structured MDPs include [104, 106, 107, 97]. [97] is the first paper extending the analysis of [99] to the case of structured MDPs. The authors derive a problem-specific regret lower bound, and show that the latter can be obtained by just modifying in Theorem IV-A1 the definition of the set of confusing MDPs . Specifically, if denotes a set of structured MDPs ( encodes the structure), then . The minimal expected regret scales as , where is the value of the modified optimization problem (20). In [97], DEL, an algorithm extending that proposed in [99] to the case of structured MDPs, is shown to optimally exploit the structure:

{theorem}

(Theorem 4, [97]) For any , the regret under DEL satisfies:

 limsupT→∞E[Rπ(T)]log(T)≤KΦ(ϕ).

The semi-infinite LP characterizing the regret lower bound can be simplified for some particular structures. For example in the case where and smoothly vary over states and controls (Lipschitz continuous), it can be shown that the regret lower bound does not scale with and . The simplified LP can then be used as in [99] to devise an asymptotically optimal algorithm.

### Iv-B Discounted MDPs

Most research efforts, from early work [47] to more recent Deep RL [105], towards the design of efficient algorithms for discounted MDPs have focussed on model-free approaches, where one directly learns the value or the Q-value function of the MDP. Such an approach leads to simple algorithms that are potentially more robust than model-based algorithms (since they do not rely on modelling assumptions). The performance analysis of these algorithms has been initially mainly centered around the question of their convergence; for example, the analysis of -learning algorithm with function approximation [108] often calls for new convergence results of stochastic approximation schemes. Researchers have then strived to investigate and optimize their convergence rates. There is no consensus on the metric one should use to characterize the speed of convergence; e.g. the recent Zap Q-learning algorithm [109] minimizes the asymptotic error covariance matrix, while most other analayses focus on the minimax sample complexity. Note that the notion of regret in discounted settings is hard to define and has hence not been studied. Also observe that problem-specific metrics have not be investigated yet, and it hence seems perilous to draw definitive conclusions from existing theoretical results for learning discounted MDPs.

#### Iv-B1 Sample complexity lower bound

For discounted MDPs, the sample complexity is defined as the number of samples one need to gather so as to learn an -optimal policy with probability at least . Minimax sample complexity lower bounds are known for both the generative and online sampling models:

{theorem}

(Theorem 1, [110] and Theorem 11, [111]) In both the generative and online sampling models, for and small enough, there exists an MDP learning a sample complexity in (where denotes the discount factor).

#### Iv-B2 The price of model-free approaches

Some model-based algorithms are known to match the minimax sample complexity lower bound. In the online sampling setting, the authors of [111] presents UCRL(), an extension of UCRL for discounted costs, and establish a minimax sample complexity upper bound matching the above lower bound. UCRL() consists in deriving upper confidence bounds for the MDP parameters, and in selecting action optimistically (this can lead to important computational issues). In the generative sampling model, algorithms mixing model-based and model-free approaches have been shown to be minimax-sample optimal. This is the case of QVI (Q-value Iteration) initially proposed in [112] and analyzed in [110]. QVI estimates the MDP, and from this estimator, applies a classical value iteration method to approximate the -function and hence the optimal policy. QVI can be also made computationally efficient [113].

As for now, there is no pure model-free algorithm achieving the minimax sample complexity limit. Speedy Q-learning [114] has a minimax sample complexity in (this is for now the best one can provably do using model-free approaches). However, there is hope to find model-free minimax-sample optimal algorithms. In fact, recently, -learning with exploration driven by simple upper confidence bounds on the -values (rather than on the MDP parameters as in UCRL) has been shown to be minimax regret optimal for episodic reinforcement learning tasks [115]. It is likely that model-free algorithms can be made minimax optimal. If this is verified, this would further advocate the use of model-specific rather than minimax performance metrics.

## V Model-Based Methods for LQR

We combine the techniques described in [7] with robust and optimal control to derive finite-time guarantees for the optimal LQR control of an unknown system. We partition our study according to three initial uncertainty regimes: (i) completely unknown , (ii) moderate error bounds under which CE control may fail, and (iii) small error bounds under which CE control is stabilizing.

### V-a PAC Bounds for Unknown (A,b)

Here we assume that the system is completely unknown, and consider the problem of identifying system estimates , bounding the corresponding parameter uncertainties and , and using these system estimates and uncertainty bounds to compute a controller with provable performance bounds. In what follows, unless otherwise specified, all results are taken from [62].

The system identification and uncertainty quantification steps are covered in Theorem IV.3 of [7], which we summarize here for the convenience of the reader.

Consider a linear dynamical system described by

 xt+1=Axt+But+wt, (23)

where , , and . To identify the matrices , we inject excitatory Gaussian noise via . We run experiments over a horizon of time-steps, and then solve for our estimates via the least-squares problem:

 [\Ahat\Bhat]⊤=argmin(A,B)N∑i=1\normx(i)T+1−Ax(i)T−Bu(i)T22. (24)

Notice that we only use the last time-steps of each trajectory: we do so for analytic simplicity, and return to single trajectory estimators that use all data later in the section. We then have the following guarantees.

{theorem}

Consider the least-squares estimator defined by (24). Fix a failure probability , and assume that . Then, it holds with probability at least , that

 \twonorm\Ahat−A≤8σwλ−1/2min(Σx)√(2nx+nu)log(54/δ)N, (25)

and

 \twonorm\Bhat−B≤8σwσu√(2nx+nu)log(54/δ)N, (26)

where is the minimum eigenvalue of the finite time controllability Gramian

We now condition on the high-probability guarantee Theorem V-A, and assume that we have system estimates and corresponding uncertainty bounds , allowing us to focus on the controller synthesis step. Our goal is to compute a controller that is robustly stabilizing for any admissible realization of the system parameters, and for which we can bound performance degradation as a function of the uncertainty sizes .

In order to meet these goals, we use the System Level Synthesis [116, 117] (SLS) nominal and robust [118] parameterizations of stabilizing controllers. The SLS framework focuses on the system responses of a closed-loop system. Consider a LTI causal controller , and let . Then the closed-loop transfer matrices from the process noise to the state and control action satisfy

 [\tfx\tfu]=[(zI−A−B\tfK)−1\tfK(zI−A−B\tfK)−1]\tfw. (27)

We then have the following theorem parameterizing the set of stable closed-loop transfer matrices, as described in equation (27), that are achievable by a stabilizing controller . {theorem}[State-Feedback Parameterization [116]] The following are true:

• The affine subspace defined by

 [zI−A−B][\tfΦx\tfΦu]=I, \tfΦx,\tfΦu∈1zRH∞ (28)

parameterizes all system responses (27) from to , achievable by an internally stabilizing state-feedback controller .

• For any transfer matrices satisfying (28), the controller is internally stabilizing and achieves the desired system response (27).

We will also make use of the following robust variant of Theorem V-A. {theorem}[Robust Stability [118]] Let and be two transfer matrices in such that

 [zI−A−B][\tfΦx\tfΦu]=I+\tfΔ. (29)

Then the controller stabilizes the system described by if and only if . Furthermore, the resulting system response is given by

 [\tfx\tfu]=[\tfΦx\tfΦu](I+\tfΔ)−1\tfw. (30)
{coro}

Under the assumptions of Theorem V-A, if for any induced norm , then the controller stabilizes the system described by .

We now return to the problem setting where the estimates of a true system satisfy , for and . We first formuate the LQR problem in terms of the system responses . It follows from Theorem V-A and the standard equivalence between infinite horizon LQR and optimal control that, for a disturbance process distributed as , the standard LQR problem (7) can be equivalently written as

 min\tfΦx,\tfΦuσ2w∥∥ ∥∥⎡⎣Q1200R12⎤⎦[\tfΦx\tfΦu]∥∥ ∥∥2\htwo s.t. equation (???). (31)

Going forward, we drop the multiplier in the objective function as it does not affect the guarantees that we compute.

We begin with a simple sufficient condition under which any controller that stabilizes also stabilizes the true system . For a matrix , we let denote the resolvent, i.e.,

{lemma}

Let the controller stabilize and be its corresponding system response (27) on system . Then if stabilizes , it achieves the following LQR cost

 (32)

Furthermore, letting

 (33)

controller stabilizes if . We can therefore pose a robust LQR problem as

 (34)

The resulting robust control problem is one subject to real-parametric uncertainty, a class of problems known to be computationally intractable [119]. To circumvent this issue, we instead find an upper-bound to the cost that is independent of the uncertainties and . First, note that if , we can write

 J(A,B,\tfK) ≤11−\hinfnorm\DhJ(\Ah,\Bh,\tfK). (35)

This upper bound separates nominal performance, as captured by , from the effects of the model uncertainty, as captured by . It therefore remains to compute a tractable bound for .{proposition}[Proposition 3.5, [62]] For any and as defined in (33), we have

 ∥\tf^Δ∥\hinf≤∥∥ ∥∥⎡⎢⎣ϵA√α\tfΦxϵB√1−α\tfΦu⎤⎥⎦∥∥ ∥∥\hinf=:Hα(\tfΦx,\tfΦu). (36)

Applying Proposition V-A in conjunction with the bound (35), we arrive at the following upper bound to the cost function of the robust LQR problem (34), which is independent of the perturbations :

 sup∥ΔA∥2≤ϵA,∥ΔB∥2≤ϵBJ(A,B,\tfK) ≤J(\Ah,\Bh,\tfK)1−Hα(\tfΦx,\tfΦu). (37)

The upper bound is only valid when , which guarantees the stability of the closed-loop system. Note that (37) can be used to upper bound the performance achieved by any robustly stabilizing controller.

We can then pose the robust LQR synthesis problem as the following quasi-convex optimization problem, which can be solved by gridding over :

 (38)

As we constrain , any feasible solution to optimization problem (38) generates a controller that stabilizes the true system .

{remark}

Optimization problem (38) is infinite-dimensional. However, one can solve a finite-dimensional approximation of the problem over a horizon (see Theorem 5.1, [62]) such that the sub-optimality bounds we prove below still hold up to universal constants.

We then have the following theorem bounding the sub-optimality of the proposed robust LQR controller.

{theorem}

Let denote the minimal LQR cost achievable by any controller for the dynamical system with transition matrices , and let denote the optimal contoller. Let be estimates of the transition matrices such that , . Then, if is synthesized via (38) with , the relative error in the LQR cost is

 J(\trueA,\trueB,\tfK)−J⋆J⋆≤5(ϵA+ϵB\ltwonorm\trueK)\hinfnorm\ResA+B\trueK, (39)

as long as .

The crux of the proof of Theorem V-A is to show that for sufficiently small , the optimal LQR controller is robustly stabilizing. We then exploit optimality to upper bound the cost achieved by our controller with that achieved by the optimal LQR controller , from which the result follows almost immediately by repeating the argument with estimated and true systems reversed. Combining Theorems V-A and V-A, we see that . This in turn shows that LQR optimal control of an unknown system is -episodic PAC learnable, where here we interpret each system identification experiment as an episode.

{example}

Consider an LQR problem specified by

 A=⎡⎢⎣1.010.0100.011.010.0100.011.01⎤⎥⎦,B=I,Q=10−3I,R=I. (40)

In the left plot of Figure 1, we show the percentage of stabilizing controllers synthesized using certainty equivalence and the proposed robust LQR controllers over 100 independent trials.. Notice that even after collecting data from 100 trajectories, the CE controller yields unstable behavior in approximately 10% of cases. Given that the state of the underlying dynamical system is only 3-dimensional, one might consider 100 data-points to be a reasonable approximation of an “asymptotic” amount of data, highlighting the need for a more refined analysis of the effects of finite data on stability and performance. Contrast this with the behavior achieved by the robust LQR synthesis method, which explicitly accounts for system uncertainty: after a small number of trials, there is a sharp transition to 100% stability across trials. Further, feasibility of the synthesis problem (38) provides a certificate of stability and performance, conditioned on the uncertainty bounds being correct. However, robustness does come at a price: as shown in the right plot of Figure 1, the CE controller outperforms the robust LQR controller when it is stabilizing.

### V-B Regret Bounds under Moderate Uncertainty

We have just described an offline procedure for learning a coarse estimate of system dynamics and computing a robustly stabilizing controller. We now consider the task of adaptively refining this model and controller. For this problem, we seek high probability bounds on the regret , defined as

 R(T):=T∑t=1(x⊤tQxt+u⊤tRut)−TJ⋆, (41)

for defined as in the previous section.

Consider the single trajectory least-squares estimator:

 (\Ah,\Bh)∈argminA,B12T−1∑t=0\normxt+1−Axt−But22, (42)

solved with data generated from system (23) driven by input . The following result from Simchowitz et al. [63] gives us a high probability bound on the error of the estimator (42). {theorem}[[63]] Suppose that is stable and that the trajectory length satisfies:

 T≥Ω(n+d+nlog(1δ(1+σ2u\normB2/σ2w))).

With probability at least , the quantity is bounded above by

 O⎛⎜ ⎜ ⎜ ⎜ ⎜⎝σw   ⎷n+d+nlog(1δ(1+σ2uσ2w\normB2))Tmin{σ2w,σ2u}⎞⎟ ⎟ ⎟ ⎟ ⎟⎠.

Here, the hides specific properties of the controllability gramian.

Suppose we are provided with an initial stabilizing controller : how should we balance controlling the system (exploitation) with exciting it for system identification purposes (exploration)? We propose studying the simple exploration scheme where . Theorem V-B tells us that if we collect data for time-steps and compute estimates using ordinary least squares, then with high probability we have that

 ϵ:=max(\norm\Ahat(T)−A2,\norm\Bhat(T)−B2)≤~O(1σηT1/2) (43)

when . Furthermore, we saw in Theorem V-A that for a model error size bounded by , that the sub-optimality incurred by a robust LQR controller synthesized using problem (38) satisfies , where here we use to denote the cost achieved by the controller . Letting denote the horizon cost