Provably Safe and Robust Learning-Based Model Predictive Control\thanksreffootnoteinfo

# Provably Safe and Robust Learning-Based Model Predictive Control\thanksreffootnoteinfo

[ aaswani@eecs.berkeley.edu    [ hgonzale@eecs.berkeley.edu    [ sastry@eecs.berkeley.edu    [ tomlin@eecs.berkeley.edu Electrical Engineering and Computer Sciences, Berkeley, CA 94720
###### Abstract

Controller design faces a trade-off between robustness and performance, and the reliability of linear controllers has caused many practitioners to focus on the former. However, there is renewed interest in improving system performance to deal with growing energy constraints. This paper describes a learning-based model predictive control (LBMPC) scheme that provides deterministic guarantees on robustness, while statistical identification tools are used to identify richer models of the system in order to improve performance; the benefits of this framework are that it handles state and input constraints, optimizes system performance with respect to a cost function, and can be designed to use a wide variety of parametric or nonparametric statistical tools. The main insight of LBMPC is that safety and performance can be decoupled under reasonable conditions in an optimization framework by maintaining two models of the system. The first is an approximate model with bounds on its uncertainty, and the second model is updated by statistical methods. LBMPC improves performance by choosing inputs that minimize a cost subject to the learned dynamics, and it ensures safety and robustness by checking whether these same inputs keep the approximate model stable when it is subject to uncertainty. Furthermore, we show that if the system is sufficiently excited, then the LBMPC control action probabilistically converges to that of an MPC computed using the true dynamics.

P
thanks: [

footnoteinfo]Corresponding author A. Aswani.

UCB]Anil Aswani, UCB]Humberto Gonzalez, UCB]S. Shankar Sastry, UCB]Claire Tomlin

redictive control; statistics; robustness; safety analysis; learning control.

## 1 Introduction

Tools from control theory face an inherent trade-off between robustness and performance. Stability can be derived using approximate models, but optimality requires accurate models. This has driven research in adaptive [64, 65, 55, 6, 60] and learning-based [74, 3, 70, 1, 47] control. Adaptive control reduces conservatism by modifying controller parameters based on system measurements, and learning-based control improves performance by using measurements to refine models of the system. However, learning by itself cannot ensure the properties that are important to controller safety and stability [15, 7, 8].

The motivation of this paper is to design a control scheme than can (a) handle state and input constraints, (b) optimize system performance with respect to a cost function, (c) use statistical identification tools to learn model uncertainties, and (d) provably converge. The main challenge is combining (a) and (c): Statistical methods converge in a probabilistic sense, and this is not strong enough for the purpose of providing deterministic guarantees of safety. Showing (d) is also difficult because of the differences between statistical and dynamical convergence.

We introduce a form of robust, adaptive model predictive control (MPC) that we refer to as learning-based model predictive control (LBMPC). The main insight of LBMPC is that performance and safety can be decoupled in an MPC framework by using reachability tools [4, 14, 56, 23, 5, 69, 52]. In particular, LBMPC improves performance by choosing inputs that minimize a cost subject to the dynamics of a learned model that is updated using statistics, while ensuring safety and stability by using theory from robust MPC [19, 21, 42, 44] to check whether these same inputs keep a nominal model stable when it is subject to uncertainty.

LBMPC is similar to other variants of MPC. For instance, linear parameter-varying MPC (LPV-MPC) has a model that changes using successive online linearizations of a nonlinear model [38, 26]; the difference is that LBMPC updates the models using statistical methods, provides robustness to poor model updates, and can involve nonlinear models. Other forms of robust, adaptive MPC [28, 2] use an adaptive model with an uncertainty measure to ensure robustness, while LBMPC uses a learned model to improve performance and a nominal model with an uncertainty measure to provide robustness.

Here, we focus on LBMPC for when the nominal model is linear and has a known level of uncertainty. After reviewing notation and definitions, we formally define the LBMPC optimization problem. Deterministic theorems about safety, stability, and robustness are proved. Next, we discuss how learning is incorporated into the LBMPC framework using parametric or nonparametric statistical tools. Provided sufficient excitation of the system, we show convergence of the control law of LBMPC to that of an MPC that knows the true dynamics. The paper concludes by discussing applications of LBMPC to three experimental testbeds [12, 9, 20, 13] and to a simulated jet engine compression system [53, 25, 39].

## 2 Preliminaries

In this section, we define the notation, the model, and summarize three results on estimation and filtering. Note that polytopes are assumed to be convex and compact.

### 2.1 Mathematical Notation

We use to denote the transpose of , and subscripts denote time indices. Marks above a variable distinguish the state, output, and input of different models of the same system. For instance, the true system has state , the linear model with disturbance has state , and the model with oracle has state .

A function is type- if it is continuous, strictly increasing, and [63]. Function is type- if for each fixed , the function is type-, and for each fixed , the function is decreasing and as [35]. Also, is a Lyapunov function for a discrete time system if (a) and ; (b) , where are type- functions; (c) lies in this interior of the domain of ; and (d) for states of a dynamical system.

Let be sets. Their Minkowski sum [66] is , and their Pontryagin set difference [66] is . This set difference is not symmetric, and so the order of operations is important; also, the set difference can result in an empty set. The linear transformation of by matrix is given by . Some useful properties [66, 37] include: , , , and .

For a sequence and rate , the notation means that such that , for all . For a random variable , constant , and rate , the notation means that given , such that , for all . The notation means that there exists such that .

### 2.2 Model

Let be the state vector, be the control input, and be the output. We assume that the states and control inputs are constrained by the polytopes . The true system dynamics are

 xn+1=Axn+Bun+g(xn,un) (1)

and , where are matrices of appropriate size and describes the unmodeled (possibly nonlinear) dynamics. The intuition is that we have a nominal linear model with modeling error. The term uncertainty is used interchangeably with modeling error.

We assume that the modeling error of (1) is bounded and lies within a polytope , meaning that for all . This assumption is not restrictive in practice because it holds whenever is continuous, since are bounded. Moreover, the set can be determined using techniques from uncertainty quantification [18]; for example, the residual error from model fitting can be used to compute this uncertainty.

### 2.3 Estimation and Filtering

Simultaneously performing state estimation and learning unmodeled dynamics requires measuring all states [10], except in special cases [9]. We focus on the case in which all states are measured (i.e, ). It is possible to relax these assumptions by using set theoretic estimation methods (e.g., [51]), but we do not consider those extensions here. For simplicity of presentation, we assume that there is no measurement noise; however, our results extend to the case with measurement noise by simply replacing the modeling error in our results with , where is a polytope encapsulating the effect of bounded measurement noise.

## 3 Learning-Based MPC

This section presents the LBMPC technique. The first step is to use reachability tools to construct a terminal set with robustness properties for the LBMPC, and this terminal set is important for proving the stability, safety, and robustness properties of LBMPC. The terminal constraint set is typically used to guarantee both feasibility and convergence [50]. We decouple performance from robustness by identifying feasibility with robustness and convergence with performance.

One novelty of LBMPC is that different models of the system are maintained by the controller. In order to delineate the variables of the various models, we add marks above and . The true system (1) has state and input . The nominal linear model with uncertainty has state and input ; its dynamics are given by

 ¯¯¯xn+1=A¯¯¯xn+B¯¯¯un+dn, (2)

where is a disturbance. Because , the reflects the uncertain nature of modeling error.

For the learned model, we denote the state and input . Its dynamics are , where is a time-varying function that is called the oracle. The reason we call this function the oracle is in reference to computer science in which an oracle is a black box that takes in inputs and gives an answer: LBMPC only needs to know the value (and gradient when doing numerical computations) of this function at a finite set of points; and yet, the mathematical structure and details of how the oracle is computed are not relevant to the stability and robustness properties of LBMPC.

### 3.1 Construction of an Invariant Set

We begin by recalling two facts [44]. First, if is stabilizable, then the set of steady-state points are and , where and are full column-rank matrices with suitable dimensions. These matrices can be computed with a null space computation, by noting that . Second, if is Schur stable (i.e., all eigenvalues have magnitude strictly less than one), then the control input steers (2) to steady-state and , whenever .

These facts are useful because they can be used to construct a robust reachable set that serves as the terminal constraint set for LBMPC. The particular type of reach set we use is known as a maximal output admissible disturbance invariant set . It is a set of points such that any trajectory of the system with initial condition chosen from this set and with control remains within the set for any sequence of bounded disturbance, while satisfying constraints on the state and input [37].

These properties of are formalized as (a) constraint satisfaction:

 Ω⊆{(¯¯¯x,θ):¯¯¯x∈X;Λθ∈X;K¯¯¯x+(Ψ−KΛ)θ∈U;Ψθ∈U}, (3)

and (b) disturbance invariance:

 [A+BKB(Ψ−KΛ)0I]Ω⊕(W×{0})⊆Ω. (4)

Recall that the component of the set is a parametrization of which points can be tracked using control .

The set has an infinite number of constraints in general, though arbitrarily good approximations can be computed in a finite number of steps [37, 44, 57]. These approximations maintain both disturbance invariance and constraint satisfaction, and these are the properties which are used in the proofs for our MPC scheme. So even though our results are for , they equally hold true for appropriately computed approximations.

### 3.2 Stability and Safety of LBMPC

LBMPC uses techniques from a type of robust MPC known as tube MPC [21, 42, 44], and it enlarges the feasible domain of the control by using tracking ideas from [22, 45]. The idea of tube MPC is that given a nominal trajectory of the linear system (2) without disturbance, then the trajectory of the true system (1) is guaranteed to lie within a tube that surrounds the nominal trajectory. A linear feedback is used to control how wide this tube can grow. Moreover, LBMPC fixes the initial condition of the nominal trajectory as in [21, 42], as opposed to letting the initial condition be an optimization variable as in [44].

Let be the number of time steps for the horizon of the MPC. The width of the tube at the -th step, for , is given by a set , and the constraints are shrunk by the width of this tube. The result is that if the nominal trajectory lies in , then the true trajectory lies in . Similarly, suppose that the -th step of the nominal trajectory lies in , where ; then the true trajectory lies in , and the invariance properties of imply that there exists a control that keeps the system stable even under disturbances.

The following optimization problem defines LBMPC

 Vn(xn)=minc,θψn(θ,~xn,…,~xn+N, ˇun,…,ˇun+N−1) (5) subject to: ~xn=xn,¯¯¯xn=xn (6) ~xn+i+1=A~xn+i+Bˇun+i+On(~xn+i,ˇun+i) (7) ¯¯¯xn+i+1=A¯¯¯xn+i+Bˇun+iˇun+i=K¯¯¯xn+i+cn+i¯¯¯xn+i+1∈X⊖Ri,ˇun+i∈U⊖KRi(¯¯¯xn+N,θ)∈Ω⊖(RN×{0})⎫⎪ ⎪ ⎪⎬⎪ ⎪ ⎪⎭ (8)

for all in the constraints; is the feedback gain used to compute ; and ; is the oracle; and are non-negative functions that are Lipschitz continuous in their arguments. Note that the Lipschitz assumption is not restrictive because it is satisfied by costs with bounded derivatives; for example, linear and quadratic costs satisfy this due to the boundedness of states and inputs. Also note that the same control is applied to both the nominal and learned models.

###### Remark 1.

The cost is a function of the states of the learned model, which uses the oracle to update the nominal model. The cost function may contain a terminal cost, an offset cost, a stage cost, etc. An interesting feature of LBMPC is that its stability and robustness properties do not depend on the actual terms within the cost function; this is one of the reasons that we state that LBMPC decouples safety (i.e., stability and robustness) from performance (i.e., having the cost be a function of the learned model).

###### Remark 2.

The constraints in (8) are taken from [21] and are robustly imposed on the nominal linear model (2), taking into account the prior bounds on the unmodeled dynamics of the nominal model . The reason that the constraints are not relaxed to exploit the refined results of the oracle (as in [28, 2]) is that this provides robustness to the situation in which the learned model is not a good representation of the true dynamics. It is known that the performance of a learning-based controller can be arbitrarily bad if the learned model does not exactly match the true model [15]; imposing the constraints on the nominal model, instead of the learned model, protects against this situation.

###### Remark 3.

There is another, more subtle reason for maintaining two models. Suppose that the oracle is bounded by a polytope , where is a polytope; then, the worst case error between the true model (1) and the learned model (7) lies within the polytope , which is strictly larger than whenever . Intuitively, this means that if we were to use the worst-case bounded learned model in the constraints, then the constraints will be reduced by a larger amount ; this is in contrast to using the nominal model in which case the constraints are reduced by only .

Note that the value function (i.e., the value of the objective (5) at its minimum), the cost function , and the oracle can be time-varying because they are functions of . It is important that the oracle be allowed to be time-varying, because it is updated using statistical methods as time advances and more data is gathered. This is discussed in more detail in the next section.

Let be a feasible point for the LBMPC scheme (5) with initial state , and denote a minimizing point of (5) as . The states and inputs predicted by the linear model (2) for point are denoted and , for . In this notation, the control law is explicitly given by

 um[M∗n]=Kxn+cn[M∗n]. (9)

This MPC scheme is endowed with robust feasibility and constraint satisfaction properties, which in turn imply stability of the closed-loop control provided by LBMPC. The equivalence between these properties and stability holds because of the compactness of constraints .

###### Theorem 1.

If has the properties defined in Sect. 3.1 and is feasible for the LBMPC scheme (5) with , then applying the control (9) gives

1. Robust feasibility: there exists a feasible for ;

2. Robust constraint satisfaction: .

###### Proof.

The proof follows a similar line of reasoning as Lemma 7 of [21]. We begin by showing that the following point is feasible for ; the results follow as consequences of this.

Let , and note that . Some algebra gives the predicted states for as and predicted inputs for as .

Because is feasible, this means by definition that for . Combining terms gives . It follows that for . Similar reasoning gives that for .

The same argument gives Now by construction of , it holds that , where is a matrix and is a point. Therefore, we have . However, the constraint satisfaction property of (3) implies that . Consequently, we have that .

Next, observe that the control leads to . Consequently, we have . As a result of the disturbance invariance property of (4), it must be that . This completes the proof for part (a).

Similar arithmetic shows that the true, next state is where . Since is a feasible point, it holds that . This implies that ; this proves part (b). ∎

###### Corollary 1.

If has the properties defined in Sect. 3.1 and is feasible for the LBMPC scheme (5) with initial state , then the closed-loop system provided by LBMPC is (a) stable, (b) satisfies all state and input constraints, and (c) feasible, for all points of time .

###### Remark 4.

Robust feasability and constraint satisfaction, as in Theorem 1, trivially imply this result.

###### Remark 5.

These results apply to the case where are time-varying; this allows, for example, changing the set point of the LBMPC using the approach in [45]. Moreover, the safety and stability that we have proved for the closed-loop system under LBMPC are actually robust results because they imply that the states remain within bounded constraints even under disturbances, provided the modeling error in (2) follows the prescribed bound and the invariant set can be computed.

Next, we discuss additional types of robustness provided by LBMPC. First, we show that the value function of LBMPC (5) is continuous, and this property can be used for establishing certain other types of robustness of an MPC controller [30, 48, 58, 43].

###### Lemma 1.

Let be the feasible region of the LBMPC (5). If are continuous, then is continuous on .

###### Proof.

We define a cost function and constraint function such that the LBMPC (5) can be rewritten as

 minc,θ ~ψn(θ,xn,cn,…,cn+N−1) (10) s.t. (c,θ)∈ϕ(xn).

The proof proceeds by showing that both the objective and constraint are continuous. Under such continuity, we get continuity of the value function by the Berge maximum theorem [16] (or equivalently by Theorem C.34 of [58]).

Because the constraints (6) and (8) in LBMPC are linear, the constraint is continuous [30]. Continuity of follows by noting that it is the composition of continuous functions — specifically (5), (6), and (7) — is also a continuous function [61]. ∎

###### Remark 6.

This result is surprising because a non-convex (and hence nonlinear) MPC problem generally has a discontinuous value function (cf. [30]). LBMPC is non-convex when is nonlinear (or is non-convex), and the reason that we have a continuous value function is that our active constraints are linear equality constraints or polytopes. In practice, this result requires being able to numerically compute a global minimum, and this can only be efficiently done for convex optimization problems.

###### Remark 7.

The proof of this result suggests another benefit of LBMPC: The fact that the constraints are linear means that suboptimal solutions can be computed by solving a linear (and hence convex) feasibility problem, even when the LBMPC problem is nonlinear. This enables more precise tradeoffs between computation and solution accuracy, as compared to conventional forms of nonlinear MPC.

Next, we prove that LBMPC is robust because its worst case behavior is an increasing function of modeling error. This type of robustness if formalized by the following definition.

###### Definition 1 (Grimm, et al. [30]).

A system is robustly asymptotically stable (RAS) about if there exists a type- function and for each there exists , such that for all satisfying it holds that and for all .

###### Remark 8.

The intuition is that if a controller for the approximate system (2) with no disturbance converges to , then the same controller applied to the approximate system (2) with bounded disturbance (note that this also includes the true system (1)) asymptotically remains within a bounded distance from .

We can now prove when LBMPC is RAS. The key intuitive points are that linear MPC (i.e, LBMPC with an identically zero oracle: ) needs to be provably convergent for the approximate model with no disturbance, and the oracle for LBMPC needs to be bounded.

###### Theorem 2.

Assume (a) has the properties defined in Sect. 3.1; (b) is feasible for LBMPC (5) with ; (c) the cost function is time-invariant, continuous, and strictly convex, and (d) there exists a continuous Lyapunov function for the approximate system (2) with no disturbance, when using the control law of linear MPC (i.e, LBMPC with ). Under these conditions, the control law of LBMPC is RAS with respect to the disturbance in (2), whenever the oracle is a continuous function satisfying . Note that this is the same one as from the definition of RAS.

###### Proof.

Let be the minimizer for linear MPC, and note that it is unique because is assumed to be strictly convex. Similarly, let be a minimizer for LBMPC. Now consider the state-dependent disturbance

 en=B(ˇun[M∗n]−ˇun[¯¯¯¯¯¯¯M∗n])+dn, (11)

for the approximate system (2). By construction, it holds that .

Proposition 8 of [30] and Theorem 1 imply that given , there exists such that for all satisfying it holds that and for all . What remains to be checked is whether there exists such that for the defined in (11).

The same argument as used in Lemma 1 coupled with the strict convexity of the linear MPC gives that is continuous, with respect to , when . (Recall that the minimizer at this point is .) Because of this continuity, this means that there exists such that , whenever the oracle lies in the set . Taking gives the result. ∎

###### Remark 9.

Condition (a) is satisfied if the set can be computed; it cannot be computed in some situations because it is possible to have . Conditions (b) and (c) are easy to check. As we will show in Sect. 3.2.1, certain systems have easy sufficient conditions for checking the Lyapunov conditions in (d).

#### 3.2.1 Example: Tracking in Linearized Systems

Here, we show that the Lyapunov condition in Theorem 2 can be easily checked when the cost function is quadratic and the approximate model is linear with bounds on its uncertainty. Suppose we use the quadratic cost defined in [45]

 ψn=∥~xn+N−Λθ∥2P+∥¯¯¯xs−Λθ∥2T+∑N−1i=0∥~xn+i−Λθ∥2Q+∥ˇun+i−Ψθ∥2R, (12)

where are positive definite matrices, to track to the point . Then, the Lyapunov condition required for Theorem 2 holds.

###### Proposition 1.

For linear MPC with cost (12) where is kept fixed, if is Schur stable and solves the discrete-time Lyapunov equation ; then there exists a continuous Lyapunov function for the equilibrium point of the approximate model (2) with no disturbances.

###### Proof.

First note that because we consider the linear MPC case, we have by definition .

Results from converse Lyapunov theory [36] indicate that the result is true if the following two conditions hold. The first is local uniform stability, meaning that for every , there exists some such that implies that for all . The second is that for all feasible points .

The second condition was shown in Theorem 1 of [45], and so we only need to check the first condition. We begin by noting that since are positive definite matrices, there exists a positive definite matrix such that and . Next, observe that . Minimizing the both sides of the inequality subject to the linear MPC constraints yields , where is the value function of the linear MPC optimization.

Because linear MPC is the special case of LBMPC in which , the result in Lemma 1 applies: The value function is continuous. Furthermore, the proof of Theorem 1 of [45] shows that the value function is non-increasing (i.e., ), non-negative (i.e., ), and zero-valued only at the equilibrium point (i.e., ). Because of the continuity of the value function, given there exists , such that whenever . The local uniform stability condition holds by noting that , and this proves the result. ∎

###### Remark 10.

The result does not immediately follow from [45], because the value function of the linear MPC is not a Lyapunov function in this situation. In particular, the value function is non-increasing, but it is not strictly decreasing.

## 4 The Oracle

In theoretical computer science, oracles are black boxes that take in inputs and give answers. An important class of arguments known as relativizing proofs utilize oracles in order to prove results in complexity theory and computability theory. These proofs proceed by endowing the oracle with certain generic properties and then studying the resulting consequences.

We have named the functions oracles in reference to those in computer science. Our reason is that we proved robustness and stability properties of LBMPC by only assuming generic properties, such as continuity or boundedness, on the function . These functions are arbitrary, which can include worst case behavior, for the purpose of the theorems in the previous section.

Whereas the previous section considered the oracles as abstract objects, here we discuss and study specific forms that the oracle can take. In particular, we can design to be a statistical tool that identifies better system models. This leads to two natural questions: First, what are examples of statistical methods that can be used to construct an oracle for LBMPC? Secondly, when does the control law of LBMPC converge to the control law of MPC that knows the true model?

This section begins by defining two general classes of statistical tools that can be used to design the oracle . For concreteness, we provide a few examples of methods that belong to these two classes. The section concludes by addressing the second question above. Because our control law is the minimizer of an optimization problem, the key technical issue that we discuss is sufficient conditions that ensure convergence of the minimizers of a sequence of optimization problems to the minimizer of a limiting optimization problem.

### 4.1 Parametric Oracles

A parametric oracle is a continuous function that is parameterized by a set of coefficients , where is a set. This class of learning is often used in adaptive control [64, 6]. In the most general case, the function is nonlinear in all its arguments, and it is customary to use a least-squares cost function with input and trajectory data to estimate the parameters

 ^λn=argminλ∈T∑nj=0(Yj−χ(xj,uj;λ))2, (13)

where . This can be difficult to compute in real-time because it is generally a nonlinear optimization problem.

###### Example 1.

It is common in biochemical networks to have nonlinear terms in the dynamics such as

 On(x,u)=λn,1(xλn,21xλn,21+λn,3)(λn,4uλn,51+λn,4), (14)

where are the unknown coefficients in this example. Such terms are often called Hill equation type reactions [11].

An important subclass of parametric oracles are those that are linear in the coefficients: , where for are a set of (possibly nonlinear) functions. The reason for the importance of this subclass is that the least-squares procedure (13) is convex in this situation, even when the functions are nonlinear. This greatly simplifies the computation required to solve the least-squares problem (13) that gives the unknown coefficients .

###### Example 2.

One special case of linear parametric oracles is when the are linear functions. Here, the oracle can be written as , where are matrices whose entries are parameters. The intuition is that this oracle allows for corrections to the values in the matrices of the nominal model; it was used in conjunction with LBMPC on a quadrotor helicopter testbed [9, 20], in which LBMPC enabled high-performance flight.

### 4.2 Nonparametric Oracles

Nonparametric regression refers to techniques that estimate a function of input variables such as , without making a priori assumptions about the mathematical form or structure of the function . This class of techniques is interesting because it allows us to integrate non-traditional forms of adaptation and “learning” into LBMPC. And because LBMPC robustly maintains feasibility and constraint satisfaction as long as can be computed, we can design or choose the nonparametric regression method without having to worry about stability properties. This is a specific instantiation of the separation between robustness and performance in LBMPC.

###### Example 3.

Neural networks are a classic example of a nonparametric method that has been used in adaptive control [55, 60, 3], and they can also be used with LBMPC. There are many particular forms of neural networks, and one specific type is a feedforward neural network with a hidden layer of neurons; it is given by

 On(x,u)=c0+∑kni=1ciσ(a′i[x′ u′]′+bi), (15)

where and for all are coefficients, and is a sigmoid function [31]. Note that this is considered a nonparametric method because it does not generally converge unless as .

Designing a nonparametric oracle for LBMPC is challenging because the tool should ideally be an estimator that is bounded to ensure robustness of LBMPC and differentiable to allow for its use with numerical optimization algorithms. Local linear estimators [62, 8] are not guaranteed to be bounded, and their extensions that remain bounded are generally non-differentiable [27]. On the other hand, neural networks can be designed to remain bounded and differentiable, but they can have technical difficulties related to the estimation of its coefficients [72].

#### 4.2.1 Example: L2-Regularized Nadaraya-Watson Estimator

The Nadaraya-Watson (NW) estimator [54, 62], which can be intuitively thought of as the interpolation of non-uniformly sampled data points by a suitably normalized convolution kernel, is promising because it ensures boundedness. Our approach to designing a nonparametric estimator for LBMPC is to modify the NW estimator by adding regularization that deterministically ensures boundedness. Thus, it serves the same purpose as trimming [17]; but the benefit of our approach is that it also deterministically ensures differentiability of the estimator. To our knowledge, this modification of NW has not been previously considered in the literature.

Define to be two non-negative parameters; except when we wish to emphasize their temporal dependence, we will drop the subscript to match the convention of the statistics literature. Let , , and , where and are data and are free variables. We define any function to be a kernel function if it has (a) finite support (i.e., for ), (b) even symmetry , (c) positive values for , (d) differentiability (denoted by ), and (e) nonincreasing values of over . The -regularized NW (L2NW) estimator is defined as

 On(x,u)=∑iYiκ(Ξi)λ+∑iκ(Ξi), (16)

where . If , then (16) is simply the NW estimator. The term acts to regularize the problem and ensures differentiability.

There are two alternative characterizations of (16). The first is as the unique minimizer of the parametrized, strictly convex optimization problem for

 L(x,u,Xi,Yi,γ)=∑iκ(Ξi)(Yi−γ)2+λγ2. (17)

Viewed in this way, the term represents a Tikhonov (or ) regularization [71, 32]. The second characterization is as the mean with weights for points , and it is useful for showing the second part of the following theorem about the deterministic properties of the L2NW estimator.

###### Theorem 3.

If , is a kernel function, and ; then (a) the L2NW estimator as defined in (16) is differentiable, and (b) .

###### Proof.

To prove (a), note that the estimate is the value of that solves , where is from (17). Because , the hypothesis of the implicit function theorem is satisfied, and result directly follows from the implicit function theorem.

Part (b) is shown by noting that the assumptions imply that . If the weights of a weighted mean are positive and have a nonzero sum, then the weighted mean can be written as a convex combination of points. This is our situation, and so the result follow from the weighted mean characterization of (16). ∎

###### Remark 11.

This shows that L2NW is deterministically bounded and differentiable, which is needed for robustness and numerical optimization, respectively. We can compute the gradient of L2NW using standard calculus, and its -th component is given by (18) for fixed .

There are few notes regarding numerical computation of L2NW. First, picking the parameters in a data-driven manner [24, 67] is too slow for real-time implementation, and so we suggest rules of thumb: Deterministic regularity is provided by Theorem 3 for any positive (e.g., 1e-3), and we conjecture using because random samples cover at this rate. Second, computational savings are possible through careful software coding, because if is small, then most terms in the summations of (17) and (18) will be zero because of the finite support of .

### 4.3 Stochastic Epi-convergence

It remains to be shown that if stochastically converges to the true model , then the control law of the LBMPC scheme will stochastically converge to that of an MPC that knows the true model. The main technical problem occurs because is time-varying, and so the control law is given by the minimizer of an LBMPC optimization problem that is different at each point in time . This presents a problem because pointwise convergence of to is generally insufficient to prove convergence of the minimizers of a sequence of optimization problems to the minimizer of a limiting optimization problem [59, 73].

A related notion called epi-convergence is sufficient for showing convergence of the control law. Define the epigraph of to be the set of all points lying on or above the function, and denote it as . To prove convergence of the sequence of minimizers, we must show that the epigraph of the cost function (and constraints) of the sequence of optimizations converges in probability to the epigraph of the cost function (and constraints) in the limiting optimization problem. This notion is called epi-convergence, and we denote it as .

For simplicity, we will assume in this section that the cost function is time-invariant (i.e., ). It is enough to cite the relevant results for our purposes, but the interested reader can refer to [59, 73] for details.

###### Theorem 4 (Theorem 4.3 [73]).

Let and be as defined in Lemma 1, and define to be the composition of (5) with both (6) and . If for all , then the set of minimizers converges

 argmin{~ψn|(c,θ)∈ϕ(xn)}p→argmin{~ψ0|(c,θ)∈ϕ(xn)}. (19)
###### Remark 12.

The intuition is that if the cost function composed with the oracle converges in the appropriate manner to composed with the true dynamics ; then we get convergence of the minimizers of LBMPC to those of the MPC with true model, and the control law (9) converges. This theorem can be used to prove convergence of the LBMPC control law.

### 4.4 Epi-convergence for Parametric Oracles

Sufficient excitation (SE) is an important concept in system identification, and it intuitively means that the control inputs and state trajectory of the system are such that all modes of the system are activated. In general, it is hard to design a control scheme that ensures this a priori, which is a key aim of reinforcement learning [15]. However, LBMPC provides a framework in which SE may be able to be designed. Because we have a nominal model, we can in principle design a reference trajectory that sufficiently explores the state-input space .

Though designing a controller that ensures SE can be difficult, checking a posteriori whether a system has SE is straightforward [46, 7, 8]. In this section, we assume SE and leave open the problem of how to design reference trajectories for LBMPC that guarantee SE. This is not problematic from the standpoint of stability and robsutness, because LBMPC provides these properties, even without SE, whenever the conditions in Sect. 3 hold. We have convergence of the control law assuming SE, statistical regularity, and that the oracle can correctly model . The proof of the following theorem can be found in [10]

###### Theorem 5.

Suppose there exists such that . If the system has SE [41, 34, 49], then the control law of the LBMPC with oracle (13) converges in probability to the control law of an MPC that knows the true model (i.e., ).

### 4.5 Epi-convergence for Nonparametric Oracles

For a nonlinear system, SE is usually defined using ergodicity or mixing, but this is hard to verify in general. Instead, we define SE as a finite sample cover (FSC) of . Let be a ball centered at with radius , then a FSC of is a set that satisfies . The intuition is that sample with average, inter-sample distance less than .

Our first result considers a generic nonparametric oracle with uniform pointwise convergence. Such uniform convergence implicitly implies SE in the form of a FSC with asymptotically decreasing radius [75], though we make this explicit in our statement of the result. A proof can be found in [10].

###### Theorem 6.

Let be some sequence such that . If is a FSC of and

 supX×U∥On(x,u)−g(x,u)∥=Op(rn), (20)

with ; then the control law of LBMPC with converges in probability to the control law of an MPC that knows the true model (i.e., ).

###### Remark 13.

Our reason for presenting this result is that this theorem may be useful for proving convergence of the control law when using types of nonparametric regression that are more complex than L2NW. However, we stress that this is a sufficient condition, and so it may be possible for nonparametric tools that do not meet this condition to generate such stochastic convergence of the controller.

Assuming SE in the form of a FSC with asymptotically decreasing radius , we can show that the control law of LBMPC that uses L2NW converges to that of an MPC that knows the true dynamics. Because the proofs [10] rely upon theory from probability and statistics, we simply summarize the main result.

###### Theorem 7.

Let be some sequence such that . If is a FSC of , , and is Lipschitz continuous; then the control law of LBMPC with L2NW converges in probability to the control law of an MPC that knows the true model (i.e., ).

## 5 Experimental and Numerical Results

In this section, we briefly discuss applications in which LBMPC has been experimentally applied to different testbeds. The section concludes with numerical simulations that display some of the features of LBMPC.

### 5.1 Energy-efficient Building Automation

We have implemented LBMPC on two testbeds that were built on the Berkeley campus for the purpose of study energy-efficient control of heating, ventilation, and air-conditioning (HVAC) equipment. The first testbed [12], which is named the Berkeley Retrofitted and Inexpensive HVAC Testbed for Energy Efficiency (BRITE), is a single-room that uses HVAC equipment that is commonly found in homes. LBMPC was able to generate up to 30% energy savings on warm days and up to 70% energy savings on cooler days, as compared to the existing control of the thermostat within the room. It achieved this by using semiparametric regression to be able to estimate, using only temperature measurements from the thermostat, the heating load from exogenous sources like occupants, equipment, and solar heating. The LBMPC used this estimated heating load as its form of learning, and was able to adjust the control action of the HVAC based on this in order to achieve large energy savings.

The second testbed [13], which is named BRITE in Sutardja Dai Hall (BRITE-S), is a seven floor office building that is used in multiple ways. The building has offices, classrooms, an auditorium, laboratory space, a kitchen, and a coffee shop with dining area. Using a variant of LBMPC for hybrid systems with controlled switches, we were able to achieve an average of 1.5MWh of energy savings per day. For reference, eight days of energy savings is enough to power an average American home for one year. Again, we used semiparametric regression to be able to estimate, using only temperature measurements from the building, the heating load from exogenous sources like occupants, equipment, and solar heating. The LBMPC used this estimated heating load along with additional estimates of unmodeled actuator dynamics, as its form of learning, in order to adjust its supervisory control action.

### 5.2 High Performance Quadrotor Helicopter Flight

We have also used LBMPC in order to achieve high performance flight for semi-autonomous systems such as a quadrotor helicopter, which is a non-traditional helicopter with four propellers that enable improved steady-state stability properties [33]. In our experiments with LBMPC on this quadrotor testbed [9, 20], the learning was implemented using an extended Kalman filter (EKF) that provided corrections to the coefficients in the matrices. This makes it similar to LPV-MPC, which performs linear MPC using a successive series of linearizations of a nonlinear model; in our case, we used the learning provided by the EKF to in effect perform such linearizations.

Various experiments that we conducted showed that LBMPC improved performance and provided robustness. Amongst the experiments we performed were those that (a) showed improved step responses with lower amounts of overshoot and settling time as compared to linear MPC, and (b) displayed the ability of the LBMPC controller to overcome a phenomenon known as the ground effect that typically makes flight paths close to the ground difficult to perform. Furthermore, the LBMPC displayed robustness by preventing crashes into the ground during experiments in which the EKF was purposely made unstable in order to mis-learn. The improved performance and learning generalization possible with the type of adaptation and learning within LBMPC was demonstrated with an integrated experiment in which the quadrotor helicopter caught ping-pong balls that were thrown to it by a human.

### 5.3 Example: Moore-Greitzer Compressor Model

Here, we present a simulation of LBMPC on a nonlinear system for illustrative purposes. The compression system of a jet engine can exhibit two types of instability: rotating stall and surge [53, 25, 39]. Rotating stall is a rotating region of reduced air flow, and it degrades the performance of the engine. Surge is an oscillation of air flow that can damage the engine. Historically, these instabilities were prevented by operating the engine conservatively. But better performance is possible through active control schemes [25, 39].

The Moore-Greitzer model is an ODE model that describes the compressor and predicts surge instability

 ˙Φ =−Ψ+Ψc+1+3Φ/2−Φ3/2 (21) ˙Ψ =(Φ+1−r√Ψ)/β2,

where is mass flow, is pressure rise, is a constant, and is the throttle opening. We assume is controlled by a second order actuator with transfer function , where is the damping coefficient, is the resonant frequency, and is the input.

We conducted simulations of this system with the parameters , , , and . We chose state constraints and , actuator constraints and , and input constraints . For the controller design, we took the approximate model with state to be the exact discretization (with sampling time ) of the linearization of (21) about the equilibrium ; the control is , where . The linearization and approximate model are unstable, and so we picked a nominal feedback matrix that stabilizes the system by ensuring that the poles of the closed-loop system were placed at . These particular poles were chosen because they are close to the poles of the open-loop system, while still being stable.

For the purpose of computing the invariant set , we used the algorithm in [37]. This algorithm uses the modeling error set as one of its inputs. This set was chosen to be a hypercube that encompasses both a bound on the linearization error, derived using the Taylor remainder theorem applied to the true nonlinear model, along with a small amount of subjectively-chosen “safety margin” to provide protection against the effect of numerical errors.

We compared the performance of linear MPC, nonlinear MPC, and LBMPC with L2NW for regulating the system about the operating point , by conducting a simulation starting from initial condition