An Online Learning Approach to Model Predictive Control

# An Online Learning Approach to Model Predictive Control

\authorblockNNolan Wagener,\authorrefmark1 Ching-An Cheng,\authorrefmark1 Jacob Sacks,\authorrefmark2 and Byron Boots\authorrefmark1 \authorblockA Georgia Institute of Technology
{nolan.wagener, cacheng, jsacks}@gatech.edu, bboots@cc.gatech.edu
###### Abstract

Model predictive control (MPC) is a powerful technique for solving dynamic control tasks. In this paper, we show that there exists a close connection between MPC and online learning, an abstract theoretical framework for analyzing online decision making in the optimization literature. This new perspective provides a foundation for leveraging powerful online learning algorithms to design MPC algorithms. Specifically, we propose a new algorithm based on dynamic mirror descent (DMD), an online learning algorithm that is designed for non-stationary setups. Our algorithm, Dynamic Mirror Descent Model Predictive Control (DMD-MPC), represents a general family of MPC algorithms that includes many existing techniques as special instances. DMD-MPC also provides a fresh perspective on previous heuristics used in MPC and suggests a principled way to design new MPC algorithms. In the experimental section of this paper, we demonstrate the flexibility of DMD-MPC, presenting a set of new MPC algorithms on a simple simulated cartpole and a simulated and real-world aggressive driving task. A video of the real-world experiment can be found at https://youtu.be/vZST3v0_S9w.

## I Introduction

\authorrefmark1Institute for Robotics and Intelligent Machines
\authorrefmark2School of Electrical and Computer Engineering
Equal contribution

Model predictive control (MPC) [20] is an effective tool for control tasks involving dynamic environments, such as helicopter aerobatics [1] and aggressive driving [30]. One reason for its success is the pragmatic principle it adopts in choosing controls: rather than wasting computational power to optimize a complicated controller for the full-scale problem (which may be difficult to accurately model), MPC instead optimizes a simple controller (e.g., an open-loop control sequence) over a shorter planning horizon that is just sufficient to make a sensible decision at the current moment. By alternating between optimizing the simple controller and applying its corresponding control on the real system, MPC results in a closed-loop policy that can handle modeling errors and dynamic changes in the environment.

Various MPC algorithms have been proposed, using tools ranging from constrained optimization techniques [7, 20, 27] to sampling-based techniques [30]. In this paper, we show that, while these algorithms were originally designed differently if we view them through the lens of online learning [16], many of them actually follow the same general update rule. Online learning is an abstract theoretical framework for analyzing online decision making. Formally, it concerns iterative interactions between a learner and an environment over rounds. At round , the learner makes a decision from some decision set . The environment then chooses a loss function based on the learner’s decision, and the learner suffers a cost . In addition to seeing the decision’s cost, the learner may be given additional information about the loss function (e.g., its gradient evaluated at ) to aid in choosing the next decision . The learner’s goal is to minimize the accumulated costs , e.g., by minimizing regret [16].

We find that the MPC process bears a strong similarity with online learning. At time (i.e., round ), an MPC algorithm optimizes a controller (i.e., the decision) over some cost function (i.e., the per-round loss). To do so, it observes the cost of the initial controller (i.e., ), improves the controller, and executes a control based on the improved controller in the environment to get to the next state (which in turn defines the next per-round loss) with a new controller .

In view of this connection, we propose a generic framework, DMD-MPC (Dynamic Mirror Descent Model Predictive Control), for synthesizing MPC algorithms. DMD-MPC is based on a first-order online learning algorithm called dynamic mirror descent (DMD) [14], a generalization of mirror descent [4] for dynamic comparators. We show that several existing MPC algorithms [31, 32] are special cases of DMD-MPC, given specific choices of step sizes, loss functions, and regularization. Furthermore, we demonstrate how new MPC algorithms can be derived systematically from DMD-MPC with only mild assumptions on the regularity of the cost function. This allows us to even work with discontinuous cost functions (like indicators) and discrete controls. Thus, DMD-MPC offers a spectrum from which practitioners can easily customize new algorithms for their applications.

In the experiments, we apply DMD-MPC to design a range of MPC algorithms and study their empirical performance. Our results indicate the extra design flexibility offered by DMD-MPC does make a difference in practice; by properly selecting hyperparameters which are obscured in the previous approaches, we are able to improve the performance of existing algorithms. Finally, we apply DMD-MPC on a real-world AutoRally car platform [13] for autonomous driving tasks and show it can achieve competent performance.

Notation: As our discussions will involve planning horizons, for clarity, we use lightface to denote variables that are meant for a single time step, and boldface to denote the variables congregated across the MPC planning horizon. For example, we use to denote the planned control at time and to denote an -step planned control sequence starting from time . We use a subscript to extract elements from a congregated variable; e.g., we use to the denote the  element in (the subscript index starts from zero). All the variables in this paper are finite-dimensional.

## Ii An Online Learning Perspective on MPC

### Ii-a The MPC Problem Setup

Let be finite. We consider the problem of controlling a discrete-time stochastic dynamical system

 xt+1∼f(xt,ut) (1)

for some stochastic transition map . At time , the system is in state . Upon the execution of control , the system randomly transitions to the next state , and an instantaneous cost is incurred. Our goal is to design a state-feedback control law (i.e., a rule of choosing based on ) such that the system exhibits good performance (e.g., accumulating low costs over time steps).

In this paper, we adopt the MPC approach to choosing : at state , we imagine controlling a stochastic dynamics model (which approximates our system ) for time steps into the future. Our planned controls come from a control distribution that is parameterized by some vector , where is the feasible parameter set. In each simulation (i.e., rollout), we sample111This can be sampled in either an open-loop or closed-loop fashion. a control sequence from the control distribution and recursively apply it to to generate a predicted state trajectory : let ; for , we set . More compactly, we can write the simulation process as

 ^xt∼^f(xt,^ut) (2)

in terms of some that is defined naturally according to the above recursion. Through these simulations, we desire to select a parameter that minimizes an MPC objective , which aims to predict the performance of the system if we were to apply the control distribution starting from .222 can be seen as a surrogate for the long-term performance of our controller. Typically, we set the planning horizon to be much smaller than to reduce the optimization difficulty and to mitigate modeling errors. In other words, we wish to find the that solves

 minθ∈Θ^J(πθ;xt). (3)

Once is decided, we then sample333This setup can also optimize deterministic policies, e.g., by defining to be a Gaussian policy with the mean being the deterministic policy. from , extract the first control , and apply it on the real dynamical system in (1) (i.e., set ) to go to the next state . Because is determined based on , MPC is effectively state-feedback.

The motivation behind MPC is to use the MPC objective to reason about the controls required to achieve desirable long-term behaviors. Consider the statistic

 C(^xt,^ut)≜∑H−1h=0c(^xt+h,^ut+h)+cend(^xt+H), (4)

where is a terminal cost function. A popular MPC objective is , which estimates the expected -step future costs. Later in Section III-A, we will discuss several MPC objectives and their properties.

Although the idea of MPC sounds intuitively promising, the optimization can only be approximated in practice (e.g., using an iterative algorithm like gradient descent), because (3) is often a stochastic program (like the example above) and the control command needs to be computed at a high frequency. In consideration of this imperfection, it is common to heuristically bootstrap the previous approximate solution as the initialization to the current problem. Specifically, let be the approximate solution to the previous problem and denote the initial condition of in solving (3). The bootstrapping step can then written as

 ~θt=Φ(θt−1) (5)

by effectively defining a shift operator (see Appendix A for details). Because the subproblems in (3) of two consecutive time steps share all control variables except for the first and the last ones, shifting the previous solution provides a warm start to (3) to amortize the computational complexity.

### Ii-B The Online Learning Perspective

As discussed, the iterative update process of MPC resembles the setup of online learning [16]. Here we provide the details to convert an MPC setup into an online learning problem. Recall from the introduction that online learning mainly consists of three components: the decision set, the learner’s strategy for updating decisions, and the environment’s strategy for updating per-round losses. We show the counterparts in MPC that correspond to each component below. Note that in this section we will overload the notation to mean .

We use the concept of per-round loss in online learning as a mechanism to measure the decision uncertainty in MPC, and propose the following identification (shown in Fig. 1) for the MPC setup described in the previous section: we set the rounds in online learning to synchronize with the time steps of our control system, set the decision set as the space of feasible parameters of the control distribution , set the learner as the MPC algorithm which in round outputs the decision and side information , and set the per-round loss as

 ℓt(⋅)=^J(⋅;xt). (6)

In other words, in round of this online learning setup, the learner plays a decision along with a side information (based on the optimized solution and the shift operator in (5)), the environment selects the per-round loss (by applying to the real dynamical system in (1) to transit the state to ), and finally the learner receives and incurs cost (which measures the sub-optimality of the future plan made by the MPC algorithm).

This online learning setup differs slightly from the standard setup in its separation of the decision and the side information ; while our setup can be converted into a standard one that treats as the sole decision played in round , we adopt this explicit separation in order to emphasize that the variable part of the incurred cost pertains to only . That is, the learner cannot go back and revert the previous control already applied on the system, but only uses to update the current and future controls .

The performance of the learner in online learning (which by our identification is the MPC algorithm) is measured in terms of the accumulated costs . For problems in non-stationary setups, a normalized way to describe the accumulated costs in the online learning literature is through the concept of dynamic regret [14, 34], which is defined as

 D-Regret=∑Tt=1ℓt(~θt)−∑Tt=1ℓt(θ⋆t), (7)

where . Dynamic regret quantifies how suboptimal the played decisions are on the corresponding loss functions. In our proposed problem setup, the optimality concept associated with dynamic regret conveys a consistency criterion desirable for MPC: we would like to make a decision at state such that, after applying control and entering the new state , its shifted plan remains close to optimal with respect to the new loss function . If the dynamics model is accurate and the MPC algorithm is ideally solving (3), we can expect that bootstrapping the previous solution through (5) into would result in a small instantaneous gap which is solely due to unpredictable future information (such as the stochasticity in the dynamical system). In other words, an online learning algorithm with small dynamic regret, if applied to our online learning setup, would produce a consistently optimal MPC algorithm with regard to the solution concept discussed above. However, we note that having small dynamic regret here does not directly imply good absolute performance on the control system, because the overall performance of the MPC algorithm is largely dependent on the form of the MPC objective (e.g., through choice of and accuracy of ). Small dynamic regret more precisely means whether the plan produced by an MPC algorithm is consistent with the given MPC objective.

## Iii A Family of MPC Algorithms Based on Dynamic Mirror Descent

The online learning perspective on MPC suggests that good MPC algorithms can be designed from online learning algorithms that achieve small dynamic regret. This is indeed the case. We will show that a range of existing MPC algorithms are in essence applications of a classical online learning algorithm called dynamic mirror descent (DMD) [14]. DMD is a generalization of mirror descent [4] to problems involving dynamic comparators (in this case, the in dynamic regret in (7)). In round , DMD applies the following update rule:

 θt=argminθ∈Θ⟨γtgt,θ⟩+Dψ(θ∥~θt),~θt+1=Φ(θt), (8)

where (which can be replaced by unbiased sampling if is an expectation), is called the shift model,444In [14], is called a dynamical model, but it is not the same as the dynamics of our control system. We therefore rename it to avoid confusion. is the step size, and for some , is the Bregman divergence generated by a strictly convex function on .

The first step of DMD in (8) is reminiscent of the proximal update in the usual mirror descent algorithm. It can be thought of as an optimization step where the Bregman divergence acts as a regularization to keep close to . Although is not necessarily a metric (since it may not be symmetric), it is still useful to view it as a distance between and . Indeed, familiar examples of the Bregman divergence include the squared Euclidean distance and KL divergence555For probability distributions and over a random variable , the KL divergence is defined as . [3].

The second step of DMD in (8) uses the shift model to anticipate the optimal decision for the next round. In the context of MPC, a natural choice for the shift model is the shift operator in (5) defined previously in Section II-A (hence the same notation), because the per-round losses in two consecutive rounds here concern problems with shifted time indices. Hall and Willett [14] show that the dynamic regret of DMD scales with how much the optimal decision sequence deviates from (i.e., , which is proportional to the unpredictable elements of the problem.

Applying DMD in (8) to the online learning problem described in Section II-B leads to an MPC algorithm shown in Algorithm 1, which we call DMD-MPC. More precisely, DMD-MPC represents a family of MPC algorithms in which a specific instance is defined by a choice of:

1. the MPC objective in (6),

2. the form of the control distribution , and

3. the Bregman divergence in (8).

Thus, we can use DMD-MPC as a generic strategy for synthesizing MPC algorithms. In the following, we use this recipe to recreate several existing MPC algorithms and demonstrate new MPC algorithms that naturally arise from this framework.

### Iii-a Loss Functions

We discuss several definitions of the per-round loss , which all result from the formulation in (6) but with different . These loss functions are based on the statistic defined in (4) which measures the -step accumulated cost of a given trajectory. For transparency of exposition, we will suppose henceforth that the control distribution is open-loop666Note again that even while using open-loop control distributions, the overall control law of MPC is state-feedback.; similar derivations follow naturally for closed-loop control distributions. For convenience of practitioners, we also provide expressions of their gradients in terms of the likelihood-ratio derivative777We assume the control distribution is sufficiently regular with respect to its parameter so that the likelihood-ratio derivative rule holds. [12]. For some function , all these gradients shall have the form

 (9)

In short, we will denote as . These gradients in practice are approximated by finite samples.

#### Iii-A1 Expected Cost

The most commonly used MPC objective is the -step expected accumulated cost function under model dynamics, because it directly estimates the expected long-term behavior when the dynamics model is accurate and is large enough. Its per-round loss function is888In experiments, we subtract the empirical average of the sampled costs from in (11) to reduce the variance, at the cost of a small amount of bias.

 ℓt(θ) (10) ∇ℓt(θ) (11)

#### Iii-A2 Expected Utility

Instead of optimizing for average cost, we may care to optimize for some preference related to the trajectory cost , such as having the cost be below some threshold. This idea can be formulated as a utility that returns a normalized score related to the preference for a given trajectory cost . Specifically, suppose that is lower bounded by zero999If this is not the case, let , which we assume is finite. We can then replace with . and at some round define the utility (i.e., ) to be a function with the following properties: , is monotonically decreasing, and . These are sensible properties since we attain maximum utility when we have zero cost, the utility never increases with the cost, and the utility approaches zero as the cost increases without bound. We then define the per-round loss as

 ℓt(θ) (12) ∇ℓt(θ) (13)

The gradient in (13) is particularly appealing when estimated with samples. Suppose we sample control sequences from and (for the sake of compactness) sample one state trajectory from for each corresponding control sequence, resulting in . Then the estimate of (13) is a convex combination of gradients:

 ∇ℓt(θ)≈−∑Ni=1wi∇θlogπθ(^uit),

where and , for . We see that each weight is computed by considering the relative utility of its corresponding trajectory. A cost with high relative utility will push its corresponding weight closer to one, whereas a low relative utility will cause to be close to zero, effectively rejecting the corresponding sample.

We give two examples of utilities and their related losses.

##### Probability of Low Cost

For example, we may care about the system being below some cost threshold as often as possible. To encode this preference, we can use the threshold utility , where is the indicator function and is a threshold parameter. Under this choice, the loss and its gradient become

 ℓt(θ) (14) ∇ℓt(θ) (15)

As we can see, this loss function also gives the probability of achieving cost below some threshold. As a result (Fig. 1(a)), costs below are treated the same in terms of the utility. This can potentially make optimization easier since we are trying to make good trajectories as likely as possible instead of finding the best trajectories as in (10).

However, if the threshold is set too low and the gradient is estimated with samples, the gradient estimate may have high variance due to the large number of rejected samples. Because of this, in practice, the threshold is set adaptively, e.g., as the largest cost of the top elite fraction of the sampled trajectories with smallest costs [6]. This allows the controller to make the best sampled trajectories more likely and therefore improve the controller.

##### Exponential Utility

We can also opt for a continuous surrogate of the indicator function, in this case the exponential utility , where is a scaling parameter. Unlike the indicator function, the exponential utility provides nonzero feedback for any given cost and allows us to discriminate between costs (i.e., if , then ), as shown in Fig. 1(b). Furthermore, acts as a continuous alternative to and dictates how quickly or slowly decays to zero, which in a soft way determines the cutoff point for rejecting given costs.

Under this choice, the loss and its gradient become

 ℓt(θ) (16) ∇ℓt(θ) (17)

The loss function in (16) is also known as the risk-seeking objective in optimal control [28]; this classical interpretation is based on a Taylor expansion of (16) showing

when is large, where is the variance of . Here we derive (16) from a different perspective that treats it as a continuous approximation of (14). The use of exponential transformations to approximate indicators is a common machine-learning trick (like the Chernoff bound [8]).

### Iii-B Algorithms

We instantiate DMD-MPC with different choices of loss function, control distribution, and Bregman divergence as concrete examples to showcase the flexibility of our framework. In particular, we are able to recover well-known MPC algorithms as special cases of Algorithm 1.

Our discussions below are organized based on the class of Bregman divergences used in (8), and the following algorithms are derived assuming that the control distribution is a sequence of independent distributions. That is, we suppose is a probability density/mass function that factorizes as

 πθ(^ut)=∏H−1h=0πθh(^ut,h), (18)

and for some basic control distribution parameterized by , where denotes the feasible set for the basic control distribution. For control distributions in the form of (18), the shift operator in (5) would set by identifying for , and initializing the final parameter as either or for some default parameter .

We start with perhaps the most common Bregman divergence: the quadratic divergence. That is, we suppose the Bregman divergence in (8) has a quadratic form 101010This is generated by defining . for some positive-definite matrix . Below we discuss different choices of and their corresponding update rules.

This basic update rule is a special case when is the identity matrix. Equivalently, the update can be written as .

We can recover the natural gradient descent algorithm [2] by defining where

is the Fisher information matrix. This rule uses the natural Riemannian metric of distributions to normalize the effects of different parameterizations of the same distribution [25].

While the above two update rules are quite general, we can further specialize the Bregman divergence to achieve faster learning when the per-round loss function can be shown to be quadratic. This happens, for instance, when the MPC problem in (3) is an LQR or LEQR problem111111The dynamics model is linear, the step cost is quadratic, the per-round loss is (10), and the basic control distribution is a Dirac-delta distribution. [11]. That is, if

 ℓt(θ)=12θTRtθ+rTtθ+const.

for some constant vector and positive definite matrix , we can set and , making given by the first step of (8) correspond to the optimal solution to (i.e., the solution of LQR/LEQR). The particular values of and for each of LQR and LEQR are derived in Appendix D.

#### Iii-B2 KL Divergence and the Exponential Family

We show that for control distributions in the exponential family [23], the Bregman divergence in (8) can be set to the KL divergence, which is a natural way to measure distances between distributions. Toward this end, we review the basics of the exponential family. We say a distribution with natural parameter of random variable belongs to the exponential family if its probability density/mass function satisfies , where is the sufficient statistics, is the carrier measure, and is the log-partition function. The distribution can also be described by its expectation parameter , and there is a duality between the two parameterizations: , where is the Legendre transformation of and . That is, . The duality results in the property below.

###### Fact 1.

[23]   .

We can use creftype 1 to define the Bregman divergence in (8) to optimize a control distribution in the exponential family:

• if is an expectation parameter, we can set
, or

• if is a natural parameter, we can set
.

We demonstrate some examples using this idea below.

##### Expectation Parameters and Categorical Distributions

We first discuss the case where is an expectation parameter and the first step in (8) is

 θt =argminθ∈Θ⟨γtgt,θ⟩+KL(πθ∥π~θt). (19)

To illustrate, we consider an MPC problem with a discrete control space and use the categorical distribution as the basic control distribution in (18), i.e., we set , where is the probability of choosing each control among at the  predicted time step and denotes the probability simplex in . This parameterization choice makes an expectation parameter of that corresponds to sufficient statistics given by indicator functions. With the structure of (9), the update direction is

where and are the  elements of and , respectively, has for each element except at index where it is , and denotes elementwise division. Update (19) then becomes the exponentiated gradient algorithm [16]:

 θt,h=1Zt,h~θt,h⊙exp(−γtgt,h)(h=0,1,…,H−1) (20)

where is the  element of , is the normalizer for , and denotes elementwise multiplication. That is, instead of applying an additive gradient step to the parameters, the update in (19) exponentiates the gradient and performs elementwise multiplication. This does a better job of accounting for the geometry of the problem, and makes projection a simple operation of normalizing a distribution.

##### Natural Parameters and Gaussian Distributions

Alternatively, we can set as a natural parameter and use

 θt =argminθ∈Θ⟨γtgt,θ⟩+KL(π~θt∥πθ) (21)

as the first step in (8). In particular, we show that, with (21), the structure of the likelihood-ratio derivative in (9) can be leveraged to design an efficient update. The main idea follows from the observation that when the gradient is computed through (9) and is the natural parameter, we can write

 (22)

where is the expectation parameter of and is the sufficient statistics of the control distribution. We combine the factorization in (22) with a property of the proximal update below (proven in Appendix C) to derive our algorithm.

###### Proposition 1.

Let be an update direction. Let be the image of under . If and , then .121212A similar proposition can be found for (19).

We find that, under the assumption131313If is not in , the update in (21) needs to perform a projection, the form of which is algorithm dependent. in creftype 1, the update rule in (21) becomes

 (23)

In other words, when , the update to the expectation parameter in (8) is simply a convex combination of the sufficient statistics and the previous expectation parameter .

We provide a concrete example of an MPC algorithm that follows from (23). Let us consider a continuous control space and use the Gaussian distribution as the basic control distribution in (18), i.e., we set for some mean vector and covariance matrix . For , we can choose sufficient statistics , which results in the expectation parameter and the natural parameter , where . Let us set as the natural parameter. Then (21) is equivalent to the update rule for :

 (24)

Several existing algorithms are special cases of (24).

• Cross-entropy method (CEM) [6]:
If is set to (14) and , then  (24) becomes

 mt,h St,h

which resembles the update rule of the cross-entropy method for Gaussian distributions [6]. The only difference is that the second-order moment matrix is updated instead of the covariance matrix .

• Model-predictive path integral (MPPI) [31]:

If we choose as the exponential utility, as in (16), and do not update the covariance, the update rule becomes

 (25)

which reduces to the MPPI update rule [31] for . This connection is also noted in [24].

### Iii-C Extensions

In the previous sections, we discussed multiple instantiations of DMD-MPC, showing the flexibility of our framework. But they are by no means exhaustive. In Appendix B, we discuss variations of DMD-MPC, e.g., imposing constraints and different ways to approximate the expectation in (9).

## Iv Related Work

Recent work on MPC has studied sampling-based approaches, which are flexible in that they do not require differentiability of a cost function. One such algorithm which can be used with general cost functions and dynamics is MPPI, which was proposed by Williams et al. [31] as a generalization of the control affine case [30]. The algorithm is derived by considering an optimal control distribution defined by the control problem. This optimal distribution is intractable to sample from, so the algorithm instead tries to bring a tractable distribution (in this case, Gaussian with fixed covariance) as close as possible in the sense of KL divergence. This ends up being the same as finding the mean of the optimal control distribution. The mean is then approximated as a weighted sum of sampled control trajectories, where the weight is determined by the exponentiated costs. Although this algorithm works well in practice (including a robust variant [33] achieving state-of-the-art performance in aggressive driving [10]), it is not clear that matching the mean of the distribution should guarantee good performance, such as in the case of a multimodal optimal distribution. By contrast, our update rule in (25) results from optimizing an exponential utility.

A closely related approach is the cross-entropy method (CEM) [6], which also assumes a Gaussian sampling distribution but minimizes the KL divergence between the Gaussian distribution and a uniform distribution over low cost samples. CEM has found applicability in reinforcement learning [19, 21, 26], motion planning [17, 18], and MPC [9, 32].

These sampling-based control algorithms can be considered special cases of general derivative-free optimization algorithms, such as covariance matrix adaptation evolutionary strategies (CMA-ES) [15] and natural evolutionary strategies (NES) [29]. CMA-ES samples points from a multivariate Gaussian, evaluates their fitness, and adapts the mean and covariance of the sampling distribution accordingly. On the other hand, NES optimizes the parameters of the sampling distribution to maximize some expected fitness through steepest ascent, where the direction is provided by the natural gradient. Akimoto et al. [2] showed that CMA-ES can also be interpreted as taking a natural gradient step on the parameters of the sampling distribution. As we showed in Section III-B, natural gradient descent is a special case of DMD-MPC framework. A similar observation that connects between MPPI and mirror descent was made by Okada and Taniguchi [24], but their derivation is limited to the KL divergence and Gaussian case.

## V Experiments

We use experiments to the validate the flexibility of DMD-MPC. We show that this framework can handle both continuous (Gaussian distribution) and discrete (categorial distribution) variations of control problems, and that MPC algorithms like MPPI and CEM can be generalized using different step sizes and control distributions to improve performance. Extra details and results are included in Appendices F and E.

### V-a Cartpole

We first consider the classic cartpole problem where we seek to swing a pole upright and keep it balanced only using actuation on the attached cart. We consider both the continuous and discrete control variants. For the continuous case, we choose the Gaussian distribution as the control distribution and keep the covariance fixed. For the discrete case, we choose the categorical distribution and use update (20). In either case, we have access to a biased stochastic model (uses a different pole length compared to the real cart).

We consider the interaction between the choice of loss, step size, and number of samples used to estimate (9),141414For our experiments, we vary the number of samples from and fix the number of samples from to ten. Furthermore, we use common random numbers when sampling from to reduce estimation variance. shown in Figs. 4 and 3. For this environment, we can achieve low cost when optimizing the expected cost in (10) with a proper step size ( for both continuous and discrete problems) while being fairly robust to the number of samples. When using either of the utilities, the number of samples is more crucial in the continuous domain, with more samples allowing for larger step sizes. In the discrete domain (Fig. 2(b)), performance is largely unaffected by the number of samples when the step size is below , excluding the threshold utility with 1000 samples. In Fig. 3(a), for a large range of utility parameters, we see that using step sizes above (the step size set in MPPI and CEM) give significant performance gains. In Fig. 3(b), there’s a more complicated interaction between the utility parameter and step size, with huge changes in cost when altering the utility parameter and keeping the step size fixed.

### V-B AutoRally

#### V-B1 Platform Description

We use the autonomous AutoRally platform [13] to run a high-speed driving task on a dirt track, with the goal of the task to achieve as low a lap time as possible. The robot (Fig. 5) is a 1:5 scale RC chassis capable of driving over () and has a desktop-class Intel Core i7 CPU and Nvidia GTX 1050 Ti GPU. For real-world experiments, we estimate the car’s pose using a particle filter from [10] which relies on a monocular camera, IMU, and GPS. In both simulated and real-world experiments, the dynamics model is a neural network which has been fitted to data collected from human demonstrations. We note that the dynamics model is deterministic, so we don’t need to estimate any expectations with respect to the dynamics.

#### V-B2 Simulated Experiments

We first use the Gazebo simulator (Fig. 9 in Section E-B) to perform a sweep of algorithm parameters, particularly the step size and number of samples, to evaluate how changing these parameters can affect the performance of DMD-MPC. For all of the experiments, the control distribution is a Gaussian with fixed covariance, and we use update (25) (i.e., the loss is the exponential utility (16)) with . The resulting lap times are shown in Fig. 6.151515The large error bar for 64 samples and step size of 0.8 is due to one particular lap where the car stalled at a turn for about 60 seconds. We see that although using more samples does result in smaller lap times, there are diminishing returns past 1920 samples per gradient. Indeed, with a proper step size, even as few as 192 samples can yield lap times within a couple seconds of 3840 samples and a step size of 1. We also observe that the curves converge as the step size decreases further, implying that only a certain number of samples are needed for a given step size. This is a particularly important advantage of DMD-MPC over methods like MPPI: by changing the step size, DMD-MPC can perform much more effectively with fewer samples, making it a good choice for embedded systems which can’t produce many samples due to computational constraints.

#### V-B3 Real-World Experiments

In the real-world setting (Fig. 7), the control distribution is a Gaussian with fixed covariance, and we use update (25) with . We used the following experimental configurations: each of 1920 and 64 samples, and each of step sizes 1 (corresponding to MPPI), 0.8, and 0.6. Overall (Table I), there’s a mild degradation in performance when decreasing the step size at 1920 samples, due to the car taking a longer path on the track (Fig. 11(a) vs. Fig. 11(c) in Section F-B). Using just 64 samples surprisingly only increases the lap times by 2 seconds and seems unaffected by the step size. This could be because, despite the noisiness of the DMD-MPC update, the setpoint controller in the car’s steering servo acts as a filter, smoothing out the control signal and allowing the car to drive on a consistent path (Fig. 13 in Section F-B).

## Vi Conclusion

We presented a connection between model predictive control and online learning. From this connection, we proposed an algorithm based on dynamic mirror descent that can work for a wide variety of settings and cost functions. We also discussed the choice of loss function within this online learning framework and the sort of preference each loss function imposes. From this general algorithm and assortment of loss functions, we show several well known algorithms are special cases and presented a general update for members of the exponential family.

We empirically validated our algorithm on continuous and discrete simulated problems and on a real-world aggressive driving task. In the process, we also studied the parameter choices within the framework, finding, for example, that in our framework a smaller number of rollout samples can be compensated for by varying other parameters like the step size.

We hope that the online learning and stochastic optimization viewpoints of MPC presented in this paper opens up new possibilities for using tools from these domains, such as alternative efficient sampling techniques [5] and accelerated optimization methods [22, 24], to derive new MPC algorithms that perform well in practice.

## Acknowledgements

This material is based upon work supported by NSF NRI award 1637758, NSF CAREER award 1750483, an NSF Graduate Research Fellowship under award No. 2015207631, and a National Defense Science & Engineering Graduate Fellowship. We thank Aravind Battaje and Hemanth Sarabu for assisting in AutoRally experiments.

## References

• Abbeel et al. [2010] Pieter Abbeel, Adam Coates, and Andrew Y. Ng. The International Journal of Robotics Research, 29(13):1608–1639, 2010.
• Akimoto et al. [2012] Youhei Akimoto, Yuichi Nagata, Isao Ono, and Shigenobu Kobayashi. Algorithmica, 64(4):698–716, 2012.
• Banerjee et al. [2005] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Journal of machine learning research, 6(Oct):1705–1749, 2005.
• Beck and Teboulle [2003] Amir Beck and Marc Teboulle. Operations Research Letters, 31(3):167–175, 2003.
• Bellman and Casti [1971] Richard Bellman and John Casti. Journal of Mathematical Analysis and Applications, 34(2):235–238, 1971.
• Botev et al. [2013] Zdravko I. Botev, Dirk P. Kroese, Reuven Y. Rubinstein, and Pierre L’Ecuyer. In Handbook of statistics, volume 31, pages 35–59. Elsevier, 2013.
• Camacho and Alba [2013] Eduardo F. Camacho and Carlos Bordons Alba. Springer Science & Business Media, 2013.
• Chernoff [1952] Herman Chernoff. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
• Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. In Advances in Neural Information Processing Systems, pages 4754–4765, 2018.
• Drews et al. [2019] Paul Drews, Grady Williams, Brian Goldfain, Evangelos A. Theodorou, and James M. Rehg. IEEE Robotics and Automation Letters, 4(2):1564–1571, 2019.
• Duncan [2013] Tyrone E. Duncan. IEEE Transactions on Automatic Control, 58(11):2910–2911, 2013.
• Glynn [1990] Peter W. Glynn. Communications of the ACM, 33(10):75–84, 1990.
• Goldfain et al. [2019] Brian Goldfain, Paul Drews, Changxi You, Matthew Barulic, Orlin Velev, Panagiotis Tsiotras, and James M. Rehg. IEEE Control Systems Magazine, 39(1):26–55, 2019.
• Hall and Willett [2013] Eric Hall and Rebecca Willett. In International Conference on Machine Learning, pages 579–587, 2013.
• Hansen et al. [2003] Nikolaus Hansen, Sibylle D. Müller, and Petros Koumoutsakos. Evolutionary Computation, 11(1):1–18, 2003.
• Hazan [2016] Elad Hazan. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
• Helvik and Wittner [2002] Bjarne E. Helvik and Otto Wittner. International Workshop on Mobile Agents for Telecommunication Applications, 2164, 2002.
• Kobilarov [2012] Marin Kobilarov. In Robotics: Science and Systems, volume 7, pages 153–160, 2012.
• Mannor et al. [2003] Shie Mannor, Reuven Y. Rubinstein, and Yohai Gat. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 512–519, 2003.
• Mayne [2014] David Q. Mayne. Automatica, 50(12):2967–2986, 2014.
• Menache et al. [2005] Ishai Menache, Shie Mannor, and Nahum Shimkin. Annals of Operations Research, 134:215–238, 2005.
• Miyashita et al. [2018] Megumi Miyashita, Shiro Yano, and Toshiyuki Kondo. Robotics and Autonomous Systems, 106:107–116, 2018.
• Nielsen and Garcia [2009] Frank Nielsen and Vincent Garcia. arXiv preprint arXiv:0911.4863, 2009.
• Okada and Taniguchi [2018] Masashi Okada and Tadahiro Taniguchi. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3013–3020. IEEE, 2018.
• Rattray et al. [1998] Magnus Rattray, David Saad, and Shun-ichi Amari. Physical review letters, 81(24):5461, 1998.
• Szita and Lörincz [2006] István Szita and András Lörincz. Neural computation, 18(12):2936–2941, 2006.
• Tassa et al. [2014] Yuval Tassa, Nicolas Mansard, and Emo Todorov. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 1168–1175. IEEE, 2014.
• van den Broek et al. [2010] LJ van den Broek, WAJJ Wiegerinck, and HJ Kappen. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI 2010), pages 1–8. AUAI Press, 2010.
• Wierstra et al. [2014] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. The Journal of Machine Learning Research, 15(1):949–980, 2014.
• Williams et al. [2016] Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1433–1440. IEEE, 2016.
• Williams et al. [2017] Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M. Rehg, Byron Boots, and Evangelos A. Theodorou. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721. IEEE, 2017.
• Williams et al. [2018a] Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. IEEE Transactions on Robotics, 34(6):1603–1622, 2018a.
• Williams et al. [2018b] Grady Williams, Brian Goldfain, Paul Drews, Kamil Saigol, James M. Rehg, and Evangelos A. Theodorou. Robust sampling based model predictive control with sparse objective information. In Robotics Science and Systems, 2018b.
• Zhang et al. [2018] Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. In Advances in Neural Information Processing Systems, pages 1323–1333, 2018.

## Appendix A Shift Operator

We discuss some details in defining the shift operator. Let be the approximate solution to the previous problem and denote the initial condition of in solving (3), and consider sampling and . We set

 ~θt=Φ(θt−1)

by defining a shift operator that outputs a new parameter in . This can be chosen to satisfy desired properties, one example being that when conditioned on and , the marginal distributions of are the same for both of and of . A simple example of this property is shown in Fig. 8. Note that also involves a new control that is not in , so the choice of is not unique but algorithm dependent; for example, we can set of to follow the same distribution as (cf. Section III-B). Because the subproblems in (3) of two consecutive time steps share all control variables except for the first and the last ones, the “shifted” parameter to the current problem should be almost as good as the optimized parameter is to the previous problem. In other words, setting provides a warm start to (3) and amortizes the computational complexity of solving for .

## Appendix B Variations of DMD-MPC

The control distributions in DMD-MPC can be fairly general (in addition to the categorical and Gaussian distributions that we discussed) and control constraints on the problem (e.g., control limits) can be directly incorporated through proper choices of control distributions, such as the beta distribution, or through mapping the unconstrained control through some squashing function (e.g., or clamp). Though our framework cannot directly handle state constraints as in constrained optimization approaches, a constraint can be relaxed to an indicator function which activates if the constraint is violated. The indicator function can then be added to the cost function in (4) with some weight that encodes how strictly the constraint should be enforced.

Moreover, different integration techniques, such as Gaussian quadrature [5], can be adopted to replace the likelihood-ratio derivative in (9) for computing the required gradient direction. We also note that the independence assumption on the control distribution in (18) is not necessary in our framework; time-correlated control distributions and feedback policies are straightforward to consider in DMD-MPC.

## Appendix C Proofs

###### Proof of creftype 1.

We prove the first statement; the second one follows directly from the duality relationship. The statement follows from the derivations below; we can write

 ηt+1 =argminη∈H⟨γtgt,η⟩+DA(η∥ηt) =argminη∈H⟨γtgt,η⟩+A(η)−⟨∇A(ηt),η⟩ =argminη∈H⟨γtgt−μt,η⟩+A(η) =argmaxη∈H⟨μt−γtgt,η⟩−A(η) =∇A∗(μt−γtgt)

where the last equality is due to the assumption that . Then applying on both sides and using the relationship that , we have . ∎

## Appendix D Derivation of LQR and LEQR Losses

The dynamics in Equation (1) are given by

 xt+1=Axt+But+wt

for some matrices and and , where . For a control sequence , noise sequence , and initial state , the resulting state sequence is found through convolution:

 ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣^xt^xt+1^xt+2⋮^xt+H⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣IAA2⋮AH⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦xt+⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣00⋯0B0⋯0ABB⋯0⋮⋮⋱⋮AH−1BAH−2B⋯B⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢⎣^ut^ut+1⋮^ut+H−1⎤⎥ ⎥ ⎥ ⎥⎦+⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣00⋯0I0⋯0AI⋯0⋮⋮⋱⋮AH−1AH−2⋯I⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢⎣^wt^wt+1⋮^wt+H−1⎤⎥ ⎥ ⎥ ⎥⎦,

or, in matrix form:

 ^xt=Fxt+G^ut+L^wt,

where , , and are defined naturally from the convolution equation above. Note that , where . Thus, we also have that

 ^xt∼N(Fxt+G^ut,LWLT).

We define the instantaneous and terminal costs as

 c(x,u) cend(x) =12xTQendx,

where and . Thus, the statistic is

 C(^xt,^ut)=12^xTtQ^xt+12^uTtR^ut,

where and .

Our control distribution is a Dirac delta distribution located at the given parameter: .

### D-a Lqr

The loss is defined as . Expanding this out gives:

 ℓt(θ)

We see this is a quadratic problem in by defining

 Rt =GTQG+R rt =GTQFxt.

### D-B Leqr

The loss is defined as

for some parameter . For compactness, we define and so that the exponent contains . In expanding the loss, we use the following fact:

###### Fact 2.

For , where , and constants and :

###### Proof.

We expand the expectation and complete the square:

 =√(2π)n|~Σ|√(2π)n|Σ|exp(c) =1√|A+Σ−1||Σ|exp(c)

where , , and . ∎

We now expand the loss:

 ℓt(θ) =−log{1√|Q′LWLT+I|exp(−12[(Fxt+Gθ)T(LWLT)−1(Fxt+Gθ) −(Fxt+Gθ)T(LWLTQ′LWLT+LWLT)−1(Fxt+Gθ)

We see this is a quadratic problem in