An Online Learning Approach to
Model Predictive Control
Abstract
Model predictive control (MPC) is a powerful technique for solving dynamic control tasks. In this paper, we show that there exists a close connection between MPC and online learning, an abstract theoretical framework for analyzing online decision making in the optimization literature. This new perspective provides a foundation for leveraging powerful online learning algorithms to design MPC algorithms. Specifically, we propose a new algorithm based on dynamic mirror descent (DMD), an online learning algorithm that is designed for nonstationary setups. Our algorithm, Dynamic Mirror Descent Model Predictive Control (DMDMPC), represents a general family of MPC algorithms that includes many existing techniques as special instances. DMDMPC also provides a fresh perspective on previous heuristics used in MPC and suggests a principled way to design new MPC algorithms. In the experimental section of this paper, we demonstrate the flexibility of DMDMPC, presenting a set of new MPC algorithms on a simple simulated cartpole and a simulated and realworld aggressive driving task. A video of the realworld experiment can be found at https://youtu.be/vZST3v0_S9w.
I Introduction
^{†}^{†}\authorrefmark1Institute for Robotics and Intelligent Machines\authorrefmark2School of Electrical and Computer Engineering
^{#}Equal contribution
Model predictive control (MPC) [20] is an effective tool for control tasks involving dynamic environments, such as helicopter aerobatics [1] and aggressive driving [30]. One reason for its success is the pragmatic principle it adopts in choosing controls: rather than wasting computational power to optimize a complicated controller for the fullscale problem (which may be difficult to accurately model), MPC instead optimizes a simple controller (e.g., an openloop control sequence) over a shorter planning horizon that is just sufficient to make a sensible decision at the current moment. By alternating between optimizing the simple controller and applying its corresponding control on the real system, MPC results in a closedloop policy that can handle modeling errors and dynamic changes in the environment.
Various MPC algorithms have been proposed, using tools ranging from constrained optimization techniques [7, 20, 27] to samplingbased techniques [30]. In this paper, we show that, while these algorithms were originally designed differently if we view them through the lens of online learning [16], many of them actually follow the same general update rule. Online learning is an abstract theoretical framework for analyzing online decision making. Formally, it concerns iterative interactions between a learner and an environment over rounds. At round , the learner makes a decision from some decision set . The environment then chooses a loss function based on the learner’s decision, and the learner suffers a cost . In addition to seeing the decision’s cost, the learner may be given additional information about the loss function (e.g., its gradient evaluated at ) to aid in choosing the next decision . The learner’s goal is to minimize the accumulated costs , e.g., by minimizing regret [16].
We find that the MPC process bears a strong similarity with online learning. At time (i.e., round ), an MPC algorithm optimizes a controller (i.e., the decision) over some cost function (i.e., the perround loss). To do so, it observes the cost of the initial controller (i.e., ), improves the controller, and executes a control based on the improved controller in the environment to get to the next state (which in turn defines the next perround loss) with a new controller .
In view of this connection, we propose a generic framework, DMDMPC (Dynamic Mirror Descent Model Predictive Control), for synthesizing MPC algorithms. DMDMPC is based on a firstorder online learning algorithm called dynamic mirror descent (DMD) [14], a generalization of mirror descent [4] for dynamic comparators. We show that several existing MPC algorithms [31, 32] are special cases of DMDMPC, given specific choices of step sizes, loss functions, and regularization. Furthermore, we demonstrate how new MPC algorithms can be derived systematically from DMDMPC with only mild assumptions on the regularity of the cost function. This allows us to even work with discontinuous cost functions (like indicators) and discrete controls. Thus, DMDMPC offers a spectrum from which practitioners can easily customize new algorithms for their applications.
In the experiments, we apply DMDMPC to design a range of MPC algorithms and study their empirical performance. Our results indicate the extra design flexibility offered by DMDMPC does make a difference in practice; by properly selecting hyperparameters which are obscured in the previous approaches, we are able to improve the performance of existing algorithms. Finally, we apply DMDMPC on a realworld AutoRally car platform [13] for autonomous driving tasks and show it can achieve competent performance.
Notation: As our discussions will involve planning horizons, for clarity, we use lightface to denote variables that are meant for a single time step, and boldface to denote the variables congregated across the MPC planning horizon. For example, we use to denote the planned control at time and to denote an step planned control sequence starting from time . We use a subscript to extract elements from a congregated variable; e.g., we use to the denote the element in (the subscript index starts from zero). All the variables in this paper are finitedimensional.
Ii An Online Learning Perspective on MPC
Iia The MPC Problem Setup
Let be finite. We consider the problem of controlling a discretetime stochastic dynamical system
(1) 
for some stochastic transition map . At time , the system is in state . Upon the execution of control , the system randomly transitions to the next state , and an instantaneous cost is incurred. Our goal is to design a statefeedback control law (i.e., a rule of choosing based on ) such that the system exhibits good performance (e.g., accumulating low costs over time steps).
In this paper, we adopt the MPC approach to choosing : at state , we imagine controlling a stochastic dynamics model (which approximates our system ) for time steps into the future. Our planned controls come from a control distribution that is parameterized by some vector , where is the feasible parameter set. In each simulation (i.e., rollout), we sample^{1}^{1}1This can be sampled in either an openloop or closedloop fashion. a control sequence from the control distribution and recursively apply it to to generate a predicted state trajectory : let ; for , we set . More compactly, we can write the simulation process as
(2) 
in terms of some that is defined naturally according to the above recursion. Through these simulations, we desire to select a parameter that minimizes an MPC objective , which aims to predict the performance of the system if we were to apply the control distribution starting from .^{2}^{2}2 can be seen as a surrogate for the longterm performance of our controller. Typically, we set the planning horizon to be much smaller than to reduce the optimization difficulty and to mitigate modeling errors. In other words, we wish to find the that solves
(3) 
Once is decided, we then sample^{3}^{3}3This setup can also optimize deterministic policies, e.g., by defining to be a Gaussian policy with the mean being the deterministic policy. from , extract the first control , and apply it on the real dynamical system in (1) (i.e., set ) to go to the next state . Because is determined based on , MPC is effectively statefeedback.
The motivation behind MPC is to use the MPC objective to reason about the controls required to achieve desirable longterm behaviors. Consider the statistic
(4) 
where is a terminal cost function. A popular MPC objective is , which estimates the expected step future costs. Later in Section IIIA, we will discuss several MPC objectives and their properties.
Although the idea of MPC sounds intuitively promising, the optimization can only be approximated in practice (e.g., using an iterative algorithm like gradient descent), because (3) is often a stochastic program (like the example above) and the control command needs to be computed at a high frequency. In consideration of this imperfection, it is common to heuristically bootstrap the previous approximate solution as the initialization to the current problem. Specifically, let be the approximate solution to the previous problem and denote the initial condition of in solving (3). The bootstrapping step can then written as
(5) 
by effectively defining a shift operator (see Appendix A for details). Because the subproblems in (3) of two consecutive time steps share all control variables except for the first and the last ones, shifting the previous solution provides a warm start to (3) to amortize the computational complexity.
IiB The Online Learning Perspective
As discussed, the iterative update process of MPC resembles the setup of online learning [16]. Here we provide the details to convert an MPC setup into an online learning problem. Recall from the introduction that online learning mainly consists of three components: the decision set, the learner’s strategy for updating decisions, and the environment’s strategy for updating perround losses. We show the counterparts in MPC that correspond to each component below. Note that in this section we will overload the notation to mean .
We use the concept of perround loss in online learning as a mechanism to measure the decision uncertainty in MPC, and propose the following identification (shown in Fig. 1) for the MPC setup described in the previous section: we set the rounds in online learning to synchronize with the time steps of our control system, set the decision set as the space of feasible parameters of the control distribution , set the learner as the MPC algorithm which in round outputs the decision and side information , and set the perround loss as
(6) 
In other words, in round of this online learning setup, the learner plays a decision along with a side information (based on the optimized solution and the shift operator in (5)), the environment selects the perround loss (by applying to the real dynamical system in (1) to transit the state to ), and finally the learner receives and incurs cost (which measures the suboptimality of the future plan made by the MPC algorithm).
This online learning setup differs slightly from the standard setup in its separation of the decision and the side information ; while our setup can be converted into a standard one that treats as the sole decision played in round , we adopt this explicit separation in order to emphasize that the variable part of the incurred cost pertains to only . That is, the learner cannot go back and revert the previous control already applied on the system, but only uses to update the current and future controls .
The performance of the learner in online learning (which by our identification is the MPC algorithm) is measured in terms of the accumulated costs . For problems in nonstationary setups, a normalized way to describe the accumulated costs in the online learning literature is through the concept of dynamic regret [14, 34], which is defined as
(7) 
where . Dynamic regret quantifies how suboptimal the played decisions are on the corresponding loss functions. In our proposed problem setup, the optimality concept associated with dynamic regret conveys a consistency criterion desirable for MPC: we would like to make a decision at state such that, after applying control and entering the new state , its shifted plan remains close to optimal with respect to the new loss function . If the dynamics model is accurate and the MPC algorithm is ideally solving (3), we can expect that bootstrapping the previous solution through (5) into would result in a small instantaneous gap which is solely due to unpredictable future information (such as the stochasticity in the dynamical system). In other words, an online learning algorithm with small dynamic regret, if applied to our online learning setup, would produce a consistently optimal MPC algorithm with regard to the solution concept discussed above. However, we note that having small dynamic regret here does not directly imply good absolute performance on the control system, because the overall performance of the MPC algorithm is largely dependent on the form of the MPC objective (e.g., through choice of and accuracy of ). Small dynamic regret more precisely means whether the plan produced by an MPC algorithm is consistent with the given MPC objective.
Iii A Family of MPC Algorithms Based on Dynamic Mirror Descent
The online learning perspective on MPC suggests that good MPC algorithms can be designed from online learning algorithms that achieve small dynamic regret. This is indeed the case. We will show that a range of existing MPC algorithms are in essence applications of a classical online learning algorithm called dynamic mirror descent (DMD) [14]. DMD is a generalization of mirror descent [4] to problems involving dynamic comparators (in this case, the in dynamic regret in (7)). In round , DMD applies the following update rule:
(8) 
where (which can be replaced by unbiased sampling if is an expectation), is called the shift model,^{4}^{4}4In [14], is called a dynamical model, but it is not the same as the dynamics of our control system. We therefore rename it to avoid confusion. is the step size, and for some , is the Bregman divergence generated by a strictly convex function on .
The first step of DMD in (8) is reminiscent of the proximal update in the usual mirror descent algorithm. It can be thought of as an optimization step where the Bregman divergence acts as a regularization to keep close to . Although is not necessarily a metric (since it may not be symmetric), it is still useful to view it as a distance between and . Indeed, familiar examples of the Bregman divergence include the squared Euclidean distance and KL divergence^{5}^{5}5For probability distributions and over a random variable , the KL divergence is defined as . [3].
The second step of DMD in (8) uses the shift model to anticipate the optimal decision for the next round. In the context of MPC, a natural choice for the shift model is the shift operator in (5) defined previously in Section IIA (hence the same notation), because the perround losses in two consecutive rounds here concern problems with shifted time indices. Hall and Willett [14] show that the dynamic regret of DMD scales with how much the optimal decision sequence deviates from (i.e., , which is proportional to the unpredictable elements of the problem.
Applying DMD in (8) to the online learning problem described in Section IIB leads to an MPC algorithm shown in Algorithm 1, which we call DMDMPC. More precisely, DMDMPC represents a family of MPC algorithms in which a specific instance is defined by a choice of:
Thus, we can use DMDMPC as a generic strategy for synthesizing MPC algorithms. In the following, we use this recipe to recreate several existing MPC algorithms and demonstrate new MPC algorithms that naturally arise from this framework.
Iiia Loss Functions
We discuss several definitions of the perround loss , which all result from the formulation in (6) but with different . These loss functions are based on the statistic defined in (4) which measures the step accumulated cost of a given trajectory. For transparency of exposition, we will suppose henceforth that the control distribution is openloop^{6}^{6}6Note again that even while using openloop control distributions, the overall control law of MPC is statefeedback.; similar derivations follow naturally for closedloop control distributions. For convenience of practitioners, we also provide expressions of their gradients in terms of the likelihoodratio derivative^{7}^{7}7We assume the control distribution is sufficiently regular with respect to its parameter so that the likelihoodratio derivative rule holds. [12]. For some function , all these gradients shall have the form
(9) 
In short, we will denote as . These gradients in practice are approximated by finite samples.
IiiA1 Expected Cost
The most commonly used MPC objective is the step expected accumulated cost function under model dynamics, because it directly estimates the expected longterm behavior when the dynamics model is accurate and is large enough. Its perround loss function is^{8}^{8}8In experiments, we subtract the empirical average of the sampled costs from in (11) to reduce the variance, at the cost of a small amount of bias.
(10)  
(11) 
IiiA2 Expected Utility
Instead of optimizing for average cost, we may care to optimize for some preference related to the trajectory cost , such as having the cost be below some threshold. This idea can be formulated as a utility that returns a normalized score related to the preference for a given trajectory cost . Specifically, suppose that is lower bounded by zero^{9}^{9}9If this is not the case, let , which we assume is finite. We can then replace with . and at some round define the utility (i.e., ) to be a function with the following properties: , is monotonically decreasing, and . These are sensible properties since we attain maximum utility when we have zero cost, the utility never increases with the cost, and the utility approaches zero as the cost increases without bound. We then define the perround loss as
(12)  
(13) 
The gradient in (13) is particularly appealing when estimated with samples. Suppose we sample control sequences from and (for the sake of compactness) sample one state trajectory from for each corresponding control sequence, resulting in . Then the estimate of (13) is a convex combination of gradients:
where and , for . We see that each weight is computed by considering the relative utility of its corresponding trajectory. A cost with high relative utility will push its corresponding weight closer to one, whereas a low relative utility will cause to be close to zero, effectively rejecting the corresponding sample.
We give two examples of utilities and their related losses.
Probability of Low Cost
For example, we may care about the system being below some cost threshold as often as possible. To encode this preference, we can use the threshold utility , where is the indicator function and is a threshold parameter. Under this choice, the loss and its gradient become
(14)  
(15) 
As we can see, this loss function also gives the probability of achieving cost below some threshold. As a result (Fig. 1(a)), costs below are treated the same in terms of the utility. This can potentially make optimization easier since we are trying to make good trajectories as likely as possible instead of finding the best trajectories as in (10).
However, if the threshold is set too low and the gradient is estimated with samples, the gradient estimate may have high variance due to the large number of rejected samples. Because of this, in practice, the threshold is set adaptively, e.g., as the largest cost of the top elite fraction of the sampled trajectories with smallest costs [6]. This allows the controller to make the best sampled trajectories more likely and therefore improve the controller.
Exponential Utility
We can also opt for a continuous surrogate of the indicator function, in this case the exponential utility , where is a scaling parameter. Unlike the indicator function, the exponential utility provides nonzero feedback for any given cost and allows us to discriminate between costs (i.e., if , then ), as shown in Fig. 1(b). Furthermore, acts as a continuous alternative to and dictates how quickly or slowly decays to zero, which in a soft way determines the cutoff point for rejecting given costs.
Under this choice, the loss and its gradient become
(16)  
(17) 
The loss function in (16) is also known as the riskseeking objective in optimal control [28]; this classical interpretation is based on a Taylor expansion of (16) showing
when is large, where is the variance of . Here we derive (16) from a different perspective that treats it as a continuous approximation of (14). The use of exponential transformations to approximate indicators is a common machinelearning trick (like the Chernoff bound [8]).
IiiB Algorithms
We instantiate DMDMPC with different choices of loss function, control distribution, and Bregman divergence as concrete examples to showcase the flexibility of our framework. In particular, we are able to recover wellknown MPC algorithms as special cases of Algorithm 1.
Our discussions below are organized based on the class of Bregman divergences used in (8), and the following algorithms are derived assuming that the control distribution is a sequence of independent distributions. That is, we suppose is a probability density/mass function that factorizes as
(18) 
and for some basic control distribution parameterized by , where denotes the feasible set for the basic control distribution. For control distributions in the form of (18), the shift operator in (5) would set by identifying for , and initializing the final parameter as either or for some default parameter .
IiiB1 Quadratic Divergence
We start with perhaps the most common Bregman divergence: the quadratic divergence. That is, we suppose the Bregman divergence in (8) has a quadratic form ^{10}^{10}10This is generated by defining . for some positivedefinite matrix . Below we discuss different choices of and their corresponding update rules.
Projected Gradient Descent
This basic update rule is a special case when is the identity matrix. Equivalently, the update can be written as .
Natural Gradient Descent
Quadratic Problems
While the above two update rules are quite general, we can further specialize the Bregman divergence to achieve faster learning when the perround loss function can be shown to be quadratic. This happens, for instance, when the MPC problem in (3) is an LQR or LEQR problem^{11}^{11}11The dynamics model is linear, the step cost is quadratic, the perround loss is (10), and the basic control distribution is a Diracdelta distribution. [11]. That is, if
for some constant vector and positive definite matrix , we can set and , making given by the first step of (8) correspond to the optimal solution to (i.e., the solution of LQR/LEQR). The particular values of and for each of LQR and LEQR are derived in Appendix D.
IiiB2 KL Divergence and the Exponential Family
We show that for control distributions in the exponential family [23], the Bregman divergence in (8) can be set to the KL divergence, which is a natural way to measure distances between distributions. Toward this end, we review the basics of the exponential family. We say a distribution with natural parameter of random variable belongs to the exponential family if its probability density/mass function satisfies , where is the sufficient statistics, is the carrier measure, and is the logpartition function. The distribution can also be described by its expectation parameter , and there is a duality between the two parameterizations: , where is the Legendre transformation of and . That is, . The duality results in the property below.
Fact 1.
[23] .
We can use creftype 1 to define the Bregman divergence in (8) to optimize a control distribution in the exponential family:

if is an expectation parameter, we can set
, or 
if is a natural parameter, we can set
.
We demonstrate some examples using this idea below.
Expectation Parameters and Categorical Distributions
We first discuss the case where is an expectation parameter and the first step in (8) is
(19) 
To illustrate, we consider an MPC problem with a discrete control space and use the categorical distribution as the basic control distribution in (18), i.e., we set , where is the probability of choosing each control among at the predicted time step and denotes the probability simplex in . This parameterization choice makes an expectation parameter of that corresponds to sufficient statistics given by indicator functions. With the structure of (9), the update direction is
where and are the elements of and , respectively, has for each element except at index where it is , and denotes elementwise division. Update (19) then becomes the exponentiated gradient algorithm [16]:
(20) 
where is the element of , is the normalizer for , and denotes elementwise multiplication. That is, instead of applying an additive gradient step to the parameters, the update in (19) exponentiates the gradient and performs elementwise multiplication. This does a better job of accounting for the geometry of the problem, and makes projection a simple operation of normalizing a distribution.
Natural Parameters and Gaussian Distributions
Alternatively, we can set as a natural parameter and use
(21) 
as the first step in (8). In particular, we show that, with (21), the structure of the likelihoodratio derivative in (9) can be leveraged to design an efficient update. The main idea follows from the observation that when the gradient is computed through (9) and is the natural parameter, we can write
(22) 
where is the expectation parameter of and is the sufficient statistics of the control distribution. We combine the factorization in (22) with a property of the proximal update below (proven in Appendix C) to derive our algorithm.
Proposition 1.
Let be an update direction. Let be the image of under . If and , then .^{12}^{12}12A similar proposition can be found for (19).
We find that, under the assumption^{13}^{13}13If is not in , the update in (21) needs to perform a projection, the form of which is algorithm dependent. in creftype 1, the update rule in (21) becomes
(23) 
In other words, when , the update to the expectation parameter in (8) is simply a convex combination of the sufficient statistics and the previous expectation parameter .
We provide a concrete example of an MPC algorithm that follows from (23). Let us consider a continuous control space and use the Gaussian distribution as the basic control distribution in (18), i.e., we set for some mean vector and covariance matrix . For , we can choose sufficient statistics , which results in the expectation parameter and the natural parameter , where . Let us set as the natural parameter. Then (21) is equivalent to the update rule for :
(24) 
IiiC Extensions
In the previous sections, we discussed multiple instantiations of DMDMPC, showing the flexibility of our framework. But they are by no means exhaustive. In Appendix B, we discuss variations of DMDMPC, e.g., imposing constraints and different ways to approximate the expectation in (9).
Iv Related Work
Recent work on MPC has studied samplingbased approaches, which are flexible in that they do not require differentiability of a cost function. One such algorithm which can be used with general cost functions and dynamics is MPPI, which was proposed by Williams et al. [31] as a generalization of the control affine case [30]. The algorithm is derived by considering an optimal control distribution defined by the control problem. This optimal distribution is intractable to sample from, so the algorithm instead tries to bring a tractable distribution (in this case, Gaussian with fixed covariance) as close as possible in the sense of KL divergence. This ends up being the same as finding the mean of the optimal control distribution. The mean is then approximated as a weighted sum of sampled control trajectories, where the weight is determined by the exponentiated costs. Although this algorithm works well in practice (including a robust variant [33] achieving stateoftheart performance in aggressive driving [10]), it is not clear that matching the mean of the distribution should guarantee good performance, such as in the case of a multimodal optimal distribution. By contrast, our update rule in (25) results from optimizing an exponential utility.
A closely related approach is the crossentropy method (CEM) [6], which also assumes a Gaussian sampling distribution but minimizes the KL divergence between the Gaussian distribution and a uniform distribution over low cost samples. CEM has found applicability in reinforcement learning [19, 21, 26], motion planning [17, 18], and MPC [9, 32].
These samplingbased control algorithms can be considered special cases of general derivativefree optimization algorithms, such as covariance matrix adaptation evolutionary strategies (CMAES) [15] and natural evolutionary strategies (NES) [29]. CMAES samples points from a multivariate Gaussian, evaluates their fitness, and adapts the mean and covariance of the sampling distribution accordingly. On the other hand, NES optimizes the parameters of the sampling distribution to maximize some expected fitness through steepest ascent, where the direction is provided by the natural gradient. Akimoto et al. [2] showed that CMAES can also be interpreted as taking a natural gradient step on the parameters of the sampling distribution. As we showed in Section IIIB, natural gradient descent is a special case of DMDMPC framework. A similar observation that connects between MPPI and mirror descent was made by Okada and Taniguchi [24], but their derivation is limited to the KL divergence and Gaussian case.
V Experiments
We use experiments to the validate the flexibility of DMDMPC. We show that this framework can handle both continuous (Gaussian distribution) and discrete (categorial distribution) variations of control problems, and that MPC algorithms like MPPI and CEM can be generalized using different step sizes and control distributions to improve performance. Extra details and results are included in Appendices F and E.
Va Cartpole
We first consider the classic cartpole problem where we seek to swing a pole upright and keep it balanced only using actuation on the attached cart. We consider both the continuous and discrete control variants. For the continuous case, we choose the Gaussian distribution as the control distribution and keep the covariance fixed. For the discrete case, we choose the categorical distribution and use update (20). In either case, we have access to a biased stochastic model (uses a different pole length compared to the real cart).
We consider the interaction between the choice of loss, step size, and number of samples used to estimate (9),^{14}^{14}14For our experiments, we vary the number of samples from and fix the number of samples from to ten. Furthermore, we use common random numbers when sampling from to reduce estimation variance. shown in Figs. 4 and 3. For this environment, we can achieve low cost when optimizing the expected cost in (10) with a proper step size ( for both continuous and discrete problems) while being fairly robust to the number of samples. When using either of the utilities, the number of samples is more crucial in the continuous domain, with more samples allowing for larger step sizes. In the discrete domain (Fig. 2(b)), performance is largely unaffected by the number of samples when the step size is below , excluding the threshold utility with 1000 samples. In Fig. 3(a), for a large range of utility parameters, we see that using step sizes above (the step size set in MPPI and CEM) give significant performance gains. In Fig. 3(b), there’s a more complicated interaction between the utility parameter and step size, with huge changes in cost when altering the utility parameter and keeping the step size fixed.
VB AutoRally
VB1 Platform Description
We use the autonomous AutoRally platform [13] to run a highspeed driving task on a dirt track, with the goal of the task to achieve as low a lap time as possible. The robot (Fig. 5) is a 1:5 scale RC chassis capable of driving over () and has a desktopclass Intel Core i7 CPU and Nvidia GTX 1050 Ti GPU. For realworld experiments, we estimate the car’s pose using a particle filter from [10] which relies on a monocular camera, IMU, and GPS. In both simulated and realworld experiments, the dynamics model is a neural network which has been fitted to data collected from human demonstrations. We note that the dynamics model is deterministic, so we don’t need to estimate any expectations with respect to the dynamics.
VB2 Simulated Experiments
We first use the Gazebo simulator (Fig. 9 in Section EB) to perform a sweep of algorithm parameters, particularly the step size and number of samples, to evaluate how changing these parameters can affect the performance of DMDMPC. For all of the experiments, the control distribution is a Gaussian with fixed covariance, and we use update (25) (i.e., the loss is the exponential utility (16)) with . The resulting lap times are shown in Fig. 6.^{15}^{15}15The large error bar for 64 samples and step size of 0.8 is due to one particular lap where the car stalled at a turn for about 60 seconds. We see that although using more samples does result in smaller lap times, there are diminishing returns past 1920 samples per gradient. Indeed, with a proper step size, even as few as 192 samples can yield lap times within a couple seconds of 3840 samples and a step size of 1. We also observe that the curves converge as the step size decreases further, implying that only a certain number of samples are needed for a given step size. This is a particularly important advantage of DMDMPC over methods like MPPI: by changing the step size, DMDMPC can perform much more effectively with fewer samples, making it a good choice for embedded systems which can’t produce many samples due to computational constraints.
VB3 RealWorld Experiments
In the realworld setting (Fig. 7), the control distribution is a Gaussian with fixed covariance, and we use update (25) with . We used the following experimental configurations: each of 1920 and 64 samples, and each of step sizes 1 (corresponding to MPPI), 0.8, and 0.6. Overall (Table I), there’s a mild degradation in performance when decreasing the step size at 1920 samples, due to the car taking a longer path on the track (Fig. 11(a) vs. Fig. 11(c) in Section FB). Using just 64 samples surprisingly only increases the lap times by 2 seconds and seems unaffected by the step size. This could be because, despite the noisiness of the DMDMPC update, the setpoint controller in the car’s steering servo acts as a filter, smoothing out the control signal and allowing the car to drive on a consistent path (Fig. 13 in Section FB).
Step size  1920 samples  64 samples 

Vi Conclusion
We presented a connection between model predictive control and online learning. From this connection, we proposed an algorithm based on dynamic mirror descent that can work for a wide variety of settings and cost functions. We also discussed the choice of loss function within this online learning framework and the sort of preference each loss function imposes. From this general algorithm and assortment of loss functions, we show several well known algorithms are special cases and presented a general update for members of the exponential family.
We empirically validated our algorithm on continuous and discrete simulated problems and on a realworld aggressive driving task. In the process, we also studied the parameter choices within the framework, finding, for example, that in our framework a smaller number of rollout samples can be compensated for by varying other parameters like the step size.
We hope that the online learning and stochastic optimization viewpoints of MPC presented in this paper opens up new possibilities for using tools from these domains, such as alternative efficient sampling techniques [5] and accelerated optimization methods [22, 24], to derive new MPC algorithms that perform well in practice.
Acknowledgements
This material is based upon work supported by NSF NRI award 1637758, NSF CAREER award 1750483, an NSF Graduate Research Fellowship under award No. 2015207631, and a National Defense Science & Engineering Graduate Fellowship. We thank Aravind Battaje and Hemanth Sarabu for assisting in AutoRally experiments.
References
 Abbeel et al. [2010] Pieter Abbeel, Adam Coates, and Andrew Y. Ng. Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research, 29(13):1608–1639, 2010.
 Akimoto et al. [2012] Youhei Akimoto, Yuichi Nagata, Isao Ono, and Shigenobu Kobayashi. Theoretical foundation for CMAES from information geometry perspective. Algorithmica, 64(4):698–716, 2012.
 Banerjee et al. [2005] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of machine learning research, 6(Oct):1705–1749, 2005.
 Beck and Teboulle [2003] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
 Bellman and Casti [1971] Richard Bellman and John Casti. Differential quadrature and longterm integration. Journal of Mathematical Analysis and Applications, 34(2):235–238, 1971.
 Botev et al. [2013] Zdravko I. Botev, Dirk P. Kroese, Reuven Y. Rubinstein, and Pierre L’Ecuyer. The crossentropy method for optimization. In Handbook of statistics, volume 31, pages 35–59. Elsevier, 2013.
 Camacho and Alba [2013] Eduardo F. Camacho and Carlos Bordons Alba. Model predictive control. Springer Science & Business Media, 2013.
 Chernoff [1952] Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
 Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765, 2018.
 Drews et al. [2019] Paul Drews, Grady Williams, Brian Goldfain, Evangelos A. Theodorou, and James M. Rehg. VisionBased HighSpeed Driving With a Deep Dynamic Observer. IEEE Robotics and Automation Letters, 4(2):1564–1571, 2019.
 Duncan [2013] Tyrone E. Duncan. Linearexponentialquadratic Gaussian control. IEEE Transactions on Automatic Control, 58(11):2910–2911, 2013.
 Glynn [1990] Peter W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
 Goldfain et al. [2019] Brian Goldfain, Paul Drews, Changxi You, Matthew Barulic, Orlin Velev, Panagiotis Tsiotras, and James M. Rehg. AutoRally: An Open Platform for Aggressive Autonomous Driving. IEEE Control Systems Magazine, 39(1):26–55, 2019.
 Hall and Willett [2013] Eric Hall and Rebecca Willett. Dynamical models and tracking regret in online convex programming. In International Conference on Machine Learning, pages 579–587, 2013.
 Hansen et al. [2003] Nikolaus Hansen, Sibylle D. Müller, and Petros Koumoutsakos. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMAES). Evolutionary Computation, 11(1):1–18, 2003.
 Hazan [2016] Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 Helvik and Wittner [2002] Bjarne E. Helvik and Otto Wittner. Using the CrossEntropy Method to Guide/Govern Mobile Agent’s Path Finding in Networks. International Workshop on Mobile Agents for Telecommunication Applications, 2164, 2002.
 Kobilarov [2012] Marin Kobilarov. Crossentropy randomized motion planning. In Robotics: Science and Systems, volume 7, pages 153–160, 2012.
 Mannor et al. [2003] Shie Mannor, Reuven Y. Rubinstein, and Yohai Gat. The cross entropy method for fast policy search. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pages 512–519, 2003.
 Mayne [2014] David Q. Mayne. Model predictive control: Recent developments and future promise. Automatica, 50(12):2967–2986, 2014.
 Menache et al. [2005] Ishai Menache, Shie Mannor, and Nahum Shimkin. Basis Function Adaptation in Temporal Difference Reinforcement Learning. Annals of Operations Research, 134:215–238, 2005.
 Miyashita et al. [2018] Megumi Miyashita, Shiro Yano, and Toshiyuki Kondo. Mirror descent search and its acceleration. Robotics and Autonomous Systems, 106:107–116, 2018.
 Nielsen and Garcia [2009] Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards. arXiv preprint arXiv:0911.4863, 2009.
 Okada and Taniguchi [2018] Masashi Okada and Tadahiro Taniguchi. Acceleration of Gradientbased Path Integral Method for Efficient Optimal and Inverse Optimal Control. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3013–3020. IEEE, 2018.
 Rattray et al. [1998] Magnus Rattray, David Saad, and Shunichi Amari. Natural gradient descent for online learning. Physical review letters, 81(24):5461, 1998.
 Szita and Lörincz [2006] István Szita and András Lörincz. Learning Tetris using the noisy crossentropy method. Neural computation, 18(12):2936–2941, 2006.
 Tassa et al. [2014] Yuval Tassa, Nicolas Mansard, and Emo Todorov. Controllimited differential dynamic programming. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 1168–1175. IEEE, 2014.
 van den Broek et al. [2010] LJ van den Broek, WAJJ Wiegerinck, and HJ Kappen. Risk sensitive path integral control. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI 2010), pages 1–8. AUAI Press, 2010.
 Wierstra et al. [2014] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949–980, 2014.
 Williams et al. [2016] Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. Aggressive driving with model predictive path integral control. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1433–1440. IEEE, 2016.
 Williams et al. [2017] Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M. Rehg, Byron Boots, and Evangelos A. Theodorou. Information Theoretic MPC for ModelBased Reinforcement Learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721. IEEE, 2017.
 Williams et al. [2018a] Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. InformationTheoretic Model Predictive Control: Theory and Applications to Autonomous Driving. IEEE Transactions on Robotics, 34(6):1603–1622, 2018a.
 Williams et al. [2018b] Grady Williams, Brian Goldfain, Paul Drews, Kamil Saigol, James M. Rehg, and Evangelos A. Theodorou. Robust sampling based model predictive control with sparse objective information. In Robotics Science and Systems, 2018b.
 Zhang et al. [2018] Lijun Zhang, Shiyin Lu, and ZhiHua Zhou. Adaptive Online Learning in Dynamic Environments. In Advances in Neural Information Processing Systems, pages 1323–1333, 2018.
Appendix A Shift Operator
We discuss some details in defining the shift operator. Let be the approximate solution to the previous problem and denote the initial condition of in solving (3), and consider sampling and . We set
by defining a shift operator that outputs a new parameter in . This can be chosen to satisfy desired properties, one example being that when conditioned on and , the marginal distributions of are the same for both of and of . A simple example of this property is shown in Fig. 8. Note that also involves a new control that is not in , so the choice of is not unique but algorithm dependent; for example, we can set of to follow the same distribution as (cf. Section IIIB). Because the subproblems in (3) of two consecutive time steps share all control variables except for the first and the last ones, the “shifted” parameter to the current problem should be almost as good as the optimized parameter is to the previous problem. In other words, setting provides a warm start to (3) and amortizes the computational complexity of solving for .
Appendix B Variations of DMDMPC
The control distributions in DMDMPC can be fairly general (in addition to the categorical and Gaussian distributions that we discussed) and control constraints on the problem (e.g., control limits) can be directly incorporated through proper choices of control distributions, such as the beta distribution, or through mapping the unconstrained control through some squashing function (e.g., or clamp). Though our framework cannot directly handle state constraints as in constrained optimization approaches, a constraint can be relaxed to an indicator function which activates if the constraint is violated. The indicator function can then be added to the cost function in (4) with some weight that encodes how strictly the constraint should be enforced.
Moreover, different integration techniques, such as Gaussian quadrature [5], can be adopted to replace the likelihoodratio derivative in (9) for computing the required gradient direction. We also note that the independence assumption on the control distribution in (18) is not necessary in our framework; timecorrelated control distributions and feedback policies are straightforward to consider in DMDMPC.
Appendix C Proofs
Proof of creftype 1.
We prove the first statement; the second one follows directly from the duality relationship. The statement follows from the derivations below; we can write
where the last equality is due to the assumption that . Then applying on both sides and using the relationship that , we have . ∎
Appendix D Derivation of LQR and LEQR Losses
The dynamics in Equation (1) are given by
for some matrices and and , where . For a control sequence , noise sequence , and initial state , the resulting state sequence is found through convolution:
or, in matrix form:
where , , and are defined naturally from the convolution equation above. Note that , where . Thus, we also have that
We define the instantaneous and terminal costs as
where and . Thus, the statistic is
where and .
Our control distribution is a Dirac delta distribution located at the given parameter: .
Da Lqr
The loss is defined as . Expanding this out gives:
We see this is a quadratic problem in by defining
DB Leqr
The loss is defined as
for some parameter . For compactness, we define and so that the exponent contains . In expanding the loss, we use the following fact:
Fact 2.
For , where , and constants and :
Proof.
We expand the expectation and complete the square:
where , , and . ∎
We now expand the loss:
We see this is a quadratic problem in