# DISCO: Double Likelihood-free Inference Stochastic Control

## Abstract

Accurate simulation of complex physical systems enables the development, testing, and certification of control strategies before they are deployed into the real systems. As simulators become more advanced, the analytical tractability of the differential equations and associated numerical solvers incorporated in the simulations diminishes, making them difficult to analyse. A potential solution is the use of probabilistic inference to assess the uncertainty of the simulation parameters given real observations of the system. Unfortunately the likelihood function required for inference is generally expensive to compute or totally intractable. In this paper we propose to leverage the power of modern simulators and recent techniques in Bayesian statistics for likelihood-free inference to design a control framework that is efficient and robust with respect to the uncertainty over simulation parameters. The posterior distribution over simulation parameters is propagated through a potentially non-analytical model of the system with the unscented transform, and a variant of the information theoretical model predictive control. This approach provides a more efficient way to evaluate trajectory roll outs than Monte Carlo sampling, reducing the online computation burden. Experiments show that the controller proposed attained superior performance and robustness on classical control and robotics tasks when compared to models not accounting for the uncertainty over model parameters.

## I Introduction

Robustness to model miss-specification and noisy sensor measurements is a critical property for control systems operating in complex robotics applications. The development of powerful and more realistic simulators allows practitioners to analyse and verify the performance of the controller against these variables before the controller is deployed to the real robot. In Model Predictive Control (MPC) one seeks to iteratively find the solution of an optimisation problem for a receding finite time-horizon using an approximate model of the system. When the dynamic model is given by complex simulators that incorporate differential equations and numerical solvers there is little hope the equations can be reversed to reason about the parameters of the simulation to best match the real behaviour of the system. Furthermore, the simulator might abstract away the equations and solver from the user. Effectively, it can be interpreted as a generative model that can be sampled from given a set of parameter values, but not inverted. In this paper we pose the question, can we leverage the power of simulators, treated as generative models, to design control strategies that are robust to parameter uncertainty?

On the other hand, the application of MPC to linear systems has been an active research area for many decades with extensive deployments to many practical problems [deoliveiraMultiagentModelPredictive2010, mayneModelPredictiveControl2014, mesbahStochasticModelPredictive2016]. Notably, the most common setting for linear MPC application are tasks that involve trajectory tracking or stabilisation. However, control tasks in reinforcement learning are usually more complex and therefore less suitable to linearisation, motivating the use of non-linear models [rechtTourReinforcementLearning2019]. Another motivation for more complex models is the ability to use of more expressive constraints, even if not directly involved in the physical process, such as economic criteria [bradfordStochasticNonlinearModel2017]. Despite its vast application in the linear case, the use of MPC in non-linear systems continues to be an increasingly active area of research in control theory [mayneModelPredictiveControl2014, mesbahStochasticModelPredictive2016].

Recent work in the field has led to controllers that are able to incorporate non-linear dynamics without relying on linear or quadratic approximations [williamsInformationTheoreticMPC2017, williamsInformationTheoreticModelPredictive2018]. However, most MPC controllers still do not consider uncertainty in the parameters of their internal simulator for future trajectories. In addition, estimating parameters for the system’s model usually requires large amounts of data from the real system, which can be infeasible for some applications. Yet, whenever the stochastic system uncertainties can be adequately modelled, it is more natural to explicitly take them into account in the control design method itself. In Stochastic MPC, the uncertainty on the internal system dynamics is intrinsic to the optimal control problem solved at every time step. This allows the controller to trade-off performance and satisfaction of the constraints by regulating the joint probability distribution of the system states and outputs [mesbahStochasticModelPredictive2016].

In this paper we make the following contributions: we develop a Stochastic Non-linear MPC variant which leverages recent advancements in likelihood-free inference to estimate both the uncertainty on the simulator parameters as well as to propagate it throughout the estimated trajectories. We call our method double likelihood-free inference stochastic control, DISCO. The posterior distribution for the parameters of the simulator is estimated by combining simulated data from generative models and observations from the physical system. Using this posterior distribution allows us to take into account the uncertainty about the system’s dynamics in the decision making process during the control task. We proceed to show that the Unscented Transform (UT) [julierUnscentedFilteringNonlinear2004] provides a computationally efficient alternative when compared to traditional Monte Carlo approaches to propagate the uncertainty from the parameter space to the forward modelling of the trajectory roll outs. In short, DISCO can be seen as a variant of the Information Theoretical MPC (IT-MPC) control algorithm [williamsInformationTheoreticModelPredictive2018] that considers the uncertainty in the system’s parameters in its internal trajectory simulations.

## Ii Related work

The use of MPC in the control of linear systems is very mature and has been widely studied and applied to real systems. However, as seen in [mayneModelPredictiveControl2014], Non-linear MPC (NMPC) is still an open-research question, especially for systems were uncertainty over parameters and constraints on controls and state-space are considered. The most common methods for controlling general nonlinear systems are based on Non-linear Programming [houskaACADOToolkitOpensource2011] and Differential Dynamic Programming (DDP) [erezIntegratedSystemRealtime2013]. Both rely on approximations of dynamics and cost functions so that the online optimisation problem becomes tractable. However, these mainstream gradient-based MPC approaches have some shortcomings. In the DDP method, the cost function must be smooth and it is notoriously difficult to include state constraints. Whereas with nonlinear programming constraints may be easily accounted for, but a common issue is what to do when no feasible solution is found.

In [mesbahStochasticModelPredictive2016], a family of Stochastic NMPC (SNMPC) methods are discussed. In Tube-based NMPC the objective of the control policy is to ensure that the forward trajectories will remain inside a desirable tube centred around a given trajectory, however the boundary tube has to be computed offline [rakovicParameterizedTubeModel2012]. A multi-stage NMPC approach has been suggested in which the uncertainty is modelled by a scenario tree approach from stochastic programming. However, the procedure quickly becomes intractable, since the size of the optimisation problem scales exponentially with the time horizon, number of uncertainties and uncertainty levels [thangavelRobustNonlinearModel2017].

Although many of the methods above focus on robustness, they do not incorporate uncertainty over the parameters of the transition function. In [bradfordStochasticNonlinearModel2017], this is accounted for by using a SNMPC with an Unscented Kalman Filter to propagate the uncertainty over the state-space. However, this method requires an optimisation with chance constraints to be solved online and, to keep the problem feasible, the variance of the trajectories has to be artificially constrained. The most similar approach is perhaps presented in [arrudaUncertaintyAversePushing2017], where the IT-MPC formulation is used in conjunction with a Ensemble of Mixture Density Networks (E-MDN) to approximate the joint probability distribution of states and actions. This is similar to our approach, however as the E-MDN tries to approximate the joint distribution of states and actions, it needs to be retrained entirely on new environments.

In contrast, the variant of IT-MPC proposed in this paper uses the UT to propagate the uncertainty over model parameters. This reduces the dimensionality of the inference problem and results in a controller more adept to generalise to unseen situations. Moreover, unlike the stochastic optimisation strategies, our framework is very amenable to the inclusion of constraints, as the control update law is based on sampled trajectories. As shown in [williamsInformationTheoreticModelPredictive2018] constraints may be applied directly to the control actions. On the other hand, we can apply soft constraints to the state space through the cost function. This is easily achieved as there is no need for the cost function to be differentiable and assures that a feasible solution will exist.

Additionally, DISCO takes advantage of the BayesSim Likelihood-free Inference (LFI) framework presented in [ramosBayesSimAdaptiveDomain2019] to update the model uncertainty periodically. Hence, given a set of true observations after a specified episode length, we can update our knowledge of the posterior probability density of parameters . This way our model can adapt to variations in the environment, e.g. adjust friction coefficients in case of rain, or intrinsic to the transition function, e.g. change of weight distribution. In contrast to other inference methods, such as Variational Inference or Markov Chain Monte Carlo, where a likelihood function is needed, in LFI we compute an approximated parametric distribution of the true posterior. Furthermore, BayesSim was shown to be more data efficient than other LFI methods, such as Approximate Bayesian Computation [ramosBayesSimAdaptiveDomain2019].

## Iii Preliminaries

We consider the problem of controlling a discrete-time stochastic system described by a non-linear set of differential equations of the form:

(1) |

where is the transition function, denotes the system states, and is the control input at a given time . We assume a finite time-horizon , and that the control frequency is given. Note that there is no direct control over the variable , but we are able to control its mean . This assumption considers not only a multiplicative noise model which is common in robotics, where lower-level actuator controllers are usually present, but also an amount of exploration in our control actions. As such, in practice, is a hyper-parameter of our control system that may need to be artificially increased.

More generally, we are interested in the problem where the real transition function is approximated by a parameterised non-linear forward model , represented as for compactness. Equation (1) may then be rewritten as:

(2) |

### Iii-a Information Theoretical MPC (IT-MPC)

Following the steps in [williamsInformationTheoreticModelPredictive2018], we can define a fixed length input sequence over a fixed control horizon , onto which we apply a Receding Horizon Control strategy. This yields , which is itself a random variable. Furthermore, let’s denote as the joint probability distribution and the corresponding probability density function (pdf) of the uncontrolled system (i.e. . Likewise, is the joint distribution and the corresponding pdf for an open-loop control sequence. The optimal control problem may then be be defined as:

(3) |

where is the set of admissible controls, is a terminal cost function, and is a running cost function of the form:

(4) |

where is known as the inverse temperature and the affine term allows the location of the minimum control (rest position) to be different than zero. Noting that the state cost may be considered independent from the control terms, we can define . Moreover, we define a mapping operator, , from input sequences to their resulting trajectory by recursively applying given , . This leads to the following state cost function:

(5) |

Finally, IT-MPC relies on the free-energy principle to compute a lower bound for the optimal control problem and defines the form of the optimal distribution function for which this bound is tight and achieves the optimal control . It can be shown that such distribution is of the form

(6) | ||||

(7) |

where the base distribution has been augmented with the cost of the state trajectory. This results in . Therefore, the optimal open-loop control sequence is the expected value of control trajectories sampled from the optimal distribution. As we cannot sample directly from , we can resort to importance sampling [andrieuIntroductionMCMCMachine2003] to construct an unbiased estimator of the optimal distribution, given the current control distribution, namely

(8) |

where is the importance sampling weight. Therefore, we can switch the expectation to , resulting in . We can then use the definition of the optimal distribution w.r.t. the base measure distribution given in [williamsInformationTheoreticModelPredictive2018] to derive the optimal information-theoretic control law:

(9) |

where

and is the difference between the current control action and the minimum control (adjusted by and usually zero). Note that in practice, for numerical stability, we multiple the numerator and denominator of by a factor , where is defined as the minimum cost.

### Iii-B Likelihood-free parameter estimation

Recent advances in LFI allowed the use of probabilistic inference to learn distributions over simulation parameters [ramosBayesSimAdaptiveDomain2019]. The main idea is that of approximating an intractable posterior using data generated from a generative forward model (or simulator) where trajectories are collected for different simulation configurations. Therefore, one can directly learn a conditional density where parameters are learned through an optimisation procedure. The learned model usually takes the form of a mixture of Gaussians where inputs are summary statistics obtained from trajectories and outputs are the parameters of the mixture.

The goal is to maximise the likelihood . It has been shown in previous work [ramosBayesSimAdaptiveDomain2019] that will be proportional to if the log-likelihood is optimised as follows:

(10) |

Consequently, a posterior estimate can be obtained by:

(11) |

The conditional density is a mixture of Gaussians,

(12) |

where are mixing functions, are mean functions and are covariance functions.

## Iv Disco

At its core, model-based control relies on an approximated transition function to optimise the control actions over the control horizon. In practice, this transition function is usually defined a priori using fundamental physical principles and domain knowledge, or empirically by applying system identification techniques [simchowitzLearningMixingSharp2018] or learning methods from data [schaalLearningDemonstration1997, abbeelAutonomousHelicopterAerobatics2010, simchowitzLearningMixingSharp2018]. Typically, these methods provide deterministic transition functions that do not incorporate model uncertainty and are invariant over time. As discussed in [williamsInformationTheoreticMPC2017], the closed-loop RHC offers a degree of robustness to model uncertainties, but the compounding error of poor predictions along the control horizon will reduce the stability margins of the system. Using the methods outlined in section III, in this paper we propose a framework to apply the IT-MPC stochastic control formulation to problems where the parameters of the transition function are unknown, but belong to a problem dependent prior, . Furthermore, we make use of BayesSim [ramosBayesSimAdaptiveDomain2019] to refine our knowledge of the parameters as we interact with the environment and gather new observations. The intuition behind this approach is that, by refining our knowledge of the parameters of an otherwise well-defined transition function, we will capitalise not only on the application domain knowledge, but also on the adaptability of inference-free learning methods. Since represents a plausible range of unknown physical parameters, e.g. mass or friction coefficient, it is straightforward to incorporate domain knowledge to this formulation. Alternatively, an improper uninformative prior may be used when no assumptions are given. On the other hand, by updating our knowledge of given observed data, we are more likely to cope with problems such as covariate shift [ganegedaraOnlineAdaptationDeep2016] and reality gap [ramosBayesSimAdaptiveDomain2019, chebotarClosingSimtoRealLoop2018]. The complete method is presented in algorithm 1.

### Iv-a Problem setup

Given a forward model with parameters and distributed according to , trajectories can be obtained from it by first sampling and generating roll outs by propagating the state-action pairs through the transition function. Although the parameters are stochastic, we assume they are invariant throughout the control horizon for a given trajectory. This is a reasonable assumption as the latent parameters are usually stable physical quantities and the update frequency of the control loop is significantly faster than their time constants. In this situation, the optimal distribution given in (6) becomes:

(13) |

where we overload the notation to emphasise the dependence of on the now stochastic . However, as and are independent, we can drop the conditioning in . As a result, our control law can be expressed as:

(14) |

where shows the dependence on and we applied the law of total expectation to get the resulting equivalence. This means our update rule now has to sample jointly from the distributions of and .

### Iv-B Propagation of uncertainty over the state-space dynamics

If we sample sufficiently from , we are able to reconstruct the joint distribution and compute our control updates. However, we note that the increased dimensionality of the sample space requires the number of samples to grow combinatorially. As such, we resort to the unscented transform [julierUnscentedFilteringNonlinear2004] as more efficient approach to propagate the uncertainty of throughout the state-space.

In [julierUnscentedFilteringNonlinear2004], the authors demonstrate how UT is able to reconstruct an approximate of the random variable resulting when an original random variable is applied to a non-linear function . The premise behind this approach is that it should be easier to approximate a probability distribution than a arbitrary non-linear transformation. The idea is to select a set of sigma points able to capture the most important statistical properties of the prior random variable . The necessary statistical information captured by the UT is the first and second order moments of . The number of sigma-points needed to do this , where is the dimension of . In [vandermerweSigmapointKalmanFilters2004], it is shown that matching the moments of up to the th order implies matching the moments of to the same order. By using a larger number of sigma-points, skew and kurtosis can also be captured [julierScaledUnscentedTransformation2002].

In DISCO, we refer to the formulation presented in [vandermerweSigmapointKalmanFilters2004] to compute sigma-points over the distribution of parameters. The expressions to compute the sigma-points and weights for the mean and covariance are presented below:

(15) |

where is the primary scaling factor, a secondary scaling (usually 0), determines the spread of the sigma points around , and is a scalar to provide an extra degree freedom. The reader is encouraged to refer to [vandermerweSigmapointKalmanFilters2004] for details on hyperparameter selection. The sigma-points are then applied recursively to the transition function to compute the cost for . In practice, the sigma points on the state space are given by . To ensure the trajectory cost of each sigma-point can be summarised using the UT, it is necessary to apply the same action sequence sampled i.i.d. to all points. Effectively, this means we need to replicate times each action sequence during our update step. Finally, the mean trajectory cost is given by

(16) |

and used in (9) for the control law update.

### Iv-C Updating the parameter prior distribution

At each time step we are computing a new control action , applying it to our environment and collecting new observations . The pairs of represent a trajectory , up to a specified time-length. This serves as input for the estimate of the posterior probability of . Once sufficient data has been aggregated in , we can use the method presented in [ramosBayesSimAdaptiveDomain2019] to refine the posterior estimate.

Finally, the unscented transform requires as an input a mean vector and covariance matrix for the parameters. Therefore, these have to be retrieved from , or, alternatively, the highest weighted Gaussian may be selected if it is above a specified threshold.

## V Experimental results

### V-a Inverted pendulum swing-up task

In this task, the controller has to swing and hold a pendulum upright using a torque command applied directly to the joint of a rigid-arm. We used the simulator in [brockmanOpenAIGym2016], and always set the pendulum initial state to the downright position and at rest. The state cost function used was , and the terminal cost function was set to zero. The inverse control temperature was set at 10 and the control authority at 1. We have also defined the number of sampled trajectories and the control horizon . For more details on the experiment parameters, please refer to the appendix.

The results presented in Fig. 1 are the accumulated cost over time for 50 iterations, for a baseline case, DISCO, and Monte Carlo sampling. Note that the oscillatory behaviour of the cost function is expected, as the controller has insufficient authority to swing the pendulum without the swinging action to increase the momentum.

All models used the same hyperparameters described above, with the exception of the for the case of MC sampling. Given that we want to compare the performance of the controller when using UT against MC, for the case where MC is used, we increment the amount of trajectories sampled by the number of sigma points used by the UT. Effectively, for the MC controller we have trajectories.

The unknown parameters in this example were the length of the arm and the mass of the pendulum. As a prior, we assumed an uniform distribution between and for both parameters. The posterior distribution was given by a mixture of Gaussians with components, trained using a reference control policy. Note that in our simulations, both models shared the same posterior distribution estimate. Once trained and conditioned on the observed data, the resulting mixture had a mean estimate for the length of meter and for mass kilo. The covariance matrix was diagonal, and the variance was for the length estimate and for mass. One of the components of the mixture was dominant with a weight of and was used as reference for the UT.

DISCO with UT outperforms MC sampling both with an uninformative prior and inferred posterior. Noticeable also, the performance of UT with the posterior distribution is better than the baseline model. This is explainable by the fact that the parameter randomisation introduced by the sigma-points provides more information in the trajectory evaluation. This way, trajectories that are borderline to a higher cost state captured by one of the sigma points get penalised. Effectively, UT works like an automatic calibration of the control temperature, when the prior is broad, many trajectories are considered in the control update average. Conversely, when the posterior gets refined, the controller is more confident to select fewer trajectories.

### V-B Skid-steer robot

This section presents experimental results with a physical robot equipped with a skid-steering drive mechanism (Fig. 2). We modelled the kinematics of the robot based on a modified unicycle model, which accounts for skidding via an additional parameter [kozlowskiModelingControl4wheel2004]. The parameters to be estimated via BayesSim are the robot’s wheel radius , axial distance , i.e. the distance between the wheels, and the displacement of the robot’s instant centre of rotation (ICR) from the robot’s centre . A non-zero value on the latter affects turning by sliding the robot sideways. To estimate the parameters, the robot was driven manually around a circle and had its trajectory data recorded. From the trajectory data we computed cross-correlation summary statistics as , which capture the centre of trajectory and the average linear velocity. In simulation, the wheel speed commands sent to the robot were repeated for different parameter settings sampled from a uniform prior, , , .

Fig. b presents the resulting marginal estimates from BayesSim for each parameter of the robot’s kinematic model. For comparisons, physical measurements indicate a of around 0.06 m and of around 0.31 m. Measuring , however, involves a laborious process, which would require different weight measurements or many trajectories from the physical hardware [yiKinematicModelingAnalysis2009]. As we are only applying a relatively simple kinematic model of the robot to explain the real trajectories, the effects of the dynamics and ground-wheel interactions are not accounted for. As a result we see that BayesSim automatically estimates some parameters, such as the axial distance to try to compensate for the miss-specifications, which helps in the control task.

The control task was defined as following a circular path at a constant tangential speed. Costs were set to make the robot follow a circle of 0.75 m radius with , where represents the robot’s distance to the edge of the circle and m/s is a reference linear speed. We performed experiments sampling from the uniform prior over the parameters , sampling from the posterior , and using only a point estimate set to , and , which was adjusted offline to reduce simulation error. For clarity, instead of the noisy raw costs, we present the mean instant cost, i.e. and the executed trajectories in Fig. 3. Again, for more details on the experiment parameters, please refer to the appendix.

We see that considering parameter uncertainty via DISCO provides significant performance improvements over the baseline MPPI algorithm running with a point estimate. Although, in term of costs, both the prior and posterior estimates offer similar performance, we see the advantages of using the parameter posterior estimates in the trajectories plot, where we see overshooting happening on some portions of the path. The latter can be explained by the prior allowing kinematic parameters candidates that are too far from the true values. Lastly, a noisier speed control explains the gap between the baseline MPPI and the DISCO methods, despite the similar performances in terms of path tracking.

## Vi Conclusion

This paper is a first step towards incorporating model uncertainty and sophisticated Bayesian inference methods to stochastic model based control. We showed how uncertainty over parameters may be formally incorporated into an stochastic non-linear MPC controller and evaluated methods of propagating the uncertainty into trajectory roll outs. This extension to information theoretical MPC provides the building blocks of an adaptive controller framework, more resilient to issues arising from reality gap and covariate shift. As shown in the robotic experiments, incorporating uncertainty may lead to a more accurate assessment of the environment and increase the performance.

The unscented transform proved an efficient way to propagate uncertainty, reducing the burden of sampled trajectories. When combined with the ability to impose hard-penalties on the state cost, the result is similar to a chance constraint, where the resulting trajectories from sigma points that violate the soft constraints are heavily penalised. It is worth noticing, that this deterministic method of estimating the moments of the parameter distribution allow the task of sampling actions to be parallelised asynchronously and aggregated when computing the final cost. In future work we intend to explore further possibilities of uncertainty propagation in a principled way.

More importantly, we showed how likelihood-free inference is a powerful tool to refine the estimation of the posterior distribution. As the inference is based on the same transition function of the controller, it may compensate overly simplified models of the environment. Therefore, we want to explore pathways to efficiently retrain this estimate online so practical experiments with time-variant parameters may be conducted. This is a crucial step towards generalisation of control policies for autonomous robots operating under varying environments and configurations. Crucially, by combining parameter estimation and gradient-free control methods, this method may also be used with black-box simulators, such as data-driven function approximators, as long as we are able to sample efficiently from them. This is a promising direction for addressing complex robotic tasks.

[Parameters used in the experimental results] A comprehensive list of parameters used in the experimental section are listed on Table I and Table II. For both experiments, the unscented transform secondary scaling () and minimum control () were set to zero. Note that, as the random seeds were not controlled, slight variations are expected when reproducing the results. Similarly, the update of the posterior distribution approximation, , will depend on and therefore will vary in every execution.

Parameter | Inverted Pendulum |
---|---|

Sampled actions () | |

Control horizon () | |

Inverse temperature () | |

Control authority () | |

Instant state cost () | |

Terminal state cost () | |

Sigma points () | |

UT Spread () | |

UT scalar () | |

Prior distribution () | |

- over pole length | |

- over pole mass | |

Posterior distribution () | |

- over pole length | |

- over pole mass |

Parameter | Skid-steer Robot |
---|---|

Sampled actions () | |

Control horizon () | |

Inverse temperature () | |

Control authority () | |

Instant state cost () | |

Terminal state cost () | |

Sigma points () | |

UT Spread () | |

UT scalar () | |

Prior distribution () | |

- over | |

- over | |

- over | |

Posterior distribution () | |

- | |

- | |

- |