Energy-Based Continuous Inverse Optimal Control

Energy-Based Continuous Inverse Optimal Control

Yifei Xu, Jianwen Xie , Tianyang Zhao, Chris Baker, Yibiao Zhao, Ying Nian Wu
University of California, Los Angeles, USA   Hikvision Research Institute, Santa Clara, USA   Isee Inc., Cambridge, USA
Abstract

The problem of continuous optimal control (over finite time horizon) is to minimize a given cost function over the sequence of continuous control variables. The problem of continuous inverse optimal control is to learn the unknown cost function from expert demonstrations. In this article, we study this fundamental problem in the framework of energy-based model, where the observed expert trajectories are assumed to be random samples from a probability density function defined as the exponential of the negative cost function up to a normalizing constant. The parameters of the cost function are learned by maximum likelihood via an “analysis by synthesis” scheme, which iterates the following two steps: (1) Synthesis step: sample the synthesized trajectories from the current probability density using the Langevin dynamics. (2) Analysis step: update the model parameters based on the difference between the synthesized trajectories and the observed trajectories. Given the fact that an efficient optimization algorithm is usually available for an optimal control problem, we also consider a variation of the above learning method, where we modify the synthesis step (1) into an optimization step while keeping the analysis step (2) unchanged. Specifically, instead of sampling each synthesized trajectory from the current probability density, we minimize the current cost function over the sequence of control variables using the existing optimization algorithm. We give justifications for this optimization-based method. We demonstrate the proposed energy-based continuous optimal control methods on autonomous driving tasks, and show that the proposed methods can learn suitable cost functions for optimal control.

1 Introduction

The problem of continuous optimal control has been extensively studied. In this paper, we study the control problem of finite time horizon, where the trajectory is over a finite period of time. In particular, we focus on the problem of autonomous driving as a concrete example. In continuous optimal control, the control variables or actions are continuous. The dynamics is known. The cost function is defined on the trajectory and is usually in the form of the sum of stepwise costs and the cost of the final state. We call such a cost function Markovian. The cost function is assumed to be given. The continuous optimal control seeks to minimize the cost function over the sequence of continuous control variables or actions, and many efficient algorithms have been developed for various optimal control problems [Todorov2006]. For instance, in autonomous driving, the iLQR (iterative linear quadratic regulator) algorithm is a commonly used optimization algorithm [Li and Todorov2004] [Bemporad et al.2002]. We call such an algorithm the built-in optimization algorithm for the corresponding control problem.

In applications such as autonomous driving, the dynamics is well defined by the underlying physics and mechanics. However, it is a much harder problem to design or specify the cost function. One solution to this problem is to learn the cost function from expert demonstrations by observing their sequences of actions. Learning the cost function in this way is called continuous inverse optimal control problem.

In this article, we study the fundamental problem of continuous inverse optimal control in the framework of energy-based model. Originated from statistical physics, an energy-based model is a probability distribution where the probability density function is in the form of exponential of the negative energy function up to a normalizing constant. Instances with low energies are assumed to be more likely according to the model. For continuous inverse optimal control, the cost function plays the role of energy function, and the observed expert sequences are assumed to be random samples from the energy-based model so that sequences with low costs are more likely to be observed.

To learn the cost function, we can assumed that the cost function is a linear combination of a set of hand-designed features, so that we only need to learn the coefficients of the linear combination. We can also assume that the cost function has more general parametrization such as a neural network defined on the trajectory, where the parameters are the weight and bias terms of the neural network. The goal is to learn the parameters of the cost function from the expert sequences.

The parameters can be learned by the maximum likelihood method in the context of the energy-based model. The maximum likelihood learning algorithm follows an “analysis by synthesis” scheme, which iterates the following two steps: (1) Synthesis step: sample the synthesized trajectories from the current probability distribution using the Langevin dynamics [Neal and others2011]. The gradient computation in the Langevin dynamics can be conveniently and efficiently carried out by back-propagation through time. (2) Analysis step: update the model parameters based on the difference between the synthesized trajectories and the observed trajectories.

Our experiments show that we can disable the noise term in the Langevin dynamics to make it a gradient descent algorithm, and the above learning algorithm can still learn the cost function that enables optimal control. This amounts to changing the synthesis step into a gradient descent minimization step. Moreover, for an optimal control problem where the cost function is of the Markovian form, a built-in optimization algorithm is usually already available, such as the iLRQ algorithm for autonomous driving. In this case, we consider a variation of the above learning method, where we change the synthesis step (1) into an optimization step while keeping the analysis step (2) unchanged. Specifically, in step (1), instead of sampling each synthesized trajectory from the current model, we minimize the current cost function over the sequence of control variables using the built-in optimization algorithm. This method is convenient because it simply uses the built-in optimal control algorithm as the inner loop of the learning algorithm, and after learning, we continue to use the same optimization algorithm for testing. We give two justifications for this optimization-based method. (1) The learning method leads to a moment matching estimator where the analysis step matches the optimized trajectories to the observed trajectories in terms of statistical properties defined by the cost function. (2) The learning method solves a minimax adversarial game where the optimization step and the analysis step play an adversarial game. The optimization step finds the modes of the cost function, while the analysis step shifts the modes toward the observed sequences.

We demonstrate the proposed energy-based continuous optimal control methods on autonomous driving and show that the proposed methods can learn suitable cost functions for optimal control.

2 Contributions and related work

The contributions of our work are as follows. (1) We propose an energy-based method for continuous inverse optimal control based on Langevin sampling. (2) We also propose an optimization-based method where the inner loop of the learning algorithm is based on gradient descent or an existing built-in optimal control algorithm. (3) We evaluate the proposed methods on autonomous driving tasks and demonstrate the usefulness of the proposed methods.

The following are research themes related to our work.

(1) Maximum entropy framework. Our work follows the maximum entropy framework of [Ziebart et al.2008] for learning the cost function. Such a framework has also been used previously for generative modeling of images [Zhu, Wu, and Mumford1998] and Markov logic network [Richardson and Domingos2006]. In this framework, the energy function is a linear combination of hand-designed features. Recently, [Wulfmeier, Ondruska, and Posner2015] generalized this framework to deep version. In these methods, the state spaces are discrete, where dynamic programming schemes can be employed to calculate the normalizing constant of the energy-based model. In our work, the state space is continuous, where we use Langevin dynamics to sample trajectories from the learned model. We also propose an optimization-based method where we use gradient descent or a built-in optimal control algorithm as the inner loop for learning.

(2) ConvNet energy-based models. Recently, [Xie et al.2015] and [Xie, Zhu, and Wu2017] applied energy-based model to various generative modeling tasks, where the energy functions are parameterized by ConvNets [LeCun, Bengio, and others1995] [Krizhevsky, Sutskever, and Hinton2012], and the sampling is based on Langevin dynamics. In our paper, we apply such models to inverse optimal control, and we also propose an optimization-based learning method and give justifications.

(3) Inverse reinforcement learning. Most of the inverse reinforcement learning methods [Finn, Levine, and Abbeel2016, Finn et al.2016], including adversarial learning methods [Goodfellow et al.2014], [Ho and Ermon2016], [Li, Song, and Ermon2017], [Finn et al.2016], involve learning a policy in addition to the cost function. In our work, we do not learn any policy. We only learn a cost function, where the trajectories are either sampled by the Langevin dynamics or obtained by gradient descent or a built-in optimal control algorithm.

(4) Continuous inverse optimal control (CIOC). The CIOC problem has been studied by [Monfort, Liu, and Ziebart2015] and [Levine and Koltun2012]. In [Monfort, Liu, and Ziebart2015], the dynamics is linear and the cost function is quadratic, so that the normalizing constant can be computed by a dynamics programming scheme. In [Levine and Koltun2012], the Laplace approximation is used for approximation. However, the accuracy of the Laplace approximation is questionable for complex cost function.

(5) Trajectory prediction. A recent body of research has been devoted to supervised learning for trajectory prediction [Alahi et al.2016], [Gupta et al.2018], [Vemula, Muelling, and Oh2018], [Lee et al.2017], [Deo, Rangesh, and Trivedi2018], [Zhao et al.2019]. These methods directly predict the coordinates and they do not consider control and dynamics models. As a result, they cannot be used for inverse optimal control.

3 Energy-based inverse control

3.1 Optimal control

We study the final horizon control problem for discrete time . Let be the state at time . Let . Let be the continuous control variable or action at time . Let . The dynamics is assumed to be deterministic, , where is given, so that determines . The trajectory is . Let be the environment condition. We assume that the recent history is known. The cost function is where consists of the parameters that define the cost function. The problem of optimal control is to find to minimize with given and under the known dynamics . The problem of inverse optimal control is to learn from expert demonstrations .

A special case of the cost function is of the linear form

(1)

where is a vector of hand-designed features, and is a vector of weights for these features. We can also parametrize by a neural network where collects all the weight and bias parameters of the network.

For autonomous driving, the state consists of the coordinate, heading angle and velocity of the car, the control variables consists of steering angle and acceleration, the environment consists of road condition, speed limit, curvature of the lane as well as the coordinates of other vehicles. For the linear cost function (1), the vector of features is hand-crafted, including the distance to goal (which is set to be a future point along the lane), the penalty for collision with other vehicles, the distance to the center of the lane, the leading angle difference from the lane, the difference from the speed limit, the L2-norm of the change in acceleration/steering between two consecutive frames and the L2-norm of the acceleration/steering. The cost function can also be parametrized by a neural network.

3.2 Energy-based probabilistic model

The energy-based model assumes the following conditional probability density function

(2)

where is the normalizing constant. Recall that is determined by according to the deterministic dynamics, so that we only need to define probability density on . The cost function serves as the energy function. For expert demonstrations , are assumed to be random samples from , so that tends to have low cost .

3.3 Sampling-based inverse control

We can learn the parameters by maximum likelihood. The log-likelihood is

(3)

We can maximize by gradient ascent, and the learning gradient is

(4)
(5)

which follows from the fact that

(6)

In order to approximate the above expectation, we can generate a random sample , which generates the sampled trajectory by unfolding the dynamics. We then estimate by

(7)
(8)

which is the stochastic unbiased estimator of . Then we can run the stochastic gradient ascent algorithm to obtain the maximum likelihood estimate of , where indexes the time step, is the step size. According to the Robbins-Monroe theory of stochastic approximation  [Robbins and Monro1951], if and , the algorithm will converge to a solution of . For each , we can also generate multiple copies of from and average them to approximate the expectation in (4), but one copy is sufficient because the averaging effect takes place over time.

In linear case we have , thus

(9)

The synthesis step that samples from can be accomplished by Langevin dynamics, which iterates the following steps:

(10)

where indexes the time step, is the derivative with respect to , and it can be computed conveniently and efficiently by back-propagation through time. is the step size, and independently over , where is the identity matrix of the same dimension as . The Langevin dynamics is an inner loop of the learning algorithm.

3.4 Optimization-based inverse control

Our experiments show that if we remove the noise term in Langevin dynamics (10), to make it a gradient descent process, , we can still learn the cost function that enables optimal control. This amounts to modifying the synthesis step into an optimization step. The learning based on gradient descent can be even more stable than the learning based on the Langevin dynamics.

Moreover, a built-in optimization algorithm is usually already available for minimizing the cost function over . For instance, in autonomous driving, a commonly used algorithm is iLQR. See Supplementary materials for details of iLQR. In this case, we can replace the synthesis step by an optimization step, where, instead of sampling , we optimize

(11)

The analysis step remains unchanged. In our experiments, we will evaluate this learning algorithm.

A justification for this learning algorithm in the context of the energy-based model (2) is to consider its tempered version: , where is the temperature. Then the optimized that minimizes can be considered the zero-temperature sample, which is used to approximate the expectation in (4).

However, in this paper, we choose to pursue justifications of the optimization-based algorithm outside the context of probabilistic model. To this end, we need to re-think about the inverse optimal control, whose goal is not to find a probabilistic model for the expert trajectories. Instead, the goal is to find a suitable cost function for optimal control, where we care about the optimal behavior, not the variabilities of the observed behaviors.

Thus we give the following non-probabilistic justifications.

Moment matching

For simplicity, consider the linear cost function . At the convergence of the optimization-based learning algorithm, which has the same analysis step (2) as the sampling-based algorithm, we have , so that

(12)

where the left-hand side is the average of the optimal behaviors obtained by (11), and the right-hand side is the average of the observed behaviors. We want the optimal behaviors to match the observed behaviors on average. We can see the above point most clearly in the extreme case where all and all , so that , i.e., we want the optimal behavior under the learned cost function to match the average observed behaviors as far as the features of the cost function are concerned. Note that the matching is not in terms of raw trajectories but in terms of the features of the cost function. In this matching, we do not care to model the variabilities in the observed behaviors. In the case of different for , the matching may not be exact for each combination of . However, such mismatches may be detected by new features which can be included in the features of the cost function, so that with a strong collection of features, we may have close matching for each . This matching justification also applies to the general cost function.

Adversarial learning

Define the value function

(13)

Then , so that the analysis step (2) increases . Thus the optimization step and the analysis step play an adversarial game , where the optimization step seeks to minimize by reducing the costs, while the analysis step seeks to increase by modifying the cost function. More specifically, the optimization step finds the minima of the cost functions to decrease , whereas the analysis step shifts the minima toward the observe trajectories in order to increase .

Algorithm 1 presents the learning algorithm.

1:input expert demonstrations .
2:output cost function parameters , and synthesized or optimized trajectories .
3:Let 0, initialize .
4:repeat
5:     Synthesis step or optimization step: synthesizing by Langevin sampling, or optimizing , by gradient descent (GD) or iLRQ, and then obtain , for each .
6:     Analysis step: update , where is computed according to (8).
7:      .
8:until , the number of iterations.
Algorithm 1 Energy-based inverse control

In our experiments, we shall mainly use optimization-based method with gradient descent (GD) or iLQR as optimizer. However, the energy-based probabilistic model and sampling-based method are crucial because they serve as the conceptual foundations of our work.

4 Experiments

4.1 Single-agent control

We evaluate the proposed energy-based inverse control methods on autonomous driving tasks.

Dataset

We use two datasets with different focuses.

Massachusetts driving dataset. This is a private dataset collected from an autonomous car during repeated trips on a stretch of highway. The dataset includes both vehicle states and environment information. This dataset has a realistic driving scenario, which includes curved lanes and complex static scenes. The control of the autonomous vehicle is collected by the hardware on the car. The car state consists of coordinates, orientation and velocity. The control is provided by two scalars: the steering and the acceleration. Environment information consists of all lanes, represented by cubic polynomials. To solve the problem of noisy GPS signal, Kalman filtering is used to denoise the data. The number of expert trajectories is 44,000, and each has frames with 0.1s intervals for a total of 3 seconds.

The data were collected over 10 days. We choose the data from the last two days for testing, and the data from the first 8 days for training. As it turns out, the testing data are slightly easier than the training data. In particulr, the paths in the testing data are smoother than the paths in the training data. As a result, the testing errors are smaller than the training errors. During training and testing, we use mini-batches of size 1000.

NGSIM US-101. NGSIM [Colyar and Halkias2007] consists of real highway traffic captured at 10Hz over a time span of 45 minutes. Compared to Massachusetts driving dataset, NGSIM contains rich vehicle interactions. The control needs to consider other nearby vehicles.

We preprocess the data by dividing the data into 5 second / 50 frame blocks. There are 831 total scenes with 96,512 5-second vehicle trajectories. We use 1 second (10 frames) as history trajectory and 4 seconds (40 frames) as future trajectory. There are no control variables provided. Thus, we need to infer the control of each vehicle given the vehicle state, which includes the position, speed and heading angle. Assuming bicycle model [Polack et al.2017] dynamics, we perform an inverse-dynamics optimization using gradient descent to minimize the distance between the trajectory reconstructed with the inferred controls, and the real one. The overall Root Mean Square Error (RMSE) compared to the ground truth GPS position is 0.97 meters. With inferred controls, we perform iterative dynamics from the starting point to reconstruct the inferred trajectory. We called it “rollout. ” Compared to the ground truth trajectory, our inferred trajectory tends to have smoother curves for coordinates, acceleration and steering. Since we use inferred trajectories, our preprocessed data are assumed to have prefect dynamic with noiseless and smooth control sequences and GPS coordinates. We randomly split training set and testing set.

Evaluation metrics

We use two evaluation metrics. The first one is Root Mean Square Error (RMSE). Assume there are N expert demonstrations. Let be the distance between and , RMSE for the -th position is defined as: . A small RMSE is desired.

The second one is the likelihood. In computing the likelihood, the longitude and latitude directions are treated differently. It is due to the fact that error in the -axis is much more importance then -axis. A small lateral shift in the -axis may trigger a collision with an adjacent vehicle.

Assume,

The likelihood is defined as:

where is the probability density under the Gaussian distribution with mean and standard deviation . The prefect prediction has likelihood of 1. A bigger likelihood is desired.

Details

The trajectories of other vehicles are set as known environment states and are assumed to remain unchanged while the ego vehicle is moving. In practice, the trajectories of other vehicles are predicted. In this paper, we sidestep this step to focus on the inverse control problem.

In this experiment, we use linear combination of hand-designed features as cost function. Feature normalization is used to make sure that each feature has the same scale. Specifically, for each expert trajectory, and for each feature, we divide the values of the feature over time by the standard deviation of these values.

In learning, the initialization of each weight parameter is set to 1. In the optimization step, we set the number of iterations for the gradient descent to 64 and the step size is set to 0.01. The initialization of control is set to zero, which implies keeping straight and keeping constant speed. To avoid irregular big gradient in some steps, we use gradient clipping to restrict the control variable to the interval [-8, +8].

In training step, we use Adam optimizer [Kingma and Ba2014] and set the learning rate to 0.1 and beta to 0.5.

In NGSIM setting, we found that the scales of steering and acceleration are different. Thus, we normalized both controls.

Baseline methods

We compare our method with two control methods and one baseline.

  • Constant velocity: simplest baseline, generates trajectories with constant velocity and zero steering.

  • GAIL : We use the same cost function as the reward function in generative adversarial imitation learning. For policy net, we use multi-layer perceptron (MLP) policy described in [Kuefler et al.2017].

  • CIOC [Levine and Koltun2012] : We use iLQR method with the same setting as in our method for optimal control on testing.

Results

Figure 1: Predicted trajectories of Massachusetts driving dataset.(Green : Predicted trajectory ; Red : Ground truth ; Orange : Other vehicle ; Gray : Lane)
Figure 2: Driving dataset : likelihood over training epochs (Left : CIOC, Middle: iLQR (ours), Right: Gradient descent (GD, ours)).
method Likelihood RMSE
Train Test Train Test
Constant Velocity 0.676 0.665 0.956 0.870
CIOC 0.795 0.653 0.789 0.987
ours (via iLQR) 0.745 0.745 0.871 0.786
ours (via GD) 0.799 0.810 0.767 0.660
Table 1: Massachusetts driving dataset result.
Method (Testing RMSE) 1s 2s 3s 4s
Constant Velocity 0.569 1.623 3.075 4.919
CIOC 0.503 1.468 2.801 4.530
GAIL 0.702 1.519 2.688 3.917
ours (via GD) 0.318 0.644 1.149 2.138
ours (via iLQR) 0.351 0.603 0.969 1.874
Table 2: NGSIM dataset result.

Figure 1 shows the sample output. Each point sequence represents a 3 seconds trajectory over time from left to right. Orange one represents the ground truth for any vehicles other than the ego vehicle. Red trajectory represents the ground truth for ego vehicle. Green one represents the predicted trajectory by gradient descent.

Results for Massachusetts driving dataset are shown in table 3. Our method achieved a substantial improvement compared to baseline CIOC [Levine and Koltun2012]. Moreover, our method is far more stable than CIOC during traing stage. Figure 2 shows the testing likelihood during training whose -axis is number of training epoch.

NGSIM result is shown in table 2. Here we report only testing error since trainin error are similiar. Our method also performs the best compared to other control method.

Discussion

We observe that CIOC performs poorly on both two datasets, which is due to the fact that its Laplace approximation is not accurate enough in our setting, where the hand-designed features and bicycle dynamic model are both non-quadratic.

The problem of GAIL is due to its complexity. The original GAIL [Ho and Ermon2016] solved discrete reinforcement learning problem. The continuous control is harder. In order to learn a good policy network as well as a good value network, a very big expert training set and a very long training stage are needed. Our method does not require such big networks.

Comparing the two optimization method iLQR and gradient descent, iLQR result is better in NGSIM while gradient descent is better in Driving dataset. It is due to the fact that the speed varies a lot in NGSIM dataset while Driving dataset has a light traffic flow with less speed variations. In the supplementary materials, we discuss about the statistics in the two datasets. Both methods perform well on steering while following the lane. Comparing to steering, acceleration is always changed more rapidly (i.e. It is rare to have sudden turn in highway while a sudden stop is more usual). Gradient descent tends to produce a smoother control, making it hard to predict sudden change for acceleration.

4.2 Testing corner cases with toy examples

Corner cases are important for model evaluation. Therefore, we construct 6 typical corner cases to test our model. Figure 3 shows the predicted trajectories for several synthetic examples. Please check the supplementary html for animation gif for these graphs.

Figure 3(a)(b) show sudden brakes, where the orange vehicles are braking at the same speed. There is no car alongside in (a), so the predicted control is to overtake the braking vehicle by changing the lane. There are cars alongside in (b), which is why the ego vehicle is predicted to trigger brake.

(c)(d) show another car trying to cut-in from the left and right to the current lane. The ego vehicle is predicted to trigger brake.

(e)(f) show a large lane curvature, yet our model still performs well on lane following.

Figure 3: Predicted Trajectory for synthetic examples. (Green : Predicted ; Orange : Other vehicle ; Gray : Lane)
Figure 4: Predicted control through time.

Figure 4 shows the control change during each step in gradient descent corresponding to Figure 3. Blue line stands for acceleration and orange line stands for steering. The dash lane is the initialization point of the control, which is the control of the last frame in history trajectory. The darkest solid line is the final predicted control. From light to dark is the changing difference during gradient descent.

In short, our learning algorithm via gradient descent is capable of learning a reasonable cost function to avoid collision and handle cut-in situations.

4.3 Comparison of different cost functions

Neural network is a powerful function approximator. It is capable of approximating complex nonlinear function given sufficient training data, and is flexible in incorporating prior information, which in our case is the manually designed features. We design three different multi-layer perceptron (MLP) network structures as an add-on to the feature-based function and a CNN structure.

Neural network designs

(a). , where is a two layer MLP (with hidden dimension 64, output dimension 1 and leaky relu activation)

(b). , where is a two layer MLP (with hidden dimension 64, output dimension 1 and leaky relu activation).

(c). , where is a two layer MLP (with hidden dimension 64, output dimension and leaky relu activation)

The CNN setting is to connect the temporal information between each frames. For a frame trajectory, each frame outputs hand-designed feature with a total of features. We consider this matrix as a dimension and channel sequence data and go through three 1D convolution layers (with kernel size 4, 4, 4, channel 32, 64, 128 and leaky relu activation) and one fully connected layer at the end.

method Likelihood RMSE
Train Test Train Test
Linear 0.800 0.810 0.767 0.660
NN setting (a) 0.799 0.794 0.668 0.576
NN setting (b) 0.800 0.804 0.823 0.700
NN setting (c) 0.773 0.772 0.802 0.713
CNN 0.814 0.733 0.626 0.778
Table 3: Massachusetts driving dataset result.

Results

As for other neural network structures, a marginal improvement is achieved by (a). This structure provides nonlinear connection layer as a transformer of the original input feature. This implies that there are some internal connections between the features, e.g. ,if there is a car in front, the collision cost will increase. At this time, the lane keeping cost will be neglected or even have a negative impact. However, other methods decrease the accuracy. During testing, we found that the error is mainly made through optimized step. That is, gradient descent tends to find a local minimum if the cost function is complex.

We can see CNN, as the only non-markovian structure, achieves the best in training stage. It shows a bit overfit due to too many similar trajectories (i.e. following lane) in the training set. We believe with a dataset with more diverse scenes, CNN can achieve a better result.

In this experiment, we show that our method can learn cost function parametrized by neural networks and also a non-markovian cost function parametrized by convolution neural networks.

4.4 Multi-agent control

In single agent control setting, the future trajectories of other vehicles are assumed known (e.g. predicted by a prediction method) and it remains unchanged no matter how our ego vehicle moves. We extend our framework to multi-agent setting. In this setting, we simultaneously control all vehicles in the scene. The controls of other vehicles are used for predicting the trajectories of other vehicles.

In this setting, suppose there are agents, and every agent in the scene can be regarded as a general agent while its state and control space are cartesian product of the individual states and controls . All the agents share the same dynamic function . The overall cost function are set to be the sum of each agent . Thus, the conditional probability density function becomes .

Baselines

  • Constant velocity: simplest baseline, generates trajectories with constant velocity and zero steering.

  • PS-GAIL : The parameter sharing GAIL is described in [Bhattacharyya et al.2018] [Bhattacharyya et al.2019]. It is similar to single-agent GAIL while each agent has the same critic and reward function.

Figure 5: Predicted trajectory for multi-agent. (Green : Predicted trajectory ; Red : Ground truth ; Gray : Lane)
Method (RMSE) 1s 2s 3s 4s
Constant Velocity 0.569 1.623 3.075 4.919
PS-GAIL 0.602 1.874 3.144 4.962
ours (multi-agent) 0.390 0.747 1.383 2.459
Table 4: NGSIM dataset results.

We achieve a similar result to single agent setting by our method. Figure 5 shows the predicted results. These two rows are about two scenes where the left one is the initial frame, the middle one is 2 seconds prediction and the right one is the full 4 seconds prediction. Please check the supplementary html for animation gif for these graphs.

The problem of PS-GAIL is the same as GAIL. Multi-agent setting is more complex so that it performs worse.

5 Conclusion

This paper studies the fundamental problem of learning the cost function from expert demonstrations for continuous optimal control over finite time horizon. We study this problem in the framework of energy-based model, and we propose an optimization-based method to learn the cost function. The optimization-based method is convenient if a built-in optimization algorithm is available, and it can be justified in terms of moment matching and adversarial learning. The proposed methods are generally applicable to continuous inverse optimal control problems.

References

  • [Alahi et al.2016] Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; and Savarese, S. 2016. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition.
  • [Bemporad et al.2002] Bemporad, A.; Morari, M.; Dua, V.; and Pistikopoulos, E. N. 2002. The explicit linear quadratic regulator for constrained systems. Automatica 38(1):3–20.
  • [Bhattacharyya et al.2018] Bhattacharyya, R. P.; Phillips, D. J.; Wulfe, B.; Morton, J.; Kuefler, A.; and Kochenderfer, M. J. 2018. Multi-agent imitation learning for driving simulation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1534–1539. IEEE.
  • [Bhattacharyya et al.2019] Bhattacharyya, R. P.; Phillips, D. J.; Liu, C.; Gupta, J. K.; Driggs-Campbell, K.; and Kochenderfer, M. J. 2019. Simulating emergent properties of human driving behavior using multi-agent reward augmented imitation learning. In Proceedings of the International Conference on Robotics and Automation (ICRA).
  • [Colyar and Halkias2007] Colyar, J., and Halkias, J. 2007. Us highway dataset. Federal Highway Administration (FHWA), Tech. Rep. FHWA-HRT-07-030.
  • [Deo, Rangesh, and Trivedi2018] Deo, N.; Rangesh, A.; and Trivedi, M. M. 2018. How would surround vehicles move? a unified framework for maneuver classification and motion prediction. IEEE Transactions on Intelligent Vehicles 3(2):129–140.
  • [Finn et al.2016] Finn, C.; Christiano, P.; Abbeel, P.; and Levine, S. 2016. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852.
  • [Finn, Levine, and Abbeel2016] Finn, C.; Levine, S.; and Abbeel, P. 2016. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, 49–58.
  • [Goodfellow et al.2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
  • [Gupta et al.2018] Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; and Alahi, A. 2018. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition.
  • [Ho and Ermon2016] Ho, J., and Ermon, S. 2016. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, 4565–4573.
  • [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105.
  • [Kuefler et al.2017] Kuefler, A.; Morton, J.; Wheeler, T.; and Kochenderfer, M. 2017. Imitating driver behavior with generative adversarial networks. In Intelligent Vehicles Symposium (IV), 2017 IEEE, 204–211. IEEE.
  • [LeCun, Bengio, and others1995] LeCun, Y.; Bengio, Y.; et al. 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10):1995.
  • [Lee et al.2017] Lee, N.; Choi, W.; Vernaza, P.; Choy, C. B.; Torr, P. H.; and Chandraker, M. 2017. Desire: Distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 336–345.
  • [Levine and Koltun2012] Levine, S., and Koltun, V. 2012. Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617.
  • [Li and Todorov2004] Li, W., and Todorov, E. 2004. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO (1), 222–229.
  • [Li, Song, and Ermon2017] Li, Y.; Song, J.; and Ermon, S. 2017. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, 3812–3822.
  • [Monfort, Liu, and Ziebart2015] Monfort, M.; Liu, A.; and Ziebart, B. D. 2015. Intent prediction and trajectory forecasting via predictive inverse linear-quadratic regulation. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, 3672–3678. AAAI Press.
  • [Neal and others2011] Neal, R. M., et al. 2011. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo 2(11):2.
  • [Polack et al.2017] Polack, P.; Altché, F.; d’Andréa Novel, B.; and de La Fortelle, A. 2017. The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles? In Intelligent Vehicles Symposium (IV), 2017 IEEE, 812–818. IEEE.
  • [Richardson and Domingos2006] Richardson, M., and Domingos, P. 2006. Markov logic networks. Machine learning 62(1-2):107–136.
  • [Robbins and Monro1951] Robbins, H., and Monro, S. 1951. A stochastic approximation method. The annals of mathematical statistics 400–407.
  • [Todorov2006] Todorov, E. 2006. Optimal control theory. Bayesian brain: probabilistic approaches to neural coding 269–298.
  • [Vemula, Muelling, and Oh2018] Vemula, A.; Muelling, K.; and Oh, J. 2018. Social attention: Modeling attention in human crowds. In Proceedings of the International Conference on Robotics and Automation (ICRA) 2018.
  • [Wulfmeier, Ondruska, and Posner2015] Wulfmeier, M.; Ondruska, P.; and Posner, I. 2015. Maximum entropy deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888.
  • [Xie et al.2015] Xie, J.; Hu, W.; Zhu, S.-C.; and Wu, Y. N. 2015. Learning sparse frame models for natural image patterns. International Journal of Computer Vision 114(2-3):91–112.
  • [Xie, Zhu, and Wu2017] Xie, J.; Zhu, S.-C.; and Wu, Y. N. 2017. Synthesizing dynamic patterns by spatial-temporal generative convnet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7093–7101.
  • [Zhao et al.2019] Zhao, T.; Xu, Y.; Monfort, M.; Choi, W.; Baker, C.; Zhao, Y.; Wang, Y.; and Wu, Y. N. 2019. Multi-agent tensor fusion for contextual trajectory prediction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Zhu, Wu, and Mumford1998] Zhu, S. C.; Wu, Y.; and Mumford, D. 1998. Filters, random fields and maximum entropy (frame): Towards a unified theory for texture modeling. International Journal of Computer Vision 27(2):107–126.
  • [Ziebart et al.2008] Ziebart, B. D.; Maas, A. L.; Bagnell, J. A.; and Dey, A. K. 2008. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, 1433–1438. Chicago, IL, USA.

Appendix

Iterative Linear Quadratic Regulation

Iterative Linear Quadratic Regulation (iLQR) is a variant of Differential dynamic programming (DDP). [Li and Todorov2004] Given an initial trajectory, it updates the trajectory by repeatedly solving for the optimal policy under linear quadratic assumptions.

Let be the i-th iteration trajectory. The dynamic is known, . Define , then,

where the subscripts denote the Jacobians and Hessians of the dynamic and cost function .

iLQR recursively calculates the Q-function from the tail of the trajectory to the head,

Then we calculate and by,

Finally they are used to update the -st trajectory given the th trajectory.

After several iterations, the trajectory converges to the current local optimal.

Details about dataset

Figure 6: Statistics comparison about Massachusetts Driving dataset and NGSIM dataset
Figure 7: Inferred trajectory compare to ground truth. (Green : Inferred; Red: Ground truth)

We provide more statistic comparison about two datsaet. Figure 6 shows histgrams for the road curvature, steering control and accelearation control. We can see that Massachusetts has a bigger curvature while NGSIM has a much bigger acceleration. We think this is the biggest reason for the different result produced by gradient descent and iLQR.

Figure 7 shows a sample scene in inferred NGSIM dataset. There are over 100 vehicles in this scene all move from left to right. Each dot sequences stand one vehicle from time frame 0 to frame 40. Green dot sequence represents inferred trajectory and red one represents ground truth.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
390157
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description