Agile Off-Road Autonomous Driving UsingEnd-to-End Deep Imitation Learning

# Agile Off-Road Autonomous Driving Using End-to-End Deep Imitation Learning

Yunpeng Pan, Ching-An Cheng, Kamil Saigol, Keuntaek Lee, Xinyan Yan,
Evangelos Theodorou, Byron Boots
Y. Pan and E. Theodorou are affiliated with School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA. {ypan37, evangelos.theodorou}@gatech.edu C.-A. Cheng, K. Saigol, X. Yan, B. Boots are affiliated with the Institute for Robotics and Intelligent Machines, Georgia Institute of Technology, Atlanta, GA 30332, USA. {cacheng, kamilsaigol, xyan43}@gatech.edu, bboots@cc.gatech.edu. K. Lee is affiliated with School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA. keuntaek.lee@gatech.edu
###### Abstract

We present an end-to-end imitation learning system for agile, off-road autonomous driving using only low-cost on-board sensors. By imitating an optimal controller, we train a deep neural network control policy to map raw, high-dimensional observations to continuous steering and throttle commands, the latter of which is essential to successfully drive on varied terrain at high speed. Compared with recent approaches to similar tasks, our method requires neither state estimation nor online planning to navigate the vehicle. Real-world experimental results demonstrate successful autonomous off-road driving, matching the state-of-the-art performance.

## I Introduction

High-speed autonomous off-road driving is a challenging robotics problem [1, 2, 3] (Fig. 1). To succeed in this task, a robot is required to perform both precise steering and throttle maneuvers in a physically-complex, uncertain environment by executing a series of high-frequency decisions. Compared with most previously studied autonomous driving tasks, the robot must reason about unstructured, stochastic natural environments and operate at high speed. Consequently, designing a control policy by following the traditional model-plan-then-act approach [1, 4] becomes challenging, as it is difficult to adequately characterize the robot’s interaction with the environment a priori.

This task has been considered previously, for example, by Williams et al. [2, 3] using model-predictive control (MPC). While the authors demonstrate impressive results, their internal control scheme relies on expensive and accurate Global Positioning System (GPS) and Inertial Measurement Unit (IMU) for state estimation and demands high-frequency online replanning for generating control commands. Due to these costly hardware requirements, their robot can only operate in a rather controlled environment.

We aim to relax these requirements by designing a reflexive driving policy that uses only low-cost, on-board sensors (e.g. camera, wheel speed sensors). Building on the success of deep reinforcement learning (RL) [5, 6], we adopt deep neural networks (DNNs) to parametrize the control policy and learn the desired parameters from the robot’s interaction with its environment. While the use of DNNs as policy representations for RL is not uncommon, in contrast to most previous work that showcases RL in simulated environments [6], our agent is a high-speed physical system that incurs real-world cost: collecting data is a cumbersome process, and a single poor decision can physically impair the robot and result in weeks of time lost while replacing parts and repairing the platform. Therefore, direct application of model-free RL techniques is not only sample inefficient, but costly and dangerous in our experiments.

These real-world factors motivate us to adopt imitation learning [8] to optimize the control policy instead. A major benefit of using imitation learning is that we can leverage domain knowledge through expert demonstrations. This is particularly convenient, for example, when there already exists an autonomous driving platform built through classic system engineering principles. While such a system (e.g. the MPC controller using a pre-trained dynamics model and a state estimator based on high-end sensors in [2]) usually requires expensive sensors and dedicated computational resources, with imitation learning we can train a lower-cost robot to behave similarly, without carrying the expert’s hardware burdens over to the learner. Note that here we assume the expert is given as a black box oracle that can provide the desired actions when queried, as opposed to the case considered in [12] where the expert can be modified to accommodate to the learning progress.

In this work, we present an imitation learning system for real-world high-speed off-road driving tasks. By leveraging demonstrations from an algorithmic expert, our system can learn a control policy that achieves similar performance as the expert. The system was implemented on a 1/5-scale autonomous AutoRally car. In real-world experiments, we show the AutoRally car—without any state estimator or online planning, but with a DNN policy that directly inputs sensor measurements from a low-cost monocular camera and wheel speed sensors—could learn to perform high-speed navigation at an average speed of 6 m/s and a top speed of 8 m/s, matching the state of the art [3].

## Ii Related Work

End-to-end learning for self-driving cars has been explored since late 1980’s. Autonomous Land Vehicle in a Neural Network (ALVINN) [8] was developed to learn steering angles directly from camera and laser range measurements using a neural network with a single hidden layer. Based on similar ideas, modern self-driving cars [10, 7, 9] have recently started to employ a batch imitation learning approach: parameterizing control policies with DNNs, these systems require only expert demonstrations during the training phase and on-board measurements during the testing phase. For example, Nvidia’s PilotNet [7], a convolutional neural network that outputs steering angle given an image, was trained to mimic human drivers’ reaction to visual input with demonstrations collected in real-world road tests. A Dataset Aggregation (DAgger) [13] related online imitation learning algorithm for autonomous driving was recently demonstrated in [11], but only considered simulated environments.

Our problem differs substantially from these previous on-road driving tasks. We study autonomous driving on a fixed set of dirt tracks, whereas on-road driving must perform well in a larger domain and contend with moving objects such as cars and pedestrians. While on-road driving in urban environments may seem more difficult, our agent must overcome challenges of a different nature. It is required to drive at high speed, and prominent visual features such as lane markers are absent. Compared with paved roads, the surface of our dirt tracks are constantly evolving and highly stochastic. As a result, to successfully perform high-speed driving in our task, high-frequency application of both steering and throttle commands are required. Previous work only focuses on steering commands [10, 7, 9]. A comparison of different imitation learning approaches to autonomous driving is presented in Table I.

Our task is most similar to the task considered by Williams et al. [2, 3] and Drews et al. [14]. Compared with a DNN policy, their MPC approach has several drawbacks: computationally expensive optimization for planning is required to be performed online at high-frequency and the learning component is not end-to-end. In [2, 3], accurate GPS and IMU feedbacks are also required for state estimation, which may not contain sufficient information to contend with the changing environment in off-road driving tasks. While the requirement on GPS and IMU is relaxed by using a vision-based cost map in [14], a large dataset (300,000 images) was used to train the model, expensive on-the-fly planning is still required, and speed performance is compromised. In contrast to previous work, our approach off-loads the hardware requirements to an expert. While the expert may use high-quality sensors and more computational power, our agent only needs access to on-board sensors and its control policy can run reactively in high frequency, without on-the-fly planning and optimization. Additionally, our experimental results match that in [2, 3] and are faster and more data efficient than that in [14].

## Iii Imitation Learning for Autonomous Driving

To design a policy for off-road autonomous driving, we introduce a policy optimization problem and then show how a policy can be learned by imitation learning. We discuss the strengths and weakness of deploying a batch or online imitation learning algorithm to our task.

### Iii-a Problem Definition

To mathematically formulate the autonomous driving task, it is natural to consider a discrete-time continuous-valued RL problem. Let , , and be the state, action, and the observation spaces. In our setting, the state space is unknown to the agent; observations consist of on-board measurements, including a monocular RGB image from the front-view camera and wheel speeds from Hall effect sensors; actions consist of continuous-valued steering and throttle commands.

The goal is to find a stationary deterministic policy (e.g. a DNN policy) such that achieves low accumulated cost over a finite horizon of length

 minπJ(π):=minπEρπ[T−1∑t=0c(st,at)] (1)

in which is the distribution of , , and under policy , for . Here is the instantaneous cost, which, for example, encourages maximal speed driving while staying on the track. For notations, we denote as the Q-function at time under policy and as its associated value function.

### Iii-B Imitation Learning

Directly optimizing (1) is challenging for high-speed off-road autonomous driving. Since our task involves a physical robot, model-free RL techniques are intolerably sample inefficient and have the risk of permanently damaging the car when applying a partially-optimized policy in exploration. Although model-based RL may require fewer samples, it can lead to suboptimal, potentially unstable, results because it is difficult for a model that uses only on-board measurements to fully capture the complex dynamics of off-road driving.

Considering these limitations, we propose to solve for the policy by imitation learning. We assume the access to an oracle policy or expert to generate demonstrations during the training phase, which relies on resources that are unavailable in the testing phase, e.g. additional sensors and computation. For example, the expert can be a computationally intensive optimal controller that relies on exteroceptive sensors not available at test time (e.g. GPS for state estimation), or an experienced human driver.

The goal of imitation learning is to perform as well as the expert with an error that has at most linear dependency on . Formally, we introduce a lemma due to Kakade and Langford [15] and define what we mean by an expert.

###### Lemma 1.

Define as an unnormalized stationary state distribution, where is the distribution of state at time when running policy . Let and be two policies. Then

 J(π)=J(π′)+Es∼dπEa∼π[Atπ′(s,a)] (2)

where is the advantage function at time with respect to running .

###### Definition 1.

A policy is called an expert to problem (1) if independent of , where denotes the Lipschitz constant of function and is the Q-function at time of running policy .

The idea behind Definition 1 is that an expert policy should perform stably under arbitrary action perturbation with respect to the cost function , regardless of where it starts.111We define the expert here using an uniform Lipschitz constant because the action space in our task is continuous; for discrete action spaces, can be replaced by and the rest applies. As we will see in Section III-C, this requirement provides guidance for whether to choose batch learning vs. online learning to train a policy by imitation.

#### Iii-B1 Online Imitation Learning

We now present the objective function for the online learning [16] approach to imitation learning, which simplifies the derivation in [13] and extends it to continuous action spaces as required in the autonomous driving task. Although our goal here is not to introduce a new algorithm, but rather to give a concise introduction to online imitation learning, we found that a connection between online imitation learning and DAgger-like algorithms [13] in continuous domains has not been formally introduced. DAgger has only been used heuristically in these domains as in [17, 11].

Assume is an expert to (1) and suppose is a normed space with norm . Let denote the Wasserstein metric [18]: for two probability distributions and defined on a metric space with metric ,

 DW(p,q) \coloneqqsupf:Lip(f(⋅))≤1Ex∼p[f(x)]−Ex∼q[f(x)] (3) =infγ∈Γ(p,q)∫M×Md(x,y)dγ(x,y), (4)

where denotes the family of distributions whose marginals are and . It can be shown by the Kantorovich-Rubinstein theorem that the above two definitions are equivalent [18]. These assumptions allow us to construct a surrogate problem, which is relatively easier to solve than (1). We achieve this by upper-bounding the difference between the performance of and given in Lemma 1:

 J(π)−J(π∗) =Est∼dπ[Eat∼π[Qtπ∗(st,at)]−Ea∗t∼π∗[Qtπ∗(st,a∗t)]] ≤Cπ∗Est∼dπ[DW(π,π∗)] ≤Cπ∗Est∼dπEat∼πEa∗t∼π∗[∥at−a∗t∥], (5)

where we invoke the definition of advantage function , the first and the second inequalities is due to (3) and (4), respectively.

Define . Thus, to make perform as well as , we can minimize the upper bound, which is equivalent to solving a surrogate RL problem

 minπEρπ[T∑t=1^c(st,at)]. (6)

The optimization problem (6) is called the online imitation learning problem. This surrogate problem is comparatively more structured than the original RL problem (1), so we can adopt algorithms with provable performance guarantees. In this paper, we use the meta-learning algorithm DAgger [13], which reduces (6) to a sequence of supervised learning problems: Let be the training data. DAgger initializes with samples gathered by running . Then, in the th iteration, it trains by supervised learning,

 πi=argminπED[^c(st,at)], (7)

where subscript denotes empirical data distribution. Next it runs to collect more data, which is then added into to train . The procedure is repeated for iterations and the best policy, in terms of (6), is returned. Suppose the policy is linearly parametrized and . Since our instantaneous cost is strongly convex, the theoretical analysis of DAgger applies. Therefore, together with the assumption that is an expert, running DAgger to solve (6) finds a policy with performance , achieving our initial goal.

We note here the instantaneous cost can be selected to be any suitable norm according the problem’s property. In our off-road autonomous driving task, we find -norm is preferable (e.g. over 2-norm) for its ability to filter outliers in a highly stochastic environment.

#### Iii-B2 Batch Imitation Learning

By swapping the order of and in the above derivation in  (5), we can derive another upper bound and use it to construct another surrogate problem: define and , then

 J(π)−J(π∗) =Es∗t∼dπ∗[Eat∼π[Qtπ(s∗t,at)]−Ea∗t∼π∗[Qtπ(s∗t,a∗t)]] ≤Es∗t∼dπ∗Ea∗t∼π∗[Ctπ(s∗t)~cπ(s∗t,a∗t)]. (8)

where we use again Lemma 1 for the equality and the property of Wasserstein distance for inequality. The minimization of the upper-bound (8) is called batch imitation learning problem [7, 9]:

 minπEρπ∗[T∑t=1~cπ(s∗t,a∗t)], (9)

In contrast to the surrogate problem in online imitation learning (6), batch imitation learning reduces to a supervised learning problem, because the expectation is defined by a fixed policy .

### Iii-C Comparison of Imitation Learning Algorithms

Comparing (5) and (8), we observe that in batch imitation learning the Lipschitz constant , without being an expert, can be on the order of in the worst case. Therefore, if we take a uniform bound and define , we see . In other words, under the same assumption in online imitation, i.e. (8) can be minimized to an error that depends linearly on , the difference between and in batch imitation learning can actually grow quadratically in due to error compounding. Therefore, in order to achieve the same level of performance as online imitation learning, batch imitation learning requires a more expressive policy class or more demonstration samples. As shown in [13], the quadratic bound is tight.

Therefore, if we can choose an expert policy that is stable in the sense of Definition 1, then online imitation learning is preferred theoretically. This is satisfied, for example, when the expert policy is an algorithm with certain performance characteristics. On the contrary, if the expert is human, the assumptions required by online imitation learning becomes hard to realize in real-road off-road driving tasks. Because humans rely on real-time sensory feedback (as in sampling from but not from ) to generate ideal expert actions, the action samples collected in the online learning approach using are often biased and inconsistent [19]. This is especially true in off-road driving tasks, where the human driver depends heavily on instant feedback from the car to overcome stochastic disturbances. Therefore, the frame-by-frame labeling approach [17], for example, can lead to a very counter-intuitive, inefficient data collection process, because the required dynamics information is lost in a single image frame. Overall, when using human demonstrations, online imitation learning can be as bad as batch imitation learning [19], just due to inconsistencies introduced by human nature.

## Iv The Autonomous Driving System

Building on the analyses in the previous section, we design a system that can learn to perform fast off-road autonomous driving with only on-board measurements. The overall system architecture for learning end-to-end DNN driving policies is illustrated in Fig. 2. It consists of three high-level controllers (an expert, a learner, and a safety control module) and a low-level controller, which receives steering and throttle commands from the high-level controllers and translates them to pulse-width modulation (PWM) signals to drive the steering and throttle actuators of a vehicle.

On the basis of the analysis in Section III-C, we assume the expert is algorithmic and has access to expensive sensors (GPS and IMU) for accurate global state estimates222Global position, heading and roll angles, linear velocities, and heading angle rate. and resourceful computational power. The expert is built on multiple hand-engineered components, including a state estimator, a dynamics model of the vehicle, a cost function of the task, and a trajectory optimization algorithm for planning (see Section IV-A). By contrast, the learner is a DNN policy that has access to only a monocular camera and wheel speed sensors and is required to output steering and throttle command directly (see Section IV-B). In this setting, the sensors that the learner uses can be significantly cheaper than that of the expert; specifically on our experimental platform, the AutoRally car (see Section IV-C), the IMU and the GPS sensors required by the expert in Section IV-A together cost more than $6,000, while the sensors used by the learner’s DNN policy cost less than$500. The safety control module has the highest priority among all three controllers and is used prevent the vehicle from high-speed crashing.

The software system was developed based on the Robot Operating System (ROS) in Ubuntu. In addition, a Gazebo-based simulation environment [20] was built using the same ROS interface but without the safety control module; the simulator was used to evaluate the performance of the system before real track tests.

### Iv-a Algorithmic Expert with Model-Predictive Control

We use an MPC expert [21] based on an incremental Sparse Spectrum Gaussian Process (SSGP) dynamics model (which was learned from 30 minute-long driving data) and an iSAM2 state estimator [22]. To generate actions, the MPC expert solves a finite horizon optimal control problem for every sampling time: at time , the expert policy is a locally optimal policy such that

 π∗(at|st)≈argminπEρπ[t+Th∑τ=tc(sτ,aτ)|st] (10)

where is the length of horizon it previews.

The computation is realized by the trajectory optimization algorithm, Differential Dynamic Programming (DDP) [23]: in each iteration of DDP, the system dynamics and the cost function are approximated quadratically along a nominal trajectory; then the Bellman equation of the approximate problem is solved in a backward pass to compute the control law; finally, a new nominal trajectory is generated by applying the updated control law through the dynamics model in a forward pass. Upon convergence, DDP returns a locally optimal control sequence , and the MPC expert executes the first action in the sequence as the expert’s action at time (i.e. ). This process is repeated at every sampling time.

In view of the analysis in Section III-B, we can assume that the MPC expert satisfies Definition 1, because it updates the approximate solution to the original RL problem (1) in high-frequency using global state information. However, because the MPC requires replanning for every step, running the expert policy (10) online consumes significantly more computational power than what is required by the learner.

### Iv-B Learning a DNN Control Policy

The learner’s control policy is parametrized by a DNN containing 10 million parameters. As illustrated in Fig. 3, the DNN policy, consists of two sub-networks: a convolutional neural network (CNN) with 6 convolutional layers, 3 max-pooling layers and 2 fully-connected layers that takes RGB monocular images as inputs333The raw images from the camera were re-scaled to ., and a feedforward network with a fully-connected hidden layer, that takes wheel speeds as inputs. The convolutional and max-pooling layers are used to extract lower-dimensional features from images. The DNN policy uses filters for all convolutonal layers, and rectified linear unit (ReLU) activation for all layers except the last one. Max-pooling layers with filters are integrated to reduce the spatial size of the representation (and therefore reduce the number of parameters and computation loads). The two sub-networks are concatenated and then followed by another fully-connected hidden layer. The structure of this DNN was selected empirically based on experimental studies of several different architectures.

In construction of the surrogate problem for imitation learning, the action space is equipped with for filtering outliers, and the optimization problem, (7) or (9), is solved using ADAM [24], which is a stochastic gradient descent algorithm with an adaptive learning rate. Note while or is used in (7) or (9), the neural network policy does not use the state, but rather the synchronized raw observation , as input. Note that we did not perform any data selection or augmentation techniques in any of the experiments. The only pre-processing was scaling and cropping raw images.

### Iv-C The Autonomous Driving Platform

To validate our imitation learning approach to off-road autonomous driving, the system was implemented on a custom-built, 1/5-scale autonomous AutoRally car (weight 22 kg; LWH 1m0.6m0.4m), shown in the top figure in Fig. 4. The car was equipped with an ASUS mini-ITX motherboard, an Intel quad-core i7 CPU, 16GB RAM, a Nvidia GTX 750ti GPU, and a 11000mAh battery. For sensors, two forward facing machine vision cameras444In this work we only used one of the cameras., a Hemisphere Eclipse P307 GPS module, a Lord Microstrain 3DM-GX4-25 IMU, and Hall effect wheel speed sensors were instrumented. In addition, an RC transmitter could be used to remotely control the vehicle by a human, and a physical run-stop button was installed to disable all motions in case of emergency.

In the experiments, all computation was executed on-board the vehicle in real-time. In addition, an external laptop was used to communicate with the on-board computer remotely via Wi-Fi to monitor the vehicle’s status. The observations were sampled and action were executed at 50 Hz to account for the high-speed of the vehicle and the stochasticity of the environment. Note this control frequency is significantly higher than [7] (10 Hz), [9] (12 Hz), and [10] (15 Hz).

## V Experimental Setup

We tested the performance of the proposed imitation learning system in Section IV in a high-speed navigation task with a desired speed of 7.5 m/s. The performance index of the task was formulated as the cost function in the finite-horizon RL problem (1) with

 c(st,at) =α1costpos(st)+α2cost% spd(st) +α3costslip(st)+α3costact(at), (11)

in which favors the vehicle to stay in the middle of the track, drives the vehicle to reach the desired speed, stabilizes the car from slipping, and inhibits large control commands (see Appendix for details).

The goal of the high-speed navigation task to minimize the accumulated cost function over one-minute continuous driving. That is, under the 50-Hz sampling rate, the task horizon was set . The cost information (11) was given to the MPC expert in Fig. 2 to perform online trajectory optimization with a two-second preview horizon (i.e. ). In the experiments, the weighting in (11) were set as , , and , so that the MPC expert in Section IV-A could perform reasonably well. The learner’s policy was tuned by online/batch imitation learning in attempts to match the expert’s performance.

### V-B Test Track

All the experiments were performed on an elliptical dirt track, shown in the bottom figure of Fig. 4, with the AutoRally car described in Section IV-C. The test track was 3m wide and 30m long and built with fill dirt. Its boundaries were surrounded by soft HDPE tubes, which were detached from the ground, for safety during experimentation. Due to the changing dirt surface, debris from the track’s natural surroundings, and the shifting track boundaries after car crashes, the track condition and vehicle dynamics can change from one experiment to the next, adding to the complexity of learning a robust policy.

### V-C Data Collection

Training data was collected in two ways. In batch imitation learning, the MPC expert was executed, and the camera images, wheel speed readings, and the corresponding steering and throttle commands were recorded. In online imitation learning, a mixture of the expert and learner’s policy was used to collect training data (camera images, wheel speeds, and expert actions): in the th iteration of DAgger, a mixed policy was executed at each time step , where is learner’s DNN policy after DAgger iterations, and is the probability of executing the expert policy. The use of a mixture policy was suggested in [13] for better stability. A mixing rate was used in our experiments. Note that the probability of using the expert decayed exponentially as the number of DAgger iterations increased. Experimental data was collected on an outdoor track, and consisted on changing lighting conditions and environmental dynamics. In the experiments, the rollouts about to crash were terminated remotely by overwriting the autonomous control commands with the run-stop button or the RC transmitter in the safety control module; these rollouts were excluded from the data collection.

### V-D Policy Learning

In online imitation learning, three iterations of DAgger were performed. At each iteration, the robot executed one rollout using the mixed policy described above (the probabilities of executing the expert policy were 60%, 36%, and 21%, respectively). For a fair comparison, the amount of training data collected in batch imitation learning was the same as all of the data collected over the three iterations of online imitation learning.

At each training phase, the optimization problem (7) or (9) was solved by ADAM for 20 epochs, with mini-batch size 64, and a learning rate of 0.001. Dropouts were applied at all fully connected layers to avoid over-fitting (with probability 0.5 for the firstly fully connected layer and 0.25 for the rest). See Section IV-B for details. Finally, after the entire learning session of a policy, three rollouts were performed using the learned policy for performance evaluation.

## Vi Experimental Results

### Vi-a Online vs Batch Learning

We first study the performance of training a control policy with online and batch imitation learning algorithms. Fig. 5 illustrates the vehicle trajectories of different policies. Due to accumulating errors, the policy trained with batch imitation learning crashed into the lower-left boundary, an area of the state space-action rarely explored in the expert’s demonstrations. On the contrary, online imitation learning let the policy learn to successfully cope with corner cases as the learned policy occasionally ventured into new areas of the state-action space.

Fig. 6 shows the performance in terms of distance traveled without crashing555We used the safe control module shown in Fig. 2 to manually terminate the rollout when the car crashed into the soft boundary. and Table II shows the statistics of the experimental results. Overall, DNN policies trained with both online and batch imitation learning algorithms were able to achieve a similar speed as the MPC expert. However, with the same amount of training data, the policies trained with online imitation learning in general outperformed those trained with batch imitation learning. In particular, the policies trained using online imitation learning achieved better performance in terms of both completion ratio and imitation loss. It is worth noting that the traveled distance of the policy learned with a batch of 3,000 samples was longer than that of other batch learning policies. As shown in Table II, this is mainly because this policy achieved better steering performance than throttle performance. As a result, although the vehicle was able to navigate without crashing, it actually traveled at a much slower speed. By contrast, the batch learning policies that used more data had better throttle performance and worse steering performance, resulting in faster speeds but higher chances of crashing.

### Vi-B Deep Neural Network Policy

One main feature of a DNN policy is that it can learn to extract both low-level and high-level features of an image and automatically detect the parts that have greater influence on steering and throttle. We validate this idea by showing in Fig. 7 the averaged feature map at each max-pooling layer (see Fig. 3), where each pixel represents the averaged unit activation across different filter outputs. We can observe that at a deeper level, the detected salient objects are boundaries of the track and parts of a building. Grass and dirt contribute little to the DNN’s output.

We also analyze the importance of incorporating wheel speeds in our task. We compare the performance of the policy based on our DNN policy and a policy based on only the CNN subnetwork (without wheel-speed inputs) in batch imitation learning. The data was collected in accordance with Section V-C. Fig. 8 shows the batch imitation learning loss in (9) of different network architectures. The full DNN policy in Fig. 3 achieved better performance consistently. While images contain position information, it is insufficient to infer velocities. Therefore, we conjecture state-of-the-art CNNs (e.g. [7]) cannot be directly used in this task. By contrast, while without a recurrent architecture, our DNN policy learned to combine wheel speeds in conjunction with CNN to infer hidden state and achieve better performance.

## Vii Conclusion

We introduce an end-to-end learning system to learn a deep neural network driving policy that maps raw on-board observations to steering and throttle commands by mimicking an expert’s behavior. We investigate both online and batch learning frameworks theoretically and empirically and propose an imitation learning system. In real-world experiments, our system was able to perform fast off-road navigation autonomously at an average speed of 6 m/s and a top speed of 8 m/s, while only using low-cost monocular camera and wheel speeds sensors. Our current and future work include developing more complex policy representations, such as recurrent neural networks, and to improve robustness to visual distractions.

## Appendix

The position cost for the high-speed navigation task is a 16-term cubic function of the vehicle’s global position :

 costpos(s)=c0+c1y+c2y2+c3y3+c4x+c5xy +c6xy2+c7xy3+c8x2+c9x2y+c10x2y2+c11x2y3 +c12x3+c13x3y+c14x3y2+c15x3y3.

This coefficients in this cost function were identified by performing a regression to fit the track’s boundary: First, a thorough GPS survey of the track was taken. Points along the inner and the outer boundaries were assigned values of and , respectively, resulting in a zero-cost path along the center of the track. The coefficient values were then determined by a least-squares regression of the polynomials in to fit the boundary data.

The speed cost is a quadratic function which penalizes the difference between the desired speed and the longitudinal velocity in the body frame. The side slip angle cost is defined as , where is the lateral velocity in the body frame. The action cost is a quadratic function defined as , where and correspond to the steering and the throttle commands, respectively. In the experiments, and were selected.

## References

• [1] J. Michels, A. Saxena, and A. Y. Ng, “High speed obstacle avoidance using monocular vision and reinforcement learning,” in International Conference on Machine learning, 2005, pp. 593–600.
• [2] G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou, “Aggressive driving with model predictive path integral control,” in IEEE International Conference on Robotics and Automation, 2016, pp. 1433–1440.
• [3] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. Rehg, B. Boots, and E. Theodorou, “Information theoretic mpc for model-based reinforcement learning,” in IEEE Conference on Robotics and Automation, 2017.
• [4] B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of motion planning and control techniques for self-driving urban vehicles,” IEEE Transactions on Intelligent Vehicles, vol. 1, no. 1, pp. 33–55, 2016.
• [5] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, Jan. 2016.
• [6] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
• [7] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski, B. Firner, L. Jackel, and U. Muller, “Explaining how a deep neural network trained with end-to-end learning steers a car,” arXiv preprint arXiv:1704.07911, 2017.
• [8] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in Advances in Neural Information Processing Systems, 1989, pp. 305–313.
• [9] V. Rausch, A. Hansen, E. Solowjow, C. Liu, E. Kreuzer, and J. K. Hedrick, “Learning a deep neural net policy for end-to-end control of autonomous vehicles,” in IEEE American Control Conference, 2017.
• [10] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun, “Off-road obstacle avoidance through end-to-end learning,” in Advances in Neural Information Processing Systems, 2006, pp. 739–746.
• [11] J. Zhang and K. Cho, “Query-efficient imitation learning for end-to-end autonomous driving,” arXiv preprint arXiv:1605.06450, 2016.
• [12] G. Kahn, T. Zhang, S. Levine, and P. Abbeel, “Plato: Policy learning using adaptive trajectory optimization,” in IEEE International Conference on Robotics and Automation, 2017, pp. 3342–3349.
• [13] S. Ross, G. J. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in International Conference on Artificial Intelligence and Statistics, vol. 1, no. 2, 2011, p. 6.
• [14] P. Drews, G. Williams, B. Goldfain, E. A. Theodorou, and J. M. Rehg, “Aggressive deep driving: Model predictive control with a cnn cost model,” arXiv preprint arXiv:1707.05303, 2017.
• [15] S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in International Conference on Machine Learning, vol. 2, 2002, pp. 267–274.
• [16] S. Shalev-Shwartz et al., “Online learning and online convex optimization,” Foundations and Trends® in Machine Learning, vol. 4, no. 2, pp. 107–194, 2012.
• [17] S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, J. A. Bagnell, and M. Hebert, “Learning monocular reactive uav control in cluttered natural environments,” in IEEE International Conference on Robotics and Automation, 2013, pp. 1765–1772.
• [18] A. L. Gibbs and F. E. Su, “On choosing and bounding probability metrics,” International Statistical Review, vol. 70, no. 3, pp. 419–435, 2002.
• [19] M. Laskey, C. Chuck, J. Lee, J. Mahler, S. Krishnan, K. Jamieson, A. Dragan, and K. Goldberg, “Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations,” arXiv preprint arXiv:1610.00850, 2016.
• [20] N. Koenig and A. H. Design, “Use paradigms for gazebo, an open-source multi-robot simulator ieee,” in IEEE International Conference on Intelligent Robots and Systems, 2004.
• [21] Y. Pan, X. Yan, E. A. Theodorou, and B. Boots, “Prediction under uncertainty in sparse spectrum Gaussian processes with applications to filtering and control,” in International Conference on Machine Learning, 2017, pp. 2760–2768.
• [22] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and F. Dellaert, “isam2: Incremental smoothing and mapping using the bayes tree,” The International Journal of Robotics Research, vol. 31, no. 2, pp. 216–235, 2012.
• [23] Y. Tassa, T. Erez, and W. D. Smart, “Receding horizon differential dynamic programming,” in Advances in Neural Information Processing Systems, 2008, pp. 1465–1472.
• [24] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters