Agile OffRoad Autonomous Driving Using
EndtoEnd Deep Imitation Learning
Abstract
We present an endtoend imitation learning system for agile, offroad autonomous driving using only lowcost onboard sensors. By imitating an optimal controller, we train a deep neural network control policy to map raw, highdimensional observations to continuous steering and throttle commands, the latter of which is essential to successfully drive on varied terrain at high speed. Compared with recent approaches to similar tasks, our method requires neither state estimation nor online planning to navigate the vehicle. Realworld experimental results demonstrate successful autonomous offroad driving, matching the stateoftheart performance.
I Introduction
Highspeed autonomous offroad driving is a challenging robotics problem [1, 2, 3] (Fig. 1). To succeed in this task, a robot is required to perform both precise steering and throttle maneuvers in a physicallycomplex, uncertain environment by executing a series of highfrequency decisions. Compared with most previously studied autonomous driving tasks, the robot must reason about unstructured, stochastic natural environments and operate at high speed. Consequently, designing a control policy by following the traditional modelplanthenact approach [1, 4] becomes challenging, as it is difficult to adequately characterize the robot’s interaction with the environment a priori.
This task has been considered previously, for example, by Williams et al. [2, 3] using modelpredictive control (MPC). While the authors demonstrate impressive results, their internal control scheme relies on expensive and accurate Global Positioning System (GPS) and Inertial Measurement Unit (IMU) for state estimation and demands highfrequency online replanning for generating control commands. Due to these costly hardware requirements, their robot can only operate in a rather controlled environment.
We aim to relax these requirements by designing a reflexive driving policy that uses only lowcost, onboard sensors (e.g. camera, wheel speed sensors). Building on the success of deep reinforcement learning (RL) [5, 6], we adopt deep neural networks (DNNs) to parametrize the control policy and learn the desired parameters from the robot’s interaction with its environment. While the use of DNNs as policy representations for RL is not uncommon, in contrast to most previous work that showcases RL in simulated environments [6], our agent is a highspeed physical system that incurs realworld cost: collecting data is a cumbersome process, and a single poor decision can physically impair the robot and result in weeks of time lost while replacing parts and repairing the platform. Therefore, direct application of modelfree RL techniques is not only sample inefficient, but costly and dangerous in our experiments.
Methods  Tasks  Observations 


Experiment  
[7]  Onroad lowspeed  Single image  Steering  Batch  Real simulated  
[8]  Onroad lowspeed  Single image laser  Steering  Batch  Real simulated  
[9]  Onroad lowspeed  Single image  Steering  Batch  Simulated  
[10]  Offroad lowspeed  Left right images  Steering  Batch  Real  
[11]  Onroad unknown speed  Single image  Steering + break  Online  Simulated  
Our Method  Offroad highspeed 

Steering + throttle 


These realworld factors motivate us to adopt imitation learning [8] to optimize the control policy instead. A major benefit of using imitation learning is that we can leverage domain knowledge through expert demonstrations. This is particularly convenient, for example, when there already exists an autonomous driving platform built through classic system engineering principles. While such a system (e.g. the MPC controller using a pretrained dynamics model and a state estimator based on highend sensors in [2]) usually requires expensive sensors and dedicated computational resources, with imitation learning we can train a lowercost robot to behave similarly, without carrying the expert’s hardware burdens over to the learner. Note that here we assume the expert is given as a black box oracle that can provide the desired actions when queried, as opposed to the case considered in [12] where the expert can be modified to accommodate to the learning progress.
In this work, we present an imitation learning system for realworld highspeed offroad driving tasks. By leveraging demonstrations from an algorithmic expert, our system can learn a control policy that achieves similar performance as the expert. The system was implemented on a 1/5scale autonomous AutoRally car. In realworld experiments, we show the AutoRally car—without any state estimator or online planning, but with a DNN policy that directly inputs sensor measurements from a lowcost monocular camera and wheel speed sensors—could learn to perform highspeed navigation at an average speed of 6 m/s and a top speed of 8 m/s, matching the state of the art [3].
Ii Related Work
Endtoend learning for selfdriving cars has been explored since late 1980’s. Autonomous Land Vehicle in a Neural Network (ALVINN) [8] was developed to learn steering angles directly from camera and laser range measurements using a neural network with a single hidden layer. Based on similar ideas, modern selfdriving cars [10, 7, 9] have recently started to employ a batch imitation learning approach: parameterizing control policies with DNNs, these systems require only expert demonstrations during the training phase and onboard measurements during the testing phase. For example, Nvidia’s PilotNet [7], a convolutional neural network that outputs steering angle given an image, was trained to mimic human drivers’ reaction to visual input with demonstrations collected in realworld road tests. A Dataset Aggregation (DAgger) [13] related online imitation learning algorithm for autonomous driving was recently demonstrated in [11], but only considered simulated environments.
Our problem differs substantially from these previous onroad driving tasks. We study autonomous driving on a fixed set of dirt tracks, whereas onroad driving must perform well in a larger domain and contend with moving objects such as cars and pedestrians. While onroad driving in urban environments may seem more difficult, our agent must overcome challenges of a different nature. It is required to drive at high speed, and prominent visual features such as lane markers are absent. Compared with paved roads, the surface of our dirt tracks are constantly evolving and highly stochastic. As a result, to successfully perform highspeed driving in our task, highfrequency application of both steering and throttle commands are required. Previous work only focuses on steering commands [10, 7, 9]. A comparison of different imitation learning approaches to autonomous driving is presented in Table I.
Our task is most similar to the task considered by Williams et al. [2, 3] and Drews et al. [14]. Compared with a DNN policy, their MPC approach has several drawbacks: computationally expensive optimization for planning is required to be performed online at highfrequency and the learning component is not endtoend. In [2, 3], accurate GPS and IMU feedbacks are also required for state estimation, which may not contain sufficient information to contend with the changing environment in offroad driving tasks. While the requirement on GPS and IMU is relaxed by using a visionbased cost map in [14], a large dataset (300,000 images) was used to train the model, expensive onthefly planning is still required, and speed performance is compromised. In contrast to previous work, our approach offloads the hardware requirements to an expert. While the expert may use highquality sensors and more computational power, our agent only needs access to onboard sensors and its control policy can run reactively in high frequency, without onthefly planning and optimization. Additionally, our experimental results match that in [2, 3] and are faster and more data efficient than that in [14].
Iii Imitation Learning for Autonomous Driving
To design a policy for offroad autonomous driving, we introduce a policy optimization problem and then show how a policy can be learned by imitation learning. We discuss the strengths and weakness of deploying a batch or online imitation learning algorithm to our task.
Iiia Problem Definition
To mathematically formulate the autonomous driving task, it is natural to consider a discretetime continuousvalued RL problem. Let , , and be the state, action, and the observation spaces. In our setting, the state space is unknown to the agent; observations consist of onboard measurements, including a monocular RGB image from the frontview camera and wheel speeds from Hall effect sensors; actions consist of continuousvalued steering and throttle commands.
The goal is to find a stationary deterministic policy (e.g. a DNN policy) such that achieves low accumulated cost over a finite horizon of length
(1) 
in which is the distribution of , , and under policy , for . Here is the instantaneous cost, which, for example, encourages maximal speed driving while staying on the track. For notations, we denote as the Qfunction at time under policy and as its associated value function.
IiiB Imitation Learning
Directly optimizing (1) is challenging for highspeed offroad autonomous driving. Since our task involves a physical robot, modelfree RL techniques are intolerably sample inefficient and have the risk of permanently damaging the car when applying a partiallyoptimized policy in exploration. Although modelbased RL may require fewer samples, it can lead to suboptimal, potentially unstable, results because it is difficult for a model that uses only onboard measurements to fully capture the complex dynamics of offroad driving.
Considering these limitations, we propose to solve for the policy by imitation learning. We assume the access to an oracle policy or expert to generate demonstrations during the training phase, which relies on resources that are unavailable in the testing phase, e.g. additional sensors and computation. For example, the expert can be a computationally intensive optimal controller that relies on exteroceptive sensors not available at test time (e.g. GPS for state estimation), or an experienced human driver.
The goal of imitation learning is to perform as well as the expert with an error that has at most linear dependency on . Formally, we introduce a lemma due to Kakade and Langford [15] and define what we mean by an expert.
Lemma 1.
Define as an unnormalized stationary state distribution, where is the distribution of state at time when running policy . Let and be two policies. Then
(2) 
where is the advantage function at time with respect to running .
Definition 1.
A policy is called an expert to problem (1) if independent of , where denotes the Lipschitz constant of function and is the Qfunction at time of running policy .
The idea behind Definition 1 is that an expert policy should perform stably under arbitrary action perturbation with respect to the cost function , regardless of where it starts.^{1}^{1}1We define the expert here using an uniform Lipschitz constant because the action space in our task is continuous; for discrete action spaces, can be replaced by and the rest applies. As we will see in Section IIIC, this requirement provides guidance for whether to choose batch learning vs. online learning to train a policy by imitation.
IiiB1 Online Imitation Learning
We now present the objective function for the online learning [16] approach to imitation learning, which simplifies the derivation in [13] and extends it to continuous action spaces as required in the autonomous driving task. Although our goal here is not to introduce a new algorithm, but rather to give a concise introduction to online imitation learning, we found that a connection between online imitation learning and DAggerlike algorithms [13] in continuous domains has not been formally introduced. DAgger has only been used heuristically in these domains as in [17, 11].
Assume is an expert to (1) and suppose is a normed space with norm . Let denote the Wasserstein metric [18]: for two probability distributions and defined on a metric space with metric ,
(3)  
(4) 
where denotes the family of distributions whose marginals are and . It can be shown by the KantorovichRubinstein theorem that the above two definitions are equivalent [18]. These assumptions allow us to construct a surrogate problem, which is relatively easier to solve than (1). We achieve this by upperbounding the difference between the performance of and given in Lemma 1:
(5) 
where we invoke the definition of advantage function , the first and the second inequalities is due to (3) and (4), respectively.
Define . Thus, to make perform as well as , we can minimize the upper bound, which is equivalent to solving a surrogate RL problem
(6) 
The optimization problem (6) is called the online imitation learning problem. This surrogate problem is comparatively more structured than the original RL problem (1), so we can adopt algorithms with provable performance guarantees. In this paper, we use the metalearning algorithm DAgger [13], which reduces (6) to a sequence of supervised learning problems: Let be the training data. DAgger initializes with samples gathered by running . Then, in the th iteration, it trains by supervised learning,
(7) 
where subscript denotes empirical data distribution. Next it runs to collect more data, which is then added into to train . The procedure is repeated for iterations and the best policy, in terms of (6), is returned. Suppose the policy is linearly parametrized and . Since our instantaneous cost is strongly convex, the theoretical analysis of DAgger applies. Therefore, together with the assumption that is an expert, running DAgger to solve (6) finds a policy with performance , achieving our initial goal.
We note here the instantaneous cost can be selected to be any suitable norm according the problem’s property. In our offroad autonomous driving task, we find norm is preferable (e.g. over 2norm) for its ability to filter outliers in a highly stochastic environment.
IiiB2 Batch Imitation Learning
By swapping the order of and in the above derivation in (5), we can derive another upper bound and use it to construct another surrogate problem: define and , then
(8) 
where we use again Lemma 1 for the equality and the property of Wasserstein distance for inequality. The minimization of the upperbound (8) is called batch imitation learning problem [7, 9]:
(9) 
In contrast to the surrogate problem in online imitation learning (6), batch imitation learning reduces to a supervised learning problem, because the expectation is defined by a fixed policy .
IiiC Comparison of Imitation Learning Algorithms
Comparing (5) and (8), we observe that in batch imitation learning the Lipschitz constant , without being an expert, can be on the order of in the worst case. Therefore, if we take a uniform bound and define , we see . In other words, under the same assumption in online imitation, i.e. (8) can be minimized to an error that depends linearly on , the difference between and in batch imitation learning can actually grow quadratically in due to error compounding. Therefore, in order to achieve the same level of performance as online imitation learning, batch imitation learning requires a more expressive policy class or more demonstration samples. As shown in [13], the quadratic bound is tight.
Therefore, if we can choose an expert policy that is stable in the sense of Definition 1, then online imitation learning is preferred theoretically. This is satisfied, for example, when the expert policy is an algorithm with certain performance characteristics. On the contrary, if the expert is human, the assumptions required by online imitation learning becomes hard to realize in realroad offroad driving tasks. Because humans rely on realtime sensory feedback (as in sampling from but not from ) to generate ideal expert actions, the action samples collected in the online learning approach using are often biased and inconsistent [19]. This is especially true in offroad driving tasks, where the human driver depends heavily on instant feedback from the car to overcome stochastic disturbances. Therefore, the framebyframe labeling approach [17], for example, can lead to a very counterintuitive, inefficient data collection process, because the required dynamics information is lost in a single image frame. Overall, when using human demonstrations, online imitation learning can be as bad as batch imitation learning [19], just due to inconsistencies introduced by human nature.
Iv The Autonomous Driving System
Building on the analyses in the previous section, we design a system that can learn to perform fast offroad autonomous driving with only onboard measurements. The overall system architecture for learning endtoend DNN driving policies is illustrated in Fig. 2. It consists of three highlevel controllers (an expert, a learner, and a safety control module) and a lowlevel controller, which receives steering and throttle commands from the highlevel controllers and translates them to pulsewidth modulation (PWM) signals to drive the steering and throttle actuators of a vehicle.
On the basis of the analysis in Section IIIC, we assume the expert is algorithmic and has access to expensive sensors (GPS and IMU) for accurate global state estimates^{2}^{2}2Global position, heading and roll angles, linear velocities, and heading angle rate. and resourceful computational power. The expert is built on multiple handengineered components, including a state estimator, a dynamics model of the vehicle, a cost function of the task, and a trajectory optimization algorithm for planning (see Section IVA). By contrast, the learner is a DNN policy that has access to only a monocular camera and wheel speed sensors and is required to output steering and throttle command directly (see Section IVB). In this setting, the sensors that the learner uses can be significantly cheaper than that of the expert; specifically on our experimental platform, the AutoRally car (see Section IVC), the IMU and the GPS sensors required by the expert in Section IVA together cost more than $6,000, while the sensors used by the learner’s DNN policy cost less than $500. The safety control module has the highest priority among all three controllers and is used prevent the vehicle from highspeed crashing.
The software system was developed based on the Robot Operating System (ROS) in Ubuntu. In addition, a Gazebobased simulation environment [20] was built using the same ROS interface but without the safety control module; the simulator was used to evaluate the performance of the system before real track tests.
Iva Algorithmic Expert with ModelPredictive Control
We use an MPC expert [21] based on an incremental Sparse Spectrum Gaussian Process (SSGP) dynamics model (which was learned from 30 minutelong driving data) and an iSAM2 state estimator [22]. To generate actions, the MPC expert solves a finite horizon optimal control problem for every sampling time: at time , the expert policy is a locally optimal policy such that
(10) 
where is the length of horizon it previews.
The computation is realized by the trajectory optimization algorithm, Differential Dynamic Programming (DDP) [23]: in each iteration of DDP, the system dynamics and the cost function are approximated quadratically along a nominal trajectory; then the Bellman equation of the approximate problem is solved in a backward pass to compute the control law; finally, a new nominal trajectory is generated by applying the updated control law through the dynamics model in a forward pass. Upon convergence, DDP returns a locally optimal control sequence , and the MPC expert executes the first action in the sequence as the expert’s action at time (i.e. ). This process is repeated at every sampling time.
In view of the analysis in Section IIIB, we can assume that the MPC expert satisfies Definition 1, because it updates the approximate solution to the original RL problem (1) in highfrequency using global state information. However, because the MPC requires replanning for every step, running the expert policy (10) online consumes significantly more computational power than what is required by the learner.
IvB Learning a DNN Control Policy
The learner’s control policy is parametrized by a DNN containing 10 million parameters. As illustrated in Fig. 3, the DNN policy, consists of two subnetworks: a convolutional neural network (CNN) with 6 convolutional layers, 3 maxpooling layers and 2 fullyconnected layers that takes RGB monocular images as inputs^{3}^{3}3The raw images from the camera were rescaled to ., and a feedforward network with a fullyconnected hidden layer, that takes wheel speeds as inputs. The convolutional and maxpooling layers are used to extract lowerdimensional features from images. The DNN policy uses filters for all convolutonal layers, and rectified linear unit (ReLU) activation for all layers except the last one. Maxpooling layers with filters are integrated to reduce the spatial size of the representation (and therefore reduce the number of parameters and computation loads). The two subnetworks are concatenated and then followed by another fullyconnected hidden layer. The structure of this DNN was selected empirically based on experimental studies of several different architectures.
In construction of the surrogate problem for imitation learning, the action space is equipped with for filtering outliers, and the optimization problem, (7) or (9), is solved using ADAM [24], which is a stochastic gradient descent algorithm with an adaptive learning rate. Note while or is used in (7) or (9), the neural network policy does not use the state, but rather the synchronized raw observation , as input. Note that we did not perform any data selection or augmentation techniques in any of the experiments. The only preprocessing was scaling and cropping raw images.
IvC The Autonomous Driving Platform
To validate our imitation learning approach to offroad autonomous driving, the system was implemented on a custombuilt, 1/5scale autonomous AutoRally car (weight 22 kg; LWH 1m0.6m0.4m), shown in the top figure in Fig. 4. The car was equipped with an ASUS miniITX motherboard, an Intel quadcore i7 CPU, 16GB RAM, a Nvidia GTX 750ti GPU, and a 11000mAh battery. For sensors, two forward facing machine vision cameras^{4}^{4}4In this work we only used one of the cameras., a Hemisphere Eclipse P307 GPS module, a Lord Microstrain 3DMGX425 IMU, and Hall effect wheel speed sensors were instrumented. In addition, an RC transmitter could be used to remotely control the vehicle by a human, and a physical runstop button was installed to disable all motions in case of emergency.
In the experiments, all computation was executed onboard the vehicle in realtime. In addition, an external laptop was used to communicate with the onboard computer remotely via WiFi to monitor the vehicle’s status. The observations were sampled and action were executed at 50 Hz to account for the highspeed of the vehicle and the stochasticity of the environment. Note this control frequency is significantly higher than [7] (10 Hz), [9] (12 Hz), and [10] (15 Hz).
V Experimental Setup
Va Highspeed Navigation Task
We tested the performance of the proposed imitation learning system in Section IV in a highspeed navigation task with a desired speed of 7.5 m/s. The performance index of the task was formulated as the cost function in the finitehorizon RL problem (1) with
(11) 
in which favors the vehicle to stay in the middle of the track, drives the vehicle to reach the desired speed, stabilizes the car from slipping, and inhibits large control commands (see Appendix for details).
The goal of the highspeed navigation task to minimize the accumulated cost function over oneminute continuous driving. That is, under the 50Hz sampling rate, the task horizon was set . The cost information (11) was given to the MPC expert in Fig. 2 to perform online trajectory optimization with a twosecond preview horizon (i.e. ). In the experiments, the weighting in (11) were set as , , and , so that the MPC expert in Section IVA could perform reasonably well. The learner’s policy was tuned by online/batch imitation learning in attempts to match the expert’s performance.
VB Test Track
All the experiments were performed on an elliptical dirt track, shown in the bottom figure of Fig. 4, with the AutoRally car described in Section IVC. The test track was 3m wide and 30m long and built with fill dirt. Its boundaries were surrounded by soft HDPE tubes, which were detached from the ground, for safety during experimentation. Due to the changing dirt surface, debris from the track’s natural surroundings, and the shifting track boundaries after car crashes, the track condition and vehicle dynamics can change from one experiment to the next, adding to the complexity of learning a robust policy.
VC Data Collection
Training data was collected in two ways. In batch imitation learning, the MPC expert was executed, and the camera images, wheel speed readings, and the corresponding steering and throttle commands were recorded. In online imitation learning, a mixture of the expert and learner’s policy was used to collect training data (camera images, wheel speeds, and expert actions): in the th iteration of DAgger, a mixed policy was executed at each time step , where is learner’s DNN policy after DAgger iterations, and is the probability of executing the expert policy. The use of a mixture policy was suggested in [13] for better stability. A mixing rate was used in our experiments. Note that the probability of using the expert decayed exponentially as the number of DAgger iterations increased. Experimental data was collected on an outdoor track, and consisted on changing lighting conditions and environmental dynamics. In the experiments, the rollouts about to crash were terminated remotely by overwriting the autonomous control commands with the runstop button or the RC transmitter in the safety control module; these rollouts were excluded from the data collection.
VD Policy Learning
In online imitation learning, three iterations of DAgger were performed. At each iteration, the robot executed one rollout using the mixed policy described above (the probabilities of executing the expert policy were 60%, 36%, and 21%, respectively). For a fair comparison, the amount of training data collected in batch imitation learning was the same as all of the data collected over the three iterations of online imitation learning.
At each training phase, the optimization problem (7) or (9) was solved by ADAM for 20 epochs, with minibatch size 64, and a learning rate of 0.001. Dropouts were applied at all fully connected layers to avoid overfitting (with probability 0.5 for the firstly fully connected layer and 0.25 for the rest). See Section IVB for details. Finally, after the entire learning session of a policy, three rollouts were performed using the learned policy for performance evaluation.
Vi Experimental Results
Via Online vs Batch Learning
Policy  Avg. speed  Top speed  Training data  Completion ratio  Total loss  Steering / Throttle loss 
Expert  6.05 m/s  8.14 m/s  NA  100  0  0 
Batch  4.97 m/s  5.51 m/s  3000  100  0.108  0.092/0.124 
Batch 
6.02 m/s  8.18 m/s  6000  51  0108  0.162/0.055 
Batch 
5.79 m/s  7.78 m/s  9000  53  0.123  0.193/0.071 
Batch 
5.95 m/s  8.01 m/s  12000  69  0.105  0.125/0.083 
Online (1 iter)  6.02 m/s  7.88 m/s  6000  100  0.090  0.112/0.067 
Online (2 iter) 
5.89 m/s  8.02 m/s  9000  100  0.075  0.095/0.055 
Online (3 iter) 
6.07 m/s  8.06 m/s  12000  100  0.064  0.073/0.055 

We first study the performance of training a control policy with online and batch imitation learning algorithms. Fig. 5 illustrates the vehicle trajectories of different policies. Due to accumulating errors, the policy trained with batch imitation learning crashed into the lowerleft boundary, an area of the state spaceaction rarely explored in the expert’s demonstrations. On the contrary, online imitation learning let the policy learn to successfully cope with corner cases as the learned policy occasionally ventured into new areas of the stateaction space.
Fig. 6 shows the performance in terms of distance traveled without crashing^{5}^{5}5We used the safe control module shown in Fig. 2 to manually terminate the rollout when the car crashed into the soft boundary. and Table II shows the statistics of the experimental results. Overall, DNN policies trained with both online and batch imitation learning algorithms were able to achieve a similar speed as the MPC expert. However, with the same amount of training data, the policies trained with online imitation learning in general outperformed those trained with batch imitation learning. In particular, the policies trained using online imitation learning achieved better performance in terms of both completion ratio and imitation loss. It is worth noting that the traveled distance of the policy learned with a batch of 3,000 samples was longer than that of other batch learning policies. As shown in Table II, this is mainly because this policy achieved better steering performance than throttle performance. As a result, although the vehicle was able to navigate without crashing, it actually traveled at a much slower speed. By contrast, the batch learning policies that used more data had better throttle performance and worse steering performance, resulting in faster speeds but higher chances of crashing.
ViB Deep Neural Network Policy
One main feature of a DNN policy is that it can learn to extract both lowlevel and highlevel features of an image and automatically detect the parts that have greater influence on steering and throttle. We validate this idea by showing in Fig. 7 the averaged feature map at each maxpooling layer (see Fig. 3), where each pixel represents the averaged unit activation across different filter outputs. We can observe that at a deeper level, the detected salient objects are boundaries of the track and parts of a building. Grass and dirt contribute little to the DNN’s output.
We also analyze the importance of incorporating wheel speeds in our task. We compare the performance of the policy based on our DNN policy and a policy based on only the CNN subnetwork (without wheelspeed inputs) in batch imitation learning. The data was collected in accordance with Section VC. Fig. 8 shows the batch imitation learning loss in (9) of different network architectures. The full DNN policy in Fig. 3 achieved better performance consistently. While images contain position information, it is insufficient to infer velocities. Therefore, we conjecture stateoftheart CNNs (e.g. [7]) cannot be directly used in this task. By contrast, while without a recurrent architecture, our DNN policy learned to combine wheel speeds in conjunction with CNN to infer hidden state and achieve better performance.
Vii Conclusion
We introduce an endtoend learning system to learn a deep neural network driving policy that maps raw onboard observations to steering and throttle commands by mimicking an expert’s behavior. We investigate both online and batch learning frameworks theoretically and empirically and propose an imitation learning system. In realworld experiments, our system was able to perform fast offroad navigation autonomously at an average speed of 6 m/s and a top speed of 8 m/s, while only using lowcost monocular camera and wheel speeds sensors. Our current and future work include developing more complex policy representations, such as recurrent neural networks, and to improve robustness to visual distractions.
Appendix
Task cost function
The position cost for the highspeed navigation task is a 16term cubic function of the vehicle’s global position :
This coefficients in this cost function were identified by performing a regression to fit the track’s boundary: First, a thorough GPS survey of the track was taken. Points along the inner and the outer boundaries were assigned values of and , respectively, resulting in a zerocost path along the center of the track. The coefficient values were then determined by a leastsquares regression of the polynomials in to fit the boundary data.
The speed cost is a quadratic function which penalizes the difference between the desired speed and the longitudinal velocity in the body frame. The side slip angle cost is defined as , where is the lateral velocity in the body frame. The action cost is a quadratic function defined as , where and correspond to the steering and the throttle commands, respectively. In the experiments, and were selected.
References
 [1] J. Michels, A. Saxena, and A. Y. Ng, “High speed obstacle avoidance using monocular vision and reinforcement learning,” in International Conference on Machine learning, 2005, pp. 593–600.
 [2] G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou, “Aggressive driving with model predictive path integral control,” in IEEE International Conference on Robotics and Automation, 2016, pp. 1433–1440.
 [3] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. Rehg, B. Boots, and E. Theodorou, “Information theoretic mpc for modelbased reinforcement learning,” in IEEE Conference on Robotics and Automation, 2017.
 [4] B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of motion planning and control techniques for selfdriving urban vehicles,” IEEE Transactions on Intelligent Vehicles, vol. 1, no. 1, pp. 33–55, 2016.
 [5] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “Endtoend training of deep visuomotor policies,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, Jan. 2016.
 [6] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
 [7] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski, B. Firner, L. Jackel, and U. Muller, “Explaining how a deep neural network trained with endtoend learning steers a car,” arXiv preprint arXiv:1704.07911, 2017.
 [8] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in Advances in Neural Information Processing Systems, 1989, pp. 305–313.
 [9] V. Rausch, A. Hansen, E. Solowjow, C. Liu, E. Kreuzer, and J. K. Hedrick, “Learning a deep neural net policy for endtoend control of autonomous vehicles,” in IEEE American Control Conference, 2017.
 [10] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun, “Offroad obstacle avoidance through endtoend learning,” in Advances in Neural Information Processing Systems, 2006, pp. 739–746.
 [11] J. Zhang and K. Cho, “Queryefficient imitation learning for endtoend autonomous driving,” arXiv preprint arXiv:1605.06450, 2016.
 [12] G. Kahn, T. Zhang, S. Levine, and P. Abbeel, “Plato: Policy learning using adaptive trajectory optimization,” in IEEE International Conference on Robotics and Automation, 2017, pp. 3342–3349.
 [13] S. Ross, G. J. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to noregret online learning,” in International Conference on Artificial Intelligence and Statistics, vol. 1, no. 2, 2011, p. 6.
 [14] P. Drews, G. Williams, B. Goldfain, E. A. Theodorou, and J. M. Rehg, “Aggressive deep driving: Model predictive control with a cnn cost model,” arXiv preprint arXiv:1707.05303, 2017.
 [15] S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in International Conference on Machine Learning, vol. 2, 2002, pp. 267–274.
 [16] S. ShalevShwartz et al., “Online learning and online convex optimization,” Foundations and Trends® in Machine Learning, vol. 4, no. 2, pp. 107–194, 2012.
 [17] S. Ross, N. MelikBarkhudarov, K. S. Shankar, A. Wendel, D. Dey, J. A. Bagnell, and M. Hebert, “Learning monocular reactive uav control in cluttered natural environments,” in IEEE International Conference on Robotics and Automation, 2013, pp. 1765–1772.
 [18] A. L. Gibbs and F. E. Su, “On choosing and bounding probability metrics,” International Statistical Review, vol. 70, no. 3, pp. 419–435, 2002.
 [19] M. Laskey, C. Chuck, J. Lee, J. Mahler, S. Krishnan, K. Jamieson, A. Dragan, and K. Goldberg, “Comparing humancentric and robotcentric sampling for robot deep learning from demonstrations,” arXiv preprint arXiv:1610.00850, 2016.
 [20] N. Koenig and A. H. Design, “Use paradigms for gazebo, an opensource multirobot simulator ieee,” in IEEE International Conference on Intelligent Robots and Systems, 2004.
 [21] Y. Pan, X. Yan, E. A. Theodorou, and B. Boots, “Prediction under uncertainty in sparse spectrum Gaussian processes with applications to filtering and control,” in International Conference on Machine Learning, 2017, pp. 2760–2768.
 [22] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and F. Dellaert, “isam2: Incremental smoothing and mapping using the bayes tree,” The International Journal of Robotics Research, vol. 31, no. 2, pp. 216–235, 2012.
 [23] Y. Tassa, T. Erez, and W. D. Smart, “Receding horizon differential dynamic programming,” in Advances in Neural Information Processing Systems, 2008, pp. 1465–1472.
 [24] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.