Gradientfree policy architecture search and adaptation
Abstract
We develop a method for policy architecture search and adaptation via gradientfree optimization which can learn to perform autonomous driving tasks. By learning from both demonstration and environmental reward we develop a model that can learn with relatively few early catastrophic failures. We first learn an architecture of appropriate complexity to perceive aspects of world state relevant to the expert demonstration, and then mitigate the effect of domainshift during deployment by adapting a policy demonstrated in a source domain to rewards obtained in a target environment. We show that our approach allows safer learning than baseline methods, offering a reduced cumulative crash metric over the agent’s lifetime as it learns to drive in a realistic simulated environment.
1Introduction
Deep architectures have become popular as function approximators to represent actionselection policies. Common approaches to learn the parameters of such models include reinforcement learning [20] and/or learning from demonstration [1]: both learn model parameters to maximize expected reward, mimic human behavior, and/or achieve implicit goals. However, the design of policy architectures, especially in a deep learning paradigm, remains relatively unexplored. Architectures are typically selected through a combination of intuition and/or trial and error.
Learning to learn, including the learning of learning architectures, is a longarticulated goal of AI, and many “metalearning” and “lifelong learning” schemes have been proposed (e.g., [21] offered seminal views; see [18] for a survey). Recently, renewed interest in this topic has focused on models which explicitly search over the structure of deep architectures, including models which fuse nonparametric Bayesian inference with deep learning to select the number of channels for visual recognition tasks [6], models which use reinforcement learning to directly optimize over deep architectures for recognition [24], and models which use a gradientfree optimization method (“evolutionary search”) to infer optimal network structure [12].
We investigate policy architecture search using gradientfree optimization and learn optimal policy structure for autonomous driving tasks. We propose a model which learns jointly from demonstration and optimization, with the goal of “safe training”: minimizing the amount of damage a vehicle incurs to learn a threshold level of performance. We base our approach on explorationbased schemes due to their ability to optimize model weights and architecture hyperparameters, leverage expert demonstrations, and adapt to reward obtained in new domains. We believe that a model which can initialize from demonstration, and learn an optimal policy from that foundation, is likely to achieve higher performance while maintaining the constraint of safe training, compared to models which must randomly search through action space during initial learning, or which learn from a reasonably safe demonstration but cannot further optimize performance based on environmental reward.
Prior approaches to combine demonstration with rewardbased learning have had mixed successes [15] mainly due to the poor generalization of the policy learned on demonstrations. We posit that effective behavior cloning requires learning a visual agent architecture that has sufficient structure to perceive the state of the world deemed relevant to the expert providing the demonstration. This may or may not be the case with existing, offtheshelf visual models. We thus think it is wise to optimize over architectures and parameters when performing expert behavioral cloning.
Often, deep models which learn to perform in one domain fail to perform well when deployed in another setting, such as differing weather or lighting conditions. Models learned from demonstration are also well known to fail when the learned policy takes the agent away from the region of the state space where the demonstration was provided [13]. We show that our method can effectively and safely adapt a model demonstrated in one environment but deployed in a visually different environment based on the reward signal in the latter domain, even when the agent is initialized far from initial demonstrations. Our approach leverages only target domain reward, and makes no assumptions about domain alignment, explicit or implicit, nor assumes any demonstration supervision in the target domain.
To achieve these goals, we present a gradientfree optimization algorithm inspired by [16] with a modification in noise generation that results in estimating the gradients more efficiently and accurately (Section 3.1). We then apply this algorithm to search over variable length architectures Next, we combine our gradientfree policy search with demonstrations to learn a better policy that adapts to the new environment by receiving rewards as feedback (Section 3.3). We experimentally show that our architecture search model finds a policy on the GTA game environment that outperforms previously published methods (e.g., [3]) in endtoend steering prediction from demonstrations, and that it can be efficiently adapted to learn to drive in previously unseen scenarios (Section 4). Our model reduces the number of crashes incurred while learning to drive, compared to baselines based only on reward or demonstration but not both, or compared to previously proposed fixed architectures that were not optimized for the domain.
2Related work
Architecture search has been investigated through different frameworks including reinforcement learning [24] and evolutionary techniques [12]. In [24], a recurrent neural network (RNN) was used to generate fixedlength architecture descriptions from a predefined search space and trained it with policy gradient methods. They were able to get close and surpass the state of the art results on  and Penn Treebank datasets, respectively. A metamodeling algorithm was proposed in [2] which used learning to sequentially search for convolutional layers for image classification tasks. They showed that their approach outperforms other existing metamodels and manuallydesigned architectures with similar types of layers. Recently, [22] introduced Budgeted Super Networks which are inspired by the REINFORCE algorithm with an objective function that maximizes prediction quality and computation cost simultaneously. Various versions of biologicallyinspired methods, or neuroevolution strategies, have been proposed for architecture search ever since they were introduced by [5]. Most of them are based on biological genetics algorithms where there is a fitness function that gets reevaluated at each “generation” to determine whether “genotypes” are perturbed in the correct direction to evolve appropriately [12]. I.e., they initialize a model and evolve it based on its performance. This paradigm was recently revisited as an alternative to reinforcement learning algorithms where optimization is performed in a gradientfree fashion and the algorithm was shown to be highly parallelizable resulting in significant speedups in playing MuJoCo and Atari games [16].
Policy search in autonomous driving application has been largely focused on demonstrationbased optimization approaches with [4] or without [3] affordance measurements. It dates back to the classic model [10] which was a shallow architecture that could map from pixels to simple driving actions. Several years after, researchers demonstrated endtoend deep learning models for steering control of smallscale cars [8], and recently NVIDIA followed the same path and showed success in predicting steering angle on a fullsize vehicle from raw pixels using a convolutional network [3]. A novel  architecture was proposed on large scale crowedsourced data to perform egomotion predictions conditioned on the previous temporal states [23]. They used dashcam camera videos to derive a generic driving model that predicted trajectory angle (not steering angles).
3Approach
We propose a learningtolearn model which includes architecture optimization, parameter learning, and representation adaptation over different time scales. Our approach can be summarized by the following two steps. (1) Given expert demonstration, search over architectures and parameters to find a policy that best mimics performance by monitoring the obtained accuracy and number of parameters. (2) Having learned from demonstration, adapt the model to the reward provided by the target environment. In both steps, it is essential to derive a function approximator that optimizes an objective function. We use a gradientfree optimization algorithm [16] that maximizes a parametrized reward function using gradient estimation to perform architecture search (Section 3.2) and policy learning (Section 3.3).
3.1Gradientfree optimization algorithm
Let be our objective function parametrized by which is an dimensional vector. can be the reward that an environment provides for an agent when it executes a policy with parameters ; our goal is to maximize the expected reward by perturbing the policy parameters, denoted as , by moving in particular directions. The parameter estimate update can be performed using a general stochastic form:
where is an approximation of the objective function (i.e. ) and is the gradient of objective estimate that can be approximated by any gradient estimator in the family of finite difference methods. The gradient is estimated in a randomly chosen direction by perturbing all the elements of to obtain two measurements of as follows:
where is a vector of mutually independent randomly perturbed variables taken from a zeromean distribution. While there is no restriction for it to have a specific type of distribution, we use Laplace distribution, as it tends to choose orthogonal directions in the long run. Other recent efforts [24] utilized Gaussian noise to sample mirrored projections. Figure ? shows a comparison between the two distributions. is a small positive number and and are the noise associated with evaluating such that: . The gradient estimate can then be computed as:
Parameter estimates can be updated by replacing the gradients in Equation 1 with those found in Equation 2.

Filter height (FH) [1, 3, 5, 7]  
Filter width (FW) [1, 3, 5, 7]  
Stride height (SH) [1, 2, 3]  
Stride width (SW) [1, 2, 3]  
Number of filters (NF) [16, 24, 32, 64, 128, 256]  
Maxpool size (MP) [1, 2, 3]  
Dropout (DO1) [0.3, 0.5, 0.7, 1.0]  

Number of units (NU) [8, 16, 32, 64, 128, 256, 512]  
Dropout (DO2) [0.3, 0.5, 0.7, 1.0]  
3.2Learning an optimal initial policy from demonstrations
Inspired by [24] we have used a recurrent neural network to sequentially generate the description of layers of an architecture from a given design space defined by the user. The RNN acts as a controller which generates the architecture description defined by its hyperparameters chosen from a predefined search space. In [24], the authors used policy gradients to train the RNN which was able to produce fixedlength convolutional and recurrent architectures. Given demonstrations and having a child network defined by the RNN, they trained the child network using supervised learning and obtained an accuracy metric on the given task on a heldout validation set and used that accuracy as a reward signal to train the RNN.
Our model uses the demonstrations to provide the reward function, ((i.e., above), to train the RNN. Unlike backpropagation which suffers from gradient vanishing while training RNNs, gradientfree algorithms do not have such an issue [16]. Our RNN controller specifies three types of layers: convolutional, fully connected, and maxpool which can have interlayer dropouts. For the reward signal, we use the negative value of total loss function. At the last layer of the network, we regress to three realvalued numbers, each having a meansquared loss. The total loss is the sum of all three losses. We use a novel reward function (Equation 3 that not only results in the minimum total loss but also grows the architecture as long as the loss keeps decreasing. Note that in case of having a classification problem, an accuracy metric can replace loss value, hence the goal will be maximizing the accuracy while controlling the number of parameters. Here we have a regression problem and our goal is to search for an architecture that is guaranteed to achieve a low loss on the given task and grow further with adding more layers (i.e. producing more parameters) to decrease this while being penalized for adding more parameters in turns of no gain in loss reduction. We propose to use a ReLUbased Lagrangemultiplier reward function as below:
where is the negative of the minimum loss (or maximum accuracy in a classification problem) on the validation set for the last epochs and is the total number of parameters in the child network. is the Lagrange parameter defined as a function of the first subreward in a ReLUbased fashion:
This reward function acts as follows. The RNN keeps generating new layers by being rewarded only based on the obtained total loss until it produces a child network that achieves the desired value of loss (or accuracy ) on the validation set. Once it reaches this threshold, it will be penalized for further growing the architecture if the loss does not decrease consequently. The parameter in Equation 3 defines the respective threshold. E.g. choosing allows architecture growth by the number of parameters in the new layer if it causes the overall loss decrease (or accuracy increase in classification) by . The thresholds are adjustable based on the problem at hand and the desired tradeoff between computational cost and loss minimization.
Our RNN controller is a threelayer LSTM network followed by a softmax layer. The inputs to the RNN are the hyperparameters that describe a layer (see Table 1 for our search space.). Training the RNN starts with randomly initializing the hyperparameters of the child network which initially has only one layer. RNN uses the reward function to update its parameter weights such that those which contributed more in the obtained reward, receive a higher weighting factor during the update and hence, we move in a hillclimbing direction which eventually maximizes the reward. We use the algorithm described in Section 3.1 to generate an architecture that yields the minimum loss. The process of generating a new layer terminates when we achieve convergence in the received reward.
[3mm]
3.3Adapting a demonstrated policy to a new driving domain
As confirmed experimentally below, it is well known that a policy learned from behavioral cloning can perform poorly when evaluated on inputs with a domain shift relative to the demonstration supervision. To overcome this, we further use the gradientfree search algorithm described in Section 3.1 to adapt a driving policy learned from demonstration in a source domain based on rewards in a target domain. We experiment with the setting where the initial agent state (e.g., location), and/or weather and lighting conditions, are substantially different than provided as demonstration. We compare to baselines where we perform rewardbased optimization using initial demonstrations instead of a randomly initialized policy, which makes our reward function converge faster, and more safely.
In the driving scenario, we wish to learn to drive with optimal or nearoptimal performance, defined by the reward in the target domain. Specifically, the reward function used in our experiment is composed of two factors (we receive if obeyed and if violated): 1) No crashes with other objects 2) Staying within the lane lines if they are available in the driving scene. We have used the lane reward function and accident detection function defined in [14] source code. The necessary information is provided by a file in [14].
In our model, an episode is the time interval that the agent has successfully driven without having a car crash. Note that not all deviations from the middle of the road necessarily result in an accident. In case of a minor deviation, while the car receives as its reward, it continues driving until it makes a mistake that causes it to crash and the game restarts. There are distinct thresholds for middlelane deviation defined in [14] for different roads (highway, urban, etc.) and different vehicle types.
4Experimental evaluation
We implemented the method described above and ran comprehensive experiments to show the efficiency and applicability of our approach in searching for an optimal driving policy that has the minimum number of catastrophic failures. Details of the experiments along with the results are provided in the following subsections. All the experiments are executed in the GTA game environment using a publicly available plugin [14] that allowed us to have control over driving conditions such as lighting, weather, car model, and reward function.
4.1Dataset
For our evaluation we have collected a dataset of an expert policy by playing GTA collecting images of size similar to [3]. Labels include steering angle, brake, and throttle values. In order to learn from diverse driving scenarios, we have used data from different locations (highways, rural roads, urban streets) where weather and lighting conditions were adjusted using [14]. Our goal was to expose the learning algorithm to a comprehensive demonstration set yet to set aside some specific scenes for further testing the performance of behavioral cloning task. Sample images from demonstration are shown in Fig. ?. They include rainy (daytime), overcast (day and night), foggy (daytime), sunny scenes and thunderstorms (daytime). Some particular scenes such as rain and thunderstorm during nighttime as well as snow at daytime have been kept for our test set (see Fig. ?). Our test set is composed of images.
4.2Learning a policy architecture from GTA demonstrations
The search space for the hyperparameters that describe a fullyconvolutional architecture is presented in Table 1. The activation function is fixed to be a rectified linear unit. The RNNcontroller has three LSTM layers, with hidden units in each, and a softmax layer at the end to choose from the given search space. The RNN weights are initialized with a random Laplace distribution . Once the RNN predicts a new layer’s description, the child network is built and trained with batch size of and Adam optimizer [7] with learning rate . We train the child network for different number of epochs starting from epochs (depending on which layer we are at) and compute the reward function as described in Section 3.2. In order to finish optimizing one layer, we track the loss reduction of both validation and training sets between the first and the last epoch to avoid overfitting.
Our model is capable of generating architectures at low or high costs of architecture growth. In order to compare our designed architecture with [3],
Model  of parameters  Training loss  Validation loss  Test loss 

Bojarski et al.  252,241  0.098  0.11  0.212 
Our small network  228,227  0.093  0.096  0.197 
Our large network  2,198,723  0.085  0.088  0.185 
We further empirically compare two different noise distributions for perturbing the parameters of the network: random Gaussian and random Laplace. We perform the architecture search over our small model using both distributions with mean zero; variance is chosen using grid search. Fig. ? shows the results for reward convergence versus the number of iterations. Both distributions result in convergence to high reward values (minimizing loss), however, the Laplace distribution tends to be less noisy and reaches slightly higher reward values.
4.3Safe policy adaptation
Next, we want to learn the driving policy in a target game domain. We start with an initial model, either using a behaviorally cloned or randomly initialized policy and gradually improve it by receiving rewards from the environment. As stated in Section 3.3, an episode of the game starts at a random location and weather condition in the game. To initialize the policy, we use the larger architecture learned in the first step (with the model of [3] as the baseline). We evaluate both models with and without being adapted to demonstration forming four cases: (1) the baseline network of without demonstration (i.e., with randomly initialized weights) and (2) with behaviorally cloned initial weights, (3) our larger architecture without demonstration and (4) with behaviorally cloned initial weights. We run all models in the GTA environment to receive the reward described in Section 3.3. Once the reward is received, the weights are perturbed by a Laplace random noise and the same procedure is repeated until the average reward in each episode of the game converges to its maximum value. Results averaged across several runs are presented in Table 3 where our model optimized with demonstration outperforms all other cases. In particular, our model has the least number of cumulative crash occurrence prior to converging to of averaged reward (details below).
Model 





154 hours  15,565  18,662  
74 hours  1,387  3,243  
114 hours  6,877  8781  
53 hours  832  982  
In Table 4 we have listed the results for test accuracies on a dataset taken from a target domain that has not been seen in the demonstrations. On the left, we see behavioral cloning alone has poor performance when significant domain shifts occur. Our adapted model has performance over loss minimization of . On the right we see that adapted performance is strong even without reward in the target domain, indicating that visual domain shift is a lesser issue than being offdemonstration; our model can adapt in the source domain and still be accurate on the target. Best performance is obtained with adaptation to reward in both source and target. This also shows that there is an improvement on loss minimization when we learn from rewards. It is worth noting that we can not judge the driving behavior only by looking at the total MSE loss as it is not a comprehensive representative of the driving task. Each one of the angle, brake, and throttle converges to a separate MSE loss among which steering angle has the least and brake has the largest loss values. This shows that learning steering angle is easier with demonstrations compared to brake and throttle which at each time step depend on multi previous frames. Table 5 shows all model predictions for a relatively complex image chosen from the target domain (Fig. ? at top corner) where a pedestrian crossing the street when the signal light is green. The behaviorally cloned models tend to predict that the agent should keep going whereas the adapted models to the target domain with rewards are able to predict the correct decision despite the green light presence in the image. Our adapted large network rewarded on both source and target is able to make the best prediction for throttle and brake (steering angle is perfect across all models).



0.212  
0.185  
Fig. ? illustrates the percentage of averaged reward per episode for the aforementioned four models until convergence. Our designed architecture which is adapted to the demonstrations on a source domain, starts with less than reward in its first episode which lasts for seconds. This is reasonable considering the fact that each episode of the game, is intentionally set up to start in a completely random new environment which is highly possible to be a significantly different domain that what the policy has seen up to that point. This again highlights the fact that a behaviorally cloned model is at high risks of failure when it is tested in a different domain. The models keep learning from the rewards until convergence. It can be seen in Fig. ? that our designed architecture adapted to the demonstration reaches of averaged reward after hours in its last episode (episode number ) which lasts for minutes and is then terminated by the user (no crashing happens). It is also shown that a suboptimal, yet adapted to the demonstrations policy [3], also converges but only to of the maximum reward and then plateaus for more than hours. Unadapted policies are also shown in Fig. ? converging to an averaged reward of in a drastically different timescale confirming the positive effect of using demonstrations in policy learning. Supplementary video of results can be found in https://saynaebrahimi.github.io/corl.html
[4mm]






Steering angle  0.006  0.003  0.005  0.002  
Brake  0.191  0.889  0.931  0.956  
Throttle  0.665  0.083  0.010  0.052  
Steering angle  0.005  0.002  0.002  0.001  
Brake  0.183  0.567  0.677  0.778  
Throttle  0.775  0.223  0.121  0.156  
5Conclusion
The goal of this work is to learn an policy for an autonomous driving task minimizing crashes and other safety violations while training. To this end we propose an algorithm which learns to generate an optimal network architecture from demonstration using a new reward function that optimizes accuracy and model size simultaneously. We confirm behavioral cloning alone can perform poorly when the target domain differs from source demonstrations. We show that our method can adapt the model learned by demonstration to a new domain relying on target environmental rewards. Experimental evaluation shows that our model achieves higher accuracy, fewer cumulative crashes, and higher target domain reward. We believe these results are encouraging and important steps towards the ultimate goal of learning complex driving policies with zero cumulative crashes or serious accidents either in simulation or the real world.
Footnotes
 As no reference implementation of [3] is openly available we had to use our own, which may be suboptimal w.r.t. the authors’ as we did not have full access to their model parameters. Also, their model was only used to predict steering angle, and overall it is not clear whether their goal was to maximize performance, find a model with relatively few parameters, or both, so they may not have explored the full design space with their model. Nonetheless, [3] was the closest model in the literature for endtoend steering angle prediction and thus the best available baseline.
References
 A survey of robot learning from demonstration.
B. D. Argall, S. Chernova, M. Veloso, and B. Browning. Robotics and autonomous systems  Designing neural network architectures using reinforcement learning.
B. Baker, O. Gupta, N. Naik, and R. Raskar. arXiv preprint arXiv:1611.02167  End to end learning for selfdriving cars.
M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. arXiv preprint arXiv:1604.07316  Deepdriving: Learning affordance for direct perception in autonomous driving.
C. Chen, A. Seff, A. Kornhauser, and J. Xiao. In Proceedings of the IEEE International Conference on Computer Vision, pages 2722–2730, 2015. 
Ingo Rechenberg Evolutionsstrategie Optimierung technischer Systeme nach Prinzipien der biologishen Evolution
M. Eigen. .  Learning the structure of deep convolutional networks.
J. Feng and T. Darrell. In Proceedings of the IEEE International Conference on Computer Vision, pages 2749–2757, 2015.  Adam: A method for stochastic optimization.
D. Kingma and J. Ba. arXiv preprint arXiv:1412.6980  Offroad obstacle avoidance through endtoend learning.
Y. LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp.  Learning and interacting in humanrobot domains.
M. N. Nicolescu and M. J. Mataric. IEEE Transactions on Systems, man, and Cyberneticspart A: Systems and Humans  Alvinn, an autonomous land vehicle in a neural network.
D. A. Pomerleau. Technical report, Carnegie Mellon University, Computer Science Department, 1989.  Evolving multimodal controllers with hyperneat.
J. K. Pugh and K. O. Stanley. In Proceedings of the 15th annual conference on Genetic and evolutionary computation, pages 735–742. ACM, 2013.  Largescale evolution of image classifiers.
E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. Le, and A. Kurakin. arXiv preprint arXiv:1703.01041  A reduction of imitation learning and structured prediction to noregret online learning.
S. Ross, G. J. Gordon, and D. Bagnell. In International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011.  Deepgtav.
A. Ruano. https://github.com/aitor/DeepGTAV  Interactive task training of a mobile robot through human gesture recognition.
P. E. Rybski and R. M. Voyles. In Robotics and Automation, 1999. Proceedings. 1999 IEEE International Conference on, volume 1, pages 664–669. IEEE, 1999.  Evolution strategies as a scalable alternative to reinforcement learning.
T. Salimans, J. Ho, X. Chen, and I. Sutskever. arXiv preprint arXiv:1703.03864  Combinations of genetic algorithms and neural networks: A survey of the state of the art.
J. D. Schaffer, D. Whitley, and L. J. Eshelman. In Combinations of Genetic Algorithms and Neural Networks, 1992., COGANN92. International Workshop on, pages 1–37. IEEE, 1992.  Deep learning in neural networks: An overview.
J. Schmidhuber. Neural networks  Evolving neural networks through augmenting topologies.
K. O. Stanley and R. Miikkulainen. Evolutionary computation 
Reinforcement learning: An introduction
R. S. Sutton and A. G. Barto. , volume 1. 
Learning to learn
S. Thrun and L. Pratt. .  Learning timeefficient deep architectures with budgeted super networks.
T. Veniat and L. Denoyer. arXiv preprint arXiv:1706.00046  Endtoend learning of driving models from largescale video datasets.
H. Xu, Y. Gao, F. Yu, and T. Darrell. arXiv preprint arXiv:1612.01079  Neural architecture search with reinforcement learning.
B. Zoph and Q. V. Le. arXiv preprint arXiv:1611.01578