Learning Transferable Architectures for Scalable Image Recognition
Abstract
Developing image classification models often requires significant architecture engineering. In this paper, we attempt to automate this engineering process by learning the model architectures directly on the dataset of interest. As this approach is expensive when the dataset is large, we propose to learn an architectural building block on a small dataset that can be transferred to a large dataset. In our experiments, we search for the best convolutional layer (or “cell”) on the CIFAR10 dataset and then apply this learned cell to the ImageNet dataset by stacking together more copies of this cell, each with their own parameters. Although the cell is not learned directly on ImageNet, an architecture constructed from the best learned cell achieves, among the published work, stateoftheart accuracy of 82.7% top1 and 96.2% top5 on ImageNet. Our model is 1.2% better in top1 accuracy than the best humaninvented architectures while having 9 billion fewer FLOPS – a reduction of 28% from the previous state of the art model. When evaluated at different levels of computational cost, accuracies of networks constructed from the cells exceed those of the stateoftheart humandesigned models. For instance, a smaller network constructed from the best cell also achieves 74% top1 accuracy, which is 3.1% better than equivalentlysized, stateoftheart models for mobile platforms. Finally, the image features learned from image classification are generically useful and can be transferred to other computer vision problems. On the task of object detection, the learned features used with FasterRCNN framework surpass stateoftheart by 4.0% achieving 43.1% mAP on the COCO dataset.
1 Introduction
ImageNet classification [7] is an important benchmark in computer vision. The seminal work of [22] on using convolutional architectures [9, 23] for ImageNet classification represents one of the most important breakthroughs in deep learning. Successive advancements on this benchmark based on convolutional neural networks (CNNs) have achieved impressive results through significant architecture engineering [32, 35, 11, 36, 34, 41].
In this paper, we consider learning the convolutional architectures directly from data with application to ImageNet classification. We focus on ImageNet classification because the features derived from this network are of great importance in computer vision. For example, features from networks that perform well on ImageNet classification provide stateoftheart performance when transferred to other computer vision tasks where labeled data is limited [8]. Furthermore, advances in convolutional architecture design may be applied to tasks outside of the image domain [38, 20].
Our approach makes use of the recently proposed Neural Architecture Search (NAS) framework [44], which uses a policy gradient algorithm to optimize architecture configurations. Running NAS directly on the ImageNet dataset is computationally expensive given the size of the dataset. We therefore use NAS to search for a good architecture on the far smaller CIFAR10 dataset, and automatically transfer the learned architecture to ImageNet. We achieve this transferrability by designing a search space so that the complexity of the architecture is independent of the depth of the network and the size of input images. More concretely, all convolutional networks in our search space are composed of convolutional layers (or “cells”) with identical structures but different weights. Searching for the best convolutional architectures is therefore reduced to searching for the best cell structures. Searching for the best cell structures is much faster than searching for the entire architectures and the cell itself is more likely to generalize to other problems. In our experiments, this approach significantly accelerates the search for the best architectures using CIFAR10 by a factor of and learns architectures that successfully transfer to ImageNet.
Our main result is that the best architecture found by searching for architectures on CIFAR10 achieves stateoftheart accuracy when transferred to ImageNet classification without much modification. On ImageNet, an architecture constructed from the best cell achieves, among the published work, stateoftheart accuracy of 82.7% top1 and 96.2% top5. This result amounts to a 1.2% improvement in top1 accuracy than the best humaninvented architectures while having 9 billion fewer FLOPS. On CIFAR10 itself, the architecture achieves 96.59% accuracy, while having fewer parameters than models with comparable performance. A small version of the stateoftheart ShakeShake model [10] with 2.9M parameters achieves 96.45% accuracy, while our 3.3M parameters model achieves a 96.59% accuracy. Not only does our model have a better accuracy, it also needs only 600 epochs to train, which is one third of the number of epochs for the ShakeShake model.
Additionally, by simply varying the number of the convolutional cells and number of filters in the convolutional cells, we can create convolutional architectures with different computational demands. Thanks to this property of the cells, we can generate a family of models that achieve accuracies superior to all humaninvented models at equivalent or smaller computational budgets [36, 19]. Notably, the smallest version of the learned model achieves 74.0% top1 accuracy on ImageNet, which is 3.1% better than previously engineered architectures targeted towards mobile and embedded vision tasks [14, 43].
Finally, we show that the image features learned from image classification are generically useful and transfer to other computer vision problems. In our experiments, the features learned from ImageNet classification can be combined with the FasterRCNN framework [28] to achieve stateoftheart on COCO object detection task for both the largest as well as mobileoptimized models. Our largest model achieves 43.1% mAP, which is 4% better than previous stateoftheart.^{1}^{1}1Opensource code for performing inference on models pretrained on ImageNet is available at the Slim repository within the TensorFlow repository at http://github.com/tensorflow/models/
2 Method
Our work makes use of the Neural Architecture Search (NAS) framework proposed by [44]. To briefly summarize the training procedure of NAS, a controller recurrent neural network (RNN) samples child networks with different architectures. The child networks are trained to convergence to obtain some accuracy on a heldout validation set. The resulting accuracies are used to update the controller so that the controller will generate better architectures over time. The controller weights are updated with policy gradient (see Figure 1).
A key element of NAS is to design the search space to generalize across problems of varying complexity and spatial scales. We observed that applying NAS directly to the ImageNet dataset would be very expensive and require months to complete an experiment. However, if the search space is properly constructed, architectural elements can transfer across datasets [44].
The focus of this work is to design a search space, such that the best architecture found on the CIFAR10 dataset would scale to larger, higherresolution image datasets across a range of computational settings. One inspiration for this search space is the recognition that architecture engineering with CNNs often identifies repeated motifs consisting of combinations of convolutional filter banks, nonlinearities and a prudent selection of connections to achieve stateoftheart results (such as the repeated modules present in the Inception and ResNet models [35, 11, 36, 34]). These observations suggest that it may be possible for the controller RNN to predict a generic convolutional cell expressed in terms of these motifs. This cell can then be stacked in series to handle inputs of arbitrary spatial dimensions and filter depth.
To construct a complete model for image classification, we take an architecture for a convolutional cell and simply repeat it many times. Each convolutional cell would have the same architecture, but have different weights. To easily build scalable architectures for images of any size, we need two types of convolutional cells to serve two main functions when taking in a feature map as input: (1) convolutional cells that return a feature map of the same dimension, and (2) convolutional cells that return a feature map where the feature map height and width is reduced by a factor of two. We name the first type and second type of convolutional cells Normal Cell and Reduction Cell respectively. For the Reduction Cell, we make the initial operation applied to the cell’s inputs have a stride of two to reduce the height and width. All of our operations that we consider for building our convolutional cells have an option of striding.
Figure 2 shows our placement of Normal and Reduction Cells for CIFAR10 and ImageNet. Note on ImageNet we have more Reduction Cells, since the incoming image size is 299x299 compared to 32x32 for CIFAR. The Reduction and Normal Cell could have the same architecture, but we empirically found it was beneficial to learn two separate architectures. We use a common heuristic to double the number of filters in the output whenever the spatial activation size is reduced in order to maintain roughly constant hidden state dimension [22, 32]. Importantly, much like Inception and ResNet models [35, 11, 36, 34], we consider the number of motif repetitions and the number of initial convolutional filters as free parameters that we tailor to the scale of an image classification problem.
Once the overall architecture of the model is defined as shown in Figure 2, the job of the controller thus becomes predicting the structures of the two convolutional cells: Normal Cell and Reduction Cell. In our search space, each cell receives as input two initial hidden states and which are the outputs of two cells in previous two lower layers or the input image. The controller RNN recursively predicts the rest of the structure of the convolutional cell, given these two initial hidden states (Figure 3). The predictions of the controller for each cell are grouped into blocks, where each block has 5 prediction steps made by 5 distinct softmax classifiers corresponding to discrete choices of the elements of a block:

Select a hidden state from or from the set of hidden states created in previous blocks.

Select a second hidden state from the same options as in Step 1.

Select an operation to apply to the hidden state selected in Step 1.

Select an operation to apply to the hidden state selected in Step 2.

Select a method to combine the outputs of Step 3 and 4 to create a new hidden state.
The algorithm appends the newlycreated hidden state to the set of existing hidden states as a potential input in subsequent blocks. The controller RNN repeats the above 5 prediction steps times corresponding to the blocks in a convolutional cell. In our experiments, selecting provides good results, although we have not exhaustively searched this space due to computational limitations.
In steps 3 and 4, the controller RNN selects an operation to apply to the hidden states. We collected the following set of operations based on their prevalence in the CNN literature:
identity  1x3 then 3x1 convolution 
1x7 then 7x1 convolution  3x3 dilated convolution 
3x3 average pooling  3x3 max pooling 
5x5 max pooling  7x7 max pooling 
1x1 convolution  3x3 convolution 
3x3 depthwiseseparable conv  5x5 depthwiseseperable conv 
7x7 depthwiseseparable conv 
In step 5 the controller RNN selects a method to combine the two hidden states, either (1) elementwise addition between two hidden states and (2) concatenation between two hidden states along the filter dimension. Finally, all of the unused hidden states generated in the convolutional cell are concatenated together in depth to provide the final cell output.
To allow the controller RNN to predict both Normal Cell and Reduction Cell, we simply make the controller have predictions in total, where the first predictions are for the Normal Cell and the second predictions are for the Reduction Cell.
3 Experiments and Results
In this section, we describe our experiments with our method described above to learn convolutional cells. In summary, all architecture searches are performed using the CIFAR10 classification task [21]. The controller RNN was trained using Proximal Policy Optimization (PPO) [30] by employing a global workqueue system for generating a pool of child networks controlled by the RNN. In our experiments, the pool of workers in the workqueue consisted of 500 GPUs. Please see Appendix A for complete details of the architecture learning and controller system.
The result of this search process over 4 days yields several candidate convolutional cells. We note that this search procedure is almost faster than previous approaches [44] that took 28 days.^{2}^{2}2In particular, we note that previous architecture search [44] employed 800 GPUs for 28 days resulting in 22,400 GPUhours. Current methods employ 500 GPUs across 4 days resulting in 2,000 GPUhours. The former effort employed Nvidia K40 GPUs, whereas the current efforts employed faster NVidia P100s. Discounting the fact that the we employ faster hardware, we estimate that the current procedure is roughly about more efficient. Additionally, we demonstrate below that the resulting architecture is superior in accuracy.
Figure 4 shows a diagram of the top performing Normal Cell and Reduction Cell. Note the prevalence of separable convolutions and the number of branches compared with competing architectures [32, 35, 11, 36, 34]. Subsequent experiments focus on this convolutional cell architecture, although we examine the efficacy of other, topranked convolutional cells in ImageNet experiments (described in Appendix B) and report their results as well. We call the three networks constructed from the best three searches NASNetA, NASNetB and NASNetC.
We demonstrate the utility of the convolutional cells by employing this learned architecture on CIFAR10 and a family of ImageNet classification tasks. The latter family of tasks is explored across a few orders of magnitude in computational budget. After having learned the convolutional cells, several hyperparameters may be explored to build a final network for a given task: (1) the number of cell repeats and (2) the number of filters in the initial convolutional cell. After selecting the number of initial filters, we employ a common heuristic to double the number of filters whenever the stride is 2. Finally, we define a simple notation, e.g., @ , to indicate these two parameters in all networks, where and indicate the number of cell repeats and the number of filters in the penultimate layer of the network, respectively.
3.1 Results on CIFAR10 Image Classification
For the task of image classification with CIFAR10, we set or 6 (Figure 2). The test accuracies of the best architectures are reported in Table 1 along with other stateoftheart models. First, note that the resulting network architectures are comparable to or surpass architectures identified through earlier NAS methods [44]. Additionally, the best architectures are better than the previous stateoftheart ShakeShake model when comparing against the same number of parameters. Additionally, we only train for 600 epochs where ShakeShake trains for 1800 epochs. See Appendix A for more details on CIFAR training.
model  depth  # params  error rate (%) 

DenseNet [16]  40  1.0M  5.24 
DenseNet [16]  100  7.0M  4.10 
DenseNet [16]  100  27.2M  3.74 
DenseNetBC [16]  190  25.6M  3.46 
ShakeShake 26 2x32d [10]  26  2.9M  3.55 
ShakeShake 26 2x96d [10]  26  26.2M  2.86 
NAS v3 [44]  39  7.1M  4.47 
NAS v3 [44]  39  37.4M  3.65 
NASNetA (6 @ 768)    3.3M  3.41 
NASNetB (4 @ 1152)    2.6M  3.73 
NASNetC (4 @ 640)    3.1M  3.59 
3.2 Results on ImageNet Image Classification
We performed several sets of experiments on ImageNet with the best convolutional cells learned from CIFAR10. We emphasize that we merely transfer the architectures from CIFAR10 but train all ImageNet models weights from scratch.
Results are summarized in Table 2 and 3 and Figure 5. In the first set of experiments, we train several image classification systems operating on 299x299 or 331x331 resolution images with different experiments scaled in computational demand to create models that are roughly on par in computational cost with Inceptionv2 [19], Inceptionv3 [36] and PolyNet [42]. We show that this family of models achieve stateoftheart performance with fewer floating point operations and parameters than comparable architectures. Second, we demonstrate that by adjusting the scale of the model we can achieve stateoftheart performance at smaller computational budgets, exceeding streamlined CNNs handdesigned for this operating regime [14, 43].
Note we do not have residual connections between convolutional cells as the models learn skip connections on their own. We empirically found manually inserting residual connections between cells to not help performance. Our training setup on ImageNet is similar to [36], but please see Appendix A for details.
Model  image size  # parameters  MultAdds  Top 1 Acc. (%)  Top 5 Acc. (%) 

Inception V2 [19]  224224  11.2 M  1.94 B  74.8  92.2 
NASNetA (5 @ 1538)  299299  10.9 M  2.35 B  78.6  94.2 
Inception V3 [36]  299299  23.8 M  5.72 B  78.0  93.9 
Xception [5]  299299  22.8 M  8.38 B  79.0  94.5 
Inception ResNet V2 [34]  299299  55.8 M  13.2 B  80.4  95.3 
NASNetA (7 @ 1920)  299299  22.6 M  4.93 B  80.8  95.3 
ResNeXt101 (64 x 4d) [41]  320320  83.6 M  31.5 B  80.9  95.6 
PolyNet [42]  331331  92 M  34.7 B  81.3  95.8 
DPN131 [4]  320320  79.5 M  32.0 B  81.5  95.8 
SENet [15]  320320  145.8 M  42.3 B  82.7  96.2 
NASNetA (6 @ 4032)  331331  88.9 M  23.8 B  82.7  96.2 
Model  # parameters  MultAdds  Top 1 Acc. (%)  Top 5 Acc. (%) 

Inception V1 [35]  6.6M  1,448 M  69.8  89.9 
MobileNet224 [14]  4.2 M  569 M  70.6  89.5 
ShuffleNet (2x) [43]  5M  524 M  70.9  89.8 
NASNetA (4 @ 1056)  5.3 M  564 M  74.0  91.6 
NASNetB (4 @ 1536)  5.3M  488 M  72.8  91.3 
NASNetC (3 @ 960)  4.9M  558 M  72.5  91.0 
Table 2 shows that the convolutional cells discovered with CIFAR10 generalize well to ImageNet problems. In particular, each model based on the convolutional cells exceeds the predictive performance of the corresponding handdesigned model. Importantly, the largest model achieves a new stateoftheart performance for ImageNet (82.7%) based on single, nonensembled predictions, surpassing previous best published result by 1.2% [4]. Among the unpublished work, our model is on par with the best reported result of 82.7% [15], while having significantly fewer floating point operations. Figure 5 shows a complete summary of our results in comparison with other published results. Note the family of models based on convolutional cells provides an envelope over a broad class of humaninvented architectures.
Finally, we test how well the best convolutional cells may perform in a resourceconstrained setting, e.g., mobile devices (Table 3). In these settings, the number of floating point operations is severely constrained and predictive performance must be weighed against latency requirements on a device with limited computational resources. MobileNet [14] and ShuffleNet [43] provide stateoftheart results obtaining 70.6% and 70.9 accuracy, respectively on 224x224 images using 550M multliplyadd operations. An architecture constructed from the best convolutional cells achieves superior predictive performance (74.0% accuracy) surpassing previous models but with comparable computational demand. In summary, we find that the learned convolutional cells are flexible across model scales achieving stateoftheart performance across almost 2 orders of magnitude in computational budget.
3.3 Improved features for object detection
Model  resolution  mAP (minival)  mAP (testdev) 

MobileNet224 [14]  19.8%    
ShuffleNet (2x) [43]  24.5%    
NASNetA (4 @ 1056)  29.6%    
ResNet101FPN [24]  800 (short side)    36.2% 
InceptionResNetv2 (GRMI) [18]  35.7%  35.6%  
InceptionResNetv2 (TDM) [31]  37.3%  36.8%  
NASNetA (6 @ 4032)  41.3%  40.7%  
NASNetA (6 @ 4032)  43.2%  43.1%  
ResNet101FPN (RetinaNet) [25]  800 (short side)    39.1% 
Image classification networks provide generic image features that may be transferred to other computer vision problems [8]. One of the most important problems is the spatial localization of objects within an image. To further validate the performance of the family of NASNetA networks, we test whether object detection systems derived from NASNetA lead to improvements in object detection [18].
To address this question, we plug in the family of NASNetA networks pretrained on ImageNet into the FasterRCNN object detection pipeline [28] using an opensource software platform [18]. We retrain the resulting object detection pipeline on the combined COCO training plus validation dataset excluding 8,000 minivalidation images. We perform single model evaluation using 300500 RPN proposals per image. In other words, we only pass a single image through a single network. We evaluate the model on the COCO minival [18] and testdev dataset and report the mean average precision (mAP) as computed with the standard COCO metric library [26]. We perform a simple search over learning rate schedules to identify the best possible model. Finally, we examine the behavior of two object detection systems employing the best performing NASNetA image featurization (NASNetA, @ ) as well as the image featurization geared towards mobile platforms (NASNetA, @ ).
For the mobileoptimized network, our resulting system achieves a mAP of 29.6% – exceeding previous mobileoptimized networks that employ FasterRCNN by over 5.0% (Table 4). For the best NASNet network, our resulting network operating on images of the same spatial resolution (800 800) achieves mAP = 40.7%, exceeding equivalent object detection systems based off lesser performing image featurization (i.e. InceptionResNetv2) by 4.0% [18, 31] (see Figure 6 and Appendix). Finally, increasing the spatial resolution of the input image results in the best reported, single model result for object detection of 43.1%, surpassing the best previous best by over 4.0% [25].^{3}^{3}3A primary advance in the best reported object detection system is the introduction of a novel loss [25]. Pairing this loss with NASNetA image featurization may lead to even further performance gains. Additionally, performance gains are achievable through ensembling multiple inferences across multiple model instances and image crops (e.g., [18]). These results provide further evidence that NASNet provides superior, generic image features that may be transferred across other computer vision tasks.
3.4 Efficiency of architecture search
An open question in this proposed method is the training efficiency of the architecture search algorithm. In this section, we demonstrate the effectiveness of reinforcement learning on architecture search on the CIFAR10 image classification problem as compared to bruteforce random search (considered to be a very strong baseline for blackbox optimization [2]) given an equivalent amount of computational resources.
We define the effectiveness of an architecture search algorithm as the increase in performance from the initial architecture identified with the search method. Importantly, we emphasize that this value of the model performance is less important as the absolute value artificially reflects irrelevant factors such as the number of training epochs and the specific model construction used during the architecture search process. Thus, we view the increase in the model performance as a proxy for judging convergence of the architecture search algorithm.
Figure 7 shows the performance of reinforcement learning (RL) and random search (RS) as more model architectures are sampled. Note that the best model identified with RL is significantly better than the best model found by RS by over 1% as measured by a CIFAR10 classification task. Additionally, RL finds an entire range of models that are of superior quality to random search. We observe this in the mean performance of the top5 and top25 models identified in RL versus RS. We take these results to indicate that RL significantly improves our ability to learn neural architectures.
4 Conclusion
In this work, we demonstrate how to learn scalable, convolutional cells from data that transfer to multiple image classification tasks. The learned architecture is quite flexible as it may be scaled in terms of computational cost and parameters to easily address a variety of problems. In all cases, the accuracy of the resulting model exceeds all humandesigned models – ranging from models designed for mobile applications to computationallyheavy models designed to achieve the most accurate results.
The key insight to this approach is to design a search space that decouples the complexity of an architecture from the depth of a network. This resulting search space permits identifying good architectures on a small dataset (i.e., CIFAR10) and transferring the learned architecture to image classifications across a range of data and computational scales.
The resulting architectures approach or exceed stateoftheart performance in both CIFAR10 and ImageNet datasets with less computational demand than humandesigned architectures [36, 19, 42]. The ImageNet results are particularly important because many stateoftheart computer vision problems (e.g., object detection [18], face detection [29], image localization [39]) derive image features or architectures from ImageNet classification models. For instance, we find that image features obtained from ImageNet used in combination with the FasterRCNN framework achieves stateoftheart object detection results. Finally, we demonstrate that we can employ the resulting learned architecture to perform ImageNet classification with reduced computational budgets that outperform streamlined architectures targeted to mobile and embedded platforms [14, 43].
Acknowledgements
We thank Jeff Dean, Yifeng Lu, Jonathan Huang, Chen Sun, Jonathan Shen, Vishy Tirumalashetty, Xiaoqiang Zheng, and the Google Brain team for the help with the project. We additionally thank Christian Sigg for performance improvements to depthwise separable convolutions.
References
 [1] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 [2] J. Bergstra and Y. Bengio. Random search for hyperparameter optimization. Journal of Machine Learning Research, 2012.
 [3] J. Chen, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous sgd. In International Conference on Learning Representations Workshop Track, 2016.
 [4] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. arXiv preprint arXiv:1707.01083, 2017.
 [5] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [6] D.A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In International Conference on Learning Representations, 2016.
 [7] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009.
 [8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, volume 32, pages 647–655, 2014.
 [9] K. Fukushima. A selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, page 93â202, 1980.
 [10] X. Gastaldi. Shakeshake regularization of 3branch residual networks. In International Conference on Learning Representations Workshop Track, 2017.
 [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, 2016.
 [13] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Computation, 1997.
 [14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [15] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. arXiv preprint arXiv:1709.01507, 2017.
 [16] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [17] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, 2016.
 [18] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy tradeoffs for modern convolutional object detectors. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Learning Representations, 2015.
 [20] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den Oord, A. Graves, and K. Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, abs/1610.10099, 2016.
 [21] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
 [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing System, 2012.
 [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
 [24] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [25] T.Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017.
 [26] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
 [27] I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
 [28] S. Ren, K. He, R. Girshick, and J. Sun. Faster RCNN: Towards realtime object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.
 [29] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
 [30] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 [31] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Topdown modulation for object detection. arXiv preprint arXiv:1612.06851, 2016.
 [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, 2015.
 [33] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [34] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inceptionv4, InceptionResnet and the impact of residual connections on learning. In International Conference on Learning Representations Workshop Track, 2016.
 [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 [36] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 [37] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
 [38] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
 [39] T. Weyand, I. Kostrikov, and J. Philbin. Planetphoto geolocation with convolutional neural networks. In European Conference on Computer Vision, 2016.
 [40] R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Machine Learning, 1992.
 [41] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [42] X. Zhang, Z. Li, C. C. Loy, and D. Lin. Polynet: A pursuit of structural diversity in very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [43] X. Zhang, X. Zhou, L. Mengxiao, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.
 [44] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.
Appendix
Appendix A Experimental Details
a.1 Dataset for Architecture Search
The CIFAR10 dataset [21] consists of 60,000 32x32 RGB images across 10 classes (50,000 train and 10,000 test images). We partition a random subset of 5,000 images from the training set to use as a validation set for the controller RNN. All images are whitened and then undergone several data augmentation steps: we randomly crop 32x32 patches from upsampled images of size 40x40 and apply random horizontal flips. This data augmentation procedure is common among related work.
a.2 Controller architecture
The controller RNN is a onelayer LSTM [13] with 100 hidden units at each layer and softmax predictions for the two convolutional cells (where is typically 5) associated with each architecture decision. Each of the predictions of the controller RNN is associated with a probability. The joint probability of a child network is the product of all probabilities at these softmaxes. This joint probability is used to compute the gradient for the controller RNN. The gradient is scaled by the validation accuracy of the child network to update the controller RNN such that the controller assigns low probabilities for bad child networks and high probabilities for good child networks.
Unlike [44], who used the REINFORCE rule [40] to update the controller, we employ Proximal Policy Optimization (PPO) [30] with learning rate 0.00035 because training with PPO is faster and more stable. To encourage exploration we also use an entropy penalty with a weight of 0.00001. In our implementation, the baseline function is an exponential moving average of previous rewards with a weight of 0.95. The weights of the controller are initialized uniformly between 0.1 and 0.1.
a.3 Training of the Controller
For distributed training, we use a work queue system where all the samples generated from the controller RNN are added to a global workqueue. A free “child” worker in a distributed worker pool asks the controller for new work from the global workqueue. Once the training of the child network is complete, the accuracy on a heldout validation set is computed and reported to the controller RNN. In our experiments we use a child worker pool size of 450, which means there are 450 networks being trained on 450 GPUs concurrently at any time. Upon receiving enough child model training results, the Controller RNN will perform a gradient update on its weights using TRPO and then sample another batch of architectures that go into the global work queue. This process continues until a predetermined number of architectures have been sampled. In our experiments, this predetermined number of architectures is 20,000 which means the search process is terminated after 20,000 child models have been trained. Additionally, we update the controller RNN with minibatches of 20 architectures. Once the search is over, the top 250 architectures are then chosen to train until convergence on CIFAR10 to determine the very best architecture.
a.4 Details of architecture search space
We performed preliminary experiments to identify a flexible, expressive search space for neural architectures that learn effectively. Generally, our strategy for preliminary experiments involved smallscale explorations to identify how to run largescale architecture search.

All convolutions employ ReLU nonlinearity. Experiments with ELU nonlinearity [6] showed minimal benefit.

To ensure that the shapes always match in convolutional cells, 1x1 convolutions are inserted as necessary.

Unlike [14], all depthwise separable convolution do not employ Batch Normalization and/or a ReLU between the depthwise and pointwise operations.

All convolutions followed an ordering of ReLU, convolution operation and Batch Normalization following [12].

Whenever a separable convolution is selected as an operation by the model architecture, the separable convolution is applied twice to the hidden state. We found this empirically to improve overall performance.
a.5 Training with stochastic regularization
We performed several experiments with various stochastic regularization methods. Naively applying dropout [33] across convolutional filters degraded performance. However, when training NASNet models, we found that stochastically dropping out each path (i.e. edge with a yellow box in Figure 4) in the cell with some fixed probability to be an effective regularizer. This is similar to [17] and [42] where they dropout full parts of their model during training and then at test time scale the path by the probability of keeping that path during training. Interestingly we found that linearly increasing the probability of dropping out a path over the course of training to significantly improve the final performance for both CIFAR and ImageNet experiments.
a.6 Training of CIFAR models
All of our CIFAR models use a single period cosine decay as in [27, 10]. All models use the momentum optimizer with momentum rate set to 0.9. All models also use L2 weight decay. Each architecture is trained for a fixed 20 epochs on CIFAR10 during the architecture search process. Additionally, we found it beneficial to use the cosine learning rate decay during the 20 epochs the CIFAR models were trained as this helped to further differentiate good architectures. We also found that having the CIFAR models use a small during the architecture search process allowed for models to train quite quickly, while still finding cells that work well once more were stacked.
a.7 Training of ImageNet models
We use ImageNet 2012 ILSVRC challenge data for large scale image classification. The dataset consists of 1.2M images labeled across 1,000 classes [7]. Overall our training and testing procedures are almost identical to [36]. ImageNet models are trained and evaluated on 299x299 or 331x331 images using the same data augmentation procedures as described previously [36]. We use distributed synchronous SGD to train the ImageNet model with 50 workers (and 3 backup workers) each with a Tesla K40 GPU [3]. We use RMSProp with a decay of 0.9 and epsilon of 1.0. Evaluations are calculated using with a running average of parameters over time with a decay rate of 0.9999. We use label smoothing with a value of 0.1 for all ImageNet models as done in [36]. Additionally, all models use an auxiliary classifier located at 2/3 of the way up the network. The loss of the auxiliary classifier is weighted by 0.4 as done in [36]. We empirically found our network to be insensitive to the number of parameters associated with this auxiliary classifier along with the weight associated with the loss. All models also use L2 regularization. The learning rate decay scheme is the exponential decay scheme used in [36]. Dropout is applied to the final softmax matrix with probability 0.5.
Appendix B Additional Experiments
We now present two additional cells that performed well on CIFAR and ImageNet. The search spaces used for these cells are slightly different than what was used for NASNetA. For the NASNetB model in Figure 9 we do not concatenate all of the unused hidden states generated in the convolutional cell. Instead all of the hiddenstates created within the convolutional cell, even if they are currently used, are fed into the next layer. Note that and there are 4 hiddenstates as input to the cell as these numbers must match for this cell to be valid. We also allow addition followed by layer normalization [1] or instance normalization [37] to be predicted as two of the combination operations within the cell, along with addition or concatenation.
For NASNetC (Figure 10), we concatenate all of the unused hidden states generated in the convolutional cell like in NASNetA, but now we allow the prediction of addition followed by layer normalization or instance normalization like in NASNetB.