AutoDeepLab:
Hierarchical Neural Architecture Search for Semantic Image Segmentation
Abstract
Recently, Neural Architecture Search (NAS) has successfully identified neural network architectures that exceed human designed ones on largescale image classification problems. In this paper, we study NAS for semantic image segmentation, an important computer vision task that assigns a semantic label to every pixel in an image. Existing works often focus on searching the repeatable cell structure, while handdesigning the outer network structure that controls the spatial resolution changes. This choice simplifies the search space, but becomes increasingly problematic for dense image prediction which exhibits a lot more network level architectural variations. Therefore, we propose to search the network level structure in addition to the cell level structure, which forms a hierarchical architecture search space. We present a network level search space that includes many popular designs, and develop a formulation that allows efficient gradientbased architecture search (3 P100 GPU days on Cityscapes images). We demonstrate the effectiveness of the proposed method on the challenging Cityscapes, PASCAL VOC 2012, and ADE20K datasets. Without any ImageNet pretraining, our architecture searched specifically for semantic image segmentation attains stateoftheart performance.
1 Introduction
Deep neural networks have been proved successful across a large variety of artificial intelligence tasks, including image recognition [38, 25], speech recognition [27], machine translation [72, 80] etc. While better optimizers [36] and better normalization techniques [32, 79] certainly played an important role, a lot of the progress comes from the design of neural network architectures. In computer vision, this holds true for both image classification [38, 71, 74, 75, 73, 25, 84, 31, 30] and dense image prediction [16, 51, 7, 63, 56, 55].
Auto Search  
Model  Cell  Network  Dataset  Days  Task 
ResNet [25]  ✗  ✗      Cls 
DenseNet [31]  ✗  ✗      Cls 
DeepLabv3+ [11]  ✗  ✗      Seg 
NASNet [92]  ✓  ✗  CIFAR10  Cls  
AmoebaNet [61]  ✓  ✗  CIFAR10  Cls  
PNASNet [47]  ✓  ✗  CIFAR10  Cls  
DARTS [49]  ✓  ✗  CIFAR10  Cls  
DPC [6]  ✓  ✗  Cityscapes  Seg  
AutoDeepLab  ✓  ✓  Cityscapes  Seg 
More recently, in the spirit of AutoML and democratizing AI, there has been significant interest in designing neural network architectures automatically, instead of relying heavily on expert experience and knowledge. Importantly, in the past year, Neural Architecture Search (NAS) has successfully identified architectures that exceed humandesigned architectures on largescale image classification problems [92, 47, 61].
Image classification is a good starting point for NAS, because it is the most fundamental and wellstudied highlevel recognition task. In addition, there exists benchmark datasets (e.g., CIFAR10) with relatively small images, resulting in less computation and faster training. However, image classification should not be the end point for NAS, and the current success shows promise to extend into more demanding domains. In this paper, we study Neural Architecture Search for semantic image segmentation, an important computer vision task that assigns a label like “person” or “bicycle” to each pixel in the input image.
Naively porting ideas from image classification would not suffice for semantic segmentation. In image classification, NAS typically applies transfer learning from low resolution images to high resolution images [92], whereas optimal architectures for semantic segmentation must inherently operate on high resolution imagery. This suggests the need for: (1) a more relaxed and general search space to capture the architectural variations brought by the higher resolution, and (2) a more efficient architecture search technique as higher resolution requires heavier computation.
We notice that modern CNN designs [25, 84, 31] usually follow a twolevel hierarchy, where the outer network level controls the spatial resolution changes, and the inner cell level governs the specific layerwise computations. The vast majority of current works on NAS [92, 47, 61, 59, 49] follow this twolevel hierarchical design, but only automatically search the inner cell level while handdesigning the outer network level. This limited search space becomes problematic for dense image prediction, which is sensitive to the spatial resolution changes. Therefore in our work, we propose a trellislike network level search space that augments the commonlyused cell level search space first proposed in [92] to form a hierarchical architecture search space. Our goal is to jointly learn a good combination of repeatable cell structure and network structure specifically for semantic image segmentation.
In terms of the architecture search method, reinforcement learning [91, 92] and evolutionary algorithms [62, 61] tend to be computationally intensive even on the low resolution CIFAR10 dataset, therefore probably not suitable for semantic image segmentation. We draw inspiration from the differentiable formulation of NAS [68, 49], and develop a continuous relaxation of the discrete architectures that exactly matches the hierarchical architecture search space. The hierarchical architecture search is conducted via stochastic gradient descent. When the search terminates, the best cell architecture is decoded greedily, and the best network architecture is decoded efficiently using the Viterbi algorithm. We directly search architecture on image crops from Cityscapes [13]. The search is very efficient and only takes about days on one P100 GPU.
We report experimental results on multiple semantic segmentation benchmarks, including Cityscapes [13], PASCAL VOC 2012 [15], and ADE20K [89]. Without ImageNet [64] pretraining, our best model significantly outperforms FRRNB [60] by and GridNet [17] by on Cityscapes test set, and performs comparably with other ImageNetpretrained stateoftheart models [81, 87, 4, 11, 6] when also exploiting the coarse annotations on Cityscapes. Notably, our best model (without pretraining) attains the same performance as DeepLabv3+ [11] (with pretraining) while being times faster in MultiAdds. Additionally, our lightweight model attains the performance only lower than DeepLabv3+ [11], while requiring fewer parameters and being times faster in MultiAdds. Finally, on PASCAL VOC 2012 and ADE20K, our best model outperforms several stateoftheart models [89, 44, 81, 87, 82] while using strictly less data for pretraining.
To summarize, the contribution of our paper is fourfold:

Ours is one of the first attempts to extend NAS beyond image classification to dense image prediction.

We propose a network level architecture search space that augments and complements the muchstudied cell level one, and consider the more challenging joint search of network level and cell level architectures.

We develop a differentiable, continuous formulation that conducts the twolevel hierarchical architecture search efficiently in GPU days.

Without ImageNet pretraining, our model significantly outperforms FRRNB and GridNet, and attains comparable performance with other ImageNetpretrained stateoftheart models on Cityscapes. On PASCAL VOC 2012 and ADE20K, our best model also outperforms several stateoftheart models.
2 Related Work
Semantic Image Segmentation
Convolutional neural networks [42] deployed in a fully convolutional manner (FCNs [67, 51]) have achieved remarkable performance on several semantic segmentation benchmarks. Within the stateoftheart systems, there are two essential components: multiscale context module and neural network design. It has been known that context information is crucial for pixel labeling tasks [26, 69, 37, 39, 16, 54, 14, 10]. Therefore, PSPNet [87] performs spatial pyramid pooling [21, 41, 24] at several grid scales (including imagelevel pooling [50]), while DeepLab [8, 9] applies several parallel atrous convolution [28, 20, 67, 57, 7] with different rates. On the other hand, the improvement of neural network design has significantly driven the performance from AlexNet [38], VGG [71], Inception [32, 75, 73], ResNet [25] to more recent architectures, such as Wide ResNet [85], ResNeXt [84], DenseNet [31] and Xception [12]. In addition to adopting those networks as backbones for semantic segmentation, one could employ the encoderdecoder structures [63, 2, 55, 44, 60, 58, 33, 78, 18, 11, 86, 82] which efficiently captures the longrange context information while keeping the detailed object boundaries. Nevertheless, most of the models require initialization from the ImageNet [64] pretrained checkpoints except FRRN [60] and GridNet [17] for the task of semantic segmentation. Specifically, FRRN [60] employs a twostream system, where fullresolution information is carried in one stream and context information in the other pooling stream. GridNet, building on top of a similar idea, contains multiple streams with different resolutions. In this work, we apply neural architecture search for network backbones specific for semantic segmentation. We further show stateoftheart performance without ImageNet pretraining, and significantly outperforms FRRN [60] and GridNet [17] on Cityscapes [13].
Neural Architecture Search Method
Neural Architecture Search aims at automatically designing neural network architectures, hence minimizing human hours and efforts. While some works [22, 34, 91, 49] search RNN cells for language tasks, more works search good CNN architectures for image classification.
Several papers used reinforcement learning (either policy gradients [91, 92, 5, 76] or Qlearning [3, 88]) to train a recurrent neural network that represents a policy to generate a sequence of symbols specifying the CNN architecture. An alternative to RL is to use evolutionary algorithms (EA), that “evolves” architectures by mutating the best architectures found so far [62, 83, 53, 48, 61]. However, these RL and EA methods tend to require massive computation during the search, usually thousands of GPU days. PNAS [47] proposed a progressive search strategy that markedly reduced the search cost while maintaining the quality of the searched architecture. NAO [52] embedded architectures into a latent space and performed optimization before decoding. Additionally, several works [59, 49, 1] utilized architectural sharing among sampled models instead of training each of them individually, thereby further reduced the search cost. Our work follows the differentiable NAS formulation [68, 49] and extends it into the more general hierarchical setting.
Neural Architecture Search Space
Earlier papers, e.g., [91, 62], tried to directly construct the entire network. However, more recent papers [92, 47, 61, 59, 49] have shifted to searching the repeatable cell structure, while keeping the outer network level structure fixed by hand. First proposed in [92], this strategy is likely inspired by the twolevel hierarchy commonly used in modern CNNs.
Our work still uses this cell level search space to keep consistent with previous works. Yet one of our contributions is to propose a new, generalpurpose network level search space, since we wish to jointly search across this twolevel hierarchy. Our network level search space shares a similar outlook as [66], but the important difference is that [66] kept the entire “fabrics” with no intention to alter the architecture, whereas we associate an explicit weight for each connection and focus on decoding a single discrete structure. In addition, [66] was evaluated on segmenting face images into classes [35], whereas our models are evaluated on largescale segmentation datasets such as Cityscapes [13], PASCAL VOC 2012 [15], and ADE20K [89].
The most similar work to ours is [6], which also studied NAS for semantic image segmentation. However, [6] focused on searching the much smaller Atrous Spatial Pyramid Pooling (ASPP) module using random search, whereas we focus on searching the much more fundamental network backbone architecture using more advanced and more efficient search methods.
3 Architecture Search Space
This section describes our twolevel hierarchical architecture search space. For the inner cell level (Sec. 3.1), we reuse the one adopted in [92, 47, 61, 49] to keep consistent with previous works. For the outer network level (Sec. 3.2), we propose a novel search space based on observation and summarization of many popular designs.
3.1 Cell Level Search Space
We define a cell to be a small fully convolutional module, typically repeated multiple times to form the entire neural network. More specifically, a cell is a directed acyclic graph consisting of blocks.
Each block is a twobranch structure, mapping from input tensors to output tensor. Block in cell may be specified using a tuple , where are selections of input tensors, are selections of layer types applied to the corresponding input tensor, and is the method used to combine the individual outputs of the two branches to form this block’s output tensor, . The cell’s output tensor is simply the concatenation of the blocks’ output tensors .
The set of possible input tensors, , consists of the output of the previous cell , the output of the previousprevious cell , and previous blocks’ output in the current cell . Therefore, as we add more blocks in the cell, the next block has more choices as potential source of input.
The set of possible layer types, , consists of the following operators, all prevalent in modern CNNs:

depthwiseseparable conv

depthwiseseparable conv

atrous conv with rate

atrous conv with rate

average pooling

max pooling

skip connection

no connection (zero)
For the set of possible combination operators , we simply let elementwise addition to be the only choice.
3.2 Network Level Search Space
In the image classification NAS framework pioneered by [92], once a cell structure is found, the entire network is constructed using a predefined pattern. Therefore the network level was not part of the architecture search, hence its search space has never been proposed nor designed.
This predefined pattern is simple and straightforward: a number of “normal cells” (cells that keep the spatial resolution of the feature tensor) are separated equally by inserting “reduction cells” (cells that divide the spatial resolution by and multiply the number of filters by ). This keepdownsampling strategy is reasonable in the image classification case, but in dense image prediction it is also important to keep high spatial resolution, and as a result there are more network level variations [9, 56, 55].
Among the various network architectures for dense image prediction, we notice two principles that are consistent:

The spatial resolution of the next layer is either twice as large, or twice as small, or remains the same.

The smallest spatial resolution is downsampled by .
Following these common practices, we propose the following network level search space. The beginning of the network is a twolayer “stem” structure that each reduces the spatial resolution by a factor of . After that, there are a total of layers with unknown spatial resolutions, with the maximum being downsampled by and the minimum being downsampled by . Since each layer may differ in spatial resolution by at most , the first layer after the stem could only be either downsampled by or . We illustrate our network level search space in Fig. 1. Our goal is then to find a good path in this layer trellis.
In Fig. 2 we show that our search space is general enough to cover many popular designs. In the future, we have plans to relax this search space even further to include Unet architectures [63, 45, 70], where layer may receive input from one more layer preceding in addition to .
We reiterate that our work searches the network level architecture in addition to the cell level architecture. Therefore our search space is strictly more challenging and generalpurpose than previous works.
4 Methods
We begin by introducing a continuous relaxation of the (exponentially many) discrete architectures that exactly matches the hierarchical architecture search described above. We then discuss how to perform architecture search via optimization, and how to decode back a discrete architecture after the search terminates.
4.1 Continuous Relaxation of Architectures
4.1.1 Cell Architecture
We reuse the continuous relaxation described in [49]. Every block’s output tensor is connected to all hidden states in :
(1) 
In addition, we approximate each with its continuous relaxation , defined as:
(2) 
where
(3)  
(4) 
In other words, are normalized scalars associated with each operator , easily implemented as softmax.
4.1.2 Network Architecture
Within a cell, all tensors are of the same spatial size, which enables the (weighted) sum in Eq. (1) and Eq. (2). However, as clearly illustrated in Fig. 1, tensors may take different sizes in the network level. Therefore in order to set up the continuous relaxation, each layer will have at most hidden states , with the upper left superscript indicating the spatial resolution.
We design the network level continuous relaxation to exactly match the search space described in Sec. 3.2. We associated a scalar with each gray arrow in Fig. 1, and the network level update is:
(6) 
where and . The scalars are normalized such that
(7)  
(8) 
also implemented as softmax.
Eq. (6) shows how the continuous relaxations of the twolevel hierarchy are weaved together. In particular, controls the outer network level, hence depends on the spatial size and layer index. Each scalar in governs an entire set of , yet specifies the same architecture that depends on neither spatial size nor layer index.
As illustrated in Fig. 1, Atrous Spatial Pyramid Pooling (ASPP) modules are attached to each spatial resolution at the th layer (atrous rates are adjusted accordingly). Their outputs are bilinear upsampled to the original resolution before summed to produce the prediction.
4.2 Optimization
The advantage of introducing this continuous relaxation is that the scalars controlling the connection strength between different hidden states are now part of the differentiable computation graph. Therefore they can be optimized efficiently using gradient descent. We adopt the firstorder approximation in [49], and partition the training data into two disjoint sets trainA and trainB. The optimization alternates between:

Update network weights by

Update architecture by
where the loss function is the cross entropy calculated on the semantic segmentation minibatch.
4.3 Decoding Discrete Architectures
Cell Architecture
Following [49], we decode the discrete cell architecture by first retaining the 2 strongest predecessors for each block (with the strength from hidden state to hidden state being ; recall from Sec. 3.1 that “zero” means “no connection”), and then choose the most likely operator by taking the argmax.
Network Architecture
Eq. (7) essentially states that the “outgoing probability” at each of the blue nodes in Fig. 1 sums to . In fact, the values can be interpreted as the “transition probability” between different “states” (spatial resolution) across different “time steps” (layer number). Quite intuitively, our goal is to find the path with the “maximum probability” from start to end. This path can be decoded efficiently using the classic Viterbi algorithm, as in our implementation.
5 Experimental Results
Herein, we report our architecture search implementation details as well as the search results. We then report semantic segmentation results on benchmark datasets with our best found architecture.
5.1 Architecture Search Implementation Details
We consider a total of layers in the network, and blocks in a cell. The network level search space has unique paths, and the number of cell structures is . So the size of the joint, hierarchical search space is in the order of .
We follow the common practice of doubling the number of filters when halving the height and width of feature tensor. Every blue node in Fig. 1 with downsample rate has output filters, where is the filter multiplier controlling the model capacity. We set during the architecture search. A stride convolution is used for all connections, both to reduce spatial size and double the number of filters. Bilinear upsampling followed by convolution is used for all connections, both to increase spatial size and halve the number of filters.
The Atrous Spatial Pyramid Pooling module used in [9] has branches: one convolution, three convolution with various atrous rates, and pooled image feature. During the search, we simplify ASPP to have branches instead of by only using one convolution with atrous rate . The number of filters produced by each ASPP branch is still .
We conduct architecture search on the Cityscapes dataset [13] for semantic image segmentation. More specifically, we use random image crops from halfresolution () images in the train_fine set. We randomly select half of the images in train_fine as trainA, and the other half as trainB (see Sec. 4.2).
The architecture search optimization is conducted for a total of epochs. The batch size is due to GPU memory constraint. When learning network weights , we use SGD optimizer with momentum , cosine learning rate that decays from to , and weight decay . When learning the architecture encoding , we use Adam optimizer [36] with learning rate and weight decay . We empirically found that if are optimized from the beginning when are not well trained, the architecture tends to fall into bad local optima. Therefore we start optimizing after epochs. The entire architecture search optimization takes about days on one P100 GPU. Fig. 4 shows that the validation accuracy steadily improves throughout this process. We also tried searching for longer epochs (, , ), but did not observe benefit.
Fig. 3 visualizes the best architecture found. In terms of network level architecture, higher resolution is preferred at both beginning (stays at downsample by for longer) and end (ends at downsample by ). We also show the strongest outgoing connection at each node using gray dashed arrows. We observe a general tendency to downsample in the first layers and upsample in the last layers. In terms of cell level architecture, the conjunction of atrous convolution and depthwiseseparable convolution is often used, suggesting that the importance of context has been learned. Note that atrous convolution is rarely found to be useful in cells for image classification^{1}^{1}1Among NASNet{A, B, C}, PNASNet{1, 2, 3, 4, 5}, AmoebaNet{A, B, C}, ENAS, DARTS, atrous convolution was used only once in AmoebaNetB reduction cell..
5.2 Semantic Segmentation Results
We evaluate the performance of our found best architecture on Cityscapes [13], PASCAL VOC 2012 [15], and ADE20K [89] datasets.
We follow the same training protocol in [9, 11]. In brief, during training we adopt a polynomial learning rate schedule [50] with initial learning rate , and large crop size (e.g., on Cityscapes, and on PASCAL VOC 2012 and resized ADE20K images). Batch normalization parameters [32] are finetuned during training. The models are trained from scratch with 1.5M iterations on Cityscapes, 1.5M iterations on PASCAL VOC 2012, and 4M iterations on ADE20K, respectively.
Method  ImageNet  MultiAdds  Params  mIOU (%)  

AutoDeepLabS  20  333.25B  10.15M  79.74  
AutoDeepLabM  32  460.93B  21.62M  80.04  
AutoDeepLabL  48  695.03B  44.42M  80.33  
FRRNA [60]      17.76M  65.7  
FRRNB [60]      24.78M    
DeepLabv3+ [11]  ✓    1551.05B  43.48M  79.55 
Method  itr500K  itr1M  itr1.5M  SDP  mIOU (%) 

AutoDeepLabS  ✓  75.20  
AutoDeepLabS  ✓  77.09  
AutoDeepLabS  ✓  78.00  
AutoDeepLabS  ✓  ✓  79.74 
Method  ImageNet  Coarse  mIOU (%) 

FRRNA [60]  63.0  
GridNet [17]  69.5  
FRRNB [60]  71.8  
AutoDeepLabS  79.9  
AutoDeepLabL  80.4  
AutoDeepLabS  ✓  80.9  
AutoDeepLabL  ✓  82.1  
ResNet38 [81]  ✓  ✓  80.6 
PSPNet [87]  ✓  ✓  81.2 
Mapillary [4]  ✓  ✓  82.0 
DeepLabv3+ [11]  ✓  ✓  82.1 
DPC [6]  ✓  ✓  82.7 
DRN_CRL_Coarse [90]  ✓  ✓  82.8 
We adopt the simple encoderdecoder structure similar to DeepLabv3+ [11]. Specifically, our encoder consists of our found best network architecture augmented with the ASPP module [8, 9], and our decoder is the same as the one in DeepLabv3+ which recovers the boundary information by exploiting the lowlevel features that have downsample rate . Additionally, we redesign the “stem” structure with three convolutions (with stride in the first and third convolutions). The first two convolutions have filters while the third convolution has filters. This “stem” has been shown to be effective for segmentation in [87, 77].
5.2.1 Cityscapes
Cityscapes [13] contains high quality pixellevel annotations of images with size (, , and for the training, validation, and test sets respectively) and about coarsely annotated training images. Following the evaluation protocol [13], semantic labels are used for evaluation without considering the void label.
In Tab. 2, we report the Cityscapes validation set results. Similar to MobileNets [29, 65], we adjust the model capacity by changing the filter multiplier . As shown in the table, higher model capacity leads to better performance at the cost of slower speed (indicated by larger MultiAdds).
In Tab. 3, we show that increasing the training iterations from 500K to 1.5M iterations improves the performance by , when employing our lightweight model variant, AutoDeepLabS. Additionally, adopting the Scheduled Drop Path [40, 92] further improves the performance by , reaching on Cityscapes validation set.
We then report the test set results in Tab. 4. Without any pretraining, our best model (AutoDeepLabL) significantly outperforms FRNNB [60] by and GridNet [17] by . With extra coarse annotations, our model AutoDeepLabL, without pretraining on ImageNet [64], achieves the test set performance of , outperforming PSPNet [87] and Mapillary [4], and attains the same performance as DeepLabv3+ [11] while requiring fewer MutliAdds computations. Notably, our lightweight model variant, AutoDeepLabS, attains on the test set, comparable to PSPNet, while using merely 10.15M parameters and 333.25B MultiAdds.
5.2.2 Pascal Voc 2012
PASCAL VOC 2012 [15] contains foreground object classes and one background class. We augment the original dataset with the extra annotations provided by [23], resulting in (train_aug) training images.
In Tab. 5, we report our validation set results. Our best model, AutoDeepLabL, with single scale inference significantly outperforms [19] by . Additionally, for all our model variants, adopting multiscale inference improves the performance by about . Further pretraining our models on COCO [46] for 4M iterations improves the performance significantly.
Finally, we report the PASCAL VOC 2012 test set result with our COCOpretrained model variants in Tab. 6. As shown in the table, our best model attains the performance of on the test set, outperforming RefineNet [44] and PSPNet [87]. Our model is lagged behind the topperforming DeepLabv3+ [11] with Xception65 as network backbone by . We think that PASCAL VOC 2012 dataset is too small to train models from scratch and pretraining on ImageNet is still beneficial in this case.
Method  MS  COCO  mIOU (%) 

DropBlock [19]  53.4  
AutoDeepLabS  71.68  
AutoDeepLabS  ✓  72.54  
AutoDeepLabM  72.78  
AutoDeepLabM  ✓  73.69  
AutoDeepLabL  73.76  
AutoDeepLabL  ✓  75.26  
AutoDeepLabS  ✓  78.31  
AutoDeepLabS  ✓  ✓  80.27 
AutoDeepLabM  ✓  79.78  
AutoDeepLabM  ✓  ✓  80.73 
AutoDeepLabL  ✓  80.75  
AutoDeepLabL  ✓  ✓  82.04 
Method  ImageNet  COCO  mIOU (%) 

AutoDeepLabS  ✓  82.5  
AutoDeepLabM  ✓  84.1  
AutoDeepLabL  ✓  85.6  
RefineNet [44]  ✓  ✓  84.2 
ResNet38 [81]  ✓  ✓  84.9 
PSPNet [87]  ✓  ✓  85.4 
DeepLabv3+ [11]  ✓  ✓  87.8 
MSCI [43]  ✓  ✓  88.0 
Method  ImageNet  mIOU (%)  PixelAcc (%)  Avg (%) 

AutoDeepLabS  40.69  80.60  60.65  
AutoDeepLabM  42.19  81.09  61.64  
AutoDeepLabL  43.98  81.72  62.85  
CascadeNet (VGG16) [89]  ✓  34.90  74.52  54.71 
RefineNet (ResNet152) [44]  ✓  40.70     
UPerNet (ResNet101) [82]  ✓  42.66  81.01  61.84 
PSPNet (ResNet152) [87]  ✓  43.51  81.38  62.45 
PSPNet (ResNet269) [87]  ✓  44.94  81.69  63.32 
DeepLabv3+ (Xception65) [11]  ✓  45.65  82.52  64.09 
5.2.3 Ade20k
ADE20K [89] has semantic classes and high quality annotations of training images and validation images. In our experiments, the images are all resized so that the longer side is during training.
6 Conclusion
In this paper, we present one of the first attempts to extend Neural Architecture Search beyond image classification to dense image prediction problems. Instead of fixating on the cell level, we acknowledge the importance of spatial resolution changes, and embrace the architectural variations by incorporating the network level into the search space. We also develop a differentiable formulation that allows efficient (about faster than DPC [6]) architecture search over our twolevel hierarchical search space. The result of the search, AutoDeepLab, is evaluated by training on benchmark semantic segmentation datasets from scratch. On Cityscapes, AutoDeepLab significantly outperforms the previous stateoftheart by , and performs comparably with ImageNetpretrained top models when exploiting the coarse annotations. On PASCAL VOC 2012 and ADE20K, AutoDeepLab also outperforms several ImageNetpretrained stateoftheart models.
There are many possible directions for future work. Within the current framework, related applications such as object detection should be plausible; we could also try untying the cell architecture across different layers (cf. [76]) with little computation overhead. Beyond the current framework, a more general and relaxed network level search space should be beneficial (cf. Sec. 3.2).
Acknowledgments
We thank Sergey Ioffe for the valuable feedback about the draft; Cloud AI and Mobile Vision team for support.
References
 [1] K. Ahmed and L. Torresani. Maskconnect: Connectivity learning by gradient descent. In ECCV, 2018.
 [2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. arXiv:1511.00561, 2015.
 [3] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. In ICLR, 2017.
 [4] S. R. Bulò, L. Porzi, and P. Kontschieder. Inplace activated batchnorm for memoryoptimized training of dnns. In CVPR, 2018.
 [5] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang. Efficient architecture search by network transformation. In AAAI, 2018.
 [6] L.C. Chen, M. D. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens. Searching for efficient multiscale architectures for dense image prediction. In NIPS, 2018.
 [7] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
 [8] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017.
 [9] L.C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587, 2017.
 [10] L.C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scaleaware semantic image segmentation. In CVPR, 2016.
 [11] L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
 [12] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.
 [13] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
 [14] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In CVPR, 2015.
 [15] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge â a retrospective. IJCV, 2014.
 [16] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. PAMI, 2013.
 [17] D. Fourure, R. Emonet, E. Fromont, D. Muselet, A. Tremeau, and C. Wolf. Residual convdeconv grid network for semantic segmentation. In BMVC, 2017.
 [18] J. Fu, J. Liu, Y. Wang, and H. Lu. Stacked deconvolutional network for semantic segmentation. arXiv:1708.04943, 2017.
 [19] G. Ghiasi, T.Y. Lin, and Q. V. Le. Dropblock: A regularization method for convolutional networks. In NIPS, 2018.
 [20] A. Giusti, D. Ciresan, J. Masci, L. Gambardella, and J. Schmidhuber. Fast image scanning with deep maxpooling convolutional neural networks. In ICIP, 2013.
 [21] K. Grauman and T. Darrell. The pyramid match kernel: Discriminative classification with sets of image features. In ICCV, 2005.
 [22] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber. Lstm: A search space odyssey. arXiv:1503.04069, 2015.
 [23] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011.
 [24] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.
 [25] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [26] X. He, R. S. Zemel, and M. CarreiraPerpindn. Multiscale conditional random fields for image labeling. In CVPR, 2004.
 [27] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 [28] M. Holschneider, R. KronlandMartinet, J. Morlet, and P. Tchamitchian. A realtime algorithm for signal analysis with the help of the wavelet transform. In Wavelets: TimeFrequency Methods and Phase Space, pages 289–297. Springer, 1989.
 [29] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861, 2017.
 [30] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. In CVPR, 2018.
 [31] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
 [32] S. Ioffe and C. Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [33] M. A. Islam, M. Rochan, N. D. Bruce, and Y. Wang. Gated feedback refinement network for dense image labeling. In CVPR, 2017.
 [34] R. Jozefowicz, W. Zaremba, and I. Sutskever. An empirical exploration of recurrent network architectures. In ICML, 2015.
 [35] A. Kae, K. Sohn, H. Lee, and E. LearnedMiller. Augmenting crfs with boltzmann machine shape priors for image labeling. In CVPR, 2013.
 [36] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [37] P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label consistency. IJCV, 82(3):302–324, 2009.
 [38] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [39] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Associative hierarchical crfs for object class image segmentation. In ICCV, 2009.
 [40] G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultradeep neural networks without residuals. In ICLR, 2017.
 [41] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006.
 [42] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
 [43] D. Lin, Y. Ji, D. Lischinski, D. CohenOr, and H. Huang. Multiscale context intertwining for semantic segmentation. In ECCV, 2018.
 [44] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multipath refinement networks with identity mappings for highresolution semantic segmentation. In CVPR, 2017.
 [45] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
 [46] T.Y. Lin et al. Microsoft coco: Common objects in context. In ECCV, 2014.
 [47] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.J. Li, L. FeiFei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In ECCV, 2018.
 [48] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. Hierarchical representations for efficient architecture search. In ICLR, 2018.
 [49] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv:1806.09055, 2018.
 [50] W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv:1506.04579, 2015.
 [51] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [52] R. Luo, F. Tian, T. Qin, and T.Y. Liu. Neural architecture optimization. In NIPS, 2018.
 [53] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, and B. Hodjat. Evolving deep neural networks. arXiv:1703.00548, 2017.
 [54] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feedforward semantic segmentation with zoomout features. In CVPR, 2015.
 [55] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
 [56] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In ICCV, 2015.
 [57] G. Papandreou, I. Kokkinos, and P.A. Savalle. Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection. In CVPR, 2015.
 [58] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In CVPR, 2017.
 [59] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. In ICML, 2018.
 [60] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Fullresolution residual networks for semantic segmentation in street scenes. In CVPR, 2017.
 [61] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv:1802.01548, 2018.
 [62] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and A. Kurakin. Largescale evolution of image classifiers. In ICML, 2017.
 [63] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
 [64] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
 [65] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
 [66] S. Saxena and J. Verbeek. Convolutional neural fabrics. In NIPS, 2016.
 [67] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
 [68] R. Shin, C. Packer, and D. Song. Differentiable neural network architecture search. In ICLR Workshop, 2018.
 [69] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multiclass object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 2009.
 [70] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Topdown modulation for object detection. arXiv:1612.06851, 2016.
 [71] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [72] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
 [73] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inceptionv4, inceptionresnet and the impact of residual connections on learning. In AAAI, 2017.
 [74] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [75] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
 [76] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le. Mnasnet: Platformaware neural architecture search for mobile. arXiv:1807.11626, 2018.
 [77] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell. Understanding convolution for semantic segmentation. In WACV, 2018.
 [78] Z. Wojna, V. Ferrari, S. Guadarrama, N. Silberman, L.C. Chen, A. Fathi, and J. Uijlings. The devil is in the decoder. In BMVC, 2017.
 [79] Y. Wu and K. He. Group normalization. In ECCV, 2018.
 [80] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144, 2016.
 [81] Z. Wu, C. Shen, and A. van den Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. arXiv:1611.10080, 2016.
 [82] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun. Unified perceptual parsing for scene understanding. In ECCV, 2018.
 [83] L. Xie and A. Yuille. Genetic cnn. In ICCV, 2017.
 [84] S. Xie, R. Girshick, P. DollÃ¡r, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
 [85] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.
 [86] Z. Zhang, X. Zhang, C. Peng, D. Cheng, and J. Sun. Exfuse: Enhancing feature fusion for semantic segmentation. In ECCV, 2018.
 [87] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.
 [88] Z. Zhong, J. Yan, W. Wu, J. Shao, and C.L. Liu. Practical blockwise neural network architecture generation. In CVPR, 2018.
 [89] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In CVPR, 2017.
 [90] Y. Zhuang, F. Yang, L. Tao, C. Ma, Z. Zhang, Y. Li, H. Jia, X. Xie, and W. Gao. Dense relation network: Learning consistent and contextaware representation for semantic image segmentation. In ICIP, 2018.
 [91] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.
 [92] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018.