# Meta Architecture Search

###### Abstract

Neural Architecture Search (NAS) has been quite successful in constructing state-of-the-art models on a variety of tasks. Unfortunately, the computational cost can make it difficult to scale. In this paper, we make the first attempt to study Meta Architecture Search which aims at learning a task-agnostic representation that can be used to speed up the process of architecture search on a large number of tasks. We propose the Bayesian Meta Architecture SEarch (BASE) framework which takes advantage of a Bayesian formulation of the architecture search problem to learn over an entire set of tasks simultaneously. We show that on Imagenet classification, we can find a model that achieves 25.7% top-1 error and 8.1% top-5 error by adapting the architecture in less than an hour from an 8 GPU days pretrained meta-network. By learning a good prior for NAS, our method dramatically decreases the required computation cost while achieving comparable performance to current state-of-the-art methods - even finding competitive models for unseen datasets with very quick adaptation. We believe our framework will open up new possibilities for efficient and massively scalable architecture search research across multiple tasks^{†}^{†}The code repository is available at https://github.com/ashaw596/meta_architecture_search..

## 1 Introduction

For deep neural networks, the particular structure often plays a vital role in achieving state-of-the-art performance in many practical applications, and there has been much work [LeCBen15, HeZhaRenSun016, HuaSunLiuSedetal16, zhang1707shufflenet, liu2017spherenet, Liu2018DCNets, liu2019NSL, szegedy2015going, simonyan2014very, xie2019exploring] exploring the space of neural network designs. Due to the combinatorial nature of the design space, hand-designing architectures is time-consuming and inevitably sub-optimal. Automated Neural Architecture Search (NAS) has had great success in finding high-performance architectures. However, people may need optimal architectures for several similar tasks at once, such as solving different classification tasks or even optimizing task networks for both high accuracy and efficient inference on multiple hardware platforms [FBNET]. Although there has been success in transferring architectures across tasks [transferable], recent work has increasingly shown that the optimal architectures can vary between even similar tasks; to achieve the best results, NAS would need to be repeatedly run for each task [PROXYLESS] which can be quite costly.

In this work, we present a first effort towards Meta Architecture Search, which aims at learning a task-agnostic representation that can be used to search over multiple tasks efficiently. The overall graphical illustration of the model can be found in Figure 1, where the meta-network represents the collective knowledge of architecture search across tasks. Meta Architecture Search takes advantage of the similarities among tasks and the corresponding similarities in their optimal networks, reducing the overall training time significantly and allowing fast adaptation to new tasks. We formulate the Meta Architecture Search problem from a Bayesian perspective and propose Bayesian Meta Architecture SEarch (BASE), a novel framework to derive a variational inference method to learn optimal weights and architectures for a task distribution. To parameterize the architecture search space, we use a stochastic neural network which contains all the possible architectures within our architecture space as specific paths within the network. By using the Gumbel-softmax [jang2017categorical] distribution in the parameterization of the path distributions, this network containing an entire architecture space can be optimized differentially. To account for the task distribution in the posterior distribution of the neural network architecture and weights, we exploit the optimization embedding[DaiDaiHeLiuetal18] technique to design the parameterization of the posterior. This allows us to train it as a meta-network optimized over a task distribution.

To train our meta-network over a wide distribution of tasks with different image sizes, we define a new space of classification tasks by randomly selecting 10 Imagenet [DenDonSocLiEtal09] classes and downsampling the images to 3232, 6464, or 224224 image sizes. By training on these datasets, we can learn good distributions of architectures optimized for different image sizes. With a meta-network trained for 8 GPU days, we then show that we can achieve very competitive results on full Imagenet by deriving optimal task-specific architectures from the meta-network, obtaining 25.7% top-1 error on ImageNet using an adaption time of less than one hour. Our method achieves significantly lower computational costs compared to current state-of-the-art NAS approaches. By adapting the multi-task meta-network for to the unseen CIFAR10 dataset for less than one hour, we found a model that achieves 2.83% Top-1 Error. Additionally, we also apply this method to tackle neural architecture search for few-shot learning, demonstrating the flexibility of our framework.

Our research opens new potentials for using Meta Architecture Search across massive amounts of tasks. The nature of the Bayesian formulation makes it possible to learn over an entire collection of tasks simultaneously, bringing additional benefits such as computational efficiency and privacy when performing neural architecture search.

## 2 Related Work

##### Neural Architecture Search

Several evolutionary and reinforcement learning based algorithms have been quite successful in achieving state-of-the-art performances on many tasks [zoph2016neural, transferable, real2018regularized, mobilenetv3]. However, these methods are computationally costly and require tremendous amounts of computing resources. While previous work has achieved good results with sharing architectures across tasks [transferable], [FBNET] and [PROXYLESS] show that task and even platform-specific architecture search is required in order to achieve the best performance. Several methods [DARTS, ENAS, cai2018path, EAS, autodeeplab] have been proposed to reduce the search time, and both FBNet [FBNET] and SNAS [SNAS] utilize the Gumbel-Softmax [jang2017categorical] distribution similarly to our meta-network design to allow gradient-based architecture optimization. [SMASH] and [HYPER] also both propose methods to generate optimal weights for one task given any architecture like our meta-network is capable of. Their methods, however, do not allow optimization of the architectures and are only trained on a single task making them inefficient in optimizing over multiple tasks. Similarly to our work, [DBLP:journals/corr/abs-1903-03536] recently proposed methods to accelerate search utilizing knowledge from previous searches and predicting posterior distributions of the optimal architecture. Our approach, however, achieves much better computational efficiency by not limiting ourselves to transferring knowledge from only the performance of discrete architectures on the validation datasets, but instead sharing knowledge for both optimal weights and architecture parameters and implicitly characterizing the entire dataset utilizing optimization embedding.

##### Meta Learning

Meta-learning methods allow networks to be quickly trained on new data and new tasks [MAML, ravi2017optimization]. While previous works have not applied these methods to Neural Architecture Search, our derived Bayesian optimization method bears some similarities to Neural Processes [pmlr-v80-garnelo18a, DBLP:journals/corr/abs-1807-01622, kim2018attentive]. Both can derive a neural network specialized for a dataset by conditioning the model on some samples from the dataset. The use of neural networks allows both to be optimized by gradient descent. However, Neural Processes use specially structured encoder and aggregator networks to build a context embedding from the samples. We use the optimization embedding technique [DaiDaiHeLiuetal18] to condition our neural network using gradient descent in an inner loop, which allows us to avoid explicitly summarizing the datasets with a separate network. This inner-outer loop dynamic shares some similarities to second-order MAML [MAML]. Both algorithms unroll the stochastic gradient descent step. Due to this, we are also able to establish a connection between the heuristic MAML algorithm and Bayesian inference.

## 3 A Bayesian Inference View of Architecture Search

In this section, we propose a Bayesian inference view for neural architecture search which naturally introduces the hierarchical structures across different tasks. Such a view inspires an efficient algorithm which can provide a task-specific neural network with adapted weights and architecture using only *a few* learning steps.

We first formulate the neural architecture search as an operation selection problem. Specifically, we consider the neural network as a composition of layers of cells, where the cells share the same architecture, but have different parameters. In the -th layer, the cell consists of a -layer sub-network with bypass connections. Specifically, we denote the as the output of the -th layer of -th cell

(1) |

where denotes a group of different operations from which depend on parameters , \eg, different nonlinear neurons, convolution kernels with different sizes, or other architecture choices. are all binary variables which are shared across layers. They indicate which layers from the to levels in -th cell should be selected as inputs to the -th layer. Therefore, with different instantiations of , the cell will select different operations to form the output. Figure 1 has an illustration of this structure.

We assume the probabilistic model as

(2) | ||||

with , , and , . With this probabilistic model, the selection of , \ie, neural network architecture search, is reduced to finding a distribution defined by , and the neural network learning is reduced to finding , both of which are the parameters of the probabilistic model.

The most natural choice here for probabilistic model estimation is the maximum log-likelihood estimation (MLE), \ie,

(3) |

However, the MLE is intractable due to the integral over latent variable . We apply the classic variational Bayesian inference trick, which leads to the evidence lower bound (ELBO), \ie,

(4) |

where . As shown in [Zellner88], the optimal solution of (4) in all possible distributions will be the posterior. With such a model, architecture learning can be recast as Bayesian inference.

### 3.1 Bayesian Meta Architecture Learning

Based on the Bayesian view of architecture search, we can easily extend it to the meta-learning setting, where we have many tasks, \ie, . We are required to learn the neural network architectures and the corresponding parameters jointly while taking the task dependencies on the neural network structure into account.

We generalize the model (2) to handle multiple tasks as follows. For the -th task, we design the model following (2). Meanwhile, the hyperparameters, \ie, , are shared across all the tasks. In other words, the layers and architecture priors are shared between tasks. Then we have the MLE:

(5) |

Similarly, we exploit the ELBO. Due to the structures induced by sharing across the tasks, the posteriors for have special dependencies, \ie,

(6) |

With the variational posterior distributions, and , introduced into the model, we can directly generate the architecture and its corresponding weights based on the posterior. In a sense, the posterior can be understood as the neural network predictive model.

## 4 Variational Inference by Optimization Embedding

The design of the parameterization of the posterior and is extremely important, especially in our case where we need to model the dependence between w.r.t. the *task distributions* and the *loss information*. Fortunately, we can bypass this problem by applying parameterized Coupled Variational Bayes (CVB), which generates the parameterization automatically through *optimization embedding* [DaiDaiHeLiuetal18].

Specifically, we assume the is Gaussian and the is a product of the categorical distribution. We approximate the categorical with the Gumbel-Softmax distribution [jang2017categorical, MadMniTeh16], which leads to a valid gradient so that the model will be fully differentiable. Therefore, we have

(7) |

Then, we can sample by following,

(8) | ||||

with and denotes the Gumbel distribution. We emphasize that we do not have any explicit form of the parameters and yet, which will be derived by optimization embedding automatically.

Plugging the formulation into the ELBO (6), we arrive at the objective

(9) |

With the ultimate objective (9) we follow the parameterized CVB derivation [DaiDaiHeLiuetal18] for embedding the optimization procedure for . Denoting the where is the stochastic approximation for , then, the stochastic gradient descent (SGD) iteratively updates as

(10) |

We can initialize which is shared across all the tasks. Alternative choices are also possible, \eg, with one more neural network, . We unfold steps of the iteration to form a neural network with output . Plugging the obtained to (8), we have the parameters and architecture as . In other words, we derive the concrete parameterization of and automatically by unfolding the optimization steps. Replacing the parameterization of and into , we have

(11) |

If we apply stochastic gradient ascent in the optimization (11) for updating , the instantiated algorithm from optimization embedding shares some similarities to second-order MAML [MAML] and DARTS [DARTS] algorithms. Both of these two algorithms unroll the stochastic gradient step. However, with the introduction of the Bayesian view, we can exploit the rich literature for the approximation of the distributions on discrete variables. More importantly, we can easily share both the architecture and weights across many tasks. Finally, this establishes the connection between the heuristic MAML algorithm to Bayesian inference, which can be of independent interest.

Practical algorithm: In the method derivation, for the simplicity of exposition, we assumed there is only one cell shared across all the layers in every task, which may be overly restrictive. Following [transferable], we design two types of cells, named as a normal cell with and a reduction cell with , which appear alternatively in the neural network. Please refer to Appendix B.3 for an illustration.

In practice, the multistep-unrolling of the gradient computation is expensive and memory inefficient. We can exploit the finite difference approximation for the gradient. This is similar to the iMAML [rajeswaran2019metalearning] and REPTILE [REPTILE] approximations of MAML. Moreover, we can further accelerate learning by exploiting parallel computation. Specifically, for each task, we start from a local copy of the current and apply stochastic gradient ascent based on the task-specific samples. Then, the shared can be updated by summarizing the task-specific parameters and architecture. The pseudo-code for the concrete algorithm for Bayesian meta-Architecture SEarch (BASE) can be found in Algorithm 1.

With a meta-network trained with BASE over a series of tasks, for a new task, we can adapt an architecture by sampling from the posterior distribution of through (7) with calculated by (10) given new task which will be used to define the full-sized network. Illustrations of the network motifs used for the search network and the full networks can be found in Appendix A.2. More details about the architecture space can be found in Appendix A.

## 5 Experiments and Results

### 5.1 Experiment Setups

##### Downsampled Multi-task Datasets

To help the meta-network generalize to inputs with different sizes, we create three new multi-task datasets: Imagenet32(Imagenet downsampled to 32x32), Imagenet64(Imagenet downsampled to 64x64), and Imagenet224(Imagenet downsampled to 224x224). Imagenet224 uses the most commonly used size for inference for the full Imagenet dataset in the mobile setting. Our tasks are defined by sampling 10 random classes from one of the resized Imagenet datasets similar to the Mini-Imagenet dataset [vinyals2016matching] in few-shot learning. This allows us to sample tasks from a space of tasks.

##### Featurization Layers

To conduct architecture search on these multi-sized, multi-task datasets, the meta-network uses separate initial featurization layers (heads) for each image size. The use of non-shared weights for the initial image featurization both allows the meta-network to learn a better prior as well as enabling the use of different striding in the heads to compensate for the significant difference in image sizes. The Imagenet224 head strides the output to 1/8th of the original input while the 32x32 and 64x64 heads both stride to 1/2th the original input size.

### 5.2 Search Performance

We validated our meta-network by transferring the results of architectures optimized for CIFAR10, SVHN, and Imagenet224 to full-sized networks. Details of how we trained the full networks can be found in Appendix A.1. To derive the full-sized Imagenet architectures, we select a high probability architectures from the posterior distribution of architectures given random 10-class Imagenet224 datasets by averaging the sampled architecture distributions for 8 random datasets. To derive the CIFAR10 and SVHN architectures, we adapted the network on the unseen datasets and selected the architecture with the highest probability of being chosen. The meta-network was trained for 130 epochs. At each epoch, we sampled and trained on a total of 24 tasks, sampling 8 10-class discrimination tasks each from Imagenet32, Imagenet64, and Imagenet224. All experiments were conducted with Nvidia 1080 Ti GPUs.

Architecture | Top-1 Test | Parameters | Search Time |

Error | (M) | (GPU Days) | |

NASNet-A + cutout [transferable] | 2.65 | 3.3 | 1800 |

AmoebaNet-A + cutout [real2018regularized] | 3.2 | 3150 | |

AmoebaNet-B + cutout [real2018regularized] | 2.8 | 3150 | |

Hierarchical Evo [LiuSimVinFeretal17] | 15.7 | 300 | |

PNAS [PNAS] | 3.2 | 225 | |

DARTS (1st order bi-level) + cutout [DARTS] | 3.3 | 1.5 | |

DARTS (2nd order bi-level) + cutout [DARTS] | 3.3 | 4 | |

SNAS (single-level) + cutout [SNAS] | 2.8 | 1.5 | |

SMASH [SMASH] | 4.03 | 16 | 1.5 |

ENAS + cutout [ENAS] | 2.89 | 4.6 | 0.5 |

BASE (Multi-task Prior) | 3.2 | 8 Meta | |

BASE (Imagenet32 Tuned) | 3.3 | 0.04 Adap / 8 Meta | |

BASE (CIFAR10 Tuned) | 2.83 | 3.1 | 0.05 Adap / 8 Meta |

##### Performance on CIFAR10 Dataset

The result of our Meta Architecture Search on CIFAR10 can be found in Table 1. We compared a few variants of our methods. BASE (Multi-task Prior) is architecture derived from training on the multi-task Imagenet datasets only without further fine-tuning. This model did not have access to any information on the CIFAR10 dataset and is used as a baseline comparison.

The BASE (Imagenet32 Tuned) is the network derived from the multi-task prior fine-tuned on Imagenet32. We chose Imagenet32 since it has the same image dimension as CIFAR10. It does slightly better than the BASE (Multi-task Prior) on CIFAR10. We compare these networks to the BASE (CIFAR10 Tuned), which is the network derived from the meta-network prior fine-tuned on CIFAR10. Not surprisingly, this network performs the best as it has access to both the multi-task prior and the target dataset. One thing to note is that for BASE (Imagenet32 Tuned) and BASE (CIFAR10 Tuned), we only fine-tuned the meta-networks for 0.04 GPU days and 0.05 GPU days respectively. The adaptation time required is significantly less than that required for the initial training of the multi-task prior, as well as the required search time for the rest of the baseline NAS algorithms. With respect to the number of parameters, our models are comparable in size with to the baseline models. Using adaptation from our meta-network prior, we can find high performing models while using significantly less compute.

##### Performance on Svhn Dataset

The result of our Meta Architecture Search on SVHN are shown in Table 2. We used the same multi-task prior previously trained on the multi-scale Imagenet datasets and quickly adapted the meta-network to SVHN in less than an hour. We also trained the CIFAR10 specialized architecture found in DARTS [DARTS]. The adapted network architecture achieves the best performance in our experiments and has comparable performance to other work for the model size. This also validates the importance of task-specific specialization since it significantly improved the network performance over both our multi-task prior and Imagenet32 tuned baselines.

Architecture | Top-1 Test | Parameters | Search Time |

Error | (M) | (GPU Days) | |

WideResnet [zagoruyko2016wide] | 1.30 0.03 | 11.7 | - |

MetaQNN [baker2016designing] | 2.24 | 9.8 | 100 |

DARTS (CIFAR10 Searched) | 2.09 | 3.3 | 4 |

BASE (Multi-task Prior) | 2.13 | 3.2 | 8 Meta |

BASE (Imagenet32 Tuned) | 2.07 | 3.3 | 0.04 Adap / 8 Meta |

BASE (SVHN Tuned) | 2.01 | 3.2 | 0.04 Adap / 8 Meta |

Architecture | Top-1 | Top-5 | Params | MACs | Search Time |

Err | Err | (M) | (M) | (GPU Days) | |

NASNet-A [transferable] | 26.0 | 8.4 | 5.3 | 564 | 1800 |

NASNet-B [transferable] | 27.2 | 8.7 | 5.3 | 488 | 1800 |

NASNet-C [transferable] | 27.5 | 9.0 | 4.9 | 558 | 1800 |

AmoebaNet-A [real2018regularized] | 25.5 | 8.0 | 5.1 | 555 | 3150 |

AmoebaNet-B [real2018regularized] | 26.0 | 8.5 | 5.3 | 555 | 3150 |

AmoebaNet-C [real2018regularized] | 24.3 | 7.6 | 6.4 | 570 | 3150 |

PNAS [PNAS] | 25.8 | 8.1 | 5.1 | 588 | 225 |

DARTS [DARTS] | 26.9 | 9.0 | 4.9 | 595 | 4 |

SNAS [SNAS] | 27.3 | 9.2 | 4.3 | 522 | 1.5 |

BASE (Multi-task Prior) | 4.6 | 544 | 8 Meta | ||

BASE (Imagenet Tuned) | 25.7 | 8.1 | 4.9 | 559 | 0.04 Adap / 8 Meta |

##### Performance on ImageNet Dataset

The results of our Meta Architecture Search on Imagenet can be found in Table 3. We compare BASE (Multi-task Prior) with Base (Imagenet Tuned), which is the multi-task prior tuned on 224x224 Imagenet. The performance of our Imagenet Tuned model actually exceeds that of existing differential NAS approaches DARTS [DARTS] and SNAS [SNAS] on both top-1 Error and top-5 error. In terms of number of parameters and Multiply Accumulates(MAC), our found models are comparable to state-of-the-art networks. Considering running time, while the multi-task pretraining took 8 GPU days, we only needed 0.04 GPU days to adapt to full sized Imagenet. In Figure 2, we compare our models with other NAS approaches with respect to top-1 error and search time. For fairness, we include the time required to learn the architecture prior, and we still achieve significant accuracy gains for our computational cost.

(a) PCA of weights | (b) PCA of architecture |

## 6 Empirical Analysis

In this section, we analyze the task-dependent parameter distributions derived from meta-network adaptation and demonstrate the abilities of the proposed method for fast adaptation as well as architecture search for few-shot learning.

### 6.1 Visualization of Posterior Distributions

Figure 3 shows the PCA visualization of the posterior distributions of the convolutional weights and architecture parameters . The CIFAR10 optimized distributions were derived by quick adapting the pretrained meta-network for the CIFAR10 dataset while the other distributions were adapted for tasks sampled from the corresponding multi-task datasets. We see that the distribution of weights is more concentrated for CIFAR10 than for other datasets, likely since it corresponds to a single task instead of a task distribution. It also seems that the Imagenet224 and Imagenet64 posterior weight and architecture distributions are close to each other. This is likely due to the fact they are the closest to each other in feature resolution after being strided down by the feature heads to and . Considering the visualization of the architecture parameter distributions, it’s notable that while the closeness of clusters seems to indicate a similarity between Imagenet32 and CIFAR10, CIFAR10 still has a clearly distinct cluster. This seems to support that even though the meta-network prior was never trained on CIFAR10, an optimized architecture posterior distribution can be quickly derived for CIFAR10.

### 6.2 Fast Adaptations

In this section, we explore the direct transfer of both architecture and convolutional weights from the meta-network by comparing the test accuracy we get on CIFAR10 with meta-networks adapted for six epochs. The results are shown in Figure 4. We compare against the baseline accuracy of the DARTS [DARTS] super-network trained from scratch on CIFAR10. Our meta-network adapted normally from a multi-task prior, achieves an accuracy of around after only one epoch. We also experimented with freezing the architecture parameters, which greatly degraded the performance. This shows the importance of co-optimizing both the weight and architecture parameters.

### 6.3 Few-Shot Learning

In order to show the generalizability of our algorithm, we used it to conduct an architecture search over the few-shot learning problem. Since few-shot learning targets adapting in very few samples, we can avoid using the Finite Difference approximation and directly use the optimization-embedding technique in these experiments. These experiments were run on a commonly used benchmark for few-shot learning, the Mini-Imagenet dataset as proposed in [vinyals2016matching], specifically on the 5-way classification 5-shot learning problem.

Architecture | 5-shot Test | Params | Few-shot |
---|---|---|---|

Accuracy | (M) | Algorithm | |

MAML [MAML] | 63.11 0.92% | 0.1 | MAML |

REPTILE [REPTILE] | 65.99 0.58% | 0.1 | REPTILE |

DARTS Architecture | 1.6 | MAML | |

BASE (Softmax) | 1.2 | MAML | |

BASE (Gumbel) | 66.2 0.7% | 1.2 | MAML |

The full-sized network is trained on the few-shot learning problem using second-order MAML [MAML]. Search and full training were run twice for each method. A variation of our algorithm was also run using a simple softmax approximation of the Categorical distribution as proposed in [DARTS] to test the effect of the Gumbel-Softmax architecture parameterization. The full results are shown in Table 4, our searched architectures achieved significantly better average testing accuracies than our baselines on five-shot learning on the Mini-Imagenet dataset in the same architecture space. The CIFAR10 optimized DARTS architecture also achieved results that were significantly better than that found in the original MAML baseline [MAML] showing some transferability between CIFAR10 and meta-learning on Mini-Imagenet. That architecture, however, also had considerably more parameters than our found architectures and trained significantly slower. The Gumbel-Softmax meta-network parameterization also found better architectures than the simple softmax parameterization.

## 7 Conclusion

In this work, we present a Bayesian Meta-Architecture search (BASE) algorithm that can learn the optimal neural network architectures for an entire task distribution simultaneously. The algorithm derived from a novel Bayesian view of architecture search utilizes the optimization embedding technique [DaiDaiHeLiuetal18] to automatically incorporated the task information into the parameterization of the posterior. We demonstrate the algorithm by training a meta-network simultaneous on a distribution of tasks derived from Imagenet and achieve state-of-the-art results given our search time on both CIFAR10, SVHN, and Imagenet with quick adapted task-specific architectures. This work paves the way for future extensions with Meta Architecture Search such as direct fast-adaption to derive both optimal task-specific architectures and optimal weights and demonstrates the great efficiency gains possible by conducting architecture search over task distributions.

#### Acknowledgments

We would like to thank the anonymous reviewers for their comments and suggestions. Part of this work was done while Bo Dai and Albert Shaw were at Georgia Tech. Le Song was supported in part by NSF grants CDS&E-1900017 D3SC, CCF-1836936 FMitF, IIS-1841351, SaTC-1704701, and CAREER IIS-1350983.

## References

Appendix

## Appendix A Architecture Space Details

For comparability in architectures, the particular search space used is very similar to that used in [DARTS] and includes the same operation space: , , depth-wise separable convolutions, and dilated depth-wise separable convolutions, max pooling, average pooling, a followed by a convolution, skip connections, and no connection. In our search, each cell is made up of a total of six nodes with 2 input nodes. The input to each cell is the output from the previous 2 cells. The output for each cell is the concatenated output from all 4 non-input nodes in the cell. Following the same methods as [DARTS, transferable], non-dilated depth-wise separable convolutions were applied twice, all depth-wise separable convolutions did not have batch-norms between the grouped and 1x1 convolutions, convolutions had RELUs and batch-norms applied in ReLU-Conv-BN order, and all operations were padded as necessary to preserve spatial resolution as to only be reduced by the reducing layers whose first operations were applied with a stride of 2.

### a.1 Cifar10 and Imagenet Training Details

##### Cifar10

The architecture is transferred to a network with 20 cells following the motif shown in Appendix A.2. The network was trained for 600 epochs with cutout augmentation. We used a batch size 96. We follow the same training strategy as [DARTS] with cutout, and drop-path probability of 0.2, and auxiliary towers with weight 0.4.

##### Svhn

The architecture is transferred to a network with 20 cells following the motif shown in Appendix A.2. The network was trained for 160 epochs. We used a batch size 96, a drop-path probability of 0.2, and auxiliary towers with weight 0.4. The networks were trained for 160 epochs with cutout augmentation.

##### ImageNet

The architecture is transferred to a network with 14 cells following the motif shown in Appendix A.2. We train and evaluate in the mobile setting with input images of size 224x224. We train with a batch size of 256 for 375 epochs. We use the SGDR[DBLP:journals/corr/LoshchilovH16a] learning rate schedule with and . We optimize with the SGD with a initial lr of 0.1 decayed by a factor of 0.97 each epoch. We use a weight decay of . For the remaing parameters we follow the same training strategy as [transferable].

### a.2 Motifs for Single-Task Scalable Architectures

Motif for the Search

Network

Motif for CIFAR10 Full

Network.

Motif for ImageNet Full

Network.

These are the network motifs used in the experiments for search over single-task networks. Our search space has two unique cell architectures, "Normal Conv" and "Reduction" Cells.

### a.3 Sample ImageNet Adapted Cell Designs

Cell Design for normal cell

Cell Design for reduction cell

## Appendix B Few Shot Learning

### b.1 Motifs for Scalable Architectures

Motif for Search Network

Motif for Full Network.

These are the network motifs used in the experiments for search over few-shot learning. Our search space has two unique cell architectures, "Normal" and "Reduction" Cells. The Meta Architecture Search was run with the "Search Network", and then for evaluation, the architectures were transferred to the full network.

### b.2 High Level Diagrams of the Meta Architecture Search method.

(a) Meta Architecture Search | (b) One-Shot Architecture Adaptation |

### b.3 Diagram of Cell space concept

The architecture parameters are shared between all architecture "normal cells" and describe the architecture distribution within in the cells. are shared between all reduce cells. All weight parameters are not unique to each layer.

### b.4 Few-shot Training Details

In our experiments on the Mini-Imagenet dataset, only the 64 training classes were used during training. The 12 validation classes were ignored, and evaluation was conducted on the 24 testing classes. Search was run for iterations. For each iteration, the meta-network was updated with the combined gradients from randomly sampled tasks. For each task steps of inner optimization were run. For the full training, all network architectures were trained with the same setting on the -shot learning problem using the second-order MAML algorithm [MAML]. The full training was run for iterations. Similarly, for each iteration, the network was again updated with the combined gradients from randomly sampled tasks, but each task was optimized with steps of inner optimization for second-order MAML.

### b.5 Sample Top Found Cell Architectures from few-shot BASE search

Cell Design for sample normal cell

Cell Design for sample reduction cell