Fast Task-Aware Architecture Inference
Neural architecture search has been shown to hold great promise towards the automation of deep learning. However in spite of its potential, neural architecture search remains quite costly. To this point, we propose a novel gradient-based framework for efficient architecture search by sharing information across several tasks. We start by training many model architectures on several related (training) tasks. When a new unseen task is presented, the framework performs architecture inference in order to quickly identify a good candidate architecture, before any model is trained on the new task. At the core of our framework lies a deep value network that can predict the performance of input architectures on a task by utilizing task meta-features and the previous model training experiments performed on related tasks. We adopt a continuous parametrization of the model architecture which allows for efficient gradient-based optimization. Given a new task, an effective architecture is quickly identified by maximizing the estimated performance with respect to the model architecture parameters with simple gradient ascent. It is key to point out that our goal is to achieve reasonable performance at the lowest cost. We provide experimental results showing the effectiveness of the framework despite its high computational efficiency.
Designing high performing neural networks is a time consuming task that typically requires substantial human effort. In the past few years, neural architecture search and algorithmic solutions to model building have received growing research interest as they can automate the manual process of model design. Although they offer impressive results that compete with human-designed models , neural architecture search requires large amount of computational resources for each new task. For this reason, recent methods have been proposed that focus on reducing its cost (see e.g., [13, 15]). This very fact becomes a major limitation in those setups that impose strict resource constraints for model design. For example, in cloud machine learning services, the client uploads a new data set and an effective model should ideally be auto-designed in minutes (or seconds). In such settings architecture search has to be very efficient, which is the main motivation for this work.
At the same time, applying independently automated model building methods to each new task requires a lot of models to be trained as well as learning how to generate high performing models from scratch. Such an approach requires a formidable amount of computational resources and is far from being scalable. On the other hand, human experts can design state-of-the-art models using prior knowledge about how existing architectures perform across different data sets. Similar to human experts, we aim to cross learn from several task data sets and leverage prior knowledge.
In this paper, we present a framework that amortizes the cost of architecture search across several tasks and remains effective thanks to the knowledge transfer between tasks. Architecture search aims at learning a mapping from a data set to a high performing architecture. We propose to formulate architecture search as a structured prediction problem and build on top of previously proposed deep value networks . Given a candidate model architecture and meta-features about the task, a deep value network provides a differentiable mapping whose output estimates the performance of the input architecture on the task data set. We also adopt a continuous parametrization of the model architecture which allows for efficient gradient-based optimization of the estimated performance. Also, in contrast to previous work, e.g.,  that uses pre-computed meta-features for the task, we present a solution for learning the meta-features directly from the raw task samples as part of the deep value network weights.
The framework consists of an offline training phase and an online inference phase (see Fig. 1 for a conceptual illustration). Assuming that we have trained several model architectures on several related (training) tasks, when a new unseen task is presented, the framework performs fast architecture optimization in order to quickly identify a good candidate architecture, before any model training is performed. In particular, the best candidate architecture is efficiently identified by maximizing the estimated performance with respect to the model architecture parameters with simple gradient ascent. In summary, the paper contributions are the following:
Efficient architecture search using gradient-based architecture optimization.
Ability to learn the task meta-features directly from the raw task data samples.
Cross learning across many tasks (by leveraging information about how various architectures perform across many tasks data sets).
We provide experimental results showing the potential of the proposed framework. The rest of the paper is organized as follows. Section 2 formally defines the problem we are interested in. Next, in Section 3, we introduce the proposed framework and present it in details. Section 4 reviews related methods from the literature. We present experimental results in Section 5 and the conclusions and future work in Section 6.
2 Problem formulation
We are interested in task-aware efficient neural architecture search. Given a new (unseen) task data set, we would like to identify quickly an effective model architecture before any model is trained. We want to learn across datasets in order to amortize the cost of neural architecture search. In particular, we want to collectively learn from all the model training experiments and leverage this wealth of information. Instead of performing architecture search independently for each new data set, we would like to transfer the knowledge obtained from past training experiments on related tasks. In summary, the proposed framework should have the following properties:
High scalability in terms of computing resources.
Ability to scale and learn collectively across task data sets.
Ability to propose a good architecture for a new related task without training any model.
In the next section we propose a general framework that has these desired properties.
3 Proposed framework
We want to automatically discover the model architecture that achieves the best quality for a given data set. Essentially we are looking for learning a mapping from an input data set to a high performing model architecture. We propose to formalize the architecture search problem as a structured output prediction problem. The key intuition is that learning to criticize candidate architectures is easier than learning to directly predict the optimal architecture. In particular, the proposed framework builds on top of the Deep Value Networks (DVNs)  that were originally developed in the context of structured prediction applied to image segmentation. In our context, a deep value network acts as a meta-model that tunes the architecture of a child model. We consider child model families parametrized by , assuming for now that is a vector of continuous variables.
A deep value network in our context takes as input: (i) descriptive meta-features derived from a certain task data set and (ii) the child model architecture parameters , and predicts how well the architecture performs on the task data set described by . The performance metric can take various forms (e.g., accuracy, AUC) but the framework is agnostic to it. In this paper, we use the validation accuracy as performance metric. The deep value network is shown conceptually in Fig. 2. When training the value network, our hope is that it learns which type of child model architectures work well on certain types of data. This tries to mimic the human expert during manual architecture design. Human experts rely on intuition and prior knowledge when developing new candidate architectures. Here, our hope is that such an ‘intuition’ is encoded in the weights of the value network and that it is generally applicable and transferable across data sets.
In the following sections, we provide more details about the proposed framework. We discuss the meta-features of a task in Section 3.1. The framework has two phases: an off-line training phase and an online inference phase that are detailed below in Sections 3.2 and 3.3 respectively. Section 3.4 discusses the child model architecture parameters .
3.1 The meta-features of a task
The meta-features of a task describe its characteristics and statistics, and they are typically derived from the task data set itself. The meta-features may include the following: total number of samples, number of classes and their distribution, label entropy, total number of features and statistics about them (min, max, median), mutual information of the features with the label and task id. The latter can be used to learn an embedding for each data set. Similar data sets should get similar embeddings (see e.g., ).
Learning the meta-features.
On top of using pre-computed meta-features such as those listed above, one can also learn them directly from the task data set . In this case, the data set (or a large fraction of it) is given as input to the deep value network, and a task embedding is learned directly from the raw task data set samples. Note that we use both the features and the labels of the task data set samples towards learning the task embedding. This task embedding plays the role of the meta-features and is learned jointly together with the rest of the weights of the deep value network.
The task embedding should be invariant to the order of the samples in the task data set. According to , such a function can be decomposed in the form for suitable transformations and . The latter transformations are typically implemented by a few layers (e.g., fully connected, non-linearities etc.). The main idea is to transform each sample from the task data set using and then aggregate the transformed samples such that the task embedding becomes permutation invariant before it is fed into . This process is shown conceptually in Fig. 2, where the deep value network essentially consists of and that are jointly learned, i.e.,
We assume here that the data samples of different tasks are expressed in a common feature space that can be ingested by .
3.2 Off-line training phase
Assume we have tasks with corresponding data sets denoted:
where is the number of data samples in the -th task. is the i-the sample and its corresponding label in the -th task data set.
For each task data set, we generate child model architectures, train them and collect the model performances on the validation set in a life-long database of model training experiments (see Fig. 1). This database is used to generate the training set for the deep value network, which consists of triplets of the form:
where the value holds the child model performance obtained when training with the model architecture on the task data set with meta-features . In this paper, the model performance metric used is the validation accuracy. As more tasks are ingested in the database and more models get trained, the deep value network improves its predictions. In the Appendix we provide experimental results demonstrating this behaviour.
Once the child model training experiments have been collected in the database, we can start training the deep value network. Algorithm 1 shows the main steps of this offline training phase.
3.3 Online inference phase
After training the deep value network , the model weights are kept fixed. At inference time, given a new task dataset, we first extract its meta-features . At this point we can employ the value network in two ways. First, if we have a candidate architecture we can evaluate it by simply doing a forward pass on the deep value net and get the estimated child model performance. Alternatively, we can compute the gradient of with respect to and perform simple gradient-based optimization to get a good candidate architecture that maximizes the estimated child model performance.
In practice, we noticed that the gradient ascent is sensitive to initialization. Hence, we run the process several times with different initial guesses and at the end pick the one that resulted in the maximum estimated performance. Note also that in order to be able to perform gradient-based inference we need to relax the model architecture parameters to live in a continuous space. Section 3.4 below discusses this parametrization in details. The main steps of the online phase are shown in Algorithm 2. This online process is also illustrated conceptually in Fig. 1.
3.4 Architecture parametrization
We discuss in this section the parametrization of the child model architectures. Previous work [13, 18] has shown that relaxing the parametrization from discrete to continuous space allows for efficient gradient-based optimization schemes while still providing competitive model performances. Our approach goes along the lines of this previous work. The main idea is that in order to make the architecture space continuous we move away from the categorical nature of design choices to a parametrized softmax over all possible choices. We provide below a few examples where this is applied.
Continuous parametrization for one layer
Assume that we have implemented a basis set consisting of base layers corresponding to different sizes and different activation functions. We associate a weight with each base layer and we define a new parametrized layer as follows
The values allow the final parametrized layer to ‘morph’ from one size to another and/or from one activation function to another. We use zero padding whenever needed to resolve the dimension mismatch among base layers of different sizes.
Continuous parametrization for a child network
Leveraging on the continuous parametrization for one layer introduced above, we can put several parametrized layers together. We attach a superscript to the layer parameters to denote the layer where they belong to i.e., is the parameter that multiplies the output of the -th base layer in the -th parametrized layer of the final network. We also add the ability for each layer to be enabled or disabled independently from the other layers. For this, we add extra parameters that control the presence or absence of each layer. This is shown conceptually in Fig. 3 below.
Putting everything together, we consider child models that are standard Feedforward Neural Networks (FFNNs) composed of an embedding module followed by several parametrized layers and a final softmax classification layer. The reason for using an embedding module is that it speeds up the training time for the child models and improves their quality especially when the training set is small. The embedding module is soft-selected by an input set of pre-trained embedding modules111The pre-trained modules are available via the Tensorflow Hub service (https://www.tensorflow.org/hub). Please refer to the Appendix for more details about them. using the same softmax trick analogous to Eq. (4) where we denote by the corresponding parameters of the softmax. After this relaxation, architecture search reduces to learning the continuous variables . We refer to as the encoding of the model architecture. Finally, we would like to emphasize that this parametrization is just one example among many possible options. Any parametrization will work with the proposed framework as long as it is continuous.
4 Related work
Automated model building is an important challenging research problem and several related methods have been proposed in the past few years. In general, previous works can be broadly categorized into the following classes:
Evolutionary methods such as  form a population of model architectures. The population is evolved over time by picking individuals and mutating them (e.g., inserting a new layer). The quality of the population improves over time as the individuals with poor performance are removed.
The proposed framework also belongs to the last category of performance prediction methods. However, our deep value network is task-aware and takes-in not only the architecture but also meta-features about the task, with the extra ability of learning them directly from the raw task samples. Hence, the proposed framework in its current form is novel (to the best of our knowledge). However, it shares connections and similarities with existing works that we outline below. The previously proposed SMAC method  for general algorithm configuration also uses a history of past configuration experiments as well as descriptive features for the problem instances. However, this method uses an expensive Bayesian optimization process as opposed to the efficient gradient-based architecture search that this framework proposes (implied by the structured prediction formulation).
The TAPAS system proposed in  also uses a history of past configuration experiments stored in a database of experiments. The paper proposes a performance predictor that takes as input the difficulty of the dataset as well as a candidate network architecture. However, they use only pre-computed meta-features and their architecture parametrization is not differentiable.
The paper in  proposes a multi-task training of RL-based architecture search methods. For each task, it learns a task embedding that captures the task similarity. The task embedding is provided as input to the controller at each time step. In contrast to our work where the task embedding is derived directly from the data samples of the task, the task embedding in  is derived from the task id. Also, this method still requires some child model trainings and evaluations on the test task as opposed to our method that requires no child model trainings.
|Data set||Train Examples||Val. Examples||Test Examples||Classes||Reference|
|US Economic Performance||3961||495||496||2||crowdflower.com|
The framework has been implemented in TensorFlow222We plan to make the code publicly available. . For the experimental results we use publicly available NLP data sets whose main characteristics are shown in Table 1. We have performed several leave-one-out experiments, where each task in our set is considered to be a test task and the rest of the tasks are used as the training tasks. Then for each such leave-one-out experiment, we train a DVN, we study its predictive performance and use it for fast architecture inference. More details are provided below.
The child models have been implemented using the parametrization discussed in Section 3.4. The sizes of the base layers in a single parametrized layer are and each one of them is combined with two distinct activation functions (relu and tanh). Hence a single parametrized layer is composed of twelve base layers and each child model has seven such parametrized layers. The child models have been trained using Adam optimizer  with a learning rate of for 20 training epochs.
Deep value network
The value network was trained on the child model training experiments stored in the database, which was populated with about 500 child model architectures per task (generated by random one-hot architecture encodings). We used a simple value network consisting of two fully connected layers of size 50 each for the task meta-features tower (aka in Fig. 2) and two fully connected layers of sizes 50 and 10 for the tower that produces the final prediction (aka in Fig. 2). The value network used standard L2 loss for regression and was trained using Stochastic Gradient Descent with momentum  (using 0.5 as default parameter). The learning rate was set to . We set kOuterIters to 1 and kInnerIters to 2 in Algorithm 1. When training the value network we normalized the child performances using the mean and standard deviation of the population of child performances for a certain task . Each task has its own level of difficulty and we noticed that this normalization step factors out the difficulty of the task and improves the performance of the value network.
5.2 Predicting the model performance
|Task name||Without Meta-features||With Meta-Features|
|airline||0.8003 0.0180||0.8260 0.0125|
|emotion||0.8269 0.0138||0.8523 0.0088|
|global warming||0.8072 0.0116||0.8179 0.0076|
|corporate messaging||0.8090 0.0067||0.8527 0.0076|
|disasters||0.8066 0.0053||0.7933 0.0121|
|political message||0.4915 0.0114||0.5078 0.0091|
|political bias||0.5408 0.0138||0.5210 0.0111|
|progressive opinion||0.8164 0.0130||0.8338 0.0063|
|progressive stance||0.7244 0.0212||0.7883 0.0200|
|us economic performance||0.3051 0.0103||0.2851 0.0114|
|Task name||Without Meta-features||With Meta-Features|
|airline||0.7709 0.0143||0.8294 0.0174|
|emotion||0.7570 0.0164||0.8011 0.0116|
|global warming||0.7002 0.0102||0.7403 0.0138|
|corporate messaging||0.7218 0.0102||0.7746 0.0099|
|disasters||0.7805 0.0120||0.8039 0.0143|
|political message||0.7345 0.0124||0.7451 0.0138|
|political bias||0.1718 0.0148||0.1382 0.0148|
|progressive opinion||0.4473 0.0060||0.4720 0.0092|
|progressive stance||0.4189 0.0112||0.4614 0.0138|
|us economic performance||0.6886 0.0164||0.6970 0.0146|
We study the predictive performance of the value network in each one of the leave-one-out experiments. In particular, given the predicted performances and their corresponding actual performances, we quantify the predictive performance in terms of the Spearman’s rank correlation coefficient as well as the standard R2 metric for regression. In order to get more accurate results we repeat this process ten times and we report the statistics of the obtained performances. Table 2 shows the obtained Spearman’s rank correlations for each task and Table 3 shows the corresponding R2 metric values. Notice that in most cases, the Spearman’s rank correlations lie around 0.8, which seems rather satisfactory for a method that does not use any child model trainings on the test task.
We have also studied experimentally the effect of the meta-features and report the predictive performances with and without meta-features. The results in both tables suggest that the meta-features are helpful, as expected, since both metrics increase. The meta-features provide task-specific information to the value network that helps towards estimating the relative performance of various architectures for the task at hand.
5.3 Architecture search
|Task name||First10||NAS||Num Child models||Proposed|
|airline||0.7904 0.0366||0.83197||751||0.8222 0.0129|
|global warming||0.7806 0.0249||0.79196||1927||0.8017 0.0108|
|disasters||0.8193 0.0105||0.83425||1283||0.8235 0.0119|
|political bias||0.7770 0.0151||0.778||1989||0.7686 0.0108|
|progressive opinion||0.6750 0.0428||0.73276||1505||0.7052 0.0381|
|progressive stance||0.4181 0.0645||0.57759||1635||0.4724 0.0537|
|us economic performance||0.7494 0.0112||0.76411||1966||0.7498 0.0132|
|corporate messaging||0.8006 0.0492||0.85897||968||0.8247 0.0387|
|emotion||0.2998 0.0278||0.35425||1779||0.3397 0.0238|
|political message||0.4230 0.0075||0.414||1974||0.4214 0.0061|
In this section we look into the performance of the child model architectures suggested by our method and we report their test accuracy. When we apply our algorithm we set kNumStartingPoints to 10 and kMaxIters to 1000 in Algorithm 2. For each task we run our method ten times in order to get more accurate statistics on the performances.
We compare against the NAS method for architecture search using Reinforcement Learning . In particular, we applied NAS on the same child models as our method. Table 4 shows for each task the test accuracy of the child model that NAS found as having the best validation accuracy. We report also the number of trained child models that were needed for achieving this accuracy. For the sake of completeness, the table also includes the performances of the first 10 models that NAS tried.
Notice that the performance of the proposed method is not too far from that of NAS. This is very promising given that the proposed method requires no child model training in its online phase and is very efficient.
Note finally that the experiments above have been all performed in the continuous architecture space. However, we acknowledge that inference with continuous child model architectures can be expensive for some applications, since it involves computations over all possible design choices. In such cases, one may want to prune the architecture (in order to make inference faster) but still keep the same model quality. This will be the subject of a forthcoming study.
6 Conclusions and future work
We presented a framework for efficient architecture inference that cross learns from several tasks. This is feasible thanks to a deep value network that predicts the performance of a candidate architecture on a certain task based on learned meta-features derived from the raw data. Given a new task, the proposed method uses simple gradient ascent to infer a candidate architecture for it and experimental results confirm that the performance of the found architecture is relatively close to that of the very expensive baseline. In our future work, we plan to explore different child model parametrizations, study the effect of pruning the architecture and apply the method to other data modalities beyond text (e.g., images).
The authors would like to thank Thomas Deselaers for his valuable comments, fruitful discussions and support.
-  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pages 265–283, Berkeley, CA, USA, 2016. USENIX Association.
-  B. Baker, O. Gupta, R. Raskar, and N. Naik. Accelerating neural architecture search using performance prediction. arXiv preprint, November 2017.
-  Corinna Cortes, Xavier Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Adanet: Adaptive structural learning of artificial neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 874–883. JMLR.org, 2017.
-  B. Deng, J. Yan, and D. Lin. Peephole: Predicting network performance before training. arXiv preprint, December 2017.
-  M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and Frank Hutter. Efficient and Robust Automated Machine Learning. NIPS, 2015.
-  N. Fusi, R. Sheth, and H. M. Elibol. Probabilistic Matrix Factorization for Automated Machine Learning. 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada., 2018.
-  A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and E. Choi. MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks. CVPR, 2018.
-  M. Gygli, M Norouzi, and A. Angelova. Deep value networks learn to evaluate and iteratively refine structured outputs. ICML, 2017.
-  F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential Model-Based Optimization for General Algorithm Configuration. 5th International Conference on Learning and Intelligent Optimization, pages 507–523, 2011.
-  R. Istrate, F. Scheidegger, G. Mariani, D. Nikolopoulos, C. Bekas, and A. C. I. Malossi. TAPAS: Train-less Accuracy Predictor for Architecture Search. arXiv preprint, 2018.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In The European Conference on Computer Vision (ECCV), September 2018.
-  Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019.
-  H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, and F. Hutter. Towards Automatically-Tuned Neural Networks. JMLR: Workshop and Conference Proceedings, 1:1–8, 2016.
-  Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via parameters sharing. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4095–4104. PMLR, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018.
-  Ning Qian. On the momentum term in gradient descent learning algorithms. Neural Netw., 12(1):145–151, January 1999.
-  E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin. Large-Scale Evolution of Image Classifiers. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR, 2017.
-  R. Shin, C. Packer, and D. Song. Differentiable Neural Network Architecture Search. Workshop track - ICLR, 2018.
-  Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian Optimization of Machine Learning Algorithms. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2951–2959. Curran Associates, Inc., 2012.
-  Catherine Wong, Neil Houlsby, Yifeng Lu, and Andrea Gesmundo. Transfer learning with neural automl. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8366–8375. Curran Associates, Inc., 2018.
-  M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola. Deep Sets. NIPS, 2017.
-  B. Zoph and Q. V. Le. Neural Architecture Search with Reinforcement Learning. ICLR, 2017.