RNAS: Architecture Ranking for Powerful Networks
Abstract
Neural Architecture Search (NAS) is attractive for automatically producing deep networks with excellent performance and acceptable computational costs. The performance of intermediate networks in most of existing NAS algorithms are usually represented by the results evaluated on a small proxy dataset with insufficient training in order to save computational resources and time. Although these representations could help us to distinct some searched architectures, they are still far away from the exact performance or ranking orders of all networks sampled from the given search space. Therefore, we propose to learn a performance predictor for ranking different models in the searching period using few networks pretrained on the entire dataset. We represent each neural architecture as a feature tensor and use the predictor to further refining the representations of networks in the search space. The resulting performance predictor can be utilized for searching desired architectures without additional evaluation. Experimental results illustrate that, we can only use (424 models) of the entire NASBench dataset to construct an accurate predictor for efficiently finding the architecture with accuracy ( top performance in the whole search space), which is about higher than that of the stateoftheart methods.
1 Introduction
Convolutional Neural Networks (CNNs) have achieves stateoftheart results in many realworld applications. However, most of these CNNs are designed based on abundant expert experience (e.g. VGG, ResNet, MobileNet) in order to save human efforts. Therefore, it is attractive to design neural network architectures without human intervention. Neural Architecture Search (NAS) is a kind of framework for automatically producing deep networks with excellent performance and low computational costs.
Based on different searching strategies and assumptions, there are a number of NAS algorithms proposed for increasing the search speed and the performance of the resulting network, including discrete searching methods such as Evolutionary Algorithm (EA) based method [9, 12, 14, 17, 18, 21] and Reinforcement Learning (RL) based method [1, 13, 24, 25], and continuous searching methods such as DARTS [10] and CARS [22].
Nevertheless, the algorithms mentioned above mainly focused on designing the searching methods, the evaluation criterion used in various NAS algorithms has not been fully investigated. For instance, discrete searching methods utilize the early stop strategy to evaluate searched architectures on a relatively small dataset for the reason of efficiency. Continuous searching methods usually use a series of learnable parameters to control the selected layers or operations in deep neural networks, which are updated according to the intermediate performance of the super net. In fact, most of evaluation criteria exploited in above mentioned schemes are inaccurate and prevent us from selecting the best neural network, as shown in Fig.1 (detailed information can be found in Sec.3.1). In addition, there is a recent consensus that the performance of the network selected by current NAS frameworks is similar to random search [15].
Besides using the intermediate performance to represent the exact network performance on the entire dataset, Domhan et al. proposed a weight probabilistic model to extrapolate the performance from the first part of a learning curve and speed up hyperparameter search of CNNs [4]. Klein et al. used a Bayesian neural network to predict unobserved learning curves [7]. Such methods rely on Markov chain Monte Carlo (MCMC) sampling procedures and handcrafted curve function, which is computationally expensive. Deng et al. develop a unified way to encode individual layers into vectors and bring them together to form an integrated description via LSTM, and directly predict the performance of a network architecture [3]. Sun et al. proposed an endtoend offline performance predictor based on random forest [20]. However, the results of the above methods are still not accurate enough.
To make the searching in NAS effective, we utilize the NASBench dataset to study the performance predictor. Specifically, we use general and fundamental features, e.g. flops and parameters of each layer (or each node) to represent a specific architecture when training the predictor. Then, pairwise ranking based loss function is used instead of elementwise loss function such as Mean Squared Error (MSE) loss and L1 loss, since keeping the rankings between different neural networks are more important than predicting their absolute performance for most of the searching methods. The experimental results show that the proposed predictor achieves a higher prediction performance compared to the other stateoftheart methods, and can efficiently finding the architecture with top accuracy in the whole search space using only of the dataset.
The rest of the paper is organized as follows: Sec.2 introduces the related works of NAS and stateoftheart architecture evaluation methods. Sec.3 starts with a toy experiment and then introduce the details of the proposed method. Several experiments conducted on NASBench dataset are shown in Sec.4, and finally Sec.5 concludes the paper.
2 Related Works
In this section, we give a brief introduction of the Neural Architecture Search (NAS) and the network performance predictor.
2.1 Neural Architecture Search
Neural architecture search aims to automatically find the optimal model hyperparameters or architectures for specific tasks under given constraints, such as higher accuracy and lower computational cost.
In the view of methodology, the latest algorithms for NAS fall into two categories. The first is the discrete search space based methods such as Evolutionary Algorithm (EA) based method [9, 12, 14, real2017large, xie2017genetic] and Reinforcement Learning (RL) based method [1, 13, 24, 25]. In EA based method, each individual neural network is regarded as an architectural component. Searching is performed through generations by using mutations and recombinations of architectural components, and the components with better performance on the validation set will be picked and inherited to the next generation during evolution. In RL based method, the choice of different architecture is regarded as a sequence of actions, and the performance on the validation set is usually used as the reward. The second is the differentiable searching methods such as DARTS [10, 11]. These kind of methods allow efficient search of the architecture using gradient descent.
In the view of searching target, methods for NAS also fall into two categories: searching for architecture of neural networks and searching for hyperparameters of neural networks [2, 5, 19, 8]. When searching for architecture, it is essential to design the search space in order to speed up the searching process. A high quality search space influences not only the duration of the search but also the quality of the solution. In this situation, the hyperparameters of the network are fixed to some empirical value. When searching for hyperparameters, people focus primarily on obtaining good optimization hyperparameters for training a fixed architecture of the network.
Although recent NAS approaches can achieve promising performance based on the welldesigned super net and search space, the performance predictor utilized in these methods are not accurate enough, which makes the searching procedure unstable.
2.2 Network Performance Predictor
There are limited works on predicting the performance of the neural networks. A kind of methods are to predict the validation accuracy using part of the learning curve after training the architectures for several epochs. A series of validation accuracies is collected for each neural network configuration as the training data, and different regression models are used to predict the performance of a neural network. Domhan et al. used a weight probabilistic model [4], Klein et al. used a Bayesian neural network [7], and Baker et al. proposed a sequential regression model to predict the validation accuracy [2]. Such methods rely heavily on the smoothness of the learning curve, and will not be effective when an abrupt change occurs, e.g. the change of the learning rate, which is frequently used in neural network training procedure. Moreover, to predict the performance of an architecture, we need to train it for several epochs, which is timeconsuming.
Another kind of methods produce performance predictors without any training of the neural architectures, which can largely accelerate the search process. Deng et al. encoded each individual layers into vectors and bring them together to form an integrated description via LSTM. An multilayer perceptron (MLP) is used to predict the final accuracy [3]. Istrate et al. proposed a performance predictor that estimates in fractions of a second classification performance for unseen input datasets, thus it is designed to transfer knowledge from familiar datasets to unseen ones [6]. Sun et al. proposed to use less training data by using an endtoend offline performance predictor based on random forest [20].
Admittedly, features or representations of neural architectures are not welldesigned, and most of them adopted the MSE based loss functions to guide the training of predictors. Thus, an effective framework for ranking different neural networks and improving NAS algorithms is urgently required.
3 Problem Formulation
In this section, we first instantiate the problem of evaluation criteria used in the previous methods with a toy experiments. Then, we give an elaborate introduction of the proposed performance predictor. Specifically, designing predictor falls into three parts. The design of encoding network architecture into feature, the design of deep regressor and the design of loss. The first two parts were paid more attention to in the previous works, but the loss are usually the simplest elementwise MSE loss (or L1 loss). In this paper, we utilize the NASBench dataset [23] and propose our own way to encode neural network architectures into feature tensors and to design the regressor. Furthermore, we propose the pairwise ranking loss to optimize the regressor.
3.1 Toy Experiments
We begin the problem formulation of our method with a toy experiment. In case of saving computational resource and time, a commonly used evaluation criterion in the previous NAS methods is to train the model on part of the training dataset with early stop strategy. The model is then tested on testing dataset, and the intermediate accuracy is used to evaluate the performance of the model in the subsequent searching algorithms. However, lighter architectures often converge faster on smaller dataset than cumbersome architectures, but perform worse when using the whole training set.
As shown in Fig.1, we conduct different models based on VGGsmall network by multiplying the channel width in each layer with a consistent ratio (i.e. ). The groundtruth performance of these networks are derived by training them on complete CIFAR100 training dataset with 200 epochs, and then validated on CIFAR100 test dataset (Fig.1 (f)). Besides, different intermediate accuracies are derived by training the networks on CIFAR100 training dataset with different epochs, and then validated on CIFAR100 testing dataset (Fig.1 (a)(e)). The results shown that the network with best groundtruth performance does not perform well when using CIFAR100 training dataset no matter how many epochs it is trained. Thus, the intermediate accuracy used in the previous methods is not accurate enough to evaluate the true performance of a given network architecture.
3.2 Feature Tensor of NASBench dataset
In this section, we give a brief introduction of NASBench dataset, and propose our own way to encode architectures in NASBench dataset into feature tensors.
NASBench dataset is a recently proposed dataset for neural architecture searching. The dataset contains over 423k unique CNN architectures and their accuracies trained on CIFAR10 dataset. The architectures share the same skeleton as shown in column 1 of Fig.2. The search space is restricted to small feedforward structures called cells. Each cell is stacked 3 times followed by a downsampling layer which halved the height and width and doubled the channel. The ’stackdownsample’ pattern is repeated 3 times followed by a global average pooling and a dense layer. The first layer is fixed as a convolution layer with 128 output channels. Different cells produce different CNN architectures. In each cell there are no more than 7 nodes in which and nodes are fixed to represent the input and output tensors to the cell, respectively. The other nodes are randomly selected from 3 different operations: convolution, convolution and maxpooling. The edges are limited to no more than 9.
Encoding a neural architecture is important for a predictor to predict the performance. Peephole [3] chose layer type, kernel width, kernel height and channel number as the representation of each layer. E2EPP [20] forced the network architecture to be composed of the DenseNet blocks, ResNet blocks and pooling blocks, and generated features based on these blocks.
However, those features are not strong enough to encode a network architecture. Different from the methods mentioned above, we focus on generating more general and fundamental features, e.g. flops and parameters of each layer (or each node). Note that the network architecture in NASBench dataset is completely determined by the architecture of the corresponding cell. The cell can be represented by a 01 adjacency matrix and a type vector (5 different node types containing input, conv, conv, maxpooling and output), in which is the number of nodes. Furthermore, we calculate the flops and parameters of each node and derive a flop vector (we assume the image size is ) and a parameter vector .
Specifically, the adjacency matrix can always be represented by an upper triangular matrix. Given a cell architecture as a directed acyclic graph (DAG) with nodes, the input node is always the root node . The depth of a node can be represented as the maximum distance between and (there may exists multiple paths from to ). Thus, we sort the nodes by their depth and the adjacency matrix is an upper triangular matrix.
Note that , and it is different in each architecture. Thus, we pad the adjacency matrix with 0 and the size is fixed as . The type vector , flop vector and parameter vector are padded accordingly. Note that the input and the output node should be fixed as the first and last node, thus the zeropadding is added at penultimate row and column each time (see Fig.2). After that, we broadcast the vectors into matrix, and make an elementwise multiplication with the adjacency matrix to get the type matrix , flop matrix and parameter matrix . Note that there are 9 cells in a network architecture. Thus, there are 9 different flop matrices and parameter matrices and they share the same type matrix. We concatenate them together to get a tensor to represent a specific architecture in NASBench dataset. Detailed implementation can be viewed in Fig.2.
3.3 Architecture Performance Predictor
Given the feature tensor mentioned above, we propose the architecture performance predictor and introduce the ranking based loss function in this section.
CNNs have shown promising performance on tensorlike input. In practice, there are usually limited training data for the predictor due to the massive time and resources spent on training a single neural architecture. Thus, we use a simple modified LeNet5 architecture (as shown in Table.1) to predict the final accuracy of a given network architecture tensor , in order to prevent the overfitting problem. Batch normalization (BN) is used after each convolutional layer, and ReLU activation function is used after each BN layer and FC layer.
Layer name  output size  parameters 

conv1  
conv2  
fc1  120  
fc2  84  
fc3  1 
In order to train the predictor, a commonly used loss function is elementwise MSE or L1 loss function [3, 6, 20]. Specifically, let be the output of the predictor, in which is the number of samples. The groundtruth accuracies are represented as . Previous works optimize the predictor by using MSE loss function:
(1) 
or L1 loss function:
(2) 
The above loss functions focus on fitting the absolute value of accuracy of a single network, and assume that a lower MSE or L1 loss leads to a better ranking results. However, this is not always the case as shown in [15]. We believe that directly focus on the ranking of the predicted accuracies between different architectures are more important than their absolute values when applying the network performance predictor to different searching methods. Thus, we applied a pairwise ranking based loss function to the predictor.
Besides utilizing the final output of the predictor, we believe that the feature extracted before the final FC layer is also useful. Note that the continuity is a common assumption in machine learning, i.e. the performance changes continuously along the feature space. However, this is not the case for the primary network architecture, in which a slightly change of the architecture may lead to a radical change of the performance (e.g. skip connect). Thus, we consider learning the feature with the property of continuity. Therefore, the effects of the proposed predictor are two folds. The first is to predict accuracies with correct ranking, the second is to generate features with the property of continuity.
Specifically, let be the output of the predictor, which are the objects to be ranked. The groundtruth accuracies are represented as . We define the pairwise ranking based loss function as:
(3) 
in which is the hinge function. Note that given a pair of examples, the loss is 0 only when the examples are in the right order and is differed by a margin. Other functions like the logistic function and the exponential function can also be applied here.
In order to generate features with continuity, consider the triplet in which is the feature generated before the final FC layer. The Euclidean distance between the two features is computed as , and the difference of the performance between two architectures is simply computed as . Thus, we achieve the property of continuity by defining the loss function as:
(4) 
Note that although Eq.3 and Eq.4 are similar, the purposes behind are quite different. Given the equations above, the final loss function is the combination of them:
(5) 
in which is the hyperparameter that controls the importance between two different loss functions.
Finally, the performance predictor is integrated into searching algorithms (EA based searching method in the following experiments). An individual is fed into the predictor and the output of the predictor is treated as the fitness of the model in EA method within milliseconds. The whole process mentioned above is detailed in Algorithm 1.
4 Experiments
In this section, we conduct several experiments on verifying the effectiveness of the proposed network performance predictor. After that, the best CNN architecture is found by embedding the predictor into EA algorithm and is compared to other stateoftheart predictors to verify its performance.
The parameter settings for training the predictor and searching for the best architecture are detailed below. When training the predictor, we used the Adam algorithm to train the LeNet architecture with initial learning rate of ; the weight decay is set to ; the batch size is set to and trained for epochs. When searching for the best CNN architecture, the search space is the same as in NASBench dataset. We set the maximum number of nodes and edges in each cell as and , and each node is randomly chosen from conv, conv and maxpooling. We set the maximum generation number to and population size to . The probability for selection, crossover and mutation are set to , and , respectively.
Peephole [3]  0.4556  0.4769  0.4963  0.4977  0.4972  0.4975  0.4951 
E2EPP [20]  0.5038  0.6734  0.7009  0.6997  0.7011  0.6992  0.6997 
Proposed v1 (type matrix + MSE)  0.3465  0.5911  0.7914  0.8229  0.8277  0.8344  0.8350 
Proposed v2 (tensor + MSE)  0.4856  0.6090  0.8103  0.8430  0.8399  0.8504  0.8431 
Proposed v3 (type matrix + pairwise)  0.6039  0.7943  0.8752  0.8894  0.8949  0.8976  0.8997 
Proposed v4 (tensor + pairwise)  0.6335  0.8136  0.8762  0.8900  0.8957  0.8979  0.8995 
4.1 Predictor Performance Comparison
We compared the proposed predictor with the methods introduced in Peephole [3] and E2EPP [20]. The NASBench dataset is selected as the training and testing sets of the predictors.
Recall that one of the fundamental idea in our proposed method is that the ranking of the predicted values is more important than their absolute values when embedding the predictor into different searching methods. Thus, for the quantitative comparison, we use the Kendall’s Tau (KTau) [16] as the indicator. KTau measures the correlation between the rankings of the predicted values and the rankings of the actual values, which is suitable for judging the quality of the predictive rankings. KTau ranges from to , a higher value indicates a better ranking.
In order to clearly review the influence of using fundamental flops and parameters feature, and the influence of using pairwise loss, we conduct the following versions of the proposed methods by fixing the predictor as LeNet and varying the feature encoding methods and the loss function, including:

proposed v1: Using only the type matrix as feature and MSE loss function.

proposed v2: Using the proposed feature encoding method (tensor feature) and MSE loss function.

proposed v3: Using only the type matrix as feature and pairwise loss function.

proposed v4: Using the proposed feature encoding method (tensor feature) and pairwise loss function.
Note that the search space in NASBench dataset and E2EPP are different from each other, and the encoding method proposed in E2EPP is unable to be used directly on NASBench dataset. In order to apply NASBench dataset to E2EPP, we produce surrogate method for E2EPP and use the feature encoding method proposed in the previous section instead of the original encoding method produced by E2EPP. The other parts remain unchanged.
The experimental results are shown in Table.2. Consider that the training samples can only cover a small proportion of the search space in reality, we focus on the second column when using only of the NASBench dataset as training set. Different proportions are used in the experiment for integrity. The results show that the proposed encoding method utilizing flops and parameters can represent an architecture better than without using it , and the KTau indicator increases about when using MSE loss and when using pairwise loss. When using pairwise loss instead of elementwise MSE loss, the KTau indicator increases about when using only the type matrix as feature, and about when using the proposed tensor feature. It means that pairwise loss is better than MSE loss at ranking regardless of input feature.
Comparing to other stateoftheart methods, Peephole used kernel size and channel number as features in addition to layer (node) type, and shows better result than proposed v1 method which uses only the layer (node) type as features. However, it performs worse than proposed v2 method when using all the feature proposed, which again shows the superiority of using fundamental flops and parameters of each layer as features. E2EPP used random forest as predictor, which has advantages only when the training samples are extremely rare. In almost all circumstances, the proposed method with pairwise loss achieves the best KTau performance.
A qualitative comparisons on NASBench dataset is shown in Fig.3. We show the results of training predictors using training data, the axis of each point represents the true ranking among all the points and the axis denotes the corresponding predicted ranking. The results show that the predicted ranking made by the proposed method is notably better than other stateoftheart methods.
4.2 Architecture Search Results
When searching for the best CNN architecture, the size of the training set of the predictor should be limited. This is because the search space in EA algorithm is the same as in NASBench dataset, and we cannot prevent the EA algorithm from searching the architectures in the training set. Thus, in order to reduce the influence of the training set, we used only of NASBench dataset as training samples to train the predictor, and subsequently used for EA algorithm. The final performance tested on CIFAR10 dataset with the best architecture searched by EA algorithm with the proposed predictor and the peer competitors mentioned above are shown in Table.3. Specifically, the best performances among top10 architectures selected by EA algorithm with different predictors are reported and the experiments are repeated 20 times with different random seed to alleviate the randomness.
Method  accuracy()  ranking() 

Peephole [3]  90.99 0.61  43.22 
E2EPP [20]  93.47 0.44  1.23 
Proposed v1  91.36 0.27  35.97 
Proposed v2  93.03 0.21  6.09 
Proposed v3  93.43 0.26  1.50 
Proposed v4  93.90 0.21  0.04 
The second column represents the classification accuracies of the selected models on CIFAR10 test set, and the third column represents the true ranking of the selected models among all the different models in NASBench dataset. The performed method outperforms other competitors, and finds an network architecture with top performance among the search space using only dataset. The fact of achieving good performance with little training data is reasonable for two reasons. The first is that the fundamental features of flops and parameters can represent the architecture well and tensor like input is suitable for CNN. The second is that using pairwise loss expand the training set to some extent. Given individuals, there are actually pairs and triplets for training.
Note that when using performance predictor in practice, the search space is often different from NASBench dataset, which means the training samples needs to be collected from scratch. Thus, we give some intuitions of selecting model architectures from search space as training samples. samples are selected from NASBench dataset as training samples with the method of random selection, select by parameters and select by flops. When selecting by parameters (flops), all samples are sorted by their total parameters (flops), and selected uniformly. Different predictors are trained with different training samples using proposed method, and are further integrated into EA algorithm for searching. The performance of the best architectures are shown in Table.4.
Method  accuracy()  ranking() 

random selection  93.90 0.21  0.04 
select by parameters  93.84 0.21  0.08 
select by flops  93.76 0.13  0.16 
The results show that random selection performs best. A possible reason is that architectures with similar parameters (flops) perform diversely, and the uniformly selected architectures cannot represent the true performance distribution of the architectures with similar parameters (flops). Thus, random selection is our choice, and is worth trying when generating training samples from search space in reality. The best cell architecture searched by EA algorithm using proposed predictor trained with random selected training samples is shown in column 2 of Fig.4.
In the following we give an intuitive representation of the best architectures selected by the performance predictor with different number of training samples as shown in Fig.4.
Note that the rank1 architecture in NASBench dataset cannot be selected by the predictor even when using of the training data. This is because when using pairwise ranking based loss function, there are training pairs and it is very inefficient to train them in a single batch. Thus, minibatch updating method is used and a single architecture is compared with limited architectures in one epoch, which causes the lack of global information about this architecture especially when the number of training samples is large. In fact, the minibatch size is set to 1024 in the experiment, and it is a compromise between effectiveness and efficiency.
This is the same reason that the performance of the architecture found by the predictor trained with dataset is marginally better than that trained with dataset. Specifically, we divide the architectures into two parts. The first part is the architectures trained with and dataset, and the second part is the rest. Note that in the first part the number of training sample is on the same order of magnitude with the minibatch size , thus the global information of a single model is easy to obtain and the performance becomes better when there are more training data. In the second part, the number of training sample is significantly larger than . On one hand, increasing the number of samples helps training. On the other hand, the global ranking information is harder to get. Thus, the performance is not guaranteed to be better when using more training samples.
Finally, there are some common characteristics among these architectures. The first is that the distance between input node and output node is at most 2, which shows the significance of skipconnection. The second is that operation appears in each architecture. Based on these observations, we separate the NASBench dataset based on the distance between input node and output node, and whether the operation is used. Some statistics are shown in Table.5.
Distance  model  Best acc  Average acc  

yes  1  68552  94.32  91.97 
2  153056  94.05  91.02  
3  110863  93.68  89.31  
4  27227  92.36  87.40  
5  2516  90.54  86.51  
6  211  88.87  84.91  
no  1  12468  91.62  88.40 
2  26282  90.81  86.69  
3  17735  90.24  83.53  
4  4282  88.95  80.20  
5  400  88.16  78.84  
6  32  86.71  74.93 
It shows that the shorter the distance between input node and output node, the better the performance is. Besides, operation helps the architecture to perform better. Based on the observation above, we may form a better search space of NASBench dataset by using only 68552 models with operation and skipconnect between input node and output node. An experiment of training and evaluating performance predictor is conducted on this sub dataset and the results show that the predictor trained and evaluated with sub dataset performs better than the previous one as shown in Table.6. It shows that a better search space helps to produce a better performance predictor.
Datasets  accuracy()  ranking() 

whole dataset  93.90 0.21  0.04 
sub dataset  94.02 0.14  0.01 
5 Conclusion
We proposed a new method for predicting network performance based on its architecture before training. We encode an architecture in NASBench dataset into a tensor utilizing flops and parameters which is general and fundamental to represent an architecture. The pairwise ranking based loss function is used for the performance predictor instead of the elementwise loss function, since the rankings between different architectures are more important than their absolute values in different searching methods. We then embed the proposed predictor into EA based method and search for the best architecture. Several experiments are conducted on NASBench dataset, and shows the priority of the proposed predictor on sorting the performance of different architectures and searching for an architecture with top performance among the search space using only of the dataset. A better sub search space is generated by utilizing the common statistics of the best architectures generated by the predictor and this subspace helps to produce a better predictor.
References
 [1] (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §1, §2.1.
 [2] (2017) Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823. Cited by: §2.1, §2.2.
 [3] (2017) Peephole: predicting network performance before training. arXiv preprint arXiv:1712.03351. Cited by: §1, §2.2, §3.2, §3.3, §4.1, Table 2, Table 3.
 [4] (2015) Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In TwentyFourth International Joint Conference on Artificial Intelligence, Cited by: §1, §2.2.
 [5] (2011) Sequential modelbased optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pp. 507–523. Cited by: §2.1.
 [6] (2018) Tapas: trainless accuracy predictor for architecture search. arXiv preprint arXiv:1806.00250. Cited by: §2.2, §3.3.
 [7] (2016) Learning curve prediction with bayesian neural networks. Cited by: §1, §2.2.
 [8] (2016) Hyperband: a novel banditbased approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560. Cited by: §2.1.
 [9] (2017) Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436. Cited by: §1, §2.1.
 [10] (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §2.1.
 [11] (2018) Neural architecture optimization. In Advances in neural information processing systems, pp. 7816–7827. Cited by: §2.1.
 [12] (2019) Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pp. 293–312. Cited by: §1, §2.1.
 [13] (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1, §2.1.
 [14] (2018) Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548. Cited by: §1, §2.1.
 [15] (2019) Evaluating the search phase of neural architecture search. arXiv preprint arXiv:1902.08142. Cited by: §1, §3.3.
 [16] (1968) Estimates of the regression coefficient based on kendall’s tau. Journal of the American statistical association 63 (324), pp. 1379–1389. Cited by: §4.1.
 [17] (2019) Searching for accurate binary neural architectures. arXiv preprint arXiv:1909.07378. Cited by: §1.
 [18] (2019) Coevolutionary compression for unpaired image translation. arXiv preprint arXiv:1907.10804. Cited by: §1.
 [19] (2015) Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180. Cited by: §2.1.
 [20] (2019) Surrogateassisted evolutionary deep learning using an endtoend random forestbased performance predictor. IEEE Transactions on Evolutionary Computation. Cited by: §1, §2.2, §3.2, §3.3, §4.1, Table 2, Table 3.
 [21] (2018) Towards evolutionary compression. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2476–2485. Cited by: §1.
 [22] (2019) CARS: continuous evolution for efficient neural architecture search. arXiv preprint arXiv:1909.04977. Cited by: §1.
 [23] (2019) Nasbench101: towards reproducible neural architecture search. arXiv preprint arXiv:1902.09635. Cited by: §3.
 [24] (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.1.
 [25] (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §1, §2.1.