RNAS: Architecture Ranking for Powerful Networks

RNAS: Architecture Ranking for Powerful Networks

Yixing Xu, Yunhe Wang, Kai Han, Hanting Chen, Yehui Tang
Shangling Jui, Chunjing Xu, Qi Tian, Chang Xu
Huawei Noah’s Ark Lab, Huawei Kirin Solution
Key Laboratory of Machine Perception (MOE), CMIC, School of EECS, Peking University, China
The University of Sydney, Darlington, NSW 2008, Australia
{yixing.xu, yunhe.wang, kai.han, jui.shangling, xuchunjing, tian.qi1}@huawei.com
{htchen, yhtang}@pku.edu.cn, c.xu@sydney.edu.au
Abstract

Neural Architecture Search (NAS) is attractive for automatically producing deep networks with excellent performance and acceptable computational costs. The performance of intermediate networks in most of existing NAS algorithms are usually represented by the results evaluated on a small proxy dataset with insufficient training in order to save computational resources and time. Although these representations could help us to distinct some searched architectures, they are still far away from the exact performance or ranking orders of all networks sampled from the given search space. Therefore, we propose to learn a performance predictor for ranking different models in the searching period using few networks pre-trained on the entire dataset. We represent each neural architecture as a feature tensor and use the predictor to further refining the representations of networks in the search space. The resulting performance predictor can be utilized for searching desired architectures without additional evaluation. Experimental results illustrate that, we can only use (424 models) of the entire NASBench dataset to construct an accurate predictor for efficiently finding the architecture with accuracy ( top performance in the whole search space), which is about higher than that of the state-of-the-art methods.

1 Introduction

Convolutional Neural Networks (CNNs) have achieves state-of-the-art results in many real-world applications. However, most of these CNNs are designed based on abundant expert experience (e.g. VGG, ResNet, MobileNet) in order to save human efforts. Therefore, it is attractive to design neural network architectures without human intervention. Neural Architecture Search (NAS) is a kind of framework for automatically producing deep networks with excellent performance and low computational costs.

Based on different searching strategies and assumptions, there are a number of NAS algorithms proposed for increasing the search speed and the performance of the resulting network, including discrete searching methods such as Evolutionary Algorithm (EA) based method [9, 12, 14, 17, 18, 21] and Reinforcement Learning (RL) based method [1, 13, 24, 25], and continuous searching methods such as DARTS [10] and CARS [22].

Nevertheless, the algorithms mentioned above mainly focused on designing the searching methods, the evaluation criterion used in various NAS algorithms has not been fully investigated. For instance, discrete searching methods utilize the early stop strategy to evaluate searched architectures on a relatively small dataset for the reason of efficiency. Continuous searching methods usually use a series of learnable parameters to control the selected layers or operations in deep neural networks, which are updated according to the intermediate performance of the super net. In fact, most of evaluation criteria exploited in above mentioned schemes are inaccurate and prevent us from selecting the best neural network, as shown in Fig.1 (detailed information can be found in Sec.3.1). In addition, there is a recent consensus that the performance of the network selected by current NAS frameworks is similar to random search [15].

(a) eval 5
(b) eval 10
(c) eval 15
(d) eval 20
(e) eval 200
(f) final
Figure 1: An illustration of the inaccurate evaluation criterion of recent NAS methods on the CIFAR-100 dataset. 5 different networks are generated by multiplying the channel width in each layer by a consistent ratio (i.e. ) compared to the standard VGG-small network. (a)-(e) Evaluation criteria used in most NAS methods. Networks are trained with of CIFAR-100 dataset with early stop strategy of different epochs. (f) Ground-truth performance. Networks trained with complete CIFAR-100 dataset until convergence.

Besides using the intermediate performance to represent the exact network performance on the entire dataset, Domhan et al. proposed a weight probabilistic model to extrapolate the performance from the first part of a learning curve and speed up hyper-parameter search of CNNs [4]. Klein et al. used a Bayesian neural network to predict unobserved learning curves [7]. Such methods rely on Markov chain Monte Carlo (MCMC) sampling procedures and hand-crafted curve function, which is computationally expensive. Deng et al. develop a unified way to encode individual layers into vectors and bring them together to form an integrated description via LSTM, and directly predict the performance of a network architecture [3]. Sun et al. proposed an end-to-end offline performance predictor based on random forest [20]. However, the results of the above methods are still not accurate enough.

To make the searching in NAS effective, we utilize the NASBench dataset to study the performance predictor. Specifically, we use general and fundamental features, e.g. flops and parameters of each layer (or each node) to represent a specific architecture when training the predictor. Then, pairwise ranking based loss function is used instead of element-wise loss function such as Mean Squared Error (MSE) loss and L1 loss, since keeping the rankings between different neural networks are more important than predicting their absolute performance for most of the searching methods. The experimental results show that the proposed predictor achieves a higher prediction performance compared to the other state-of-the-art methods, and can efficiently finding the architecture with top accuracy in the whole search space using only of the dataset.

The rest of the paper is organized as follows: Sec.2 introduces the related works of NAS and state-of-the-art architecture evaluation methods. Sec.3 starts with a toy experiment and then introduce the details of the proposed method. Several experiments conducted on NASBench dataset are shown in Sec.4, and finally Sec.5 concludes the paper.

2 Related Works

In this section, we give a brief introduction of the Neural Architecture Search (NAS) and the network performance predictor.

2.1 Neural Architecture Search

Neural architecture search aims to automatically find the optimal model hyper-parameters or architectures for specific tasks under given constraints, such as higher accuracy and lower computational cost.

In the view of methodology, the latest algorithms for NAS fall into two categories. The first is the discrete search space based methods such as Evolutionary Algorithm (EA) based method [9, 12, 14, real2017large, xie2017genetic] and Reinforcement Learning (RL) based method [1, 13, 24, 25]. In EA based method, each individual neural network is regarded as an architectural component. Searching is performed through generations by using mutations and re-combinations of architectural components, and the components with better performance on the validation set will be picked and inherited to the next generation during evolution. In RL based method, the choice of different architecture is regarded as a sequence of actions, and the performance on the validation set is usually used as the reward. The second is the differentiable searching methods such as DARTS [10, 11]. These kind of methods allow efficient search of the architecture using gradient descent.

In the view of searching target, methods for NAS also fall into two categories: searching for architecture of neural networks and searching for hyper-parameters of neural networks [2, 5, 19, 8]. When searching for architecture, it is essential to design the search space in order to speed up the searching process. A high quality search space influences not only the duration of the search but also the quality of the solution. In this situation, the hyper-parameters of the network are fixed to some empirical value. When searching for hyper-parameters, people focus primarily on obtaining good optimization hyper-parameters for training a fixed architecture of the network.

Although recent NAS approaches can achieve promising performance based on the well-designed super net and search space, the performance predictor utilized in these methods are not accurate enough, which makes the searching procedure unstable.

2.2 Network Performance Predictor

There are limited works on predicting the performance of the neural networks. A kind of methods are to predict the validation accuracy using part of the learning curve after training the architectures for several epochs. A series of validation accuracies is collected for each neural network configuration as the training data, and different regression models are used to predict the performance of a neural network. Domhan et al. used a weight probabilistic model [4], Klein et al. used a Bayesian neural network [7], and Baker et al. proposed a sequential regression model to predict the validation accuracy [2]. Such methods rely heavily on the smoothness of the learning curve, and will not be effective when an abrupt change occurs, e.g. the change of the learning rate, which is frequently used in neural network training procedure. Moreover, to predict the performance of an architecture, we need to train it for several epochs, which is time-consuming.

Another kind of methods produce performance predictors without any training of the neural architectures, which can largely accelerate the search process. Deng et al. encoded each individual layers into vectors and bring them together to form an integrated description via LSTM. An multi-layer perceptron (MLP) is used to predict the final accuracy [3]. Istrate et al. proposed a performance predictor that estimates in fractions of a second classification performance for unseen input datasets, thus it is designed to transfer knowledge from familiar datasets to unseen ones [6]. Sun et al. proposed to use less training data by using an end-to-end offline performance predictor based on random forest [20].

Admittedly, features or representations of neural architectures are not well-designed, and most of them adopted the MSE based loss functions to guide the training of predictors. Thus, an effective framework for ranking different neural networks and improving NAS algorithms is urgently required.

3 Problem Formulation

In this section, we first instantiate the problem of evaluation criteria used in the previous methods with a toy experiments. Then, we give an elaborate introduction of the proposed performance predictor. Specifically, designing predictor falls into three parts. The design of encoding network architecture into feature, the design of deep regressor and the design of loss. The first two parts were paid more attention to in the previous works, but the loss are usually the simplest element-wise MSE loss (or L1 loss). In this paper, we utilize the NASBench dataset [23] and propose our own way to encode neural network architectures into feature tensors and to design the regressor. Furthermore, we propose the pair-wise ranking loss to optimize the regressor.

3.1 Toy Experiments

We begin the problem formulation of our method with a toy experiment. In case of saving computational resource and time, a commonly used evaluation criterion in the previous NAS methods is to train the model on part of the training dataset with early stop strategy. The model is then tested on testing dataset, and the intermediate accuracy is used to evaluate the performance of the model in the subsequent searching algorithms. However, lighter architectures often converge faster on smaller dataset than cumbersome architectures, but perform worse when using the whole training set.

Figure 2: An example of encoding neural network architecture into feature tensor. Column 1: The skeleton of the neural network architecture. Column 2: A specific cell architecture with 6 nodes. Column 3: The corresponding adjacency matrix , type vector , flop vector and parameter vector of the cell. Column 4: Padding the adjacency matrix to and the vectors accordingly. Note that the zero-padding is added at penultimate row and column, since the last row and column represents the output node. Column 5: Vectors are broadcasted into matrix, and an element wise multiplication is made with the adjacency matrix to get the type matrix, flop matrix and parameter matrix. Column 6: There are 9 cells in the network, thus producing 9 different flop matrices and parameter matrices. All the cells share the same type matrix. We concatenate all the matrices to get the final tensor.

As shown in Fig.1, we conduct different models based on VGG-small network by multiplying the channel width in each layer with a consistent ratio (i.e. ). The ground-truth performance of these networks are derived by training them on complete CIFAR-100 training dataset with 200 epochs, and then validated on CIFAR-100 test dataset (Fig.1 (f)). Besides, different intermediate accuracies are derived by training the networks on CIFAR-100 training dataset with different epochs, and then validated on CIFAR-100 testing dataset (Fig.1 (a)-(e)). The results shown that the network with best ground-truth performance does not perform well when using CIFAR-100 training dataset no matter how many epochs it is trained. Thus, the intermediate accuracy used in the previous methods is not accurate enough to evaluate the true performance of a given network architecture.

3.2 Feature Tensor of NASBench dataset

In this section, we give a brief introduction of NASBench dataset, and propose our own way to encode architectures in NASBench dataset into feature tensors.

NASBench dataset is a recently proposed dataset for neural architecture searching. The dataset contains over 423k unique CNN architectures and their accuracies trained on CIFAR-10 dataset. The architectures share the same skeleton as shown in column 1 of Fig.2. The search space is restricted to small feed-forward structures called cells. Each cell is stacked 3 times followed by a downsampling layer which halved the height and width and doubled the channel. The ’stack-downsample’ pattern is repeated 3 times followed by a global average pooling and a dense layer. The first layer is fixed as a convolution layer with 128 output channels. Different cells produce different CNN architectures. In each cell there are no more than 7 nodes in which and nodes are fixed to represent the input and output tensors to the cell, respectively. The other nodes are randomly selected from 3 different operations: convolution, convolution and max-pooling. The edges are limited to no more than 9.

Encoding a neural architecture is important for a predictor to predict the performance. Peephole [3] chose layer type, kernel width, kernel height and channel number as the representation of each layer. E2EPP [20] forced the network architecture to be composed of the DenseNet blocks, ResNet blocks and pooling blocks, and generated features based on these blocks.

However, those features are not strong enough to encode a network architecture. Different from the methods mentioned above, we focus on generating more general and fundamental features, e.g. flops and parameters of each layer (or each node). Note that the network architecture in NASBench dataset is completely determined by the architecture of the corresponding cell. The cell can be represented by a 0-1 adjacency matrix and a type vector (5 different node types containing input, conv, conv, max-pooling and output), in which is the number of nodes. Furthermore, we calculate the flops and parameters of each node and derive a flop vector (we assume the image size is ) and a parameter vector .

Specifically, the adjacency matrix can always be represented by an upper triangular matrix. Given a cell architecture as a directed acyclic graph (DAG) with nodes, the input node is always the root node . The depth of a node can be represented as the maximum distance between and (there may exists multiple paths from to ). Thus, we sort the nodes by their depth and the adjacency matrix is an upper triangular matrix.

Note that , and it is different in each architecture. Thus, we pad the adjacency matrix with 0 and the size is fixed as . The type vector , flop vector and parameter vector are padded accordingly. Note that the input and the output node should be fixed as the first and last node, thus the zero-padding is added at penultimate row and column each time (see Fig.2). After that, we broadcast the vectors into matrix, and make an element-wise multiplication with the adjacency matrix to get the type matrix , flop matrix and parameter matrix . Note that there are 9 cells in a network architecture. Thus, there are 9 different flop matrices and parameter matrices and they share the same type matrix. We concatenate them together to get a tensor to represent a specific architecture in NASBench dataset. Detailed implementation can be viewed in Fig.2.

3.3 Architecture Performance Predictor

Given the feature tensor mentioned above, we propose the architecture performance predictor and introduce the ranking based loss function in this section.

CNNs have shown promising performance on tensor-like input. In practice, there are usually limited training data for the predictor due to the massive time and resources spent on training a single neural architecture. Thus, we use a simple modified LeNet-5 architecture (as shown in Table.1) to predict the final accuracy of a given network architecture tensor , in order to prevent the over-fitting problem. Batch normalization (BN) is used after each convolutional layer, and ReLU activation function is used after each BN layer and FC layer.

Layer name output size parameters
conv1
conv2
fc1 120
fc2 84
fc3 1
Table 1: Architecture of the predictor.

In order to train the predictor, a commonly used loss function is element-wise MSE or L1 loss function [3, 6, 20]. Specifically, let be the output of the predictor, in which is the number of samples. The ground-truth accuracies are represented as . Previous works optimize the predictor by using MSE loss function:

(1)

or L1 loss function:

(2)

The above loss functions focus on fitting the absolute value of accuracy of a single network, and assume that a lower MSE or L1 loss leads to a better ranking results. However, this is not always the case as shown in [15]. We believe that directly focus on the ranking of the predicted accuracies between different architectures are more important than their absolute values when applying the network performance predictor to different searching methods. Thus, we applied a pairwise ranking based loss function to the predictor.

Besides utilizing the final output of the predictor, we believe that the feature extracted before the final FC layer is also useful. Note that the continuity is a common assumption in machine learning, i.e. the performance changes continuously along the feature space. However, this is not the case for the primary network architecture, in which a slightly change of the architecture may lead to a radical change of the performance (e.g. skip connect). Thus, we consider learning the feature with the property of continuity. Therefore, the effects of the proposed predictor are two folds. The first is to predict accuracies with correct ranking, the second is to generate features with the property of continuity.

Specifically, let be the output of the predictor, which are the objects to be ranked. The ground-truth accuracies are represented as . We define the pairwise ranking based loss function as:

(3)

in which is the hinge function. Note that given a pair of examples, the loss is 0 only when the examples are in the right order and is differed by a margin. Other functions like the logistic function and the exponential function can also be applied here.

In order to generate features with continuity, consider the triplet in which is the feature generated before the final FC layer. The Euclidean distance between the two features is computed as , and the difference of the performance between two architectures is simply computed as . Thus, we achieve the property of continuity by defining the loss function as:

(4)

Note that although Eq.3 and Eq.4 are similar, the purposes behind are quite different. Given the equations above, the final loss function is the combination of them:

(5)

in which is the hyper-parameter that controls the importance between two different loss functions.

Finally, the performance predictor is integrated into searching algorithms (EA based searching method in the following experiments). An individual is fed into the predictor and the output of the predictor is treated as the fitness of the model in EA method within milliseconds. The whole process mentioned above is detailed in Algorithm 1.

\setstretch

1.0

Algorithm 1 Training predictor and integrated into EA algorithm
0:  Search space , training set with samples in which is the network architecture and is the corresponding classification performance on a particular dataset.
0:  
1:  Initialize the parameter of performance predictor ;
2:  Encode the architecture into tensor ;
3:  Feed into and get the output ;
4:  Calculate the pairwise ranking based loss function using Eq.5;
5:  Update the parameter in using SGD;
5:  
6:  Generate a population of individuals in which is a random generated network architecture from search space , and is the population size;
7:  for to do
8:      Encode each individual into tensor ;
9:      Feed into predictor and get the output as the     fitness of the individual;
10:      Use the fitness to select, cross-over and mutate within    current population and generate next population;
11:  end for
11:  The best individual .

4 Experiments

In this section, we conduct several experiments on verifying the effectiveness of the proposed network performance predictor. After that, the best CNN architecture is found by embedding the predictor into EA algorithm and is compared to other state-of-the-art predictors to verify its performance.

The parameter settings for training the predictor and searching for the best architecture are detailed below. When training the predictor, we used the Adam algorithm to train the LeNet architecture with initial learning rate of ; the weight decay is set to ; the batch size is set to and trained for epochs. When searching for the best CNN architecture, the search space is the same as in NASBench dataset. We set the maximum number of nodes and edges in each cell as and , and each node is randomly chosen from conv, conv and max-pooling. We set the maximum generation number to and population size to . The probability for selection, crossover and mutation are set to , and , respectively.

Peephole [3] 0.4556 0.4769 0.4963 0.4977 0.4972 0.4975 0.4951
E2EPP [20] 0.5038 0.6734 0.7009 0.6997 0.7011 0.6992 0.6997
Proposed v1 (type matrix + MSE) 0.3465 0.5911 0.7914 0.8229 0.8277 0.8344 0.8350
Proposed v2 (tensor + MSE) 0.4856 0.6090 0.8103 0.8430 0.8399 0.8504 0.8431
Proposed v3 (type matrix + pairwise) 0.6039 0.7943 0.8752 0.8894 0.8949 0.8976 0.8997
Proposed v4 (tensor + pairwise) 0.6335 0.8136 0.8762 0.8900 0.8957 0.8979 0.8995
Table 2: The Kendall’s Tau (KTau) of Peephole, E2EPP and the proposed algorithms on the NASBench dataset with different proportions of training samples.

4.1 Predictor Performance Comparison

We compared the proposed predictor with the methods introduced in Peephole [3] and E2EPP [20]. The NASBench dataset is selected as the training and testing sets of the predictors.

Recall that one of the fundamental idea in our proposed method is that the ranking of the predicted values is more important than their absolute values when embedding the predictor into different searching methods. Thus, for the quantitative comparison, we use the Kendall’s Tau (KTau) [16] as the indicator. KTau measures the correlation between the rankings of the predicted values and the rankings of the actual values, which is suitable for judging the quality of the predictive rankings. KTau ranges from to , a higher value indicates a better ranking.

In order to clearly review the influence of using fundamental flops and parameters feature, and the influence of using pairwise loss, we conduct the following versions of the proposed methods by fixing the predictor as LeNet and varying the feature encoding methods and the loss function, including:

  • proposed v1: Using only the type matrix as feature and MSE loss function.

  • proposed v2: Using the proposed feature encoding method (tensor feature) and MSE loss function.

  • proposed v3: Using only the type matrix as feature and pairwise loss function.

  • proposed v4: Using the proposed feature encoding method (tensor feature) and pairwise loss function.

Note that the search space in NASBench dataset and E2EPP are different from each other, and the encoding method proposed in E2EPP is unable to be used directly on NASBench dataset. In order to apply NASBench dataset to E2EPP, we produce surrogate method for E2EPP and use the feature encoding method proposed in the previous section instead of the original encoding method produced by E2EPP. The other parts remain unchanged.

The experimental results are shown in Table.2. Consider that the training samples can only cover a small proportion of the search space in reality, we focus on the second column when using only of the NASBench dataset as training set. Different proportions are used in the experiment for integrity. The results show that the proposed encoding method utilizing flops and parameters can represent an architecture better than without using it , and the KTau indicator increases about when using MSE loss and when using pairwise loss. When using pairwise loss instead of element-wise MSE loss, the KTau indicator increases about when using only the type matrix as feature, and about when using the proposed tensor feature. It means that pairwise loss is better than MSE loss at ranking regardless of input feature.

(a) Peephole
(b) E2EPP
(c) Proposed
Figure 3: The predicted ranking and true ranking of Peephole, E2EPP and the proposed method on NASBench dataset. 1000 models are randomly selected for exhibition purpose. The axis denotes the true ranking, and the axis denotes the corresponding predicted ranking.

Comparing to other state-of-the-art methods, Peephole used kernel size and channel number as features in addition to layer (node) type, and shows better result than proposed v1 method which uses only the layer (node) type as features. However, it performs worse than proposed v2 method when using all the feature proposed, which again shows the superiority of using fundamental flops and parameters of each layer as features. E2EPP used random forest as predictor, which has advantages only when the training samples are extremely rare. In almost all circumstances, the proposed method with pairwise loss achieves the best KTau performance.

A qualitative comparisons on NASBench dataset is shown in Fig.3. We show the results of training predictors using training data, the axis of each point represents the true ranking among all the points and the axis denotes the corresponding predicted ranking. The results show that the predicted ranking made by the proposed method is notably better than other state-of-the-art methods.

4.2 Architecture Search Results

When searching for the best CNN architecture, the size of the training set of the predictor should be limited. This is because the search space in EA algorithm is the same as in NASBench dataset, and we cannot prevent the EA algorithm from searching the architectures in the training set. Thus, in order to reduce the influence of the training set, we used only of NASBench dataset as training samples to train the predictor, and subsequently used for EA algorithm. The final performance tested on CIFAR-10 dataset with the best architecture searched by EA algorithm with the proposed predictor and the peer competitors mentioned above are shown in Table.3. Specifically, the best performances among top-10 architectures selected by EA algorithm with different predictors are reported and the experiments are repeated 20 times with different random seed to alleviate the randomness.

Method accuracy() ranking()
Peephole [3] 90.99 0.61 43.22
E2EPP [20] 93.47 0.44 1.23
Proposed v1 91.36 0.27 35.97
Proposed v2 93.03 0.21 6.09
Proposed v3 93.43 0.26 1.50
Proposed v4 93.90 0.21 0.04
Table 3: The classification accuracy () on CIFAR-10 dataset and the ranking () among different architectures in NASBench dataset using EA algorithm with the proposed predictor and the peer competitors. Predictors are trained with samples randomly selected from NASBench dataset.

The second column represents the classification accuracies of the selected models on CIFAR-10 test set, and the third column represents the true ranking of the selected models among all the different models in NASBench dataset. The performed method outperforms other competitors, and finds an network architecture with top performance among the search space using only dataset. The fact of achieving good performance with little training data is reasonable for two reasons. The first is that the fundamental features of flops and parameters can represent the architecture well and tensor like input is suitable for CNN. The second is that using pairwise loss expand the training set to some extent. Given individuals, there are actually pairs and triplets for training.

Note that when using performance predictor in practice, the search space is often different from NASBench dataset, which means the training samples needs to be collected from scratch. Thus, we give some intuitions of selecting model architectures from search space as training samples. samples are selected from NASBench dataset as training samples with the method of random selection, select by parameters and select by flops. When selecting by parameters (flops), all samples are sorted by their total parameters (flops), and selected uniformly. Different predictors are trained with different training samples using proposed method, and are further integrated into EA algorithm for searching. The performance of the best architectures are shown in Table.4.

Method accuracy() ranking()
random selection 93.90 0.21 0.04
select by parameters 93.84 0.21 0.08
select by flops 93.76 0.13 0.16
Table 4: The classification accuracy () on CIFAR-10 dataset and the ranking () among different architectures in NASBench dataset using the predictors trained with samples selected from NASBench dataset. Different selection methods are used.
Figure 4: The best architectures found by the predictor with different ratio of training samples.

The results show that random selection performs best. A possible reason is that architectures with similar parameters (flops) perform diversely, and the uniformly selected architectures cannot represent the true performance distribution of the architectures with similar parameters (flops). Thus, random selection is our choice, and is worth trying when generating training samples from search space in reality. The best cell architecture searched by EA algorithm using proposed predictor trained with random selected training samples is shown in column 2 of Fig.4.

In the following we give an intuitive representation of the best architectures selected by the performance predictor with different number of training samples as shown in Fig.4.

Note that the rank-1 architecture in NAS-Bench dataset cannot be selected by the predictor even when using of the training data. This is because when using pairwise ranking based loss function, there are training pairs and it is very inefficient to train them in a single batch. Thus, mini-batch updating method is used and a single architecture is compared with limited architectures in one epoch, which causes the lack of global information about this architecture especially when the number of training samples is large. In fact, the mini-batch size is set to 1024 in the experiment, and it is a compromise between effectiveness and efficiency.

This is the same reason that the performance of the architecture found by the predictor trained with dataset is marginally better than that trained with dataset. Specifically, we divide the architectures into two parts. The first part is the architectures trained with and dataset, and the second part is the rest. Note that in the first part the number of training sample is on the same order of magnitude with the mini-batch size , thus the global information of a single model is easy to obtain and the performance becomes better when there are more training data. In the second part, the number of training sample is significantly larger than . On one hand, increasing the number of samples helps training. On the other hand, the global ranking information is harder to get. Thus, the performance is not guaranteed to be better when using more training samples.

Finally, there are some common characteristics among these architectures. The first is that the distance between input node and output node is at most 2, which shows the significance of skip-connection. The second is that operation appears in each architecture. Based on these observations, we separate the NAS-Bench dataset based on the distance between input node and output node, and whether the operation is used. Some statistics are shown in Table.5.

Distance model Best acc Average acc
yes 1 68552 94.32 91.97
2 153056 94.05 91.02
3 110863 93.68 89.31
4 27227 92.36 87.40
5 2516 90.54 86.51
6 211 88.87 84.91
no 1 12468 91.62 88.40
2 26282 90.81 86.69
3 17735 90.24 83.53
4 4282 88.95 80.20
5 400 88.16 78.84
6 32 86.71 74.93
Table 5: Statistics on NAS-Bench dataset. ‘’ refers to whether the model uses this operation. ‘Distance’ refers to the distance between input node and output node. ‘model’ refers to the number of models. ‘Best acc’ refers to the performance of the best architecture among ‘model’ number of models on CIFAR-10 dataset. ‘Average acc’ refers to the average performance of ‘model’ number of models on CIFAR-10 dataset.

It shows that the shorter the distance between input node and output node, the better the performance is. Besides, operation helps the architecture to perform better. Based on the observation above, we may form a better search space of NAS-Bench dataset by using only 68552 models with operation and skip-connect between input node and output node. An experiment of training and evaluating performance predictor is conducted on this sub dataset and the results show that the predictor trained and evaluated with sub dataset performs better than the previous one as shown in Table.6. It shows that a better search space helps to produce a better performance predictor.

Datasets accuracy() ranking()
whole dataset 93.90 0.21 0.04
sub dataset 94.02 0.14 0.01
Table 6: Predictors trained and evaluated with the whole NAS-Bench dataset and sub dataset. The experiments are repeated 20 times to alleviate the randomness of the results.

5 Conclusion

We proposed a new method for predicting network performance based on its architecture before training. We encode an architecture in NASBench dataset into a tensor utilizing flops and parameters which is general and fundamental to represent an architecture. The pairwise ranking based loss function is used for the performance predictor instead of the element-wise loss function, since the rankings between different architectures are more important than their absolute values in different searching methods. We then embed the proposed predictor into EA based method and search for the best architecture. Several experiments are conducted on NASBench dataset, and shows the priority of the proposed predictor on sorting the performance of different architectures and searching for an architecture with top performance among the search space using only of the dataset. A better sub search space is generated by utilizing the common statistics of the best architectures generated by the predictor and this sub-space helps to produce a better predictor.

References

  • [1] B. Baker, O. Gupta, N. Naik, and R. Raskar (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §1, §2.1.
  • [2] B. Baker, O. Gupta, R. Raskar, and N. Naik (2017) Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823. Cited by: §2.1, §2.2.
  • [3] B. Deng, J. Yan, and D. Lin (2017) Peephole: predicting network performance before training. arXiv preprint arXiv:1712.03351. Cited by: §1, §2.2, §3.2, §3.3, §4.1, Table 2, Table 3.
  • [4] T. Domhan, J. T. Springenberg, and F. Hutter (2015) Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §1, §2.2.
  • [5] F. Hutter, H. H. Hoos, and K. Leyton-Brown (2011) Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pp. 507–523. Cited by: §2.1.
  • [6] R. Istrate, F. Scheidegger, G. Mariani, D. Nikolopoulos, C. Bekas, and A. C. I. Malossi (2018) Tapas: train-less accuracy predictor for architecture search. arXiv preprint arXiv:1806.00250. Cited by: §2.2, §3.3.
  • [7] A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter (2016) Learning curve prediction with bayesian neural networks. Cited by: §1, §2.2.
  • [8] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2016) Hyperband: a novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560. Cited by: §2.1.
  • [9] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu (2017) Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436. Cited by: §1, §2.1.
  • [10] H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §2.1.
  • [11] R. Luo, F. Tian, T. Qin, E. Chen, and T. Liu (2018) Neural architecture optimization. In Advances in neural information processing systems, pp. 7816–7827. Cited by: §2.1.
  • [12] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, et al. (2019) Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pp. 293–312. Cited by: §1, §2.1.
  • [13] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1, §2.1.
  • [14] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2018) Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548. Cited by: §1, §2.1.
  • [15] C. Sciuto, K. Yu, M. Jaggi, C. Musat, and M. Salzmann (2019) Evaluating the search phase of neural architecture search. arXiv preprint arXiv:1902.08142. Cited by: §1, §3.3.
  • [16] P. K. Sen (1968) Estimates of the regression coefficient based on kendall’s tau. Journal of the American statistical association 63 (324), pp. 1379–1389. Cited by: §4.1.
  • [17] M. Shen, K. Han, C. Xu, and Y. Wang (2019) Searching for accurate binary neural architectures. arXiv preprint arXiv:1909.07378. Cited by: §1.
  • [18] H. Shu, Y. Wang, X. Jia, K. Han, H. Chen, C. Xu, Q. Tian, and C. Xu (2019) Co-evolutionary compression for unpaired image translation. arXiv preprint arXiv:1907.10804. Cited by: §1.
  • [19] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams (2015) Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180. Cited by: §2.1.
  • [20] Y. Sun, H. Wang, B. Xue, Y. Jin, G. G. Yen, and M. Zhang (2019) Surrogate-assisted evolutionary deep learning using an end-to-end random forest-based performance predictor. IEEE Transactions on Evolutionary Computation. Cited by: §1, §2.2, §3.2, §3.3, §4.1, Table 2, Table 3.
  • [21] Y. Wang, C. Xu, J. Qiu, C. Xu, and D. Tao (2018) Towards evolutionary compression. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2476–2485. Cited by: §1.
  • [22] Z. Yang, Y. Wang, X. Chen, B. Shi, C. Xu, C. Xu, Q. Tian, and C. Xu (2019) CARS: continuous evolution for efficient neural architecture search. arXiv preprint arXiv:1909.04977. Cited by: §1.
  • [23] C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy, and F. Hutter (2019) Nas-bench-101: towards reproducible neural architecture search. arXiv preprint arXiv:1902.09635. Cited by: §3.
  • [24] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.1.
  • [25] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §1, §2.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393423
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description