Efficient Neural Architecture Search
with Network Morphism
While neural architecture search (NAS) has drawn increasing attention for automatically tuning deep neural networks, existing search algorithms usually suffer from expensive computational cost. Network morphism, which keeps the functionality of a neural network while changing its neural architecture, could be helpful for NAS by enabling a more efficient training during the search. However, network morphism based NAS is still computationally expensive due to the inefficient process of selecting the proper morph operation for existing architectures. As we know, Bayesian optimization has been widely used to optimize functions based on a limited number of observations, motivating us to explore the possibility of making use of Bayesian optimization to accelerate the morph operation selection process. In this paper, we propose a novel framework enabling Bayesian optimization to guide the network morphism for efficient neural architecture search by introducing a neural network kernel and a tree-structured acquisition function optimization algorithm. With Bayesian optimization to select the network morphism operations, the exploration of the search space is more efficient. Moreover, we carefully wrapped our method into an open-source software, namely Auto-Keras111The code is available at http://autokeras.com. for people without rich machine learning background to use. Intensive experiments on real-world datasets have been done to demonstrate the superior performance of the developed framework over the state-of-the-art baseline methods.
Efficient Neural Architecture Search
with Network Morphism
|Department of Computer Science and Engineering|
|Texas A&M University|
|Department of Computer Science and Engineering|
|Texas A&M University|
|Department of Computer Science and Engineering|
|Texas A&M University|
Neural Architecture Search; Bayesian Optimization; Network Morphism; Gaussian Process; Kernel Methods
Neural architecture search (NAS), which aims to search for the best neural network architecture given a learning task, has become an effective computational tool in automated machine learning (AutoML). Unfortunately, existing NAS algorithms are usually computationally expensive. The time complexity of NAS could be roughly computed as , where is the number of neural architectures evaluated during the search, and is the average time consumption for evaluating each of the neural networks. Many NAS approaches, such as deep reinforcement learning [?, ?, ?, ?] and evolutionary algorithms [?, ?, ?, ?, ?, ?], require a large to reach a good performance. Also, each of the neural networks is trained from scratch which is very slow.
Network morphism has been successfully applied for neural architecture search [?, ?]. Network morphism is a technique to morph the architecture of a neural network but keep its functionality [?, ?]. Therefore, we are able to modify a trained neural network into a new architecture using the network morphism operations, e.g., inserting a layer or adding a skip-connection. Only a few more epochs are required to further train the new architecture for better performance. Using network morphism would reduce the average training time in neural architecture search. The most important problem to solve for network morphism based NAS methods is the selection of operations, which is to select from the network morphism operation set to morph an existing architecture to a new one. The state-of-the-art network morphism based method [?] uses a deep reinforcement learning controller, which requires a large number of training examples, i.e., in . Another simple approach [?] is to use random algorithm and hill-climbing, which can only explore the neighborhoods of the searched area each time, and could potentially be trapped by local optimum.
Bayesian optimization has been widely adopted for finding the optimum value of a function based on a limited number of observations. It is usually used to find the optimum point of a black-box function, whose observations are expensive to obtain. For example, it has been used in hyperparameter tuning for machine learning models [?, ?, ?, ?], each observation of which involves the training and testing of a machine learning model, which is very similar to the NAS problem. The unique properties of Bayesian optimization motivate us to explore its capability in guiding the network morphism to reduce the number of trained neural networks to make the search more efficient.
It is a non-trivial task to design a Bayesian optimization method for network morphism based neural architecture search due to the following challenges. First, the underlying Gaussian process (GP) is traditionally used for Euclidean space. To update the Bayesian optimization model with observations, the underlying GP is to be trained with the searched architectures and their performance. However, the neural network architectures are not in Euclidean space and hard to parameterize into a fixed-length vector. Second, an acquisition function needs to be optimized for Bayesian optimization to generate the next architecture to observe. However, it is not to maximize a function in Euclidean space for morphing the neural architectures, but to select a node to expand in a tree-structured search space, where each node represents an architecture and each edge a morph operation. The traditional Newton-like or gradient-based methods cannot be simply applied. Third, the network morphism operations changing one layer in the neural architecture may invoke many changes to other layers to maintain the input and output consistency, which is not defined in previous work. The network morphism operations are complicated in a search space of neural architectures with skip-connections.
In this paper, an efficient neural architecture search with network morphism is proposed, which utilizes Bayesian optimization to guide through the search space by selecting the most promising operations each time. To tackle the aforementioned challenges, an edit-distance based neural network kernel is constructed. Being consistent with the key idea of network morphism, it measures how many operations are needed to change one neural network to another. Besides, a novel acquisition function optimizer is designed specially for the tree-structure search space to enable Bayesian optimization to select from the operations. The optimization methods can balance between the exploration and exploitation during the optimization. In addition, a network-level morphism is defined to address the complicated changes in the neural architectures based on previous layer-level network morphism. Our method is wrapped into an open-source software, namely Auto-Keras. The proposed approach is evaluated on benchmark datasets and compared with the state-of-the-art baseline methods. The main contributions of this paper are summarized as follows:
An efficient neural architecture search algorithm with network morphism is proposed.
Bayesian optimization for NAS with neural network kernel, tree-structured acquisition function optimization, and a network-level morphism is proposed.
An open-source software, namely Auto-Keras, is developed based on our method for neural architecture search.
Intensive experiments are conducted on benchmark datasets to evaluate the proposed method.
The general neural architecture search problem we studied in this paper is defined as: given a neural architecture search space , the input data , and the cost metric , we aim at finding an optimal neural network with its trained parameter , which could achieve the lowest cost metric value on the given dataset . Mathematically, this definition is equivalent to find satisfing:
where denotes the parameter set of network , is the number of parameters in .
Before explaining the proposed algorithm, we first define the target search space . Let denotes the computational graph of a neural network . Each node denotes an intermediate output tensor of a layer of . Each directed edge denotes a layer of , whose input tensor is and output tensor is . indicates is before in topological order of the nodes, i.e., by traveling through the edges in , is reachable from node . The search space in this work is defined as: a space consisting of any neural network architecture , which satisfies two conditions: (1) is a directed acyclic graph (DAG). (2) , . It is worth pointing out that although skip connection is allowed, there should only be one main chain in . Moreover, the search space defined here is large enough to cover a wide range of famous neural architectures, e.g., DenseNet, ResNet.
The key idea of the proposed method is to explore the search space via morphing the network architectures guided by an efficient Bayesian optimization algorithm. Traditional Bayesian optimization consists of a loop of three steps: update, generation, and observation. Equipped with the view of NAS, our proposed Bayesian optimization algorithm iteratively conducts: (1) Update: train the underlying Gaussian process model with the existing architectures and their performance; (2) Generation: generate the next architecture to observe by optimizing an delicately defined acquisition function; (3) Observation: train the generated neural architecture to obtain the performance. There are two main challenges in designing the method for morphing the neural architectures with Bayesian optimization. It highly reduces the desired number of trained neural architectures and avoids the merely neighborhood-wise exploration. Next, we introduce three key components separately in the subsequent sections coping with the three design challenges. The time complexity of update and generation is low enough comparing to the observation.
The first challenge we need to address is that the NAS space is not a Euclidean space, which does not satisfy the assumption of the traditional Gaussian process. It is impractical to vectorize every neural architecture due to the uncertainly large number of layers and parameters it may contain. Since the Gaussian process is a kernel method, instead of vectorizing a neural architecture, we propose to tackle the challenge by designing a neural network kernel function. The intuition behind the kernel function is the edit-distance for morphing one neural architecture to another.
Kernel Definition: Suppose and are two neural networks. Inspired by Deep Graph Kernels [?], we propose an edit-distance kernel for neural networks, which consistent with our idea of using network morphism. Edit-distance here means how many operations are needed to morph one neural network to another. The concrete kernel function is defined as follows:
where function denotes the edit-distance of two neural networks, whose range is , is the Bourgain algorithm [?], which distorts the distance to ensure the validity of the kernel.
Calculating the edit-distance of two neural networks can be mapped to calculating the edit-distance of two graphs, which is an NP-hard problem [?]. Based on the search space we have defined in Section Efficient Neural Architecture Search with Network Morphism, we solve the problem by proposing an approximated solution as follows:
where denote the edit-distance for morphing the layers, i.e., the minimum edit needed to morph to if the skip-connections are ignored, and are the layer sets of neural networks and , is the approximated edit-distance for morphing skip-connections between two neural networks, and are the skip-connection sets of neural network and , and is the balancing factor.
Calculating : We assume , the edit-distance for morphing the layers of two neural architectures and is calculated by minimizing the follow equation:
where is an injective matching function of layers satisfying: , if layers in and are all sorted in topological order, denotes the edit-distance of widening a layer into another defined in Equation (Efficient Neural Architecture Search with Network Morphism), where is the width of layer .
The intuition of Equation (Efficient Neural Architecture Search with Network Morphism) is consistent with the idea of network morphism shown in Figure Efficient Neural Architecture Search with Network Morphism. Suppose a matching is provided between the nodes in two neural networks. The numbers on the nodes are the widths of the intermediate tensors. The matchings between the nodes are marked by light blue. The nodes are intermediate tensors output by the previous layers, which are indicators of the width of the previous layers (e.g., the output vector length of a fully-connected layer or the number of filters of a convolutional layer). So a matching between the nodes can be seen as a matching between the layers. To morph to , we need to first widen the three nodes in to the same width as their matched nodes in , and then insert a new node of width 20 after the first node in . Based on this morphing scheme, the overall edit-distance of the layers is defined as in Equation (Efficient Neural Architecture Search with Network Morphism).
Since there are many ways to morph to , to find the best matching between the nodes that minimizes , we propose a dynamic programming approach by defining a matrix , which is recursively calculated as follows:
where is the minimum value of , where and .
Calculating : The intuition of the term is to measure the edit-distance of matching the most similar skip-connections in two neural networks into pairs. As shown in Figure Efficient Neural Architecture Search with Network Morphism, the skip-connections with the same color are matched pairs. Similar to , is defined as follows:
where we assume . measures the total edit-distance for not matched skip-connections. Each of the not mapped skip-connections in means a new skip connection that needs to be inserted in . The mapping function is an injective function. The edit-distance for two matched skip-connections is defined as:
where is the topological rank of the layer the skip-connection started from, is the number of layers between the start and end point of the skip-connection .
This minimization problem in Equation (Efficient Neural Architecture Search with Network Morphism) can be mapped to a bipartite graph matching problem, where and are the two parts of the graph, each skip-connection is a node in its corresponding part. The edit-distance between two skip-connections is the weight of the edge between them. The bipartite graph problem is solved by Hungarian algorithm (Kuhn-Munkres algorithm) [?].
Proof of Kernel Validity: Gaussian process requires the kernel to be valid, i.e. the kernel matrices are positive semidefinite, to keep the distributions valid. The edit-distance in Equation (Efficient Neural Architecture Search with Network Morphism) is a metric distance proved by Theorem 1. Though, a generalized RBF kernel in the form of based on a distance in metric space may not always be a valid kernel, our kernel defined in Equation (Efficient Neural Architecture Search with Network Morphism) is proved to be valid by Theorem 2.
Theorem 1. is a metric space distance.
Theorem 1 is proved by proving the non-negativity, definiteness, symmetry, and triangle inequality of .
From the definition of in Equation (Efficient Neural Architecture Search with Network Morphism), , . , . , . Similarly, , , and , . In conclusion, , .
is trivial. To prove , let . , and , . Let and be the layer sets of and . Let and be the skip-connection sets of and .
and . , and , . , , , , , , , . According to Equation (Efficient Neural Architecture Search with Network Morphism), each of the layers in has the same width as the matched layer in , According to the restrictions of , the matched layers are in the same order, and all the layers are matched, i.e. the layers of the two networks are exactly the same. Similarly, the skip-connections in the two neural networks are exactly the same. . So , let . Finally, .
Let and be two neural networks in , Let and be the layer sets of and . If , since it will always swap and if has more layers. If , since is undirected, and is symmetric. Similarly, is symmetric. In conclusion, , .
Let , , be neural network layers of any width. If , . If , . If , . By the symmetry property of , the rest of the orders of , and also satisfy the triangle inequality. , .
, given and used to compute and , we are able to construct to compute satisfies .
Let . , , , , .
From the definition of , with the current matching functions and , and . First, is matched to . Since the triangle inequality property of , . Second, the rest of the and are free to match with each other.
Let , , , , , .
From the definition of , with the current matching functions and , and . and . So . Similarly, . Finally, , .
In conclusion, is a metric space distance.
Theorem 2. is a valid kernel.
Proof of Theorem 2: The network edit-distance is a distance in metric space, the proof of which is in Theorem 1 in the Appendix. The Bourgain algorithm [?] denoted as in Equation (Efficient Neural Architecture Search with Network Morphism) preserves the symmetry and definiteness property of . Therefore, , and . The kernel matrix of generalized RBF kernel in the form of is positive definite if and only if there is an isometric embedding in Euclidean space for the metric space with metric [?]. Any finite metric space distance can be isometrically embedded into Euclidean space by changing the scale of the distance measurement [?]. So Bourgain algorithm distort to be isometrically embeddable in Euclidean space, Therefore, the kernel matrix is always positive definite. So is a valid kernel.
The second challenge we need to address is acquisition function optimization. The traditional acquisition functions are defined on Euclidean space, the methods for optimizing which are not applicable to the tree-structured search via network morphism. A novel method to optimize the acquisition function is proposed for tree-structured space.
Upper-confidence bound (UCB) in Equation (Efficient Neural Architecture Search with Network Morphism) is chosen as our acquisition function.
where , is the balancing factor, and are the posterior mean and standard deviation of variable . UCB has two properties that fit our problem. First, it has an explicit balance factor for exploration and exploitation. Second, the function value is directly comparable with the cost function value in search history , which is a property to be used in our algorithm. With the acquisition function, is the generated architecture for next observation.
The tree-structured space is defined as follows. During the minimization of the , should be obtained from and , where is an observed architecture in the search history , is a sequence of operations to morph the architecture into a new one. Morph to with is denoted as , where is the function to morph with the operations in . Therefore, the search can be viewed as a tree-structured search, where each node is a neural architecture, whose children are morphed from it by network morphism operations.
The state-of-the-art acquisition function maximization techniques, e.g., gradient-based or Newton-like method, are designed for numerical data, which do not apply in the tree-structure space. TreeBO [?] has proposed a way to maximize the acquisition function in a tree-structured parameter space. Only its leaf nodes have acquisition function values, which is different from our case. Moreover, the proposed solution is surrogate multivariate Bayesian optimization model. In NASBOT [?], they use an evolutionary algorithm to optimize the acquisition function. They are both very expansive in computing time. To minimize our acquisition function, we need a method to efficiently minimize the acquisition function in the tree-structured space.
Inspired by the various heuristic search algorithms for exploring the tree-structured search space and various optimization method balancing between exploration and exploitation. A new method based on A* search, which is good at tree-structured search, and simulated annealing, which is good at balancing exploration and exploitation, is proposed.
As shown in Algorithm 1, the algorithm takes minimum temperature , temperature decreasing rate for simulated annealing, and search history described in Section Efficient Neural Architecture Search with Network Morphism as input. It outputs a neural architecture and a sequence of operations . From line 2 to line 3, the searched architectures are pushed into the priority queue, in which they are sorted according to their cost function value or the acquisition function value. Since UCB is chosen as the acquisiton function, is directly comparable with the history observation values . records which history architecture is morphed from. records what operations are applied to morph to . From line 4 to line 11 is the loop minimizing the acquisition function. Following the setting in A* search, in each iteration, the architecture with the lowest acquisition function value is pop out to be expanded on line 5 to 6, where is all the possible operations to morph the architecture , is the function for morph the architecture with the operation sequence . However, not all the children are pushed into the priority queue for exploration purpose. The decision is made by simulated annealing on line 8, where is a typical acceptance function in simulated annealing. Notably, on line 10, the cost value is directly compared with the acquisition function value given the property UCB.
The third challenge is to maintain the input and output tensor shape consistency when morphing the architectures. Previous work showed how to preserve the functionality of the layers the operators applied on, namely layer-level morphism. However, from a network-level view, any change of a single layer could have a butterfly effect on the entire network. Otherwise, it would break the input and output tensor shape consistency. To tackle the challenge, a network-level morphism is proposed to find and morph the layers influenced by a layer-level operation in the entire network.
There are four network morphism operations we could perform on a neural network [?], which can all be reflected in the change of the computational graph. The first operation is inserting a layer to to make it deeper denoted as , where is the node marking the place to insert the layer. The second one is widening a node in denoted as , where is the intermediate output tensor to be widened. Widen here could be making the output vector of a fully-connected layer longer, or adding more filters to the previous convolutional layer of . The third one is adding an additive connection from node to node denoted as . The fourth one is adding an concatenative connection from node to node denoted as . For , no other operation is needed except for initializing the weights of the newly added layer as described in [?]. However, for all other three operations, more changes are required to .
First, we define an effective area of as to better describe where to change in the network. The effective area is a set of nodes in the computational graph, which can be recursively defined by the following rules: 1. . 2. , if , . 3. , if , . is the set of fully-connected layers and convolutional layers. Operation needs to change two set of layers, the previous layer set , which needs to output a wider tensor, and next layer set , which needs to input a wider tensor. Second, for operator , additional pooling layers may be needed on the skip-connection. and have the same number of channels, but their shape may differ because of the pooling layers between them. So we need a set of pooling layers whose effect is the same as the combination of all the pooling layers between and , which is defined as . where could be any path between and , is the pooling layer set. Third, the effect area of can be similarly defined by the following rules: 1. . 2. . 3. , if , . 4. , if , . The and is the same as defined in the wide operation. Additional pooling layers are also needed for the skip-connection.
As described at the start of Section 3 in the paper, Bayesian optimization can be roughly divided into three steps: update, generation, and observation. The bottle-neck of the efficiency of the algorithm is observation, which involves the training of the generated neural architecture. However, the efficiency of the update and the generation is also important, since they must not become the bottleneck. Let be the number of architectures in the search history. The time complexity of the update is . In each generation, the kernel is computed between the new architectures during optimizing acquisition function and the ones in the search history, the number of values in which is , where is the number of architectures computed during the optimization of the acquisition function. The time complexity for computing once is , where and are the number of layers and skip-connections. So the overall time complexity is . The magnitude of these factors is within the scope of tens. So the time consumption of update and generation is trivial comparing to the observation time.
An open-source software, namely Auto-Keras, is developed using on our method for neural architecture search, in which Keras [?] is used for the construction and training of the neural networks. Similar to SMAC [?], Auto-WEKA [?], and Auto-Sklearn [?], the goal is to enable domain experts who is not familiar with machine learning technologies to use deep neural networks easily. Although, there are several AutoML services available on large cloud computing platforms, three things are prohibiting the users from using them. First, the cloud services are not free to use, which may not be affordable for everyone who wants to use AutoML techniques. Second, the cloud services based AutoML usually requires complicated configurations of Docker containers and Kubernetes. Third, the AutoML service providers are honest but curious, which cannot guarantee the security and privacy of the data. An open-source software, which is easily downloadable and runs locally, would solve these problems and make the AutoML accessible to everyone.
The Auto-Keras package consists of four major components, which are Classifier, Searcher, Graph, and Trainer. The Classifier is the program interface class, which is responsible for calling corresponding modules to complete certain tasks. The Searcher is the module containing Bayesian optimization. Each time its search function is called, it would run one round of Bayesian optimization, which consists of update, generation, observation. The Graph is the class of computational graph of neural networks, which has member functions implemented for the network morphism operations. It is called by the Searcher to morph the neural architectures. The Trainer is the class to train a given neural network with the training data in a separate process to avoid the GPU memory leak. It is capable of various training techniques to improve the final performance of the neural network including data augmentation and automated detection of convergence.
The design of the application programming (API) interface follows the classic design of the Scikit-Learn API [?, ?]. The training of a neural network requires as few as three lines of code calling the constructor, the fit and predict function respectively. Users can also specify the model trainer’s hyperparameters using the default parameters to the functions. Several accommodations have been implemented to enhance the user experience with the Auto-Keras package. First, the user can restore and continue a previous search which might be accidentally killed. From the users’ perspective, the main difference of using Auto-Keras comparing with other similar packages is it takes much longer, since it needs to train a number of deep neural networks. It is possible for some accident to happen to kill the process before the search finishes. Therefore, the search outputs all the searched neural network architectures with their trained parameters into a specific directory on the disk. As long as the path to the directory is provided, the previous search can be restored. Second, all the searched architectures are visualized in the saved directory as PNG files. Third, the user can export the search results, which are neural architectures, as saved Keras models for other usages. Fourth, for advanced users, they can specify all kinds of hyperparameters of the search process and neural network optimization process by the default parameters in the interface.
To fully automate the entire process from input data to the final trained neural network, automated detection of convergence is needed both during the search and the final training of the found best architecture. We use the same strategy as the early stop strategy in the multi-layer perceptron algorithm in Scikit-Learn [?]. It sets a maximum threshold . If the loss of the validation set doesn’t decrease in epochs, the training stops. Since different architectures may require different numbers of training epochs, comparing with the many state-of-the-art methods using a fixed number of training epochs, the convergence detection strategy is more adaptive to different architectures. It would better ensure the correlation between the performance of a certain neural architecture during the search and its final performance when fully trained, which is essential for the entire neural architecture search process.
The program can run across multiple GPUs and CPUs at the same time. It relies on the inner parallel mechanism of Keras to run across multiple GPUs during the training of the neural networks. The functional programming paradigm in python enables the rest of the computation to run in parallel across multiple CPUs. However, if we do the observation, update, and generation of Bayesian optimization in an sequential order. The GPUs will be idle during the update and generation. The CPUs will be idle during the observation. To improve the efficiency, the observation is run in parallel with the update and generation in separated processes. A training queue is maintained as a buffer. In each Bayesian optimization cycle, the Trainer takes one architecture from the queue and trains it. The Searcher runs the generation in parallel to search the next architecture to train. After observation, the model is updated with the architecture and the observed performance. After generation, the newly generated architecture is pushed into the queue. In this way, the idle time of GPU and CPU are dramatically reduced to improve the efficiency of the search process.
In the experiments, we aim at answering the following questions. 1) What is the effectiveness of the search algorithm with limited running time? 2) What are the influences of the important hyperparameters of the search algorithm? 3) Does the edit-distance neural network kernel correctly predict the similarity in actual performance?
Datasets Three benchmark datasets, MNIST [?], CIFAR10 [?], and FASHION [?] are used for the experiments. They require very different neural architectures to achieve good performance.
Baselines Four categories of baseline methods are used to compare with our work. two straightforward solutions, random search (RAND) and grid search (GRID), two traditional baselines, MCMC [?] and SMAC [?], two state-of-the-art neural architecture search work: SEAS [?], NASBOT [?], and two variants of our proposed methods, BFS and BO. MCMC and SMAC tunes the 16 hyperparameters of a three-layer convolutional neural network, including the width, dropout rate, and regularization rate of each layer. We carefully implemented the SEAS as described in their paper. For NASBOT, since the experimental settings are very similar, we directly trained their searched neural architecture published in the paper. The architecture is implemented and trained on the dataset originally used for the search. The BFS methods is a variant of our own method, which replace the Bayesian optimization with the breadth-first search. BO is another variant, which does not use network morphism for acceleration. Finally, our proposed method is NASNM.
Methods MNIST CIFAR10 FASHION RANDOM GRID MCMC SMAC SEAS NASBOT NA NA BFS BO NASNM Table \thetable: Classification error rate
The experiments of evaluating the effectiveness of the proposed method are conducted as follows. First, each dataset is split by 60-20-20 into training, validation and testing set. Second, run the method for 12 hours on a single GPU (NVIDIA GeForce GTX 1080 Ti) on the training and validation set. Third, the output architecture is trained with both training and validation set. Fourth, the testing set is used to evaluate the trained architecture. Error rate is selected as the evaluation metric since all the datasets are for classification. For a fair comparison, the same model trainer is used to train the neural networks for 200 epochs for all the experiments, which contains several techniques to enhance the performance, including, layer regularization, data augmentation, learning rate control, and etc.
The results are shown in Table Efficient Neural Architecture Search with Network Morphism. Our method achieved lowest error rate on all of the datasets. MNIST requires simple neural architectures to achieve lower error rate. Complicated neural architectures are likely to overfit the dataset. Shown in the results, our method is able to avoid the overfitting issue during the search. Similarly, CIFAR10 is a dataset require more complicated neural architectures and is likely to be underfitted. FASHION is an intermediate dataset which could both be underfitted and overfitted. On these two datasets, our method also achieved the lowest error rate compared with the baselines. Most of the simple and traditional approaches performed well on the MNIST dataset, but not very well on the CIFAR10 dataset. For simple approaches like random search and grid search, they don’t work on CIFAR10 since they can only try a limited number of architectures blindly. For traditional approaches, the main reason is their inability to change the depth and skip-connections of the architectures. SEAS [?] also performed well on all three datasets. The error rate of CIFAR10 is a little higher. It is because the good architectures are farther from the initial architecture in terms of network morphism operations, and the hill-climbing strategy only takes one step at a time in morphing the current best architecture. NASBOT [?] performed well on CIFAR10. It also uses Bayesian optimization, but it is not a network morphism based method. The training of the neural networks takes longer. The low error rate was achieved by parallel searching on multiple GPUs. BFS has the similar problem as hill-climbing, it always searches a large number of neighbors first, which make it not likely to reach the good results far from the initial architecture. BO can jump far from the initial architecture. But without network morphism, it needs to train each neural architecture for much longer, which limits the number of architectures it can search within a given time. Some results may not be as good as in some papers. The main reason is all the methods use the default training settings, including data preprocessors, optimizers, batch size and etc, to eliminate the influence of unwanted factors.
There are several hyperparameters in our proposed method, in Equation (Efficient Neural Architecture Search with Network Morphism) and in Equation (Efficient Neural Architecture Search with Network Morphism), and in Algorithm 1. Since and are just normal hyperparameters of simulated annealing, the experiments focus on and . balances between the distance of layers and skip connections in the kernel function. is the weight of the variance in the acquisition function, which balances the exploration and exploitation of the search strategy. The setup for the parameter experiments is similar to the performance experiments, except for the final training in step three.
As shown in the top part of Figure Efficient Neural Architecture Search with Network Morphism, with the increase of from to , the error rate decreased and increased. If the is too low, the search process is not explorative enough to search the architectures far from the initial architecture. If it is too high, the search process would keep exploring the far points instead of trying the most promising architectures. As shown in the bottom part of Figure Efficient Neural Architecture Search with Network Morphism, influences the performance similar to . If is too low, the differences in the skip-connections of two neural architectures are ignored. If it is too high, the differences in the convolutional or fully-connected layers are ignored. The differences in layers and skip-connections should be balanced in the kernel function to achieve a good performance of the entire framework.
Figure \thefigure: Kernel and performance matrix visualization
To show the quality of the edit-distance neural network kernel, we investigate the difference between the two matrices and . is the kernel matrix, where . describes the similarity of the actual performance between neural networks, where , is the cost function value in the search history described in Section Efficient Neural Architecture Search with Network Morphism. Here we use CIFAR10 as the dataset, and error rate as the cost metric. Since the values in and are in different scales, both matrices are normalized to the range -1 to 1.
, , and are visualized in Figure (a) and (b). Lighter color means a larger value. There are several patterns shown in the figures. First, the white diagonal of Figure (a) and (b). Since the definiteness of the kernel, , the diagonal of is always 1. It is the same for since no difference exists in the performance of the same neural network. Second, there is a small square on the upper left of Figure (a). These are the initial neural architectures to train the Bayesian optimizer, which are neighbors to each other in terms of network morphism operations. The similar pattern in Figure (b) indicates that, when the kernel measure two architectures as similar, they tend to have similar performance. Third, the dark region on the top and left of Figure (a). The main reason is the rest of the architectures are dissimilar to the initial architectures. The similar pattern in Figure (b) shows that, when kernel measure two architectures as dissimilar, they tend to have different performance. The dark color on the upper right corner of Figure (a) shows a small flaw of the kernel, that the quantity of the difference in the performance is not accurately measured. Fourth, the kernel matrix is smoother than the performance matrix, which is because there is noise in the measured performance due to various training issues. Finally, we quantitatively measure the difference between and with mean square error (MSE). The range of the values is (-1, 1). The MSE is .
In this paper, a novel method for efficient neural architecture search with network morphism is proposed. It enables Bayesian optimization to guide the search by designing a neural network kernel, and an algorithm for optimizing acquisition function in tree-structured space. The proposed method is wrapped into an open-source software, which can be easily downloaded and used with an extremely simple interface. The method has shown good performance in the experiments and outperformed several traditional hyperparameter-tuning methods and state-of-the-art neural architecture search methods. In the future, the search space may be expanded to the recurrent neural network (RNN). It is also important to tune the neural architecture and the hyperparameters of the training process together to further save the manual labor.
-  Baker, B., Gupta, O., Naik, N., and Raskar, R. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 (2016).
-  Bourgain, J. On lipschitz embedding of finite metric spaces in hilbert space. Israel Journal of Mathematics 52, 1-2 (1985), 46–52.
-  Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., and Varoquaux, G. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning (2013), pp. 108–122.
-  Cai, H., Chen, T., Zhang, W., Yu, Y., and Wang, J. Efficient architecture search by network transformation. In AAAI (2018).
-  Chen, T., Goodfellow, I., and Shlens, J. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641 (2015).
-  Chollet, F., et al. Keras. https://keras.io, 2015.
-  Desell, T. Large scale evolution of convolutional neural networks using volunteer computing. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (2017), ACM, pp. 127–128.
-  Elsken, T., Metzen, J.-H., and Hutter, F. Simple and efficient architecture search for convolutional neural networks. arXiv preprint arXiv:1711.04528 (2017).
-  Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., and Hutter, F. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems (2015), pp. 2962–2970.
-  Haasdonk, B., and Bahlmann, C. Learning with distance substitution kernels. In Joint Pattern Recognition Symposium (2004), Springer, pp. 220–227.
-  Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration. LION 5 (2011), 507–523.
-  Jenatton, R., Archambeau, C., González, J., and Seeger, M. Bayesian optimization with tree-structured dependencies. In International Conference on Machine Learning (2017), pp. 1655–1664.
-  Kandasamy, K., Neiswanger, W., Schneider, J., Poczos, B., and Xing, E. Neural architecture search with bayesian optimisation and optimal transport. arXiv preprint arXiv:1802.07191 (2018).
-  Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., and Leyton-Brown, K. Auto-weka 2.0: Automatic model selection and hyperparameter optimization in weka. Journal of Machine Learning Research 17 (2016), 1–5.
-  Krizhevsky, A., and Hinton, G. Learning multiple layers of features from tiny images.
-  Kuhn, H. W. The hungarian method for the assignment problem. Naval Research Logistics (NRL) 2, 1-2 (1955), 83–97.
-  LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
-  Liu, H., Simonyan, K., Vinyals, O., Fernando, C., and Kavukcuoglu, K. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436 (2017).
-  Maehara, H. Euclidean embeddings of finite metric spaces. Discrete Mathematics 313, 23 (2013), 2848–2856.
-  Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
-  Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., and Dean, J. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268 (2018).
-  Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548 (2018).
-  Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y. L., Le, Q., and Kurakin, A. Large-scale evolution of image classifiers. arXiv preprint arXiv:1703.01041 (2017).
-  Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems (2012), pp. 2951–2959.
-  Suganuma, M., Shirakawa, S., and Nagao, T. A genetic programming approach to designing convolutional neural network architectures. In Proceedings of the Genetic and Evolutionary Computation Conference (2017), ACM, pp. 497–504.
-  Thornton, C., Hutter, F., Hoos, H. H., and Leyton-Brown, K. Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (2013), ACM, pp. 847–855.
-  Wei, T., Wang, C., Rui, Y., and Chen, C. W. Network morphism. In International Conference on Machine Learning (2016), pp. 564–572.
-  Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
-  Xie, L., and Yuille, A. Genetic cnn. arXiv preprint arXiv:1703.01513 (2017).
-  Yanardag, P., and Vishwanathan, S. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015), ACM, pp. 1365–1374.
-  Zeng, Z., Tung, A. K., Wang, J., Feng, J., and Zhou, L. Comparing stars: On approximating graph edit distance. Proceedings of the VLDB Endowment 2, 1 (2009), 25–36.
-  Zhong, Z., Yan, J., and Liu, C.-L. Practical network blocks design with q-learning. arXiv preprint arXiv:1708.05552 (2017).
-  Zoph, B., and Le, Q. V. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).