ImmuNeCS: Neural Committee Search by an Artificial Immune System
Current Neural Architecture Search techniques can suffer from a few shortcomings, including high computational cost, excessive bias from the search space, conceptual complexity or uncertain empirical benefits over random search. In this paper, we present ImmuNeCS, an attempt at addressing these issues with a method that offers a simple, flexible, and efficient way of building deep learning models automatically, and we demonstrate its effectiveness in the context of convolutional neural networks. Instead of searching for the 1-best architecture for a given task, we focus on building a population of neural networks that are then ensembled into a neural network committee, an approach we dub Neural Committee Search. To ensure sufficient performance from the committee, our search algorithm is based on an artificial immune system that balances individual performance with population diversity. This allows us to stop the search when accuracy starts to plateau, and to bridge the performance gap through ensembling. In order to justify our method, we first verify that the chosen search space exhibits the locality property. To further improve efficiency, we also combine partial evaluation, weight inheritance, and progressive search. First, experiments are run to verify the validity of these techniques. Then, preliminary experimental results on two popular computer vision benchmarks show that our method consistently outperforms random search and yields promising results within reasonable GPU budgets. An additional experiment also shows that ImmuNeCS’s solutions transfer effectively to a more difficult task, where they achieve results comparable to a direct search on the new task. We believe these findings can open the way for new, accessible alternatives to traditional NAS.
Neural Architecture Search NAS Neural Network Committee Artificial Immune Systems
Neural architecture search (NAS) has become one of the most active topics within deep learning. Its purpose is to develop algorithms that are able to automatically discover optimal neural network architectures for a given task. The main objective is to reduce the amount of time spent by human practitioners in designing architectures, thus saving costs or making deep learning more accessible to organizations and people lacking the expertise. Moreover, there are two additional potential benefits of using NAS: firstly, some specific use-cases might not be efficiently tackled with existing standard architectures from the literature and may therefore require an approach that is not biased by human priors. Secondly, many experiments show that NAS-based models often achieve extremely competitive performance on benchmarks while being more compact or efficient. This can find applications in resource-limited environments such as mobile devices.
While early works [1, 2] can outperform human-designed architectures on benchmark datasets, they required thousands of GPU-days and are therefore only accessible to a few organizations around the world. Therefore, one of the main challenges in NAS has become efficiency improvement, regardless of the search strategy employed. Many solutions have been proposed, examples of which include performance prediction [3, 4, 5, 6], weight inheritance [2, 7, 8], network morphism [9, 8, 10], and parameter sharing [11, 12, 13, 14, 15, 16].
While they effectively improve efficiency, some of the efforts mentioned above also significantly increase the conceptual and algorithmic complexity of their associated methods. To some extent, this defeats the purpose of making deep learning more accessible, in so far as a user might want to understand the algorithm they are using and not simply execute a series of instructions.
Most recent NAS approaches [5, 14, 12, 8] use a cell-based architecture search space, as per [17, 18]. While this approach dramatically reduces the size of the search space (see Section 2.1), it is motivated by the human affinity for regular patterns, as seen in classical architectures such as ResNet  and DenseNet . To the best of our knowledge, there is no theoretical justification for such regular structures and a machine-led search might end up discovering better architectures if free from this prior.
The series of limitations mentioned above have led us to approach the NAS problem from a different angle, with the aim of providing a method that is simple, flexible, effective, and efficient. We first observe that typical NAS learning curves show rapid progress in the early phases of the search, followed by a long plateau period where only minor improvements are made. Most of the search time is spent in that plateau phase. Secondly, NAS methods typically search for a single architecture or cell, and discard the others. If instead, we retain the members of the final population and ensemble them into a neural network committee (NNC), we can gain an economical boost in the accuracy of our model. In turn, this allows us to stop the search earlier, towards the beginning of the plateau phase, and recover the missing performance by means of ensembling.
To enable such a gain, the members of the population must be individually competent, but also diverse in the classification errors they make [21, 22, 23, 24, 25, 26]. This is achieved by employing an artificial immune system (AIS), a class of bio-inspired algorithms that has proven to be able to find multiple high-quality local optima [27, 28].
Another contribution of this work is that we verify three assumptions commonly made in the field of NAS research, but which are rarely verified: the locality property of the search space, the correlation between partial evaluation and final performance of architectures, and the validity of progressively growing neural networks, in other words the correlation between the performance of an architecture, and the performance of the same architecture with a slight increase in complexity.
The remainder of this paper is organized as follows: we first give a brief overview of the research in fields related to this work in Section 2. In Section 3, we then describe ImmuNeCS (Immune-inspired Neural Committee Search) in detail, including the search space and search algorithm. Section 4 presents a series of experiments that we ran to 1) verify three common assumptions in NAS that are rarely validated; 2) assess the performance and efficiency of our method on two common datasets, and 3) evaluate the ability of an ImmuNeCS-produced population of neural nets to transfer to another task. We conclude by discussing the advantages and limitations of our approach and exploring future research directions.
2 Related Work
In this section, we present some key advances in the field of NAS, particularly around efficiency improvement. We then give a brief overview of AIS algorithms and NNCs.
2.1 Neural Architecture Search
NAS is a field of deep learning that has attracted muchh attention over the last few years . Early works [1, 2] achieved promising results on benchmark datasets at the cost of thousands of GPU-days. The promise shown by these papers triggered intensive research to try and improve the efficiency of NAS methods. Here, we describe some solutions that have come out of this general effort, which can often be combined 111We refer to  for a comprehensive review of NAS..
Cell-based search space
Given the complexity of deep neural networks, the search space can potentially be infinite, so all NAS approaches come up with some ways of reducing it to a more computationally tractable size. Inspired by state-of-the-art hand-crafted architectures [19, 31, 32], most recent NAS works search for motifs called cells rather than full architectures, an idea first proposed in [17, 18]. These cells are repeated in a pre-defined way to generate models. This greatly reduces the search space and enables transferability to more complex tasks by simply increasing the number of assembled cells. It also allows performing the search using relatively small architectures, before "augmenting" the final model before training by increasing the number of cells. The speed-up associated with this form of architecture search is around one order of magnitude .
However, cell-based search limits one’s ability to explore one of the most intriguing aspects of NAS, namely its potential for discovering macro-architectures that humans would not have come up with. In this research, we choose not to follow this strategy and instead use a limited selection of low- or high-level blocks that the search algorithm can assemble in any way it deems most effective, similar to the work in .
Some of the most significant gains in efficiency have been obtained by training only one large model and sampling subgraphs from it that share its weights [11, 12, 14, 15, 16]. The approaches using this strategy routinely report speed-ups of two to three orders of magnitude. However, recent research [34, 35] has cast some doubt over the superiority of weight sharing methods over a well-designed random search, because the performance ranking of subgraphs using the shared weights might not accurately represent the ranking of final, trained-from-scratch models.
Given a set of possible operations for each of positions (i.e. layer or cell element), the number of possible architectures is . However, by making the search progressive, it is possible to achieve : Starting from a trivial or small architecture, one can search among the possible configurations for the next position, and repeat the process to incrementally grow the network. This approach is adopted in [36, 37] in the context of a cell-based search and in  on full architectures. The progressive search method allows  to report efficiency gains of 3 to 5 times in the number of models evaluated.
Progressive search assumes that there is a better chance of obtaining a strong candidate when building up from an architecture that is already strong, an assumption that we test in this research (see Section 4.1).
In any NAS strategy, candidate evaluation is one of the key bottlenecks, as training each neural network can take several GPU-hours. As a result, a common solution to reduce this time is to partially evaluate candidates, i.e. train them on a subset of the training set and/or for only a few epochs [17, 4, 5]. This is based on the assumption that there is strong rank-correlation between the accuracy of partially trained networks and that of the same architectures subjected to full training. This assumption is not always true in practice ; we therefore run an empirical verification in this paper before adopting this strategy.
NAS methods generally derive new architectures from previously-evaluated ones, by applying some form of transformation. Weight inheritance is the idea that in such cases, the weights learned by the parent architectures are carried-over to the child candidate in all components (i.e. layers or cell elements) that have not been altered during the transformation process. The inherited weights can then be fine-tuned , or even frozen completely . In this paper, we use weight inheritance as a starting point for fine-tuning.
Network morphism [39, 40, 7, 10] is a more radical form of weight inheritance. Modifications to the architecture are designed to preserve the function that they represent, either exactly or approximately , so that all the weights learned so far can be re-used.
Both forms of network morphism allow child architectures to be warm-started with weights that have been learned by their parent, thus reducing the evaluation time.
2.2 Artificial Immune Systems
AIS algorithms are a class of evolutionary algorithms that date back from the early 2000’s and are inspired by theories related to the mammal immune system [41, 42]. They rely on the idea of a population of antibodies that can proliferate and mutate depending on their affinity to detected antigens, while also interacting between themselves to maintain diversity. When an antibody has been able to bind to an antigen, it will enter a memory bank that speeds up and amplifies the immune response in case of a future encounter with the same antigen or one closely related to it (secondary response).
In the context of optimization tasks, one of the simplest AIS algorithms is the Clonal Selection Algorithm or Clonalg . It has shown its ability to locate multiple optima and implicitly account for multiple possible solutions. Subsequent developments have taken advantage of interactions between population members to promote diversity, leading to the Opt-AINet algorithm . In this paper, we use a variant of Clonalg, with inspiration from Opt-AINet and provisions made for progressive search (see Section 2.1). Compared to the Genetic Algorithm (GA) , another popular population-based search method, AIS have the added benefit of not requiring crossover or recombination of candidates, i.e. the recombination of different parts of the “genomes” of two individuals to create an offspring. It is not obvious how one would define such a crossover operation: layers co-adapt during training, and simply combining a section of one candidate with a section of another candidate seems ill-suited to the neural network paradigm. The authors of  seem to reach the same conclusion and forego crossover entirely, despite using an algorithm related to GA.
2.3 Neural Network Committees
Neural network committees are simply ensemble models using neural networks. As such, they have been studied for a long time , including in the context of convolutional neural networks (CNN) , and can benefit from most of the research around ensembling strategies. The performance of an ensemble essentially depends on two factors: The quality of each model in the ensemble, and their error diversity. However, it is not always clear how to achieve this error diversity, or even how to define it . A natural idea is to explicitly optimize models for accuracy and diversity . The approach in  promotes diversity by using a modified fitness function to discover a population of neural networks by an evolutionary algorithm. Their work presents similarities with ours, however the emphasis of their search algorithm is on general hyperparameters rather than complete architectures, and their experiments are limited to the MNIST dataset  and simple models.
The alternative to explicit diversity enforcement is to use a search algorithm that is inherently able to maintain diversity. It seems that AIS algorithms are effective in this regard . In particular, the authors of  compare Opt-AINet trained with accuracy as its sole objective, with another AIS (Omni-AINet) specifically designed to optimize for both accuracy and diversity. They show that the single-task Opt-AINet can outperform the multi-task setting.
3 Description of ImmuNeCS
In this section, we first explain the observations that motivate our method, then detail the main components of ImmuNeCS.
3.1 Key Observations
As with many optimization processes, NAS tends to exhibit a fast rate of improvement in the early phases of the search. However, progress gradually becomes more difficult and more and more time is required to achieve the required progress (regardless of the metric used to measure that goal). If we had at our disposal a method to economically boost the accuracy of our model(s), we might be able stop the search much earlier and still recover a similar level of performance.
Plurality of solutions
A common feature of all NAS methods is that they evaluate many candidates during the search, whose performance gradually improves. In the context of evolutionary algorithms for instance, this means that the final generation of candidates should all have relatively good performance. Most NAS methods retain the absolute best candidate based on some criterion (typically validation accuracy, or Pareto dominance in multi-objective search) and discard the knowledge accumulated by the others.
Neural Committee Search
Motivated by the two observations above, we choose to follow a different approach to most existing NAS methods. Instead of discarding the final generation of network architectures, we ensemble them, which achieves a significant improvement in the prediction accuracy compared to even the best network in the population (typically close to 1%pt in our experiments). This in turn allows us to use relatively aggressive termination criteria for the search (see Section 3.4), thus completing the whole process within a reasonable compute budget. This idea, which we dub Neural Committee Search (NCS), shifts the NAS problem from focusing on a single architecture to growing a competent but diverse population of classifiers.
We represent an architecture as a directed acyclic graph where nodes correspond to layers/blocks and edges to tensors (see Figure 1). Node is the start or input node, which receives data samples. Each subsequent node has at least one incoming edge, which comes from node . On the CIFAR-10  task, each hidden layer except can have a second incoming edge from any node , thus allowing skip connections. The start node is the only node that is allowed an outdegree greater than 2. On the Fashion-MNIST  task, all nodes have indegree 1.
Each node has an Aggregation and an Operation. Aggregations are used to combine tensors from earlier layers. They are None when the node has an indegree of 1 but can be Add or Concatenate when it is 2, as described in Section 3.3. Operations apply some mathematical transformation to their input tensor (e.g. convolution, ResNet block…). As per Table 1, each Operation can have several hyperparameters.
3.3 Search Space
The search space of any NAS method requires careful consideration. Too permissive, the dimensionality becomes unmanageable and the search becomes untractable. Too restrictive, and the algorithm may be prevented from discovering the interesting architectures. Any search space design will necessarily include some human bias in the choice of operations that are accessible, the hyperparameters of these operations, or the ways they can be connected to one another. However, as stated in Section 2.1, we would still like the search to be able to surprise us with solutions that humans would not have thought of. In this research, we design two different search spaces depending on the dataset used in the experiments. We chose to design relatively small search spaces, but with solutions that might appear unconventional compared to other NAS methods.
Fashion-MNIST search space
Fashion-MNIST is a dataset of small, -pixel greyscale pictures of ten classes of clothing items. It is a good choice for developing and testing a classification model because it is simple enough that our method can complete the search in about one day on two NVidia RTX 2080Ti GPUs, yet complex enough that there can be significant differences in final accuracy between different experimental settings. On this task, we restrict the search space to strictly sequential architectures (no skip connections). Each layer can be a convolution, a depthwise separable convolution, a pooling layer or an Identity function 222The raison d’être of the identity function is to possibly neutralize an existing layer during subsequent mutations, see Section 3.4.. Each of these operations except Identity has three hyperparameters with a number of discrete values that they can take (see Table 1). Note that we deliberately make two decisions:
Somewhat contrary to common practice, Conv and DSepConv layers do not necessarily include batch normalization nor an activation function. In practice, the algorithm does indeed make use of this freedom.
For simplicity, pooling layers are solely responsible for widening via their Channel multiplier parameter. Other layers do not change the number of channels.
The input layer takes data batches and applies a pointwise convolution to them to increase the channel count from 1 to 64, with no batch normalization nor activation. The network’s final layer, the classifier, is also fixed and follows a common structure in CNNs. It first applies adaptive global concatenation pooling, whose number of channels is twice the output of the feature extraction part of network . Then the data goes through batch normalization and dropout. The dropout rate is set at 20% based on side experiments. Finally, a fully-connected layer reduces the number of activations to the number of classes of the task at hand (10).
CIFAR-10 search space
CIFAR-10 is one of the most popular computer vision tasks in recent NAS research. It consists of small, -pixel color images, distributed across ten classes. It is a much more challenging task than Fashion-MNIST and requires deeper architectures with skip connections. Letting the algorithm incrementally assemble simple layers as with Fashion-MNIST would be very slow, yet for reasons explained in Section 2.1, we want to avoid the cell-based search paradigm.
Inspired by , we therefore decide to design the search space around a menu of high-level blocks. These are taken from the classical deep learning literature [19, 20, 31] but we do not impose any restrictions on the order nor the number of each block in the final architecture. As before, each block type has hyperparameters that the AIS can choose from (see Table 1). Here again, the input layer is simply a pointwise convolution with an output width of 32. The classifier part is the same as with Fashion-MNIST.
Each node is decoded into a block function which receives a tensor from block , and optionally, a tensor from any block . and can be either summed up, or concatenated along the channel dimension. Mismatches between spatial dimensions are resolved by bilinear interpolation, whereas channel counts are aligned by pointwise convolutions without batch normalization nor activation function. The output from can thus be:
if has only one input (which is only certain for ),
or otherwise (where denotes concatenation along the channel dimension, ignoring spatial and width adjustments with slight abuse of notation).
Unlike most recent works in NAS, we do not predefine where the data should be downsampled or the number of channels increased in the macro-architecture. Instead, these manipulations are freely decided by the search algorithm. In theory, and given enough network evaluations, this allows the algorithm to adapt the receptive field and number of channels to local requirements within each architecture, rather than mandating a cell structure that can work around fixed values for both these structural decisions.
3.4 Search by an AIS
We first give an overview of our Clonalg-derived search algorithm, before detailing its key components: cloning and mutation, random insertions, and augmentation.
The search is conducted by an AIS, starting from a population of networks comprising of a small number of random hidden layers. Then the networks go through partial training and are evaluated on a validation set. Their validation accuracy represents their affinity to the task.
Subsequently, at every generation, a fixed number of clones is generated for each network. These clones undergo mutations to their connections, aggregation and operation (see details below). The resulting architectures are trained and evaluated, and the best candidates from the pool of parents and clones are retained to form the next generation. The mean affinity of the population is then computed, and a few new random networks are inserted into the population to explore new regions of the search space. If the mean affinity does not improve by more than a threshold within a patience period of generations, the population goes through augmention: The algorithm generates clones of each individual and appends one random layer to each of them. This progressive search mechanism allows the AIS to generate minimal networks for each task. A layer can be modified through mutation even after the addition of subsequent layers, which prevents models from being locked in the sub-region of the search space defined by the previous layers.
The whole process then restarts from the cloning and mutation step. The search terminates when consecutive augmentation phases have not yielded sufficient improvement, i.e. the mean affinity has not improved by more than . Note that these two hyperparameters control the threshold and patience for both the inner loop (how many generations to wait for an improvement before augmenting the population), and the outer loop (how many augmentations to wait for an improvement before stopping the search). By changing the values of and , one can make the search stop earlier or later in the learning curve. In our experiments, we used . was usually set at 0.0075 for Fashion-MNIST and 0.003 for CIFAR-10, which correspond to relatively steep parts of the respective learning curves (0.75/0.3%pt improvement in mean accuracy within two generations). The pseudo-code is provided in Algorithm 1.
Cloning and mutation
The mutation strategy is an important difference of AISs compared to traditional evolutionary methods such as the genetic algorithm. Here, the magnitude of the mutations that clones undergo is influenced by the affinity of their respective parents. This is a way of balancing exploitation and exploration: if a parent has high affinity, the AIS will focus on a small region around this parent. Conversely, when a parent has poor affinity, the AIS will extend the search to a wider area in an attempt to find more promising solutions. In addition, as shown in Equation (2), we assign linearly larger mutation variances to more recent layers, i.e. closer to the network’s head. This is because earlier layers have already had several opportunities to mutate and we do not want to generate more mutations than required, as they are costly in terms of re-training the network.
In practice, given a clone of depth (i.e. with layers/building blocks), the mutation rate is computed as:
with being the affinity of the clone’s parent architecture. Then for the layer or block, the strength of the mutation is randomly sampled as:
However, the mutation operator above assumes continuous values, whereas we are dealing with discrete features conditioned by other features. To resolve this issue, we store continuous values in the range for each hyperparameter and discretize them on the fly through binning when constructing each network (see Figure 2). This allows us to remember all mutations even when they were insufficient to alter the integer value of a hyperparameter (a similar strategy was employed in ).
Each clone’s mutation sequence is described in Algorithm 2. To avoid changes too drastic, we do not let mutations affect both the connectivity of the clone and its layers, so if at least one connection has changed, we immediately move on to the next clone. Otherwise, we start mutating nodes: for each of them, we first perturb the aggregation type (if applicable, i.e. the node has two incoming edges). Next, we try to perturb the node’s operation type, then all of the node’s operation hyperparameters. All these perturbations use Equation (2), and at each step, we assess whether the discretized value has changed as per Figure 2. If it has changed, we skip the next steps and move on to the next clone. By proceeding in this way, we ensure that only small elements of the networks are changed at every round of mutation, ensuring consistency between the parents’ and clones’ performance (see the experiment on locality and mutation operation in Section 4.1).
To fully utilize the population’s capacity, we keep track of all architectures already evaluated using a unique string encoding scheme and do not allow mutated clones to be identical to previously encountered networks. All mutated clones inherit the weights learned by their parent, except in the layers that have been modified by mutations (as assessed after discretization), or whose number of channels has changed due to mutations to upstream components. Layers that do not inherit weights are initialized with He initialization .
Novelty introduction is a way to promote exploration during the search. In our method, it is also a way of improving diversity, which as we saw is essential to NNCs. Other evolutionary algorithms frequently resort to crossover operations. However, as argued is Section 2.2, such approaches seem ill-suited when it comes to neural networks. Instead, we simply insert a small number of random individuals into the population at every generation. To make sure that their capacity is comparable to the rest of the population and therefore, to give them a chance to survive the next selection operation, their depth is set at the current average depth of the population.
As mentioned in Section 2.1, we adopt a progressive search strategy. To this end, we periodically increase the capacity of the population by cloning all networks, removing the heads of all clones, appending one random layer to each (with the number of incoming edges sampled randomly), and rebuilding their heads (see Figure 3). It is worth noting that the original, non-augmented networks remain part of the population so that augmentations that do not bring any benefits can be ignored. As with mutations, clones inherit the weights learned by their parent, except for the new layer, which is initialized with He initialization.
Evaluation during search
Network graphs are only decoded into the actual neural network architectures at evaluation time or when inheriting weights from their parent. Otherwise, we only keep the graph representations in memory. As the training and evaluation of candidate solutions is by far the most resource-hungry part of the algorithm, we resort to partial evaluation. Each candidate is trained on only 20% of the training set, and we define an aggressive early stopping policy: If the validation accuracy does not improve by more than 0.5% on Fashion-MNIST (0.3% on CIFAR-10) for 2 epochs, training stops. In all cases, training is not allowed to exceed 15 epochs on Fashion-MNIST and 30 on CIFAR-10. At that point, the weights of the best epoch seen so far are saved, and the network is assigned the best validation accuracy as its affinity score. In addition to saving time, this strategy favors architectures that train fast.
Driven by the objective of improving the whole population, we monitor progress by computing the average population affinity at every generation, rather than the single best affinity as is typically the case in NAS.
3.5 Neural committee building
Once the search is completed, we are left with partially trained architectures. On the Fashion-MNIST task, all architectures are retained. On CIFAR-10, due to the longer time required to train each model, we retain the best candidates based on their affinity scores. We train the retained networks further without reinitializing their weights, on the full training set and for more epochs (see experimental details in Section 4).
The NNC’s class predictions are obtained by weighted soft majority vote. Given a data sample and a committee made up of neural nets with affinity scores (), we get a collection of -dimensional probability vectors , where is the number of classes in our classification task. We then compute the following weighted sum:
Finally, the committee’s prediction is given by .
4 Experiments and Results
In this section, we first describe experiments that were conducted to verify a number of common assumptions in NAS. We then present preliminary results from full-scale experiments on the two tasks and compare them to several existing NAS methods. Finally, we run a comparison to random search to assert that the performance of our method does not only result from the search space design.
4.1 Assumption Validation
A central, although implicit, assumption made by NAS is that the search space of all architectures exhibits the locality property: Architectures that are nearby in the hyperparameter space will perform similarly. However, this is not an assumption that is commonly tested, so we run an experiment to verify it, and at the same time verify that our mutation operator is defined in a way that exploits this locality property. Moreover, as described in Section 3.4, we employ a few tricks to speed up the search: weight inheritance, partial evaluation, and progressive network growth. Weight inheritance is a common practice in many papers [2, 8] and its validity is verified in . However, the other two techniques require further validation.
Locality and mutation operator
To test for locality, we generate a population of 100 random networks in the Fashion-MNIST search space. To test the assumption for both shallow and deep networks, half the population has depth 3 and the other half, depth 9. Each network is evaluated against the Fashion-MNIST task, then 10 mutated clones are generated for each parent, using the mutation operator described in Section 3.4. For each parent, we compute the mean and standard deviation of its clones’ affinity scores and analyse their correlation to the parent’s affinity.
If the locality assumption holds, and if the mutation operator can take advantage of it, we can expect two things to happen. Firstly, the mean affinity of the clones should be correlated to the parent’s affinity. Secondly, as the mutation operator applies mutations with a larger variance to clones whose parent has a low affinity score, we should observe an inverse correlation between parent affinity and the variance in clone affinity.
Figure 4 illustrates the correlations found on the mean and standard deviation and are in line with expectations. It also appears that correlations are stronger in the deep network regime, which makes intuitive sense, as a single mutated value modifies a smaller proportion of the network.
Here, we refer to partial evaluation as the training of candidate architectures with a subset of the training set and an aggressive early stopping policy, as presented in Section 3.4. As in the previous experiment, we generate random architectures of depths 3 and 9 in the Fashion-MNIST search space and evaluate them under the partial evaluation regime. Next, we train them to convergence on the full dataset (training details in Section 4.2), from the inherited weights.
We plot the full-training test accuracy values against the partial-training validation accuracy values in Figure 5. The correlation is very strong for shallow networks and somewhat weaker for deeper architectures. This can be explained by the fact that deeper architectures typically learn slower, therefore applying the same termination criteria to all network sizes might be suboptimal. Exploring alternative policies might constitute future work; nevertheless, the correlation is still robust even at depth 9, so that partial evaluation is still indicative of the final accuracy. Moreover, by the time a depth of 9 is reached, most networks perform reasonably well and they are going to be ensembled anyway so that it is less critical for partial evaluation to provide an exact indication of the ultimate performance – which can be seen as a further benefit of our approach. Further experiments have also shown that these correlations are slightly weaker when weights are not inherited but reintialized, which justifies the combination of partial training and weight inheritance.
Progressive search is the idea of incrementally building architectures by adding layers one by one. The underlying assumptions need to be verified by answering these two questions: 1) Does adding a layer to a strong candidate, in expectation, yield an even stronger candidate? 2) Given two candidates, does, in expectation, augmenting the stronger one yield a stronger candidate than augmenting the weaker one?
To test these assumptions, we again generate a population of random architectures of depths 3 and 9. We train them under the partial evaluation regime, measure their affinity score (validation accuracy), then augment each one them with an additional layer (with inherited weights in all other layers) and repeat the evaluation process. Figure 6 addresses the two questions above: 1) Shallow networks almost always benefit from augmentation and while this is less true for deep networks, there is no catastrophic loss of performance due to augmentation; 2) there is significant correlation between affinity scores of networks before and after augmentation.
4.2 Full-scale experiments
Having verified the assumptions behind ImmuNeCS, we can now turn to full scale experiments on the Fashion-MNIST and the CIFAR-10 tasks. All experiments were run on NVidia 2080Ti GPUs.
Search: population size (the small population size is made possible by the relatively small search space and the progressive search strategy), number of clones per parent , number of augmented copies per parent , number of random networks inserted at each generation , mutation factor , patience , threshold , maximum number of generations ;
Partial evaluation: training set size samples (20% of the total training data available), validation set size samples, batch size, data pre-processing: pad with zeros to 34x34 and crop back to 28x28, initial learning rate , optimizer: Adam with cosine annealing  (no restart; the choice of cosine annealing is justified by its good anytime performance, as in ), , , weight decay , early stopping patience epochs, early stopping threshold , maximum number of epochs (most evaluations stop around 10 epochs given these early stoppping criteria);
Final training: retain architectures; training hyperparameters are identical to partial evaluation except: training set size samples, initial learning rate , 1 restart after 35 epochs, fixed number of epochs (no early stopping).
The results obtained are summarized and compared to representative works are presented in Table 2. Our method achieves an average over 6 runs of 94.17% (standard deviation 0.15%), outperforming the others by a significant margin while requiring a comparable amount of resources. Admittedly, measuring resources in GPU-days is somewhat inaccurate because hardware improves over time; however this metric can still give us an estimate of each method’s efficiency.
|Model||Reported||# Runs||Test accuracy||GPU-days||Method||Cell-based||Comments|
|NASH333As implemented and reported in ||Best||1||91.95%||0.5||
|Gradient Evolution||Median (Best)||10||90.58% (91.36%)||Evolution||No|
|DeepSwarm||Mean (Best)||5||93.25% (93.56%)||1.2||Swarm Optimization||No|
|ImmuNeCS||Mean (Best)||6||94.17% (94.39%)||2.4||AIS||No|
Given the larger number of possible operation types (and thus, the denser sampling space), we sample them from , keeping the definition of provided in Section 3.4. All other items are sampled as before.
Partial evaluation: training set size samples, validation set size samples, batch size=, data pre-processing: reflection padding to 40x40 and crop back to 32x32 + cutout  and random flip, initial learning rate , early stopping threshold , maximum number of epochs .
Final training: to save on retraining time, we retain only the best architectures ranked by affinity score; training hyperparameters are identical to partial evaluation except: training set size samples, initial learning rate , 2 restarts after 30 and 90 epochs, fixed number of epochs (no early stopping).
|ResNet 110||Mean (Best)||5||93.39% 0.16||–||Manual||–|
|NAS depth 15||Best||1||94.50%||12600||Reinforcement Learning||No||
|NAS depth 39||Best||1||95.53%||12600||Reinforcement Learning||No||
|NAS depth 39 + more filters||Best||1||96.35%||12600||Reinforcement Learning||No||
|NASNet-A 28M||Best||1||97.60%||1800||Reinforcement Learning||Yes||Augmentation of final model|
|NASNet-A 3.3M||Best||1||97.35%||1800||Reinforcement Learning||Yes||Augmentation of final model|
|Large-Scale Evolution||Mean (Best)||5||94.1% 0.4 (94.6%)||3000||Evolution||No|
|AmoebaNet-B 2.8M||Mean||5||97.45% 0.05||4500 (TPU)||Evolution||Yes||Augmentation of final model|
|AmoebaNet-B 34M||Mean||5||97.87% 0.04||4500 (TPU)||Evolution||Yes||Augmentation of final model|
|NASH single model||Mean||4||94.80%||1||
|NASH snapshot ensemble||Mean||4||95.30%||2||
|LEMONADE Search Space I||Best||1||96.50%||56||
|LEMONADE Search Space II||Best||1||96.60%||56||
|Yes||Augmentation of final model|
|ENAS Macro||Best||1||95.77%||0.3||Weight sharing||No|
|ENAS Micro||Best||1||97.11%||0.5||Weight sharing||Yes||Augmentation of final model|
|DARTS 1st Order||Best||1||97.05%||1.5||Differentiable||Yes||Augmentation of final model|
|DARTS 2nd Order||Best of 4 runs||4||97.17% 0.06||4||Differentiable||Yes||Augmentation of final model|
|CGP-CNN||Mean (Best)||3||93.95% (94.34%)||27||EA||No|
4.3 Comparison with Random Search
The initial results and comparisons on this task are summarized in Table 3. Although further experiments are required, ImmuNeCS seems to achieve a competitive balance of performance and efficiency, particularly among methods that do not use a cell-based search space. We point out that the search space has not been optimized in any way, we simply reproduced six blocks from traditional hand-crafted architectures. It is very likely that a more thorough analysis of appropriate block candidates would improve these results significantly. One could even imagine a cell-based approach to design a few blocks, followed by the same high-level architecture search as above to combine these blocks (similar to ). Moreover, where a number of papers apply some form of post-processing specifically at the final retraining stage (hyperparameter search, augmentation etc.), in our method, the final architectures found by the algorithm are retrained without any modification nor hyperparameter optimization.
Some NAS methods have come under question because their superiority over Random Search (RS) could not be established in rigourous experiments [34, 35]. This might indicate that the techniques they use add complexity without helping performance, and that the performance mostly comes from the designed search spaces rather than the search algorithms. It is therefore important to compare ImmuNeCS with RS to ensure this is not the case with our approach. To this end, we run ImmuNeCS seven times on Fashion-MNIST at different evaluation budgets, which are obtained by varying the search-stopping threshold and the maximum number of generations. We then run a full RS at different budget values spanning the range covered by ImmuNeCS. In each of these RS runs, and to ensure a fair comparison, the generated architectures’ number of layers is sampled around the mean depth of the corresponding AIS runs.
The results in Figure 7 show that the search by an AIS provides substantial benefits over RS ( for the null hypothesis: "Both methods yield the same average committee test accuracy"). The wider distribution of the AIS results is due to the fact that our method exhibits a stronger correlation between performance and the number of network evaluations than RS. Therefore, varying the evaluation budget spreads the RS results less.
We hypothesize that the plurality of ImmuNeCS’s solutions helps them transfer to a different task and achieve competitive results even when the architecture search was conducted on an easier task. We therefore run ImmuNeCS on MNIST, a very simple task by modern standards, using the same search space and search parameters as described in Sections 3.3 and 4.2 for the Fashion-MNIST task. We obtain an NNC of 12 CNNs that are typically shallower than those obtained through a direct search on the Fashion-MNIST task. We train them first on MNIST and obtain a final accuracy of 99.6%. We then train the same architectures on Fashion-MNIST and measure a final accuracy of 92.8%, which is still among the best results presented in Table 2. Crutially, the performance ranking of architectures on MNIST is completely different to that on Fashion-MNIST, indicating that a 1-best-focused search would be unlikely to return the best architecture for both tasks. This opens up some interesting perspectives, such as a form of transfer learning where an NNC could be discovered on a task with plenty of labelled data, then used to produce predictions on a task with sparser data. It could also allow an NNC to be discovered on a relatively simple task, and further refined on the more complex one.
5 Conclusion and Discussion
We have presented ImmuNeCS, a novel approach to the automatic design of deep learning systems. Instead of focusing on producing a single architecture, we promote a diverse population of competent models that are ensembled to achieve competitive results with reasonable resource requirements. To improve efficiency, we use techniques whose underlying assumptions are verified through dedicated experiments. We also show that our method outperforms random search in a fair comparison, and remark that the NNC found on a simple task is able to generalize well to a more complex task.
Furthermore, our method presents other benefits: 1) It is conceptually simple and therefore approachable from non-experts, 2) it does not enforce an architecture based on repeated identical cells, and is thus able to discover irregular patterns, 3) it is flexible and can accommodate many search spaces. One obvious drawback is that inference is slower, as we need to aggregate the predictions of all members of the NNC rather than a single model. However, in applications where predictions are not required in real-time, this could be an acceptable trade-off.
Future work includes expanding the CIFAR-10 search space to further improve performance, and running experiments on tasks from the field of medical imaging to assess ImmuNeCS’s flexibility and usefulness in real-life situations. In order to make the algorithm more efficient, we also intend to investigate methods that could help guide the search towards promising areas of the search space.
-  Barret Zoph and Quoc V. Le. Neural Architecture Search with Reinforcement Learning. In International Conference on Learning Representations, 2017.
-  Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V. Le, and Alexey Kurakin. Large-Scale Evolution of Image Classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2902–2911. JMLR.org, 2017.
-  Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves. In IJCAI, volume 15, pages 3460–3468, 2015.
-  Zela A., Klein A., Falkner S., and Hutter F. Towards Automated Deep Learning: Efficient Joint Neural Architecture and Hyperparameter Search. arXiv preprint arXiv:1807.06906, 2018.
-  Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized Evolution for Image Classifier Architecture Search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4780–4789, 2019.
-  Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating Neural Architecture Search Using Performance Prediction. arXiv preprint arXiv:1705.10823, 2017.
-  Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Simple and Efficient Architecture Search for Convolutional Neural Networks. arXiv preprint arXiv:1711.04528, 2017.
-  Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Efficient Multi-Objective Neural Architecture Search via Lamarckian Evolution. arXiv preprint arXiv:1804.09081, 2018.
-  Haifeng Jin, Qingquan Song, and Xia Hu. Auto-Keras: An Efficient Neural Architecture Search System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1946–1956. ACM, 2019.
-  Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by network transformation. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Shreyas Saxena and Jakob Verbeek. Convolutional Neural Fabrics. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4053–4061. Curran Associates, Inc, 2016.
-  Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient Neural Architecture Search via Parameter Sharing. In International Conference on Machine Learning, pages 4092–4101, 2018.
-  Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and Simplifying One-Shot Architecture Search. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 550–559. PMLR, 10–15 Jul 2018.
-  Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable Architecture Search. arXiv preprint arXiv:1806.09055, 2018.
-  Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. In International Conference on Learning Representations, 2019.
-  Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: Stochastic Neural Architecture Search. In International Conference on Learning Representations, 2019.
-  Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.
-  Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. Practical block-wise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2423–2432, 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
-  G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity Creation Methods: A Survey and Categorisation. Information Fusion, 6(1):5–20, 2005. Cited By :517.
-  G. P. Coelho and F. J. Von Zuben. The Influence of the Pool of Candidates on the Performance of Selection and Combination Techniques in Ensembles. In The 2006 IEEE International Joint Conference on Neural Network Proceedings, pages 5132–5139, 2006.
-  Rodrigo Pasti and Leandro Nunes de Castro. The influence of diversity in an immune-based algorithm to train mlp networks. In Leandro Nunes de Castro, Fernando Jose Von Zuben, and Helder Knidel, editors, Proceedings of the 6th International Conference on Artificial Immune Systems (ICARIS), pages 71–82, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.
-  Rodrigo Pasti, Leandro Nunes de Castro, Guilherme Palermo Coelho, and Fernando Jose Von Zuben. Neural Network Ensembles: Immune-Inspired Approaches to the Diversity of Components. Natural Computing, 9(3):625–653, 09/01 2010. ID: Pasti2010.
-  P. A. D. Castro and F. J. Von Zuben. Learning Ensembles of Neural Networks by Means of a Bayesian Artificial Immune System. IEEE Transactions on Neural Networks, 22(2):304–316, 2011.
-  X. Zeng, D. F. Wong, and L. S. Chao. Constructing Better Classifier Ensemble Based on Weighted Accuracy and Diversity Measure. TheScientificWorldJournal, 2014:961747, 2014.
-  Leandro N. De Castro and Fernando J. Von Zuben. Learning and Optimization Using the Clonal Selection Principle. IEEE Transactions on Evolutionary Computation, 6(3):239–251, 2002.
-  L. N. de Castro and J. Timmis. An Artificial Immune Network for Multimodal Function Optimization. In Evolutionary Computation, 2002. CEC’02. Proceedings of the 2002 Congress on, volume 1, pages 699–704, 2002.
-  Marius Lindauer and Frank Hutter. Best practices for scientific research on neural architecture search. arXiv preprint arXiv:1909.02453, 2019.
-  Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural Architecture Search: A Survey. Journal of Machine Learning Research, 20(55):1–21, 2019.
-  Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017.
-  Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016. 1605.07146.
-  Masanori Suganuma, Shinichi Shirakawa, and Tomoharu Nagao. A genetic programming approach to designing convolutional neural network architectures. arXiv preprint arXiv:1704.00764, 2017.
-  Christian Sciuto, Kaicheng Yu, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. Evaluating the Search Phase of Neural Architecture Search. arXiv preprint arXiv:1902.08142, 2019.
-  Liam Li and Ameet Talwalkar. Random Search and Reproducibility for Neural Architecture Search. arXiv preprint arXiv:1902.07638, 2019.
-  Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan L. Yuille, Jonathan Huang, and Kevin Murphy. Progressive Neural Architecture Search. arXiv preprint arXiv:1712.00559, 2017.
-  Jin-Dong Dong, An-Chieh Cheng, Da-Cheng Juan, Wei Wei, and Min Sun. Ppp-net: Platform-aware progressive search for pareto-optimal neural architectures. In International Conference on Learning Representations (ICLR) Workshop, 2018.
-  Edvinas Byla and Wei Pang. Deepswarm: Optimising convolutional neural networks using swarm intelligence. arXiv preprint arXiv:1905.07350, 2019.
-  Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. In International Conference on Learning Representations, 2016.
-  Tao Wei, Changhu Wang, Yong Rui, and Chang Wen Chen. Network morphism. In International Conference on Machine Learning, pages 564–572, 2016.
-  F. M. Burnet. Clonal selection and after, volume 63 of Theoretical Immunology, page 85. Marcel Dekker Inc, 1978.
-  N. K. Jerne. Towards a network theory of the immune system. COLLECT.ANN.INST.PASTEUR, 125(1-2):373–389, 1974.
-  John Henry Holland et al. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press, 1992.
-  Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993–1001, 1990.
-  D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Convolutional neural network committees for handwritten character classification. In International Conference on Document Analysis and Recognition (ICDAR), 2011, pages 1135–1139, 2011.
-  E. Bochinski, T. Senst, and T. Sikora. Hyper-parameter optimization for convolutional neural network committees based on evolutionary algorithms. In Proceedings - International Conference on Image Processing, ICIP; 24th IEEE International Conference on Image Processing, ICIP 2017, volume 2017-September, pages 3924–3928, 2018.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report, University of Toronto, 1(4), 2009.
-  Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv preprint arXiv:1708.07747, 2017.
-  Jeremy Howard. Fast.ai Deep Learning 2019 - Lesson 6, 2019.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
-  Ilya Loshchilov and Frank Hutter. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv preprint arXiv:1608.03983, 2016.
-  N. Mitschke, M. Heizmann, K. Noffz, and R. Wittmann. Gradient based evolution to optimize the structure of convolutional neural networks. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 3438–3442, Oct 2018.
-  Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
-  R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, and N. Duffy. Evolving deep neural networks (2017). arXiv preprint arXiv:1703.00548, 2017.