ImmuNeCS: Neural Committee Search by an Artificial Immune System

ImmuNeCS: Neural Committee Search by an Artificial Immune System

Luc Frachon
l.frachon.18@abdn.ac.uk
&Wei Pang
pang.wei@abdn.ac.uk
&George M. Coghill
g.coghill@abdn.ac.uk
Department of Computing Science, University of Aberdeen, UK
Abstract

Current Neural Architecture Search techniques can suffer from a few shortcomings, including high computational cost, excessive bias from the search space, conceptual complexity or uncertain empirical benefits over random search. In this paper, we present ImmuNeCS, an attempt at addressing these issues with a method that offers a simple, flexible, and efficient way of building deep learning models automatically, and we demonstrate its effectiveness in the context of convolutional neural networks. Instead of searching for the 1-best architecture for a given task, we focus on building a population of neural networks that are then ensembled into a neural network committee, an approach we dub Neural Committee Search. To ensure sufficient performance from the committee, our search algorithm is based on an artificial immune system that balances individual performance with population diversity. This allows us to stop the search when accuracy starts to plateau, and to bridge the performance gap through ensembling. In order to justify our method, we first verify that the chosen search space exhibits the locality property. To further improve efficiency, we also combine partial evaluation, weight inheritance, and progressive search. First, experiments are run to verify the validity of these techniques. Then, preliminary experimental results on two popular computer vision benchmarks show that our method consistently outperforms random search and yields promising results within reasonable GPU budgets. An additional experiment also shows that ImmuNeCS’s solutions transfer effectively to a more difficult task, where they achieve results comparable to a direct search on the new task. We believe these findings can open the way for new, accessible alternatives to traditional NAS.

\keywords

Neural Architecture Search NAS Neural Network Committee Artificial Immune Systems

1 Introduction

Neural architecture search (NAS) has become one of the most active topics within deep learning. Its purpose is to develop algorithms that are able to automatically discover optimal neural network architectures for a given task. The main objective is to reduce the amount of time spent by human practitioners in designing architectures, thus saving costs or making deep learning more accessible to organizations and people lacking the expertise. Moreover, there are two additional potential benefits of using NAS: firstly, some specific use-cases might not be efficiently tackled with existing standard architectures from the literature and may therefore require an approach that is not biased by human priors. Secondly, many experiments show that NAS-based models often achieve extremely competitive performance on benchmarks while being more compact or efficient. This can find applications in resource-limited environments such as mobile devices.

While early works [1, 2] can outperform human-designed architectures on benchmark datasets, they required thousands of GPU-days and are therefore only accessible to a few organizations around the world. Therefore, one of the main challenges in NAS has become efficiency improvement, regardless of the search strategy employed. Many solutions have been proposed, examples of which include performance prediction [3, 4, 5, 6], weight inheritance [2, 7, 8], network morphism [9, 8, 10], and parameter sharing [11, 12, 13, 14, 15, 16].

While they effectively improve efficiency, some of the efforts mentioned above also significantly increase the conceptual and algorithmic complexity of their associated methods. To some extent, this defeats the purpose of making deep learning more accessible, in so far as a user might want to understand the algorithm they are using and not simply execute a series of instructions.

Most recent NAS approaches [5, 14, 12, 8] use a cell-based architecture search space, as per [17, 18]. While this approach dramatically reduces the size of the search space (see Section 2.1), it is motivated by the human affinity for regular patterns, as seen in classical architectures such as ResNet [19] and DenseNet [20]. To the best of our knowledge, there is no theoretical justification for such regular structures and a machine-led search might end up discovering better architectures if free from this prior.

The series of limitations mentioned above have led us to approach the NAS problem from a different angle, with the aim of providing a method that is simple, flexible, effective, and efficient. We first observe that typical NAS learning curves show rapid progress in the early phases of the search, followed by a long plateau period where only minor improvements are made. Most of the search time is spent in that plateau phase. Secondly, NAS methods typically search for a single architecture or cell, and discard the others. If instead, we retain the members of the final population and ensemble them into a neural network committee (NNC), we can gain an economical boost in the accuracy of our model. In turn, this allows us to stop the search earlier, towards the beginning of the plateau phase, and recover the missing performance by means of ensembling.

To enable such a gain, the members of the population must be individually competent, but also diverse in the classification errors they make [21, 22, 23, 24, 25, 26]. This is achieved by employing an artificial immune system (AIS), a class of bio-inspired algorithms that has proven to be able to find multiple high-quality local optima [27, 28].

Another contribution of this work is that we verify three assumptions commonly made in the field of NAS research, but which are rarely verified: the locality property of the search space, the correlation between partial evaluation and final performance of architectures, and the validity of progressively growing neural networks, in other words the correlation between the performance of an architecture, and the performance of the same architecture with a slight increase in complexity.

The remainder of this paper is organized as follows: we first give a brief overview of the research in fields related to this work in Section 2. In Section 3, we then describe ImmuNeCS (Immune-inspired Neural Committee Search) in detail, including the search space and search algorithm. Section 4 presents a series of experiments that we ran to 1) verify three common assumptions in NAS that are rarely validated; 2) assess the performance and efficiency of our method on two common datasets, and 3) evaluate the ability of an ImmuNeCS-produced population of neural nets to transfer to another task. We conclude by discussing the advantages and limitations of our approach and exploring future research directions.

2 Related Work

In this section, we present some key advances in the field of NAS, particularly around efficiency improvement. We then give a brief overview of AIS algorithms and NNCs.

2.1 Neural Architecture Search

NAS is a field of deep learning that has attracted muchh attention over the last few years [29]. Early works [1, 2] achieved promising results on benchmark datasets at the cost of thousands of GPU-days. The promise shown by these papers triggered intensive research to try and improve the efficiency of NAS methods. Here, we describe some solutions that have come out of this general effort, which can often be combined 111We refer to [30] for a comprehensive review of NAS..

Cell-based search space

Given the complexity of deep neural networks, the search space can potentially be infinite, so all NAS approaches come up with some ways of reducing it to a more computationally tractable size. Inspired by state-of-the-art hand-crafted architectures [19, 31, 32], most recent NAS works search for motifs called cells rather than full architectures, an idea first proposed in [17, 18]. These cells are repeated in a pre-defined way to generate models. This greatly reduces the search space and enables transferability to more complex tasks by simply increasing the number of assembled cells. It also allows performing the search using relatively small architectures, before "augmenting" the final model before training by increasing the number of cells. The speed-up associated with this form of architecture search is around one order of magnitude [17].

However, cell-based search limits one’s ability to explore one of the most intriguing aspects of NAS, namely its potential for discovering macro-architectures that humans would not have come up with. In this research, we choose not to follow this strategy and instead use a limited selection of low- or high-level blocks that the search algorithm can assemble in any way it deems most effective, similar to the work in [33].

Weight sharing

Some of the most significant gains in efficiency have been obtained by training only one large model and sampling subgraphs from it that share its weights [11, 12, 14, 15, 16]. The approaches using this strategy routinely report speed-ups of two to three orders of magnitude. However, recent research [34, 35] has cast some doubt over the superiority of weight sharing methods over a well-designed random search, because the performance ranking of subgraphs using the shared weights might not accurately represent the ranking of final, trained-from-scratch models.

Progressive search

Given a set of possible operations for each of positions (i.e. layer or cell element), the number of possible architectures is . However, by making the search progressive, it is possible to achieve : Starting from a trivial or small architecture, one can search among the possible configurations for the next position, and repeat the process to incrementally grow the network. This approach is adopted in [36, 37] in the context of a cell-based search and in [38] on full architectures. The progressive search method allows [36] to report efficiency gains of 3 to 5 times in the number of models evaluated.

Progressive search assumes that there is a better chance of obtaining a strong candidate when building up from an architecture that is already strong, an assumption that we test in this research (see Section 4.1).

Partial evaluation

In any NAS strategy, candidate evaluation is one of the key bottlenecks, as training each neural network can take several GPU-hours. As a result, a common solution to reduce this time is to partially evaluate candidates, i.e. train them on a subset of the training set and/or for only a few epochs [17, 4, 5]. This is based on the assumption that there is strong rank-correlation between the accuracy of partially trained networks and that of the same architectures subjected to full training. This assumption is not always true in practice [4]; we therefore run an empirical verification in this paper before adopting this strategy.

Weight inheritance

NAS methods generally derive new architectures from previously-evaluated ones, by applying some form of transformation. Weight inheritance is the idea that in such cases, the weights learned by the parent architectures are carried-over to the child candidate in all components (i.e. layers or cell elements) that have not been altered during the transformation process. The inherited weights can then be fine-tuned [2], or even frozen completely [8]. In this paper, we use weight inheritance as a starting point for fine-tuning.

Network morphism [39, 40, 7, 10] is a more radical form of weight inheritance. Modifications to the architecture are designed to preserve the function that they represent, either exactly or approximately [8], so that all the weights learned so far can be re-used.

Both forms of network morphism allow child architectures to be warm-started with weights that have been learned by their parent, thus reducing the evaluation time.

2.2 Artificial Immune Systems

AIS algorithms are a class of evolutionary algorithms that date back from the early 2000’s and are inspired by theories related to the mammal immune system [41, 42]. They rely on the idea of a population of antibodies that can proliferate and mutate depending on their affinity to detected antigens, while also interacting between themselves to maintain diversity. When an antibody has been able to bind to an antigen, it will enter a memory bank that speeds up and amplifies the immune response in case of a future encounter with the same antigen or one closely related to it (secondary response).

In the context of optimization tasks, one of the simplest AIS algorithms is the Clonal Selection Algorithm or Clonalg [27]. It has shown its ability to locate multiple optima and implicitly account for multiple possible solutions. Subsequent developments have taken advantage of interactions between population members to promote diversity, leading to the Opt-AINet algorithm [28]. In this paper, we use a variant of Clonalg, with inspiration from Opt-AINet and provisions made for progressive search (see Section 2.1). Compared to the Genetic Algorithm (GA) [43], another popular population-based search method, AIS have the added benefit of not requiring crossover or recombination of candidates, i.e. the recombination of different parts of the “genomes” of two individuals to create an offspring. It is not obvious how one would define such a crossover operation: layers co-adapt during training, and simply combining a section of one candidate with a section of another candidate seems ill-suited to the neural network paradigm. The authors of [2] seem to reach the same conclusion and forego crossover entirely, despite using an algorithm related to GA.

2.3 Neural Network Committees

Neural network committees are simply ensemble models using neural networks. As such, they have been studied for a long time [44], including in the context of convolutional neural networks (CNN) [45], and can benefit from most of the research around ensembling strategies. The performance of an ensemble essentially depends on two factors: The quality of each model in the ensemble, and their error diversity. However, it is not always clear how to achieve this error diversity, or even how to define it [21]. A natural idea is to explicitly optimize models for accuracy and diversity [26]. The approach in [46] promotes diversity by using a modified fitness function to discover a population of neural networks by an evolutionary algorithm. Their work presents similarities with ours, however the emphasis of their search algorithm is on general hyperparameters rather than complete architectures, and their experiments are limited to the MNIST dataset [47] and simple models.

The alternative to explicit diversity enforcement is to use a search algorithm that is inherently able to maintain diversity. It seems that AIS algorithms are effective in this regard [23]. In particular, the authors of [24] compare Opt-AINet trained with accuracy as its sole objective, with another AIS (Omni-AINet) specifically designed to optimize for both accuracy and diversity. They show that the single-task Opt-AINet can outperform the multi-task setting.

3 Description of ImmuNeCS

In this section, we first explain the observations that motivate our method, then detail the main components of ImmuNeCS.

3.1 Key Observations

Diminishing returns

As with many optimization processes, NAS tends to exhibit a fast rate of improvement in the early phases of the search. However, progress gradually becomes more difficult and more and more time is required to achieve the required progress (regardless of the metric used to measure that goal). If we had at our disposal a method to economically boost the accuracy of our model(s), we might be able stop the search much earlier and still recover a similar level of performance.

Plurality of solutions

A common feature of all NAS methods is that they evaluate many candidates during the search, whose performance gradually improves. In the context of evolutionary algorithms for instance, this means that the final generation of candidates should all have relatively good performance. Most NAS methods retain the absolute best candidate based on some criterion (typically validation accuracy, or Pareto dominance in multi-objective search) and discard the knowledge accumulated by the others.

Neural Committee Search

Motivated by the two observations above, we choose to follow a different approach to most existing NAS methods. Instead of discarding the final generation of network architectures, we ensemble them, which achieves a significant improvement in the prediction accuracy compared to even the best network in the population (typically close to 1%pt in our experiments). This in turn allows us to use relatively aggressive termination criteria for the search (see Section 3.4), thus completing the whole process within a reasonable compute budget. This idea, which we dub Neural Committee Search (NCS), shifts the NAS problem from focusing on a single architecture to growing a competent but diverse population of classifiers.

3.2 Representation

Figure 1: Graph representation of an architecture with three hidden layers.

We represent an architecture as a directed acyclic graph where nodes correspond to layers/blocks and edges to tensors (see Figure 1). Node is the start or input node, which receives data samples. Each subsequent node has at least one incoming edge, which comes from node . On the CIFAR-10 [48] task, each hidden layer except can have a second incoming edge from any node , thus allowing skip connections. The start node is the only node that is allowed an outdegree greater than 2. On the Fashion-MNIST [49] task, all nodes have indegree 1.

Each node has an Aggregation and an Operation. Aggregations are used to combine tensors from earlier layers. They are None when the node has an indegree of 1 but can be Add or Concatenate when it is 2, as described in Section 3.3. Operations apply some mathematical transformation to their input tensor (e.g. convolution, ResNet block…). As per Table 1, each Operation can have several hyperparameters.

3.3 Search Space

The search space of any NAS method requires careful consideration. Too permissive, the dimensionality becomes unmanageable and the search becomes untractable. Too restrictive, and the algorithm may be prevented from discovering the interesting architectures. Any search space design will necessarily include some human bias in the choice of operations that are accessible, the hyperparameters of these operations, or the ways they can be connected to one another. However, as stated in Section 2.1, we would still like the search to be able to surprise us with solutions that humans would not have thought of. In this research, we design two different search spaces depending on the dataset used in the experiments. We chose to design relatively small search spaces, but with solutions that might appear unconventional compared to other NAS methods.

Fashion-MNIST search space

Fashion-MNIST is a dataset of small, -pixel greyscale pictures of ten classes of clothing items. It is a good choice for developing and testing a classification model because it is simple enough that our method can complete the search in about one day on two NVidia RTX 2080Ti GPUs, yet complex enough that there can be significant differences in final accuracy between different experimental settings. On this task, we restrict the search space to strictly sequential architectures (no skip connections). Each layer can be a convolution, a depthwise separable convolution, a pooling layer or an Identity function 222The raison d’être of the identity function is to possibly neutralize an existing layer during subsequent mutations, see Section 3.4.. Each of these operations except Identity has three hyperparameters with a number of discrete values that they can take (see Table 1). Note that we deliberately make two decisions:

  • Somewhat contrary to common practice, Conv and DSepConv layers do not necessarily include batch normalization nor an activation function. In practice, the algorithm does indeed make use of this freedom.

  • For simplicity, pooling layers are solely responsible for widening via their Channel multiplier parameter. Other layers do not change the number of channels.

The input layer takes data batches and applies a pointwise convolution to them to increase the channel count from 1 to 64, with no batch normalization nor activation. The network’s final layer, the classifier, is also fixed and follows a common structure in CNNs. It first applies adaptive global concatenation pooling, whose number of channels is twice the output of the feature extraction part of network [50]. Then the data goes through batch normalization and dropout. The dropout rate is set at 20% based on side experiments. Finally, a fully-connected layer reduces the number of activations to the number of classes of the task at hand (10).

CIFAR-10 search space

CIFAR-10 is one of the most popular computer vision tasks in recent NAS research. It consists of small, -pixel color images, distributed across ten classes. It is a much more challenging task than Fashion-MNIST and requires deeper architectures with skip connections. Letting the algorithm incrementally assemble simple layers as with Fashion-MNIST would be very slow, yet for reasons explained in Section 2.1, we want to avoid the cell-based search paradigm.

Inspired by [33], we therefore decide to design the search space around a menu of high-level blocks. These are taken from the classical deep learning literature [19, 20, 31] but we do not impose any restrictions on the order nor the number of each block in the final architecture. As before, each block type has hyperparameters that the AIS can choose from (see Table 1). Here again, the input layer is simply a pointwise convolution with an output width of 32. The classifier part is the same as with Fashion-MNIST.

Operation Type
Hyperparameter 1
{Possible values}
Hyperparameter 2
{Possible values}
Hyperparameter 3
{Possible values}
Conv
Kernel size
{1,3,5,7}
BatchNorm
{yes/no}
ReLU
{yes/no}
DSepConv
Kernel size
{1,3,5,7}
BatchNorm
{yes/no}
ReLU
{yes/no}
Pool
Type
{Max, Avg}
Kernel size
{3,5}
Ch.multiplier
{1, , , 2}
Identity
Operation Type
Hyperparameter 1
{Possible values}
Hyperparameter 2
{Possible values}
Resnet Block[19]
Kernel size
{3,5}
Downsample
{yes/no}
Resnet Bottleneck
Block[19]
Kernel size
{3,5}
Downsample
{yes/no}
Densenet Block[20]
Growth factor
{12, 24, 36}
Transition layer
{yes/no}
Densenet Bottleneck
Block[20]
Growth factor
{12, 24, 36}
Transition layer
{yes/no}
Inception-Resnet
Block A[31]
Kernel size
{3, 5}
Bottleneck factor
{0.1, 0.4, 0.75}
Inception-Resnet
Block B[31]
Kernel size
{3, 5}
Bottleneck factor
{0.1, 0.4, 0.75}
Pool
Type
{Max, Avg}
Kernel size
{3,5}
Identity
Table 1: Operations, hyperparameters and their possible values included in the Fashion-MNIST (left) and CIFAR-10 (right) search spaces. Ch.multiplier refers to the factor by which pooling layers increase the number of channels. Downsample indicates whether the block should reduce the spatial size of the activations by a factor 2 and increase the channel count by the same factor. Growth factor is the number of channels that are added to the data tensors as they go through the block. Transition layer indicates whether a compression layer is appended to reduce the channel count after the block by 50%. Bottleneck factor is an internal compression factor in the channel count that affects the middle branch of the block. The output of the block retains the same number of channels as its input. See respective papers for details of the blocks.

Each node is decoded into a block function which receives a tensor from block , and optionally, a tensor from any block . and can be either summed up, or concatenated along the channel dimension. Mismatches between spatial dimensions are resolved by bilinear interpolation, whereas channel counts are aligned by pointwise convolutions without batch normalization nor activation function. The output from can thus be:

  • if has only one input (which is only certain for ),

  • or otherwise (where denotes concatenation along the channel dimension, ignoring spatial and width adjustments with slight abuse of notation).

Unlike most recent works in NAS, we do not predefine where the data should be downsampled or the number of channels increased in the macro-architecture. Instead, these manipulations are freely decided by the search algorithm. In theory, and given enough network evaluations, this allows the algorithm to adapt the receptive field and number of channels to local requirements within each architecture, rather than mandating a cell structure that can work around fixed values for both these structural decisions.

3.4 Search by an AIS

We first give an overview of our Clonalg-derived search algorithm, before detailing its key components: cloning and mutation, random insertions, and augmentation.

Overview

The search is conducted by an AIS, starting from a population of networks comprising of a small number of random hidden layers. Then the networks go through partial training and are evaluated on a validation set. Their validation accuracy represents their affinity to the task.

Subsequently, at every generation, a fixed number of clones is generated for each network. These clones undergo mutations to their connections, aggregation and operation (see details below). The resulting architectures are trained and evaluated, and the best candidates from the pool of parents and clones are retained to form the next generation. The mean affinity of the population is then computed, and a few new random networks are inserted into the population to explore new regions of the search space. If the mean affinity does not improve by more than a threshold within a patience period of generations, the population goes through augmention: The algorithm generates clones of each individual and appends one random layer to each of them. This progressive search mechanism allows the AIS to generate minimal networks for each task. A layer can be modified through mutation even after the addition of subsequent layers, which prevents models from being locked in the sub-region of the search space defined by the previous layers.

The whole process then restarts from the cloning and mutation step. The search terminates when consecutive augmentation phases have not yielded sufficient improvement, i.e. the mean affinity has not improved by more than . Note that these two hyperparameters control the threshold and patience for both the inner loop (how many generations to wait for an improvement before augmenting the population), and the outer loop (how many augmentations to wait for an improvement before stopping the search). By changing the values of and , one can make the search stop earlier or later in the learning curve. In our experiments, we used . was usually set at 0.0075 for Fashion-MNIST and 0.003 for CIFAR-10, which correspond to relatively steep parts of the respective learning curves (0.75/0.3%pt improvement in mean accuracy within two generations). The pseudo-code is provided in Algorithm 1.

procedure SEARCH(: population size, : initial number of layers, : mutation factor, : number of clones per parent, : number of random insertions, : number of augmented networks per parent, : patience, : threshold)
     pop MakeRandomArchitectures()
     pop MakeAugmentedCopies(pop, ) Population size = )
     Evaluate(pop) Assign affinity to each network
     while ExitCondition(, ) not met do:
         Make training and validation datasets
         clones Clone(pop, ) clones in total
         clones MUTATE(clones, ) See Algorithm 2
         Evaluate(clones)
         pop SelectNBest(pop clones, ) Population size = N
         avg_affinity ComputeAverageAffinity(pop)
         pop MakeRandomArchitectures() Population size = , small (1 or 2)
         Evaluate() Train and evaluate new architectures only
         if avg_affinity improved by less than for generations then:
              pop MakeAugmentedCopies(pop, ) Population size =
              Evaluate( Train and evaluate new architectures only               return pop
Algorithm 1 ImmuNeCS Search

Cloning and mutation

The mutation strategy is an important difference of AISs compared to traditional evolutionary methods such as the genetic algorithm. Here, the magnitude of the mutations that clones undergo is influenced by the affinity of their respective parents. This is a way of balancing exploitation and exploration: if a parent has high affinity, the AIS will focus on a small region around this parent. Conversely, when a parent has poor affinity, the AIS will extend the search to a wider area in an attempt to find more promising solutions. In addition, as shown in Equation (2), we assign linearly larger mutation variances to more recent layers, i.e. closer to the network’s head. This is because earlier layers have already had several opportunities to mutate and we do not want to generate more mutations than required, as they are costly in terms of re-training the network.

Figure 2: Perturbation of continuous hyperparameter values and binning into discrete values. Top example: the strength of the perturbation was sufficient to change the discretized value of the operation type. Bottom example: the perturbation was too small to change the expression of the kernel size hyperparameter, but the change in its continuous value is recorded and becomes the starting point for future mutations. Best viewed in color.

In practice, given a clone of depth (i.e. with layers/building blocks), the mutation rate is computed as:

(1)

with being the affinity of the clone’s parent architecture. Then for the layer or block, the strength of the mutation is randomly sampled as:

(2)

However, the mutation operator above assumes continuous values, whereas we are dealing with discrete features conditioned by other features. To resolve this issue, we store continuous values in the range for each hyperparameter and discretize them on the fly through binning when constructing each network (see Figure 2). This allows us to remember all mutations even when they were insufficient to alter the integer value of a hyperparameter (a similar strategy was employed in [2]).

procedure MUTATE(clones, : mutation factor )
     for each clone with parent’s fitness  do:
         for each node (layer) , do: Mutate connections, CIFAR-10 only
              PERTURB ’s indegree see explanation of PERTURB on Figure 2
              if ’s indegree == 2 then:
                  if indegree was 1 before, sample a node as input Uniform random from predecessor nodes
                  else PERTURB input node’s index The input is always from                        
         if ’s connections are unchanged then: Assessed after discretization
              for each node (layer) , do: Mutate aggregations and operations
                  if ’s indegree == 2, PERTURB Aggregation type Add or Concat
                  if (discrete) Aggregation type has not changed then:
                       PERTURB Operation type
                       if (discrete) Operation type has changed then:
                           Sample new Operation hyperparameters Uniform random from [0,1]
                       else
                           PERTURB each Operation hyperparameter                                                                      return clones
Algorithm 2 Mutation operator pseudo-code

Each clone’s mutation sequence is described in Algorithm 2. To avoid changes too drastic, we do not let mutations affect both the connectivity of the clone and its layers, so if at least one connection has changed, we immediately move on to the next clone. Otherwise, we start mutating nodes: for each of them, we first perturb the aggregation type (if applicable, i.e. the node has two incoming edges). Next, we try to perturb the node’s operation type, then all of the node’s operation hyperparameters. All these perturbations use Equation (2), and at each step, we assess whether the discretized value has changed as per Figure 2. If it has changed, we skip the next steps and move on to the next clone. By proceeding in this way, we ensure that only small elements of the networks are changed at every round of mutation, ensuring consistency between the parents’ and clones’ performance (see the experiment on locality and mutation operation in Section 4.1).

To fully utilize the population’s capacity, we keep track of all architectures already evaluated using a unique string encoding scheme and do not allow mutated clones to be identical to previously encountered networks. All mutated clones inherit the weights learned by their parent, except in the layers that have been modified by mutations (as assessed after discretization), or whose number of channels has changed due to mutations to upstream components. Layers that do not inherit weights are initialized with He initialization [51].

Random insertions

Novelty introduction is a way to promote exploration during the search. In our method, it is also a way of improving diversity, which as we saw is essential to NNCs. Other evolutionary algorithms frequently resort to crossover operations. However, as argued is Section 2.2, such approaches seem ill-suited when it comes to neural networks. Instead, we simply insert a small number of random individuals into the population at every generation. To make sure that their capacity is comparable to the rest of the population and therefore, to give them a chance to survive the next selection operation, their depth is set at the current average depth of the population.

Network augmentations

As mentioned in Section 2.1, we adopt a progressive search strategy. To this end, we periodically increase the capacity of the population by cloning all networks, removing the heads of all clones, appending one random layer to each (with the number of incoming edges sampled randomly), and rebuilding their heads (see Figure 3). It is worth noting that the original, non-augmented networks remain part of the population so that augmentations that do not bring any benefits can be ignored. As with mutations, clones inherit the weights learned by their parent, except for the new layer, which is initialized with He initialization.

Figure 3: Network augmentation strategy for progressive search. Best viewed in color.

Evaluation during search

Network graphs are only decoded into the actual neural network architectures at evaluation time or when inheriting weights from their parent. Otherwise, we only keep the graph representations in memory. As the training and evaluation of candidate solutions is by far the most resource-hungry part of the algorithm, we resort to partial evaluation. Each candidate is trained on only 20% of the training set, and we define an aggressive early stopping policy: If the validation accuracy does not improve by more than 0.5% on Fashion-MNIST (0.3% on CIFAR-10) for 2 epochs, training stops. In all cases, training is not allowed to exceed 15 epochs on Fashion-MNIST and 30 on CIFAR-10. At that point, the weights of the best epoch seen so far are saved, and the network is assigned the best validation accuracy as its affinity score. In addition to saving time, this strategy favors architectures that train fast.

Driven by the objective of improving the whole population, we monitor progress by computing the average population affinity at every generation, rather than the single best affinity as is typically the case in NAS.

3.5 Neural committee building

Once the search is completed, we are left with partially trained architectures. On the Fashion-MNIST task, all architectures are retained. On CIFAR-10, due to the longer time required to train each model, we retain the best candidates based on their affinity scores. We train the retained networks further without reinitializing their weights, on the full training set and for more epochs (see experimental details in Section 4).

The NNC’s class predictions are obtained by weighted soft majority vote. Given a data sample and a committee made up of neural nets with affinity scores (), we get a collection of -dimensional probability vectors , where is the number of classes in our classification task. We then compute the following weighted sum:

Finally, the committee’s prediction is given by .

4 Experiments and Results

In this section, we first describe experiments that were conducted to verify a number of common assumptions in NAS. We then present preliminary results from full-scale experiments on the two tasks and compare them to several existing NAS methods. Finally, we run a comparison to random search to assert that the performance of our method does not only result from the search space design.

4.1 Assumption Validation

A central, although implicit, assumption made by NAS is that the search space of all architectures exhibits the locality property: Architectures that are nearby in the hyperparameter space will perform similarly. However, this is not an assumption that is commonly tested, so we run an experiment to verify it, and at the same time verify that our mutation operator is defined in a way that exploits this locality property. Moreover, as described in Section 3.4, we employ a few tricks to speed up the search: weight inheritance, partial evaluation, and progressive network growth. Weight inheritance is a common practice in many papers [2, 8] and its validity is verified in [7]. However, the other two techniques require further validation.

Locality and mutation operator

To test for locality, we generate a population of 100 random networks in the Fashion-MNIST search space. To test the assumption for both shallow and deep networks, half the population has depth 3 and the other half, depth 9. Each network is evaluated against the Fashion-MNIST task, then 10 mutated clones are generated for each parent, using the mutation operator described in Section 3.4. For each parent, we compute the mean and standard deviation of its clones’ affinity scores and analyse their correlation to the parent’s affinity.

If the locality assumption holds, and if the mutation operator can take advantage of it, we can expect two things to happen. Firstly, the mean affinity of the clones should be correlated to the parent’s affinity. Secondly, as the mutation operator applies mutations with a larger variance to clones whose parent has a low affinity score, we should observe an inverse correlation between parent affinity and the variance in clone affinity.

Figure 4 illustrates the correlations found on the mean and standard deviation and are in line with expectations. It also appears that correlations are stronger in the deep network regime, which makes intuitive sense, as a single mutated value modifies a smaller proportion of the network.

\thesubsubfigure Spearman rank correlation for depth = 3:
; depth = 9:
\thesubsubfigure Spearman rank correlation for depth = 3:
; depth = 9:
Figure 4: Validity check of the locality assumption, using mean (a) and standard deviation (b) of clones vs their respective parent’s affinity. Dashed lines and grey areas indicate LOESS regression and its 95% confidence interval, respectively. Best viewed in color.

Partial evaluation

Here, we refer to partial evaluation as the training of candidate architectures with a subset of the training set and an aggressive early stopping policy, as presented in Section 3.4. As in the previous experiment, we generate random architectures of depths 3 and 9 in the Fashion-MNIST search space and evaluate them under the partial evaluation regime. Next, we train them to convergence on the full dataset (training details in Section 4.2), from the inherited weights.

We plot the full-training test accuracy values against the partial-training validation accuracy values in Figure 5. The correlation is very strong for shallow networks and somewhat weaker for deeper architectures. This can be explained by the fact that deeper architectures typically learn slower, therefore applying the same termination criteria to all network sizes might be suboptimal. Exploring alternative policies might constitute future work; nevertheless, the correlation is still robust even at depth 9, so that partial evaluation is still indicative of the final accuracy. Moreover, by the time a depth of 9 is reached, most networks perform reasonably well and they are going to be ensembled anyway so that it is less critical for partial evaluation to provide an exact indication of the ultimate performance – which can be seen as a further benefit of our approach. Further experiments have also shown that these correlations are slightly weaker when weights are not inherited but reintialized, which justifies the combination of partial training and weight inheritance.

Figure 5: Validity check of the partial evaluation strategy. Spearman rank correlation overall: ; for depth = 3: ; for depth = 9: . Dashed lines and grey areas indicate LOESS regression and its 95% confidence interval, respectively. The solid line is the identity function, for reference. Best viewed in color. Figure 6: Validity check of progressive search. Spearman rank correlation overall: ; for depth = 3: ; for depth = 9: . Dashed lines and grey areas indicate LOESS regression and its 95% confidence interval, respectively. The solid line is the identity function, for reference. Best viewed in color.

Progressive search

Progressive search is the idea of incrementally building architectures by adding layers one by one. The underlying assumptions need to be verified by answering these two questions: 1) Does adding a layer to a strong candidate, in expectation, yield an even stronger candidate? 2) Given two candidates, does, in expectation, augmenting the stronger one yield a stronger candidate than augmenting the weaker one?

To test these assumptions, we again generate a population of random architectures of depths 3 and 9. We train them under the partial evaluation regime, measure their affinity score (validation accuracy), then augment each one them with an additional layer (with inherited weights in all other layers) and repeat the evaluation process. Figure 6 addresses the two questions above: 1) Shallow networks almost always benefit from augmentation and while this is less true for deep networks, there is no catastrophic loss of performance due to augmentation; 2) there is significant correlation between affinity scores of networks before and after augmentation.

4.2 Full-scale experiments

Having verified the assumptions behind ImmuNeCS, we can now turn to full scale experiments on the Fashion-MNIST and the CIFAR-10 tasks. All experiments were run on NVidia 2080Ti GPUs.

Fashion-MNIST

We use sequential architectures sampled from the Fashion-MNIST search space described in Section 3.3 and Table 1 (left). The meta-parameters used in the main experiment are:

  • Search: population size (the small population size is made possible by the relatively small search space and the progressive search strategy), number of clones per parent , number of augmented copies per parent , number of random networks inserted at each generation , mutation factor , patience , threshold , maximum number of generations ;

  • Partial evaluation: training set size samples (20% of the total training data available), validation set size samples, batch size, data pre-processing: pad with zeros to 34x34 and crop back to 28x28, initial learning rate , optimizer: Adam with cosine annealing [52] (no restart; the choice of cosine annealing is justified by its good anytime performance, as in [7]), , , weight decay , early stopping patience epochs, early stopping threshold , maximum number of epochs (most evaluations stop around 10 epochs given these early stoppping criteria);

  • Final training: retain architectures; training hyperparameters are identical to partial evaluation except: training set size samples, initial learning rate , 1 restart after 35 epochs, fixed number of epochs (no early stopping).

The results obtained are summarized and compared to representative works are presented in Table 2. Our method achieves an average over 6 runs of 94.17% (standard deviation 0.15%), outperforming the others by a significant margin while requiring a comparable amount of resources. Admittedly, measuring resources in GPU-days is somewhat inaccurate because hardware improves over time; however this metric can still give us an estimate of each method’s efficiency.

Model Reported # Runs Test accuracy GPU-days Method Cell-based Comments
Auto-Keras[9] Best 1 92.56% 0.5
Network morphism +
Bayesian optimization
No Time-constrained
NASH[7]333As implemented and reported in [9] Best 1 91.95% 0.5
Network morphism +
Evolution
No Time-constrained
Gradient Evolution[53] Median (Best) 10 90.58% (91.36%) Evolution No
DeepSwarm[38] Mean (Best) 5 93.25% (93.56%) 1.2 Swarm Optimization No
ImmuNeCS Mean (Best) 6 94.17% (94.39%) 2.4 AIS No
Table 2: Comparison of results of our method with others on the Fashion-MNIST task.

Cifar-10

We sample architectures from the CIFAR-10 search space described in Section 3.3 and Table 1 (left). The meta-parameters are the same as with Fashion-MNIST, with the following exceptions:

  • Given the larger number of possible operation types (and thus, the denser sampling space), we sample them from , keeping the definition of provided in Section 3.4. All other items are sampled as before.

  • Partial evaluation: training set size samples, validation set size samples, batch size=, data pre-processing: reflection padding to 40x40 and crop back to 32x32 + cutout [54] and random flip, initial learning rate , early stopping threshold , maximum number of epochs .

  • Final training: to save on retraining time, we retain only the best architectures ranked by affinity score; training hyperparameters are identical to partial evaluation except: training set size samples, initial learning rate , 2 restarts after 30 and 90 epochs, fixed number of epochs (no early stopping).

Model Reported # Runs Result GPU-days Method Cell-based Post-processing
ResNet 110[19] Mean (Best) 5 93.39% 0.16 Manual
NAS depth 15[1] Best 1 94.50% 12600 Reinforcement Learning No
Search on hyperparameters
before final retraining
NAS depth 39[1] Best 1 95.53% 12600 Reinforcement Learning No
Search on hyperparameters
before final retraining
NAS depth 39 + more filters[1] Best 1 96.35% 12600 Reinforcement Learning No
Search on hyperparameters
before final retraining
NASNet-A 28M[17] Best 1 97.60% 1800 Reinforcement Learning Yes Augmentation of final model
NASNet-A 3.3M[17] Best 1 97.35% 1800 Reinforcement Learning Yes Augmentation of final model
Large-Scale Evolution[2] Mean (Best) 5 94.1% 0.4 (94.6%) 3000 Evolution No
Large-Scale Evolution,
top-2 ensemble[2]
Best 1 95.60% 3000 Evolution No
AmoebaNet-B 2.8M[5] Mean 5 97.45% 0.05 4500 (TPU) Evolution Yes Augmentation of final model
AmoebaNet-B 34M[5] Mean 5 97.87% 0.04 4500 (TPU) Evolution Yes Augmentation of final model
NASH single model[7] Mean 4 94.80% 1
Network morphism +
Evolution
No
NASH snapshot ensemble[7] Mean 4 95.30% 2
Network morphism +
Evolution
No
LEMONADE Search Space[8] I Best 1 96.50% 56
Network morphism +
Evolution
No
LEMONADE Search Space II[8] Best 1 96.60% 56
Network morphism +
Evolution
Yes Augmentation of final model
ENAS Macro[12] Best 1 95.77% 0.3 Weight sharing No
ENAS Macro +
augmented final model
Best 1 96.13% 0.3 Weight sharing No
ENAS Micro[12] Best 1 97.11% 0.5 Weight sharing Yes Augmentation of final model
DARTS 1st Order[14] Best 1 97.05% 1.5 Differentiable Yes Augmentation of final model
DARTS 2nd Order[14] Best of 4 runs 4 97.17% 0.06 4 Differentiable Yes Augmentation of final model
CGP-CNN[33] Mean (Best) 3 93.95% (94.34%) 27 EA No
PNAS[36] Mean 15 96.59% 0.09 300
Progressive search +
Surrogate function
Yes
Search on hyperparameters
before final retraining
ImmuNeCS Best 1 94.42% 14 AIS No
Table 3: Comparison of results of our method with others on the CIFAR-10 task.

4.3 Comparison with Random Search

Figure 7: Comparison of the Clonalg-based AIS with Random Search.

The initial results and comparisons on this task are summarized in Table 3. Although further experiments are required, ImmuNeCS seems to achieve a competitive balance of performance and efficiency, particularly among methods that do not use a cell-based search space. We point out that the search space has not been optimized in any way, we simply reproduced six blocks from traditional hand-crafted architectures. It is very likely that a more thorough analysis of appropriate block candidates would improve these results significantly. One could even imagine a cell-based approach to design a few blocks, followed by the same high-level architecture search as above to combine these blocks (similar to [55]). Moreover, where a number of papers apply some form of post-processing specifically at the final retraining stage (hyperparameter search, augmentation etc.), in our method, the final architectures found by the algorithm are retrained without any modification nor hyperparameter optimization.

Some NAS methods have come under question because their superiority over Random Search (RS) could not be established in rigourous experiments [34, 35]. This might indicate that the techniques they use add complexity without helping performance, and that the performance mostly comes from the designed search spaces rather than the search algorithms. It is therefore important to compare ImmuNeCS with RS to ensure this is not the case with our approach. To this end, we run ImmuNeCS seven times on Fashion-MNIST at different evaluation budgets, which are obtained by varying the search-stopping threshold and the maximum number of generations. We then run a full RS at different budget values spanning the range covered by ImmuNeCS. In each of these RS runs, and to ensure a fair comparison, the generated architectures’ number of layers is sampled around the mean depth of the corresponding AIS runs.

The results in Figure 7 show that the search by an AIS provides substantial benefits over RS ( for the null hypothesis: "Both methods yield the same average committee test accuracy"). The wider distribution of the AIS results is due to the fact that our method exhibits a stronger correlation between performance and the number of network evaluations than RS. Therefore, varying the evaluation budget spreads the RS results less.

4.4 Transferability

We hypothesize that the plurality of ImmuNeCS’s solutions helps them transfer to a different task and achieve competitive results even when the architecture search was conducted on an easier task. We therefore run ImmuNeCS on MNIST, a very simple task by modern standards, using the same search space and search parameters as described in Sections 3.3 and 4.2 for the Fashion-MNIST task. We obtain an NNC of 12 CNNs that are typically shallower than those obtained through a direct search on the Fashion-MNIST task. We train them first on MNIST and obtain a final accuracy of 99.6%. We then train the same architectures on Fashion-MNIST and measure a final accuracy of 92.8%, which is still among the best results presented in Table 2. Crutially, the performance ranking of architectures on MNIST is completely different to that on Fashion-MNIST, indicating that a 1-best-focused search would be unlikely to return the best architecture for both tasks. This opens up some interesting perspectives, such as a form of transfer learning where an NNC could be discovered on a task with plenty of labelled data, then used to produce predictions on a task with sparser data. It could also allow an NNC to be discovered on a relatively simple task, and further refined on the more complex one.

5 Conclusion and Discussion

We have presented ImmuNeCS, a novel approach to the automatic design of deep learning systems. Instead of focusing on producing a single architecture, we promote a diverse population of competent models that are ensembled to achieve competitive results with reasonable resource requirements. To improve efficiency, we use techniques whose underlying assumptions are verified through dedicated experiments. We also show that our method outperforms random search in a fair comparison, and remark that the NNC found on a simple task is able to generalize well to a more complex task.

Furthermore, our method presents other benefits: 1) It is conceptually simple and therefore approachable from non-experts, 2) it does not enforce an architecture based on repeated identical cells, and is thus able to discover irregular patterns, 3) it is flexible and can accommodate many search spaces. One obvious drawback is that inference is slower, as we need to aggregate the predictions of all members of the NNC rather than a single model. However, in applications where predictions are not required in real-time, this could be an acceptable trade-off.

Future work includes expanding the CIFAR-10 search space to further improve performance, and running experiments on tasks from the field of medical imaging to assess ImmuNeCS’s flexibility and usefulness in real-life situations. In order to make the algorithm more efficient, we also intend to investigate methods that could help guide the search towards promising areas of the search space.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398576
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description