The Ant Swarm Neuro-Evolution Procedure for Optimizing Recurrent Networks

The Ant Swarm Neuro-Evolution Procedure for Optimizing Recurrent Networks

AbdElRahman A. ElSaid
aelsaid@mail.rit.edu
&Alexader G. Ororbia
ago@cs.rit.edu

Golisano College of Computing and Information Sciences
Rochester Institute of Technology
Rochester, NY 14623
&Travis J. Desell
tjdvse@rit.edu
https://people.rit.edu/aae8800/home.htmlhttps://www.cs.rit.edu/ãgo/http://www.se.rit.edu/t̃ravis/index.php
Abstract

Hand-crafting effective and efficient structures for recurrent neural networks (RNNs) is a difficult, expensive, and time-consuming process. To address this challenge, we propose a novel neuro-evolution algorithm based on ant colony optimization (ACO), called ant swarm neuro-evolution (ASNE), for directly optimizing RNN topologies. The procedure selects from multiple modern recurrent cell types such as -RNN, GRU, LSTM, MGU and UGRNN cells, as well as recurrent connections which may span multiple layers and/or steps of time. In order to introduce an inductive bias that encourages the formation of sparser synaptic connectivity patterns, we investigate several variations of the core algorithm. We do so primarily by formulating different functions that drive the underlying pheromone simulation process (which mimic L1 and L2 regularization in standard machine learning) as well as by introducing ant agents with specialized roles (inspired by how real ant colonies operate), i.e., explorer ants that construct the initial feed forward structure and social ants which select nodes from the feed forward connections to subsequently craft recurrent memory structures. We also incorporate a Lamarckian strategy for weight initialization which reduces the number of backpropagation epochs required to locally train candidate RNNs, speeding up the neuro-evolution process. Our results demonstrate that the sparser RNNs evolved by ASNE significantly outperform traditional one and two layer architectures consisting of modern memory cells, as well as the well-known NEAT algorithm. Furthermore, we improve upon prior state-of-the-art results on the time series dataset utilized in our experiments.

\DeclareCaptionType

mycapequ[Equation][]

1 Introduction

Given their success across a wide swath of pattern recognition tasks, artificial neural networks (ANNs) have become a popular tool to use when attempting to solve data-driven problems. However, in order to solve increasingly more complicated problems, neural architectures are becoming vastly more complex. Increasing the complexity of an ANN entails having to operate with more layers of neural processing elements required, most of which are usually wider and more densely-connected, greatly complicating the model design process. The resulting increase in complexity introduces new challenges and complications when fitting these ANN models to actual data. These problems are further compounded when ANNs are meant to process temporal data, entailing recurrent connections which can span varying periods of time. As a result, crafting performant ANNs becomes expensive and incredibly difficult for engineers, highlighting a grand challenge facing the domain of machine learning – the automation of ANN architecture design, which includes selecting the form of the underlying synaptic topology as well as the values of the weights themselves. The key to this automation might lie in developing optimization procedures that can effectively explore the vast, combinatorial search space of possible topological structures that could be constructed from a large set of neuronal units and the wide variety of synaptic connectivity patterns that relate them to one another.

Recent interest in automated architecture search has resulted in many proposed ideas related to deep feed forward and convolutional networks, including those based on nature-inspired metaheuristics [yang2010nature]. However, few, if any, have focused on the far more difficult problem of optimizing recurrent neural networks (RNNs) aimed at processing temporal, sequential data such as time series, i.e., automated RNN design.

This study addresses the challenge of automated RNN design by developing a novel ANN topology optimizer based on concepts from artificial evolution and ant colony optimization (ACO). Specifically, we propose an algorithm called Ant Swarm Neuro-Evolution (ASNE), which automatically constructs and optimizes the topology of RNNs, with a focus on time series data prediction. In developing our optimization approach, we furthermore develop and experiment with variations of our method in the following ways:

  • In order to encourage the discovery of more sparsely-connected neural topologies, we investigate different schemes for dynamically modifying the pheromone traces deposited by ant agents that compose the swarm. Specifically, we introduce functions for introducing regularization into the overall optimization, slowly clearing out densely-connected synaptic areas by depriving poorly performing weights/edges of pheromone accumulation.

  • We incorporate and analyze various weight initialization schemes and find that a Lamarckian inheritance strategy is highly effective.

  • Inspired by the role-specialization that ants operate under within the context of real-world ant colonies, we extend ASNE to utilize different specialized ant agents to modularize the underlying synaptic connectivity construction process, which we find greatly improves solutions found by our metaheuristic.

Experimentally, we validate our proposed nature-inspired metaheuristic on an open-access real-world time series data set collected form a coal-fired power plant. A rigorous ablation study of the ASNE algorithm is conducted by analyzing the candidate network topologies it finds. A total over experiments with varying heuristics and hyperparameters were performed, which entailed training different RNNs. Our results indicate that ASNE is able to build well performing, arbitrary RNN structures with connections that span both structure and time using both simple and complex memory cells. More importantly, ASNE is shown to significantly outperform the well-known neuro-evolutionary algorithm, NEAT [stanley2002evolving], as well as the state-of-the-art evolutionary optimizer, EXAMM [ororbia2019examm], which have held the prior best results on this data set.

2 Related Work

With respect to neuroevolution of recurrent network topologies, a great deal of work already exists, ranging from stochastic alteration of the topology as in drop-out [srivastava2014dropout] to something more sophisticated like that in the original NEAT [stanley2002evolving] and its more modern incarnate HyperNEAT [stanley2009hypercube]. Other proposed approaches include EPNet [yao1997new], EANT [kassahun2005efficient], GeNet[xie2017genetic], CoDeepNEAT [miikkulainen2019evolving], and EXACT [desell2017large]. EXACT was recently extended to evolve RNNs that used LSTM memory cells (named EXALT) and shown to perform quite well on time-series prediction problems [elsaid2019evolving]. Later, the algorithm, named EXAMM, was generalized to evolve networks consisting of a library of recurrent memory cells [ororbia2019examm]. These previously proposed ideas center around the use of a genetic algorithm [holland1992adaptation], where optimization is inspired by approaches that draw from the evolution of organisms, of either Darwinian and/or Lamarckian nature. More recently, work by Camero et alhave shown that a Mean Absolute Error (MAE) random sampling strategy can provide good estimates the performance of RNNs [camero2018low] and have successfully used it instead of actually evaluating or training RNNs to speed up neuro-evolution of LSTM RNNs [camero2019specialized].

Nonetheless, very few studies in the body of work described above consider ant colony optimization (ACO) [dorigo1992optimization] as the central optimizer for network topology, and even fewer in general focus on exploring how to evolve complex temporal models like the RNN, with a few exceptions, such as EXALT and EXAMM. Of the few that have investigated ACO, most existing work has used it to strictly optimize feed forward networks and, even in that case, have dominantly focused on either initializing the weights of the connections [mavrovouniotis2013evolving], or on reducing the dimension of the input vector solution space [sivagaminathan2007hybrid]. One notable effort that has used ACO for RNN optimization in some form is [desell2015evolving], which used ACO to optimize smaller neural network structures based on Elman recurrent networks [elman1990finding].

This paper contributes to the domain of nature-inspired neural network topology optimization by proposing a novel metaheuristic for evolving the full structure of an RNN as opposed to prior studies that have applied the technique as only a partial component of the optimization process [elsaid2018optimizing] or in smaller Elman RNN topologies with limited recurrent connectivity [desell2015evolving]. Furthermore, our algorithm is capable of utilizing the same full suite of recurrent memory cells as the state-of-the-art evolutionary algorithm EXAMM (LSTM, GRU, MGU, UGRNN, and -RNN cells). To the best of our knowledge, we are the first to propose an ACO-based approach to automate RNN design, offering a powerful procedure that combines concepts of both neuro-evolution and ant colony metaheuristic optimization.

3 Ant Swarm Neuro-Evolution (ASNE)

ASNE handles the optimization of ANN structures by constructing a simple multi-agent system, where each agent treats the ANN as graph structure, considering neuronal processing elements (PEs) as the nodes and the synaptic weights that connect PEs as the edges. In order to design the operations that these agents perform as well as the manner in which they traverse the ANN graph, we may appeal to the metaphor of ants and the collective they holistically form, i.e., the ant colony. As a result, the agents will function based on simplifications of myrmecological principles, such as the mechanics of ant-to-ant social interaction.

At a high level, in ASNE, the individual ant agents operate on a single massively connected “superstructure”, which contains all possible ways that PEs may connect with each other both in terms of structure, i.e., all possible feedforward pathways that start from the input/sensory PEs and end at the output/actuator PEs, and time, i.e., all possible recurrent connections that span many multiple time delays. In our implementation, ants choose to move over connections between nodes (or neurons), probabilistically and as a function of a simulated chemical known as the “pheromone”. In nature, the pheromone is one primary driver of how ants communicate with each other, the traces of which allow the collective to “know” of potential food sources ensuring the survival of the colony in the long term. When an ant finds food, the ant will start marking the path it takes to return back to the colony, the pheromone trace of which other ants will then subsequently follow. In the ANN superstructure, these traces, which are simulated by an additional, dynamic scalar weight (or importance value) assigned to a given synapse, will bias any given ant agent to favor selecting some possible (more rewarding) synaptic pathways over others.

procedure Master
       construct fully connected structure with edges holding initial pheromone and weights values.
      
      for  do
            
            
            
            if   then
                 
                 
                              
            if  then
                 if  then
                        update colony’s weights from using phi equation
                 else if  then
                        update colony’s weights with constant fraction of weights                              
            if  then
                  Sum (Fwd Edge/Recurrent Pheromone) = Sum(Bkwd Recurrent Edges Pheromone)                    
procedure Worker
      
      
      
procedure 
      if  then
             ants only move one layer at a step
      else if  then
             ants can jump over layers       
      if  then
            for  do
                  ant chooses the nodes, edges, and recurrent edges             
      else if  then
            for  do
                  ant choose the nodes, edges from colony             
            if  then
                 for  do
                        ants choose rec_edges only from fwd rec_edges                  
            else if  then
                 for  do
                        ants choose rec_edges only from bwd rec_edges                  
            else if  then
                 for  do
                        ants choose rec_edges only from fwd rec_edges                  
                 for  do
                        ants choose rec_edges only from bwd rec_edges                                    return new_nn
procedure ()
      for each  do
            if  then
                 
            else if  then
                 
            else if  then
                 
            else if  then
                                    
Algorithm 1 Ant Colony Algorithm

The few existing efforts on using forms of ACO for RNN optimization [elsaid2018optimizing, elsaid2019evolving] restrict the ACO process to operate within individual LSTM memory cells. In contrast, ASNE allows individual ants traverse a single massively connected “superstructure”, which contains all possible ways that the nodes of an RNN may connect with each other both in terms of structure (i.e., all possible feed forward connections), and in time (i.e., all possible recurrent connections spanning many multiple time delays)111Note that this superstructure is more connected than a standard fully connected neural network – each layer is also fully connected to each other layer as well, allowing for forward and backward layer skipping connections, with additional recurrent connections between node pairs for each time skip allowed.. The high-level pseudocode for our ASNE topology optimizer is depicted in Algorithm 1.

ASNE was developed as an asynchronous parallel system for use on high performance computing resources, which has a master process that maintains the colony information and worker processes to (locally) train the RNNs. This parallel implementation is asynchronous, the master process generates new RNNs as needed for worker processes (which operate on separate, dedicated CPU or GPU resources) and updates colony information and pheromones as trained RNN results are returned. This results in a naturally load balanced algorithm with high scalability.

Within the master process itself, ASNE operates by having a fixed number of ant agents traverse the neural superstructure. Ants choose to move over connections between nodes (neurons) randomly, but they are probabilistically biased towards connections with higher simulated “pheromone” values. Pheromone deposit values are periodically evaporated to prevent the search process from becoming stuck in local minima. Interestingly enough, the modification of the evaporation function could be considered to a way in which one could encode certain priors into the ANN itself.

From the overall superstructure, which the ant agents exclusively operate on, RNN subnetworks are extracted (as dictated by the current pheromone trace network available at the current simulation time step, which yields a map of nodes and connecting synapses, both recurrent and feedforward, visited by the ant agents) and then further fine-tuned locally with only a few epochs of back-propagation (backprop) through time. After a particular worker is done locally training a RNN subnetwork, the candidate’s weight values and cost (fitness) function (measure on a validation subset of data) are communicated back to the swarm and superstructure (housed in the master process), adjusting the pheromone trace network and affecting future ant agent traversal behavior.

One crucial element in our ASNE procedure is the introduction of different ant agent types, which is inspired by how real ants specialize to act according to specific roles to serve the needs of the colony [odonnell2018antroles]. Specifically, we consider designing ant agents that serve specific roles in constructing parts of candidate RNN subnetworks – some ants exclusively traverse feedforward synaptic pathways while others only explore recurrent synaptic pathways. The high-level pseudocode for our ASNE topology optimizer is depicted in Algorithm 1222The full code is posted at: https://github.com/travisdesell/exact/tree/adding_ant_colony..

Within the framework of ASNE, we investigate variations of its various underlying mechanisms. These include the use of Lamarckian weight initialization, allowing ant agents to also select from multiple memory cell types as opposed to operating exclusively with simple neurons, introducing specialized ants that have different graph traversal strategies, and constraining ant movement and manipulating the pheromone evaporation function in order to encourage the discovery of sparse RNN topologies.

3.1 Lamarckian Weight Initialization

Edges and recurrent edges’ weights can be randomly initialized each time a new RNN is generated by the ants. However, initializing parameters this way requires local tuning (via backprop) for many epochs for the RNN to reach suitable generalization error, as they do not make use of any information gained by prior trained RNN candidates. Further, it has been shown that reusing of parental weights (i.e., epigenetic or Lamarckian weight initialization) can significantly speed up the neuro-evolution process and result in better performing, smaller ANNs in general [desell2018accelerating].

To apply Lamarckian weight initialization to ASNE, each edge in the ant swarm’s connectivity super-structure also tracks a weight value in addition to its pheromone value. These weights are randomly initialized uniformly . Each time a generated RNN performs well, the weights of its best performance, as measured on a validation data subset, are used to update the weight values in the swarm’s super-structure internal bookkeeping.

Formally, we define is a function of the population’s best and worse evaluated RNN fitness, as the colony’s edge weight,, as the corresponding neural network’s edge weight, as the population’s best fitness, and as the population’s worst fitness. Weight initialization then proceeds as follows:

(1a)
(1b)
(1c)

With respect to the function , we investigated two variations. The first variant, as shown in Equation 1, used the fitness of the RNN used to update the weights to determine how much these new (locally found) weight values effect those of the colony. The second variant of was set to a predetermined constant instead of being calculated or adjusted by fitness. This process essentially allows for a running average (either with a fixed update or dynamic update based on fitness) of the best weights found for each connection in the superstructure. When a new RNN is generated, it uses the current weight values in whatever edges that were extracted from the superstructure on the master process. This allows for a Lamarckian evolution for edges weights, as prior RNNs with the best fitness scores are allowed to pass on their weights to future generations.

3.2 Memory Cell Selection

For any particular node in the super-structure, ASNE also has the ability to utilize the pheromones present to select which memory cell type a particular node will be in the generated network. A node could chosen to be either an LSTM [hochreiter1997long], a GRU [chung2014empirical], an MGU [zhou2016minimal], a UGRNN [collins2016capacity], or a -RNN cell [ororbia2017learning]. We refer the reader to these works for the formulations of these memory cells. Pheromones are deposited and updated for each of these memory cell possibilities as described below.

3.3 Altering Graph Traversal with Ant Species

As mentioned above, we explored various strategies for guiding ant traversal over the connectivity superstructure. Inspired by role specialization in real colonies, we implemented ant agents that explored the connectivity graph in specific ways. First, we started with a generic ant agent, called the standard ant, which was allowed to traverse through the massively connected colony superstructure in an unbiased manner. This, in essence, recovers the standard simple ant agent in classic ACO, which has complete freedom to explore any piece of a given graph structure. However, it became quickly apparent that this type of ant would get “stuck” in the network, generating a significantly high number of recurrent connections before finally reaching an output node. This meant that the RNN candidates extracted for local fine-tuning were rather dense, and in turn, compute-heavy (featuring many extraneous parameters as is characteristic of over-parameterized models).

\includegraphics

[width=.4height=.1]./pix/backward_bias_schem.pdf

Figure 1: Potential paths an ant can take from a given node (in orange) with the massively-connected superstructure. The number of recurrent paths (red) far outnumber the forward paths (green). This problem is exacerbated as the possible recurrent time scale increases, which results in multiple backward recurrent connections for each red connection, each going back a different number of time steps in the past.

Why do standard/simple ants get stuck or meander too long in the superstructure? In the superstructure, nodes (especially at the final hidden layer) have the option of selecting potential backward recurrent paths, which significantly outnumber the number of potential forward moving paths (see Figure 1). Assuming that each connection has an equal number of pheromones (which is a standard setting for pheromone initialization), agents will circle around the colony using these backward paths, yielding RNN candidates with very dense recurrent structure.

To prevent this problem, our first tactic was to alter the pheromone deposit function by adding extra pheromones to forward paths upon initialization as well as after every pheromone update. The biasing method yielded better proportions of forward and backward paths. Algorithm 2 illustrates this process.

for each  do
      if  or  then
            for each  do
                                    
Algorithm 2 Forward Connections Bias Algorithm

Even with this forward path bias added to the pheromone deposit function, when using standard ants, we found that ASNE still tended to favor the generation of fairly dense networks. Altering the number of ant agents used to explore the structure as a means to control density of RNN candidates proved to help somewhat but was rather unwieldy and entailed far too much external human intervention. Instead, we developed an ant agent role specialization scheme that we found worked far better as an automatic control mechanism to control the network size and synaptic density.

\includegraphics

[width=.48height=.15]./pix/ants_traces.pdf

Figure 2: In multi-role traversal, explorer ants (red) first select the forward paths in the network, creating a basic structure for the RNN. The social ant agents then select from the nodes chosen by the explorer ants. Within the social ant agent role, there is a sub-specialization consisting of forward recurrent ants (blue) that create additional forward recurrent connections between these nodes and backward recurrent ants (green) that move backwards from the output toward the input, creating backward recurrent connections between the same nodes.

The first agent role, the explorer ant, means that the agent is only allowed to choose from forward connections in the connectivity superstructure. The connections selected by this specialized agent would utilized to generate the base neural structure upon which recurrent connections could then be added to. After the explorer ants selected the possible nodes and forward connections, two additional specializations of what we call social ants would then be used, i) forward recurrent ants and, ii) backward recurrent ants. Social ants are first restricted to only visiting nodes that have already been selected by the explorer ants. In the case of the forward recurrent ants, when a path is chosen, the agent would specifically create a recurrent connection that moved forward in the network along the same path, along with a selected time skip (determined by pheromones). Backward recurrent ants, on the other hand, move backwards through the network and, for each path they take, a backward recurrent connection is added, along with a selected time skip (also determined by pheromones). Figure 2 provides an example of possible pathways that these specialized agents can take in a colony superstructure.

In addition to the development of specialized ant agents as described above, we explored two modes for general ant movement; i) ants were allowed to pick edges that could jump over layers in the colony (i.e., the superstructure is massively connected, with a plethora of skip connections), or ii) ants were only allowed to select edges between consecutive layers (i.e., the superstructure is fully connected, with no skip connections). This was tested to see the impact that layer skipping would have on the sparsity and performance of generated RNNs. Jumping and non-jumping modes were tested for both the standard ants (with and without forward-path bias) and the specialized ant agent roles.

3.4 Updating Pheromone Values

In this section, we describe the various schemes we experimented with in designing the ASNE optimization procedure. We define as the pheromone value, as the pheromone decay parameter, as the weights of the evaluated (candidate) RNN, and as the candidate model’s fitness. Specifically, we describe four different functional schemes used to model pheromone deposits.

The first strategy we implemented for ASNE is also standard for classical ACO setups. This deposit scheme rewards well performing RNNs with a fixed (constant) pheromone deposit while penalized ill-performing RNN models by evaporating the pheromone trace by a constant evaporation value, . Specifically, this approach is defined as:

(2)

The second strategy we implemented was one that used the fitness (value) as a parameter to guide pheromone deposit. This has been shown to improve ACO performance in prior studies [sivagaminathan2007hybrid]. This scheme is defined as follows:

(3)

The third strategy was to use the values of the neural synaptic weights themselves to control/guide the deposit of pheromones. Specifically, we inserted a penalty on the weights, specifically an L1 penalty (assuming a Laplacian prior of the synaptic weight values), in order to encourage regularization that favored sparser connectivity structure. This form of weight decay is sometimes applied to ANNs when controlling for over-parameterization and sparse weight matrices (with many near hard-zero values) are highly desirable. L1 regularization was applied to the pheromone deposition calculation in the following manner:

(4)

The fourth and final strategy we employed was to insert an L2 penalty to regularize the RNN candidate weights. This assumes a Gaussian prior over the synaptic weight values and is sometimes referred to in ANN literature as “weight decay”. We incorporate L2 regularization into pheromone deposition according to the following formula:

(5)

We developed these L1 and L2 functional variations of pheromone deposit schemes in the hopes that they would ultimately encourage/reward the uncovery of sparse, compact RNN predictive models.

3.5 Pheromone Evaporation

Pheromone trace values (deposited on the superstructures synaptic edge pathways) evaporate or “decay” after each generation of an RNN in order to reduce the amount of pheromones on synaptic edges that are not being used much by ant agent collective [sivagaminathan2007hybrid, mavrovouniotis2013evolving, liu2006evolving]. Pheromone values are updated (or decayed) according to the following equation:

(6)

where is the pheromone value after the update, is the current pheromone value, is the original baseline pheromone value, and is the pheromone evaporation rate. This function evaporates the pheromone back towards the original baseline value.

4 Results

All ASNE and EXAMM experiments generated total RNNs, training each for epochs. NEAT, on the other hand, was allowed to generate RNNs. If we assume that a forward pass (forward propagation) and a backward pass (backprop calculation) are approximately the same computationally, this generously gave NEAT approximately times the amount of compute time (as RNNs trained for epochs would equivocate to forward and backward passes). The RNNs with non-evolvable (fixed) architectures were allowed to train for epochs. Every experiment was repeated times to compute means and standard deviations in order to ensure a proper statistical comparison.

ASNE used a colony superstructure with input nodes, hidden layers, each with hidden nodes, and a single output node. Recurrent synapses could span , or steps in time. The resulting connectivity superstructure consisted of nodes, edges, and recurrent edges. While this may seem modest compared to modern convolutional architectures, which may consist of millions of connections, it is important to note that the RNNs generated from this superstructure are unrolled over time steps (according to the time series length of the training and testing data samples) when trained locally via backpropagation through time (BPTT). This means algorithms such as ASNE must handle (fully-unrolled) networks of up to nodes, edges, and recurrent edges with errors from the final output (predictor) potentially back-propagated over up to synaptic connections.

The dataset utilized in this study is an open access time series dataset taken from a coal fired powerplant. The data was introduced in previous neuro-evolution studies for time series data prediction [elsaid2019evolving, alex2019investigating]. It consists of possible parameters, recorded for days with each parameter recorded at each minute. These parameters were used to predict the flame intensity parameter (the response variable, in regression parlance). Results were generated by training RNNs on days worth of data taken from one of the coal burners from this data set. Fitness values (mean absolute error) were calculated on the other days, which was data that was treated as a test set.

experiments were conducted in order to include all combinations of the ASNE options/variations (described below). Each experiment was repeated times to obtain robust results. These ASNE experiments generated, trained, and evaluated million RNNs. Experiments were scheduled on a high performance computing cluster with Intel® Xeon® Gold 6150 CPUs, each with 36 cores and 375 GB RAM (total 2304 cores and 24 TB of RAM). Each experiment utilized nodes. Overall, it took approximately days to complete the entire battery of experiments. Given the unstructured nature of the RNNs evolved in this work, utilizing CPUs has been found to be more efficient than GPUs as there are no wide, fully connected layers which would benefit from parallelized matrix algebra on a GPU. Further, it allows the use of large scale high performance computing clusters which typically have many more CPUs than GPUs available.

4.1 Backpropagation Hyperparameters

All ANNs were trained with backprop and stochastic gradient descent (SGD) using the same hyperparameters. SGD was run with a learning rate and used Nesterov momentum () to smooth out the local gradient descent. No dropout regularization was used since it has been shown in other work to reduce performance when training RNNs for time series prediction [elsaid2018optimizing]. To prevent exploding gradients, gradients were re-scaled (as prescribed by Pascanu et al [pascanu2013difficulty]) to a unit Gaussian ball when the norm of the gradient was above a threshold of . To improve performance for vanishing gradients, gradient boosting (the opposite of clipping) was used when the norm of the gradient was below a threshold of . The forget gate bias of the LSTM cells had added to it as this has been shown to yield significant improvements in training time by Jozefowicz et al [jozefowicz2015empirical]. Weights for RNN in all other cases were initialized as described in the section describing our Lamarckian Weight Initialization 3.1 scheme for ASNE and in Ororbia et al for EXAMM [ororbia2019examm].

4.2 ASNE Options and Hyper-parameters

The influence/effect of individual ASNE hyper-parameters was carefully investigated in this study. A pheromone decay rate of and a pheromone evaporation rate of were chosen as they were shown to be effective in preliminary tests and is within the recommended standard range [sivagaminathan2007hybrid]. The other ASNE parameters we considered were:

  1. Number of ants : {20, 40, 80, 160}.

  2. Regularization update parameter: {0.25, 0.65, 0.90}.

  3. Initializing RNN weights with constant values of ({0.3, 0.6, 0.9}), using as calculated by a function of fitness, and non-Lamarckian randomized weight initialization.

The application of the examined heuristics that appear in the figures and tables that follow are labeled as follows:

  1. Function :

  2. Constant :

  3. L1 Pheromone regularization: (Equation 4)

  4. L2 Pheromone regularization: (Equation 5)

  5. Standard Ant Species:

    • Standard Ants:

    • Standard Ants with Bias:

  6. Multi Species Ants:

    • Explorer Ants:

    • Explorer Ants and Forward Social Ants:

    • Explorer Ants and Backward Social Ants:

    • Explorer Ants, Forward and Backward Social Ants:

  7. Layer Jumping:

  8. No Layer Jumping:

4.3 Performance of Individual Heuristics

Figure 3 presents the performance of ASNE when each each heuristic is applied separately. Furthermore, it presents for comparison the performance of the state-of-the-art EXAMM, NEAT, and traditional fixed standard RNNs. While ASNE in this case (augmented by only one heuristic) did not outperform EXAMM except for some outliers, both EXAMM and ASNE showed dramatically better performance than NEAT, even though NEAT was given a significant amount of extra compute time. ASNE, EXAMM and NEAT also significantly outperformed traditional RNNs. Some of the gain over NEAT is most likely due to the use of backpropagation by EXAMM and ASNE since NEAT uses fairly simple and non-gradient based recombination operations to adjust weights.

4.4 Performance of Combined Heuristics

Top 10 Top 25 Top 100 Top 250 Top 500
Mean Median Best Mean Median Best Mean Median Best Mean Median Best Mean Median Best

3(0) 4(0) 3(0) 9(0) 7(0) 9(0) 26(0) 23(0) 31(8) 58(0) 54(0) 49(8) 108(1) 96(0) 100(14)
Const 7(0) 6(0) 7(0) 14(0) 14(0) 12(0) 60(0) 63(0) 54(8) 147(0) 149(0) 155(16) 294(0) 301(0) 299(43)
No 0(0) 0(0) 0(0) 2(0) 4(0) 4(0) 14(0) 14(0) 15(0) 45(0) 47(0) 46(0) 98(0) 103(0) 101(0)
L1 2(0) 4(0) 0(0) 9(0) 8(0) 3(3) 42(0) 34(0) 30(4) 96(0) 96(0) 91(4) 190(0) 186(1) 186(21)
L2 5(0) 5(0) 6(0) 13(0) 12(0) 16(1) 40(0) 45(0) 38(3) 100(0) 98(0) 95(12) 189(0) 192(0) 185(21)





StdAnts
0 0 0 1 0 0 3 0 0 20 19 0 80 77 7
StdBiasAnts 0 0 0 0 0 0 3 1 0 23 16 0 83 83 11


ExpAnts
0 0 10 0 0 25 1 0 100 10 6 250 92 85 440
ExpFrdAnts 6 7 0 14 15 0 45 49 0 98 103 0 123 128 40
ExpBkwAnts 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ExpFrdBkwAnts 4 3 0 10 10 0 48 50 0 99 106 0 122 127 2


No Jump
0 0 5 0 0 13 0 0 52 0 0 128 2 9 282
Layer Jump 10 10 5 25 25 12 100 100 48 250 250 122 498 491 218
20 Ants 0 0 2 0 0 6 0 0 24 0 0 65 3 6 220
40 Ants 2 0 3 5 1 7 14 15 23 50 57 63 97 87 120
80 Ants 4 3 2 8 11 6 44 45 26 82 80 60 175 173 80
160 Ants 4 7 3 12 13 6 42 40 27 118 113 62 225 234 80






Table 1: Heuristic Ranking Statistics
\includegraphics

[angle=-90,origin=c, width=.5]./pix/pure_pluseunoptimized_fits.pdf

Figure 3: Performance of NEAT, EXAMM, & individually applied ASNE heuristics against fixed memory cell RNNs.
\includegraphics

[angle=-90,origin=c, width=.98]./pix/BestResults_noInv_noNeat.pdf

Figure 4: Performance of EXAMM and the top 25 ASNE experiments

The combined application of multiple different heuristics, as illustrated in Figure 4, yielded ASNE results that outperformed all baselines, including the fixed RNNs, NEAT, as well as EXAMM. Table 1 provides statistics ranking each of the heuristics based on how many times the experiments that utilized them appeared in the top , , , , and best results as determined by the mean, median, and the best performance of the RNN generated in the experiment’s 10 repeats. Values in parentheses are the number of times an experiment that only utilized that heuristic appeared in that top ranking. The utilization of multiple heuristics dominated the top results, with individually-applied heuristics not appearing in the top , and only times in the top (only as best results).

Lamarckian weight inheritance also proved to be important, yielding strong performance, with all of the top utilizing either functional or constant parameters. Furthermore, it also occurred (mean), (median) and (best) times in the top , and (mean), (median), and (best) times in the top .

Additionally, all of the best performing RNNs used layer-jumping ants, which tend favor more sparse connectivity patterns. Most of the best results used pheromone weight-regularization, with L2 regularization appearing at a nearly 50% rate in the top , and results. The regularization factor was also high, at 65% or 90%, for most of the best experiments that used it.

All of the top best results utilized the multiple ant species heuristic, which strongly supports the use of specialized ants. The number of ants varied between and for all the top results in the mean and median case, with a larger number of ants tending to perform better. However the case of ants did occasionally appear in the best cases, even sometimes in the top and, furthermore, these networks tended to be rather sparse but very well performing. This may suggest that the experiments that utilized more ants had an easier time finding the most important structures, but also potentially had extraneous connections which were not needed. In contrast, the experiments with less ants had less of a chance of finding these important structures due to lower (overall) connectivity. This suggests that further optimizations could be designed to better guide ASNE towards the discovery of more efficient network architectures.

Perhaps one the most interesting items to observe is the performance distribution when multiple ant agent roles was used in ASNE. The entirety of the best found RNNs, up to the top were from explorer ants only, so these generated RNNs only had recurrent connectivity in terms of whatever the various memory cells offered. However, for the mean and median performance of the experiments, nearly all the top , , and consisted of explorer and forward recurrent roles or explorer, forward, and backward recurrent ant specializations – with only a very few of the only explorer ant only configurations showing up in the top and . First, this suggests that backward recurrent connections (which are most commonly utilized in RNNs) were less effective than forward recurrent connections. Second, it also appears that adding these recurrent connections tended to make the RNNs perform significantly better on the average and median cases, while the RNNs which were generated with only explorer ants had the ability to occasionally find RNNs that generalized quite well. These results certainly suggest further study in order to better understand the effect of combining recurrent connections and memory cells. In addition, perhaps alternative strategies can be developed that retain the stability of adding recurrent connections while still efficiently finding well-generalizing RNNs.

RNN Density

Tables 2 and 3 show the number of nodes, edges, and recurrent edges in the best evolved RNNs for the experiments related to ASNE augmented only with single, individual heuristics. EXAMM found the simplest structures but these were not always the best-performing, which may suggest that EXAMM, as powerful as it is, still sometimes gets trapped in local minima. Utilizing the multiple ant agent roles and L2 pheromone regularization proved to be very effective in generating smaller, sparser RNNs. The smaller RNN size combined with its strong performance in the top rankings suggest that modeling ant role specialization can significantly improve how well an ACO/neuro-evolutionary, such as the proposed ASNE, generates candidate RNNs.

Fitness Structure Coefficient

Figure 5 examines the relationship between the size of the network and its fitness. Results for RNNs from the top best performing experiments are shown along with RNNs (taken from the individual heuristic experiments).

The following equation was used to calculate a measure of the contribution of each weight to the fitness of the RNN:

(7)

where is a structural coefficient calculated by (the mean absolute error of the RNN) and is the number of weights currently contained in the candidate RNN structure. Higher values represent RNNs where weights contribute more to the performance of the network.

\includegraphics

[angle=-90,origin=c,width=.6height=.55]./pix/fit_strct_coeff_withSolo.pdf

Figure 5: Ranges of how much each weight in an RNN contributed to its performance (Equation 7).
Min Max Avg Reduce%

%
%
%
%
%
%
%
L1 %
L2 %
EXAMM %


Table 2: Number of Edges (40 Ants)
Min Max Avg Reduce %
%
%
%
-
%
%
%
L1 %
L2 %
EXAMM %




Table 3: Number of Recurrent Edges (40 Ants)

5 Discussion

To the best of our knowledge, this work represents the first application of ant colony optimization (ACO) to the problem of neuro-evolution/neural architecture search for recurrent neural networks with varying recurrent time spans and more complex connectivity patterns (the only prior related study that investigated ACO for evolving RNNs was critically constrained to small RNNs with a single recurrent timestep and Elman-style connectons [desell2015evolving]). Specifically, we proposed the novel ant swarm neuro-evolution algorithm (ASNE) for metaheuristically searching the the massive search space of possible RNNs with complex connectivity patterns (of both recurrent and feedforward forms). ASNE generates candidates from a massively-connected superstructure (the colony/swarm), taking advantage of ACO for structural optimization and concepts from neuro-evolutionary/genetic approaches for maintaining populations of RNN candidates that are trained locally and asynchronously (making ASNE a memetic procedure as well). A hallmarkk of ASNE is its computational formalization of role/specialization in real ant colonies – ant agents internally are prevented from getting stuck “wandering” around the superstructure through the use of different kinds of ant agents that are constrained to only explore different components of the underlying complex graph space. This is a form of modularization that proves particularly useful in cutting up large, complex search spaces under ASNE.

Our experimental results show that using ants with different roles generated RNNs that were not only sparse but performant – these candidates almost entirely outperformed the more standard ant traversal strateg even when standard ants were biased to more likely select forward paths. This innovation of utilizing multiple ant types improves the ACO core of ASNE when searching for effective RNNs. Furthermore, Lamarckian weight inheritance greatly improved the accuracy of the generated RNNs333Corraborating prior studies that have also shown the benefits of such an initialization scheme [desell2018accelerating, ororbia2019examm]. and allowing ants to jump (or skip) layers proved to not only boost performance but also to increase sparsity. Lastly, to our knowledge the introduction of L1 and L2 regularization into the ACO pheromone deposition process is quite novel if albeit a bit unconventional. Our results show by playing with the form of the pheromone adjustment function, we can increase the likelihood that sparser RNNs are found that also outperform schemes that do not incorporate regularization/constraints. The strategies we formalize in this work are generic and could be applied to any other ACO algorithm’s pheromone update process.

The proposed ASNE metaheuristic not only provides advances and new concepts for the field of ant colony optimization research to further explore but also shows strong promise for its use as an alternative neuro-evolution algorithm for automated RNN architecture search. It significantly outperforms the well-known NEAT algorithm (even when NEAT is given an order of magnitude more computation), and, more importantly, ASNE outperforms the state-of-the-art EXAMM genetic evolutionary algorithm on the time series problem studied in this paper.

The work also opens up a number of avenues for future study as well as presents some interesting questions. In particular, why were explorer ants able to find the best networks and yet performed quite poorly in the mean and median cases? Why did explorer ants combined with social recurrent ants perform extremely well in the mean and median cases but not in the best cases? Answering experimental queestions such as these might lead to insights as to how recurrent connections that skip multiple steps of time interact with recurrent memory cells, potentially leading to the design of more expressive, RNN structures that better capture longer-term dependencies in sequential data. Finally, future work should entail investigation of ASNE on other time series datasets as well as sequence modeling (and classification) problems more commonly explored in mainstream statistical learning research, such as language modeling [mikolov2010recurrent, ororbia2017learning].

Acknowledgements

This material is in part supported by the U.S. Department of Energy, Office of Science, Office of Advanced Combustion Systems under Award Number #FE0031547. We also thank Microbeam Technologies, Inc. for their help in collecting and preparing the coal-fired power plant dataset. Most of the computation of this research was done on the high performance computing clusters of Research Computing at Rochester Institute of Technology. We would like to thank the Research Computing team for their assistance and the support they generously offered to ensure that the heavy computation this study required was available.

References

Appendix A Complete Results

Minimum Maximum Mean Median Std. Dev.
NEAT
EXAMM
-L2.65--AJ-ExpFrdBkw
--AJ-ExpFrdBkw
--AJ-ExpFrdBkw
--AJ-ExpFrd
-L1.65--AJ-ExpFrdBkw
-L2.25--AJ-ExpFrd
-L1.25--AJ-ExpFrd
-L2.65--AJ-ExpFrd
-L2.25--AJ-ExpFrd
-L2.9--AJ-ExpFrd
-L2.65--AJ-ExpFrd
-L1.65--AJ-ExpFrd
-L2.65--AJ-ExpFrdBkw
-L2.65--AJ-ExpFrdBkw
-L2.9--AJ-ExpFrdBkw
-L2.65--AJ-ExpFrd
L1.9--AJ-ExpFrd
-L1.25--AJ-ExpFrdBkw
L1.65--AJ-ExpFrd
-L2.25--AJ-ExpFrdBkw
-L2.9--AJ-ExpFrd
-L2.25--AJ-ExpFrdBkw
-L1.65--AJ-ExpFrd
-L1.65--AJ-ExpFrd
-L1.65--AJ-ExpFrd
L1.65--AJ-ExpFrd
--AJ-ExpFrd
-L1.65--AJ-ExpFrdBkw
-L1.9--AJ-ExpFrd
L2.9--AJ-ExpFrd
L2.65--AJ-ExpFrd
--AJ-ExpFrd
-L2.65--AJ-ExpFrdBkw
--AJ-ExpFrd
-L2.65--AJ-ExpFrd
-L1.25--AJ-ExpFrdBkw
--AJ-ExpFrdBkw
-L2.65--AJ-ExpFrd
--AJ-ExpFrdBkw
--AJ-ExpFrd
--AJ-ExpFrdBkw
-L1.9--AJ-ExpFrd
L2.65--AJ-ExpFrd
L2.9--AJ-ExpFrdBkw
--AJ-ExpFrdBkw
-L2.25--AJ-ExpFrd
-L1.65--AJ-ExpFrdBkw
-L2.65--AJ-ExpFrd
-L1.25--AJ-ExpFrd
-L1.25--AJ-ExpFrd
--AJ-ExpFrd
-L2.25--AJ-ExpFrd
L1.25--AJ-ExpFrd
-L1.9--AJ-ExpFrd
-L1.9--AJ-ExpFrdBkw
-L1.25--AJ-ExpFrdBkw
-L2.9--AJ-ExpFrd
-L1.9--AJ-ExpFrdBkw
--AJ-ExpFrdBkw
-L2.9--AJ-ExpFrdBkw
-L2.25--AJ-ExpFrd
-L1.9--AJ-ExpFrdBkw
-L1.9--AJ-ExpFrdBkw
L1.65--AJ-ExpFrdBkw
-L1.65--AJ-ExpFrdBkw
-L1.25--AJ-ExpFrd
-L1.65--AJ-ExpFrdBkw
L2.25--AJ-ExpFrd
--AJ-ExpFrdBkw
-L2.25--AJ-ExpFrdBkw
-L1.9--AJ-ExpFrdBkw
-L2.9--AJ-ExpFrdBkw
--AJ-ExpFrdBkw
-AJ-ExpFrdBkw
-L1.25--AJ-ExpFrd
-L1.65--AJ-ExpFrd
-L2.9--AJ-ExpFrdBkw
-L1.25--AJ-ExpFrdBkw
--AJ-ExpFrdBkw
-L2.25--AJ-ExpFrdBkw
-L2.9--AJ-ExpFrdBkw
-L1.25--AJ-ExpFrdBkw
--AJ-ExpFrd
-L1.9--AJ-ExpFrdBkw
L1.9--AJ-ExpFrdBkw
-L2.65--AJ-ExpFrdBkw
L2.9--AJ-ExpFrd
-L1.65--AJ-ExpFrdBkw
-L2.25--AJ-ExpFrd
L2.65--AJ-ExpFrdBkw
-L2.65--AJ-ExpFrdBkw
-L2.9--AJ-ExpFrd
-L1.65--AJ-ExpFrdBkw
-L2.25--AJ-Exp
-L2.65--AJ-ExpFrdBkw
-L1.25--AJ-ExpFrdBkw
-AJ-ExpFrd
-AJ-ExpFrdBkw
-L2.9--AJ-ExpFrdBkw
-L1.9--AJ-ExpFrdBkw
--AJ-ExpFrdBkw
L1.25--AJ-ExpFrd
-L2.65--AJ-ExpFrd
-L1.25--AJ-ExpFrd
-AJ-ExpFrd
-AJ-ExpFrd
-L2.9--AJ-ExpFrdBkw
-L2.65--AJ-ExpFrd
-L2.9--AJ-ExpFrdBkw
-L2.65--AJ-ExpFrdBkw
-L2.65--AJ-ExpFrdBkw
--AJ-ExpFrd
-L2.65--AJ-ExpFrd
-L2.9--AJ-ExpFrd
--AJ-ExpFrd
-L1.25--AJ-ExpFrdBkw
--AJ-ExpFrdBkw
-L2.9--AJ-ExpFrd
-L2.65--AJ-ExpFrd
-L2.25--AJ-ExpFrd
-L2.9--AJ-ExpFrd
--AJ-ExpFrdBkw
--AJ-ExpFrd
-L1.9--AJ-ExpFrd
--AJ-ExpFrd
L2.65--AJ-ExpFrdBkw
-L2.25--AJ-ExpFrd
L1.25--AJ-ExpFrdBkw
--AJ-ExpFrdBkw
-L1.9--AJ-ExpFrd
--AJ-ExpFrd
-L1.65--AJ-ExpFrdBkw
-L1.65--AJ-ExpFrd
-L1.25--AJ-ExpFrdBkw
-L2.25--AJ-ExpFrdBkw
--AJ-ExpFrdBkw
--AJ-ExpFrdBkw
L1.9--AJ-ExpFrdBkw
--AJ-ExpFrdBkw
-L2.9--AJ-ExpFrd
-L2.25--AJ-ExpFrd
-L2.25--AJ-ExpFrdBkw
-L1.65--AJ-ExpFrd
--AJ-ExpFrdBkw
--AJ-ExpFrd
-L1.25--AJ-ExpFrdBkw
-L1.65--AJ-ExpFrd
L1.9--AJ-ExpFrd
--AJ-Exp
--AJ-ExpFrd
--AJ-ExpFrd
-L2.9--AJ-Exp
--AJ-ExpFrd
L1.65--AJ-ExpFrdBkw
-L1.65--AJ-ExpFrd
L2.65--AJ-ExpFrd