The Ant Swarm NeuroEvolution Procedure for Optimizing Recurrent Networks
Abstract
Handcrafting effective and efficient structures for recurrent neural networks (RNNs) is a difficult, expensive, and timeconsuming process. To address this challenge, we propose a novel neuroevolution algorithm based on ant colony optimization (ACO), called ant swarm neuroevolution (ASNE), for directly optimizing RNN topologies. The procedure selects from multiple modern recurrent cell types such as RNN, GRU, LSTM, MGU and UGRNN cells, as well as recurrent connections which may span multiple layers and/or steps of time. In order to introduce an inductive bias that encourages the formation of sparser synaptic connectivity patterns, we investigate several variations of the core algorithm. We do so primarily by formulating different functions that drive the underlying pheromone simulation process (which mimic L1 and L2 regularization in standard machine learning) as well as by introducing ant agents with specialized roles (inspired by how real ant colonies operate), i.e., explorer ants that construct the initial feed forward structure and social ants which select nodes from the feed forward connections to subsequently craft recurrent memory structures. We also incorporate a Lamarckian strategy for weight initialization which reduces the number of backpropagation epochs required to locally train candidate RNNs, speeding up the neuroevolution process. Our results demonstrate that the sparser RNNs evolved by ASNE significantly outperform traditional one and two layer architectures consisting of modern memory cells, as well as the wellknown NEAT algorithm. Furthermore, we improve upon prior stateoftheart results on the time series dataset utilized in our experiments.
mycapequ[Equation][]
1 Introduction
Given their success across a wide swath of pattern recognition tasks, artificial neural networks (ANNs) have become a popular tool to use when attempting to solve datadriven problems. However, in order to solve increasingly more complicated problems, neural architectures are becoming vastly more complex. Increasing the complexity of an ANN entails having to operate with more layers of neural processing elements required, most of which are usually wider and more denselyconnected, greatly complicating the model design process. The resulting increase in complexity introduces new challenges and complications when fitting these ANN models to actual data. These problems are further compounded when ANNs are meant to process temporal data, entailing recurrent connections which can span varying periods of time. As a result, crafting performant ANNs becomes expensive and incredibly difficult for engineers, highlighting a grand challenge facing the domain of machine learning – the automation of ANN architecture design, which includes selecting the form of the underlying synaptic topology as well as the values of the weights themselves. The key to this automation might lie in developing optimization procedures that can effectively explore the vast, combinatorial search space of possible topological structures that could be constructed from a large set of neuronal units and the wide variety of synaptic connectivity patterns that relate them to one another.
Recent interest in automated architecture search has resulted in many proposed ideas related to deep feed forward and convolutional networks, including those based on natureinspired metaheuristics [yang2010nature]. However, few, if any, have focused on the far more difficult problem of optimizing recurrent neural networks (RNNs) aimed at processing temporal, sequential data such as time series, i.e., automated RNN design.
This study addresses the challenge of automated RNN design by developing a novel ANN topology optimizer based on concepts from artificial evolution and ant colony optimization (ACO). Specifically, we propose an algorithm called Ant Swarm NeuroEvolution (ASNE), which automatically constructs and optimizes the topology of RNNs, with a focus on time series data prediction. In developing our optimization approach, we furthermore develop and experiment with variations of our method in the following ways:

In order to encourage the discovery of more sparselyconnected neural topologies, we investigate different schemes for dynamically modifying the pheromone traces deposited by ant agents that compose the swarm. Specifically, we introduce functions for introducing regularization into the overall optimization, slowly clearing out denselyconnected synaptic areas by depriving poorly performing weights/edges of pheromone accumulation.

We incorporate and analyze various weight initialization schemes and find that a Lamarckian inheritance strategy is highly effective.

Inspired by the rolespecialization that ants operate under within the context of realworld ant colonies, we extend ASNE to utilize different specialized ant agents to modularize the underlying synaptic connectivity construction process, which we find greatly improves solutions found by our metaheuristic.
Experimentally, we validate our proposed natureinspired metaheuristic on an openaccess realworld time series data set collected form a coalfired power plant. A rigorous ablation study of the ASNE algorithm is conducted by analyzing the candidate network topologies it finds. A total over experiments with varying heuristics and hyperparameters were performed, which entailed training different RNNs. Our results indicate that ASNE is able to build well performing, arbitrary RNN structures with connections that span both structure and time using both simple and complex memory cells. More importantly, ASNE is shown to significantly outperform the wellknown neuroevolutionary algorithm, NEAT [stanley2002evolving], as well as the stateoftheart evolutionary optimizer, EXAMM [ororbia2019examm], which have held the prior best results on this data set.
2 Related Work
With respect to neuroevolution of recurrent network topologies, a great deal of work already exists, ranging from stochastic alteration of the topology as in dropout [srivastava2014dropout] to something more sophisticated like that in the original NEAT [stanley2002evolving] and its more modern incarnate HyperNEAT [stanley2009hypercube]. Other proposed approaches include EPNet [yao1997new], EANT [kassahun2005efficient], GeNet[xie2017genetic], CoDeepNEAT [miikkulainen2019evolving], and EXACT [desell2017large]. EXACT was recently extended to evolve RNNs that used LSTM memory cells (named EXALT) and shown to perform quite well on timeseries prediction problems [elsaid2019evolving]. Later, the algorithm, named EXAMM, was generalized to evolve networks consisting of a library of recurrent memory cells [ororbia2019examm]. These previously proposed ideas center around the use of a genetic algorithm [holland1992adaptation], where optimization is inspired by approaches that draw from the evolution of organisms, of either Darwinian and/or Lamarckian nature. More recently, work by Camero et alhave shown that a Mean Absolute Error (MAE) random sampling strategy can provide good estimates the performance of RNNs [camero2018low] and have successfully used it instead of actually evaluating or training RNNs to speed up neuroevolution of LSTM RNNs [camero2019specialized].
Nonetheless, very few studies in the body of work described above consider ant colony optimization (ACO) [dorigo1992optimization] as the central optimizer for network topology, and even fewer in general focus on exploring how to evolve complex temporal models like the RNN, with a few exceptions, such as EXALT and EXAMM. Of the few that have investigated ACO, most existing work has used it to strictly optimize feed forward networks and, even in that case, have dominantly focused on either initializing the weights of the connections [mavrovouniotis2013evolving], or on reducing the dimension of the input vector solution space [sivagaminathan2007hybrid]. One notable effort that has used ACO for RNN optimization in some form is [desell2015evolving], which used ACO to optimize smaller neural network structures based on Elman recurrent networks [elman1990finding].
This paper contributes to the domain of natureinspired neural network topology optimization by proposing a novel metaheuristic for evolving the full structure of an RNN as opposed to prior studies that have applied the technique as only a partial component of the optimization process [elsaid2018optimizing] or in smaller Elman RNN topologies with limited recurrent connectivity [desell2015evolving]. Furthermore, our algorithm is capable of utilizing the same full suite of recurrent memory cells as the stateoftheart evolutionary algorithm EXAMM (LSTM, GRU, MGU, UGRNN, and RNN cells). To the best of our knowledge, we are the first to propose an ACObased approach to automate RNN design, offering a powerful procedure that combines concepts of both neuroevolution and ant colony metaheuristic optimization.
3 Ant Swarm NeuroEvolution (ASNE)
ASNE handles the optimization of ANN structures by constructing a simple multiagent system, where each agent treats the ANN as graph structure, considering neuronal processing elements (PEs) as the nodes and the synaptic weights that connect PEs as the edges. In order to design the operations that these agents perform as well as the manner in which they traverse the ANN graph, we may appeal to the metaphor of ants and the collective they holistically form, i.e., the ant colony. As a result, the agents will function based on simplifications of myrmecological principles, such as the mechanics of anttoant social interaction.
At a high level, in ASNE, the individual ant agents operate on a single massively connected “superstructure”, which contains all possible ways that PEs may connect with each other both in terms of structure, i.e., all possible feedforward pathways that start from the input/sensory PEs and end at the output/actuator PEs, and time, i.e., all possible recurrent connections that span many multiple time delays. In our implementation, ants choose to move over connections between nodes (or neurons), probabilistically and as a function of a simulated chemical known as the “pheromone”. In nature, the pheromone is one primary driver of how ants communicate with each other, the traces of which allow the collective to “know” of potential food sources ensuring the survival of the colony in the long term. When an ant finds food, the ant will start marking the path it takes to return back to the colony, the pheromone trace of which other ants will then subsequently follow. In the ANN superstructure, these traces, which are simulated by an additional, dynamic scalar weight (or importance value) assigned to a given synapse, will bias any given ant agent to favor selecting some possible (more rewarding) synaptic pathways over others.
The few existing efforts on using forms of ACO for RNN optimization [elsaid2018optimizing, elsaid2019evolving] restrict the ACO process to operate within individual LSTM memory cells. In contrast, ASNE allows individual ants traverse a single massively connected “superstructure”, which contains all possible ways that the nodes of an RNN may connect with each other both in terms of structure (i.e., all possible feed forward connections), and in time (i.e., all possible recurrent connections spanning many multiple time delays)^{1}^{1}1Note that this superstructure is more connected than a standard fully connected neural network – each layer is also fully connected to each other layer as well, allowing for forward and backward layer skipping connections, with additional recurrent connections between node pairs for each time skip allowed.. The highlevel pseudocode for our ASNE topology optimizer is depicted in Algorithm 1.
ASNE was developed as an asynchronous parallel system for use on high performance computing resources, which has a master process that maintains the colony information and worker processes to (locally) train the RNNs. This parallel implementation is asynchronous, the master process generates new RNNs as needed for worker processes (which operate on separate, dedicated CPU or GPU resources) and updates colony information and pheromones as trained RNN results are returned. This results in a naturally load balanced algorithm with high scalability.
Within the master process itself, ASNE operates by having a fixed number of ant agents traverse the neural superstructure. Ants choose to move over connections between nodes (neurons) randomly, but they are probabilistically biased towards connections with higher simulated “pheromone” values. Pheromone deposit values are periodically evaporated to prevent the search process from becoming stuck in local minima. Interestingly enough, the modification of the evaporation function could be considered to a way in which one could encode certain priors into the ANN itself.
From the overall superstructure, which the ant agents exclusively operate on, RNN subnetworks are extracted (as dictated by the current pheromone trace network available at the current simulation time step, which yields a map of nodes and connecting synapses, both recurrent and feedforward, visited by the ant agents) and then further finetuned locally with only a few epochs of backpropagation (backprop) through time. After a particular worker is done locally training a RNN subnetwork, the candidate’s weight values and cost (fitness) function (measure on a validation subset of data) are communicated back to the swarm and superstructure (housed in the master process), adjusting the pheromone trace network and affecting future ant agent traversal behavior.
One crucial element in our ASNE procedure is the introduction of different ant agent types, which is inspired by how real ants specialize to act according to specific roles to serve the needs of the colony [odonnell2018antroles]. Specifically, we consider designing ant agents that serve specific roles in constructing parts of candidate RNN subnetworks – some ants exclusively traverse feedforward synaptic pathways while others only explore recurrent synaptic pathways. The highlevel pseudocode for our ASNE topology optimizer is depicted in Algorithm 1^{2}^{2}2The full code is posted at: https://github.com/travisdesell/exact/tree/adding_ant_colony..
Within the framework of ASNE, we investigate variations of its various underlying mechanisms. These include the use of Lamarckian weight initialization, allowing ant agents to also select from multiple memory cell types as opposed to operating exclusively with simple neurons, introducing specialized ants that have different graph traversal strategies, and constraining ant movement and manipulating the pheromone evaporation function in order to encourage the discovery of sparse RNN topologies.
3.1 Lamarckian Weight Initialization
Edges and recurrent edges’ weights can be randomly initialized each time a new RNN is generated by the ants. However, initializing parameters this way requires local tuning (via backprop) for many epochs for the RNN to reach suitable generalization error, as they do not make use of any information gained by prior trained RNN candidates. Further, it has been shown that reusing of parental weights (i.e., epigenetic or Lamarckian weight initialization) can significantly speed up the neuroevolution process and result in better performing, smaller ANNs in general [desell2018accelerating].
To apply Lamarckian weight initialization to ASNE, each edge in the ant swarm’s connectivity superstructure also tracks a weight value in addition to its pheromone value. These weights are randomly initialized uniformly . Each time a generated RNN performs well, the weights of its best performance, as measured on a validation data subset, are used to update the weight values in the swarm’s superstructure internal bookkeeping.
Formally, we define is a function of the population’s best and worse evaluated RNN fitness, as the colony’s edge weight,, as the corresponding neural network’s edge weight, as the population’s best fitness, and as the population’s worst fitness. Weight initialization then proceeds as follows:
(1a)  
(1b)  
(1c) 
With respect to the function , we investigated two variations. The first variant, as shown in Equation 1, used the fitness of the RNN used to update the weights to determine how much these new (locally found) weight values effect those of the colony. The second variant of was set to a predetermined constant instead of being calculated or adjusted by fitness. This process essentially allows for a running average (either with a fixed update or dynamic update based on fitness) of the best weights found for each connection in the superstructure. When a new RNN is generated, it uses the current weight values in whatever edges that were extracted from the superstructure on the master process. This allows for a Lamarckian evolution for edges weights, as prior RNNs with the best fitness scores are allowed to pass on their weights to future generations.
3.2 Memory Cell Selection
For any particular node in the superstructure, ASNE also has the ability to utilize the pheromones present to select which memory cell type a particular node will be in the generated network. A node could chosen to be either an LSTM [hochreiter1997long], a GRU [chung2014empirical], an MGU [zhou2016minimal], a UGRNN [collins2016capacity], or a RNN cell [ororbia2017learning]. We refer the reader to these works for the formulations of these memory cells. Pheromones are deposited and updated for each of these memory cell possibilities as described below.
3.3 Altering Graph Traversal with Ant Species
As mentioned above, we explored various strategies for guiding ant traversal over the connectivity superstructure. Inspired by role specialization in real colonies, we implemented ant agents that explored the connectivity graph in specific ways. First, we started with a generic ant agent, called the standard ant, which was allowed to traverse through the massively connected colony superstructure in an unbiased manner. This, in essence, recovers the standard simple ant agent in classic ACO, which has complete freedom to explore any piece of a given graph structure. However, it became quickly apparent that this type of ant would get “stuck” in the network, generating a significantly high number of recurrent connections before finally reaching an output node. This meant that the RNN candidates extracted for local finetuning were rather dense, and in turn, computeheavy (featuring many extraneous parameters as is characteristic of overparameterized models).
Why do standard/simple ants get stuck or meander too long in the superstructure? In the superstructure, nodes (especially at the final hidden layer) have the option of selecting potential backward recurrent paths, which significantly outnumber the number of potential forward moving paths (see Figure 1). Assuming that each connection has an equal number of pheromones (which is a standard setting for pheromone initialization), agents will circle around the colony using these backward paths, yielding RNN candidates with very dense recurrent structure.
To prevent this problem, our first tactic was to alter the pheromone deposit function by adding extra pheromones to forward paths upon initialization as well as after every pheromone update. The biasing method yielded better proportions of forward and backward paths. Algorithm 2 illustrates this process.
Even with this forward path bias added to the pheromone deposit function, when using standard ants, we found that ASNE still tended to favor the generation of fairly dense networks. Altering the number of ant agents used to explore the structure as a means to control density of RNN candidates proved to help somewhat but was rather unwieldy and entailed far too much external human intervention. Instead, we developed an ant agent role specialization scheme that we found worked far better as an automatic control mechanism to control the network size and synaptic density.
The first agent role, the explorer ant, means that the agent is only allowed to choose from forward connections in the connectivity superstructure. The connections selected by this specialized agent would utilized to generate the base neural structure upon which recurrent connections could then be added to. After the explorer ants selected the possible nodes and forward connections, two additional specializations of what we call social ants would then be used, i) forward recurrent ants and, ii) backward recurrent ants. Social ants are first restricted to only visiting nodes that have already been selected by the explorer ants. In the case of the forward recurrent ants, when a path is chosen, the agent would specifically create a recurrent connection that moved forward in the network along the same path, along with a selected time skip (determined by pheromones). Backward recurrent ants, on the other hand, move backwards through the network and, for each path they take, a backward recurrent connection is added, along with a selected time skip (also determined by pheromones). Figure 2 provides an example of possible pathways that these specialized agents can take in a colony superstructure.
In addition to the development of specialized ant agents as described above, we explored two modes for general ant movement; i) ants were allowed to pick edges that could jump over layers in the colony (i.e., the superstructure is massively connected, with a plethora of skip connections), or ii) ants were only allowed to select edges between consecutive layers (i.e., the superstructure is fully connected, with no skip connections). This was tested to see the impact that layer skipping would have on the sparsity and performance of generated RNNs. Jumping and nonjumping modes were tested for both the standard ants (with and without forwardpath bias) and the specialized ant agent roles.
3.4 Updating Pheromone Values
In this section, we describe the various schemes we experimented with in designing the ASNE optimization procedure. We define as the pheromone value, as the pheromone decay parameter, as the weights of the evaluated (candidate) RNN, and as the candidate model’s fitness. Specifically, we describe four different functional schemes used to model pheromone deposits.
The first strategy we implemented for ASNE is also standard for classical ACO setups. This deposit scheme rewards well performing RNNs with a fixed (constant) pheromone deposit while penalized illperforming RNN models by evaporating the pheromone trace by a constant evaporation value, . Specifically, this approach is defined as:
(2) 
The second strategy we implemented was one that used the fitness (value) as a parameter to guide pheromone deposit. This has been shown to improve ACO performance in prior studies [sivagaminathan2007hybrid]. This scheme is defined as follows:
(3) 
The third strategy was to use the values of the neural synaptic weights themselves to control/guide the deposit of pheromones. Specifically, we inserted a penalty on the weights, specifically an L1 penalty (assuming a Laplacian prior of the synaptic weight values), in order to encourage regularization that favored sparser connectivity structure. This form of weight decay is sometimes applied to ANNs when controlling for overparameterization and sparse weight matrices (with many near hardzero values) are highly desirable. L1 regularization was applied to the pheromone deposition calculation in the following manner:
(4) 
The fourth and final strategy we employed was to insert an L2 penalty to regularize the RNN candidate weights. This assumes a Gaussian prior over the synaptic weight values and is sometimes referred to in ANN literature as “weight decay”. We incorporate L2 regularization into pheromone deposition according to the following formula:
(5) 
We developed these L1 and L2 functional variations of pheromone deposit schemes in the hopes that they would ultimately encourage/reward the uncovery of sparse, compact RNN predictive models.
3.5 Pheromone Evaporation
Pheromone trace values (deposited on the superstructures synaptic edge pathways) evaporate or “decay” after each generation of an RNN in order to reduce the amount of pheromones on synaptic edges that are not being used much by ant agent collective [sivagaminathan2007hybrid, mavrovouniotis2013evolving, liu2006evolving]. Pheromone values are updated (or decayed) according to the following equation:
(6) 
where is the pheromone value after the update, is the current pheromone value, is the original baseline pheromone value, and is the pheromone evaporation rate. This function evaporates the pheromone back towards the original baseline value.
4 Results
All ASNE and EXAMM experiments generated total RNNs, training each for epochs. NEAT, on the other hand, was allowed to generate RNNs. If we assume that a forward pass (forward propagation) and a backward pass (backprop calculation) are approximately the same computationally, this generously gave NEAT approximately times the amount of compute time (as RNNs trained for epochs would equivocate to forward and backward passes). The RNNs with nonevolvable (fixed) architectures were allowed to train for epochs. Every experiment was repeated times to compute means and standard deviations in order to ensure a proper statistical comparison.
ASNE used a colony superstructure with input nodes, hidden layers, each with hidden nodes, and a single output node. Recurrent synapses could span , or steps in time. The resulting connectivity superstructure consisted of nodes, edges, and recurrent edges. While this may seem modest compared to modern convolutional architectures, which may consist of millions of connections, it is important to note that the RNNs generated from this superstructure are unrolled over time steps (according to the time series length of the training and testing data samples) when trained locally via backpropagation through time (BPTT). This means algorithms such as ASNE must handle (fullyunrolled) networks of up to nodes, edges, and recurrent edges with errors from the final output (predictor) potentially backpropagated over up to synaptic connections.
The dataset utilized in this study is an open access time series dataset taken from a coal fired powerplant. The data was introduced in previous neuroevolution studies for time series data prediction [elsaid2019evolving, alex2019investigating]. It consists of possible parameters, recorded for days with each parameter recorded at each minute. These parameters were used to predict the flame intensity parameter (the response variable, in regression parlance). Results were generated by training RNNs on days worth of data taken from one of the coal burners from this data set. Fitness values (mean absolute error) were calculated on the other days, which was data that was treated as a test set.
experiments were conducted in order to include all combinations of the ASNE options/variations (described below). Each experiment was repeated times to obtain robust results. These ASNE experiments generated, trained, and evaluated million RNNs. Experiments were scheduled on a high performance computing cluster with Intel^{®} Xeon^{®} Gold 6150 CPUs, each with 36 cores and 375 GB RAM (total 2304 cores and 24 TB of RAM). Each experiment utilized nodes. Overall, it took approximately days to complete the entire battery of experiments. Given the unstructured nature of the RNNs evolved in this work, utilizing CPUs has been found to be more efficient than GPUs as there are no wide, fully connected layers which would benefit from parallelized matrix algebra on a GPU. Further, it allows the use of large scale high performance computing clusters which typically have many more CPUs than GPUs available.
4.1 Backpropagation Hyperparameters
All ANNs were trained with backprop and stochastic gradient descent (SGD) using the same hyperparameters. SGD was run with a learning rate and used Nesterov momentum () to smooth out the local gradient descent. No dropout regularization was used since it has been shown in other work to reduce performance when training RNNs for time series prediction [elsaid2018optimizing]. To prevent exploding gradients, gradients were rescaled (as prescribed by Pascanu et al [pascanu2013difficulty]) to a unit Gaussian ball when the norm of the gradient was above a threshold of . To improve performance for vanishing gradients, gradient boosting (the opposite of clipping) was used when the norm of the gradient was below a threshold of . The forget gate bias of the LSTM cells had added to it as this has been shown to yield significant improvements in training time by Jozefowicz et al [jozefowicz2015empirical]. Weights for RNN in all other cases were initialized as described in the section describing our Lamarckian Weight Initialization 3.1 scheme for ASNE and in Ororbia et al for EXAMM [ororbia2019examm].
4.2 ASNE Options and Hyperparameters
The influence/effect of individual ASNE hyperparameters was carefully investigated in this study. A pheromone decay rate of and a pheromone evaporation rate of were chosen as they were shown to be effective in preliminary tests and is within the recommended standard range [sivagaminathan2007hybrid]. The other ASNE parameters we considered were:

Number of ants : {20, 40, 80, 160}.

Regularization update parameter: {0.25, 0.65, 0.90}.

Initializing RNN weights with constant values of ({0.3, 0.6, 0.9}), using as calculated by a function of fitness, and nonLamarckian randomized weight initialization.
The application of the examined heuristics that appear in the figures and tables that follow are labeled as follows:

Function :

Constant :

L1 Pheromone regularization: (Equation 4)

L2 Pheromone regularization: (Equation 5)

Standard Ant Species:

Standard Ants:

Standard Ants with Bias:


Multi Species Ants:

Explorer Ants:

Explorer Ants and Forward Social Ants:

Explorer Ants and Backward Social Ants:

Explorer Ants, Forward and Backward Social Ants:


Layer Jumping:

No Layer Jumping:
4.3 Performance of Individual Heuristics
Figure 3 presents the performance of ASNE when each each heuristic is applied separately. Furthermore, it presents for comparison the performance of the stateoftheart EXAMM, NEAT, and traditional fixed standard RNNs. While ASNE in this case (augmented by only one heuristic) did not outperform EXAMM except for some outliers, both EXAMM and ASNE showed dramatically better performance than NEAT, even though NEAT was given a significant amount of extra compute time. ASNE, EXAMM and NEAT also significantly outperformed traditional RNNs. Some of the gain over NEAT is most likely due to the use of backpropagation by EXAMM and ASNE since NEAT uses fairly simple and nongradient based recombination operations to adjust weights.
4.4 Performance of Combined Heuristics
Top 10  Top 25  Top 100  Top 250  Top 500  
Mean  Median  Best  Mean  Median  Best  Mean  Median  Best  Mean  Median  Best  Mean  Median  Best  

3(0)  4(0)  3(0)  9(0)  7(0)  9(0)  26(0)  23(0)  31(8)  58(0)  54(0)  49(8)  108(1)  96(0)  100(14) 
Const  7(0)  6(0)  7(0)  14(0)  14(0)  12(0)  60(0)  63(0)  54(8)  147(0)  149(0)  155(16)  294(0)  301(0)  299(43) 
No  0(0)  0(0)  0(0)  2(0)  4(0)  4(0)  14(0)  14(0)  15(0)  45(0)  47(0)  46(0)  98(0)  103(0)  101(0) 
L1  2(0)  4(0)  0(0)  9(0)  8(0)  3(3)  42(0)  34(0)  30(4)  96(0)  96(0)  91(4)  190(0)  186(1)  186(21) 
L2  5(0)  5(0)  6(0)  13(0)  12(0)  16(1)  40(0)  45(0)  38(3)  100(0)  98(0)  95(12)  189(0)  192(0)  185(21) 
StdAnts 
0  0  0  1  0  0  3  0  0  20  19  0  80  77  7 
StdBiasAnts  0  0  0  0  0  0  3  1  0  23  16  0  83  83  11 
ExpAnts 
0  0  10  0  0  25  1  0  100  10  6  250  92  85  440 
ExpFrdAnts  6  7  0  14  15  0  45  49  0  98  103  0  123  128  40 
ExpBkwAnts  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
ExpFrdBkwAnts  4  3  0  10  10  0  48  50  0  99  106  0  122  127  2 
No Jump 
0  0  5  0  0  13  0  0  52  0  0  128  2  9  282 
Layer Jump  10  10  5  25  25  12  100  100  48  250  250  122  498  491  218 
20 Ants  0  0  2  0  0  6  0  0  24  0  0  65  3  6  220 
40 Ants  2  0  3  5  1  7  14  15  23  50  57  63  97  87  120 
80 Ants  4  3  2  8  11  6  44  45  26  82  80  60  175  173  80 
160 Ants  4  7  3  12  13  6  42  40  27  118  113  62  225  234  80 

The combined application of multiple different heuristics, as illustrated in Figure 4, yielded ASNE results that outperformed all baselines, including the fixed RNNs, NEAT, as well as EXAMM. Table 1 provides statistics ranking each of the heuristics based on how many times the experiments that utilized them appeared in the top , , , , and best results as determined by the mean, median, and the best performance of the RNN generated in the experiment’s 10 repeats. Values in parentheses are the number of times an experiment that only utilized that heuristic appeared in that top ranking. The utilization of multiple heuristics dominated the top results, with individuallyapplied heuristics not appearing in the top , and only times in the top (only as best results).
Lamarckian weight inheritance also proved to be important, yielding strong performance, with all of the top utilizing either functional or constant parameters. Furthermore, it also occurred (mean), (median) and (best) times in the top , and (mean), (median), and (best) times in the top .
Additionally, all of the best performing RNNs used layerjumping ants, which tend favor more sparse connectivity patterns. Most of the best results used pheromone weightregularization, with L2 regularization appearing at a nearly 50% rate in the top , and results. The regularization factor was also high, at 65% or 90%, for most of the best experiments that used it.
All of the top best results utilized the multiple ant species heuristic, which strongly supports the use of specialized ants. The number of ants varied between and for all the top results in the mean and median case, with a larger number of ants tending to perform better. However the case of ants did occasionally appear in the best cases, even sometimes in the top and, furthermore, these networks tended to be rather sparse but very well performing. This may suggest that the experiments that utilized more ants had an easier time finding the most important structures, but also potentially had extraneous connections which were not needed. In contrast, the experiments with less ants had less of a chance of finding these important structures due to lower (overall) connectivity. This suggests that further optimizations could be designed to better guide ASNE towards the discovery of more efficient network architectures.
Perhaps one the most interesting items to observe is the performance distribution when multiple ant agent roles was used in ASNE. The entirety of the best found RNNs, up to the top were from explorer ants only, so these generated RNNs only had recurrent connectivity in terms of whatever the various memory cells offered. However, for the mean and median performance of the experiments, nearly all the top , , and consisted of explorer and forward recurrent roles or explorer, forward, and backward recurrent ant specializations – with only a very few of the only explorer ant only configurations showing up in the top and . First, this suggests that backward recurrent connections (which are most commonly utilized in RNNs) were less effective than forward recurrent connections. Second, it also appears that adding these recurrent connections tended to make the RNNs perform significantly better on the average and median cases, while the RNNs which were generated with only explorer ants had the ability to occasionally find RNNs that generalized quite well. These results certainly suggest further study in order to better understand the effect of combining recurrent connections and memory cells. In addition, perhaps alternative strategies can be developed that retain the stability of adding recurrent connections while still efficiently finding wellgeneralizing RNNs.
RNN Density
Tables 2 and 3 show the number of nodes, edges, and recurrent edges in the best evolved RNNs for the experiments related to ASNE augmented only with single, individual heuristics. EXAMM found the simplest structures but these were not always the bestperforming, which may suggest that EXAMM, as powerful as it is, still sometimes gets trapped in local minima. Utilizing the multiple ant agent roles and L2 pheromone regularization proved to be very effective in generating smaller, sparser RNNs. The smaller RNN size combined with its strong performance in the top rankings suggest that modeling ant role specialization can significantly improve how well an ACO/neuroevolutionary, such as the proposed ASNE, generates candidate RNNs.
Fitness Structure Coefficient
Figure 5 examines the relationship between the size of the network and its fitness. Results for RNNs from the top best performing experiments are shown along with RNNs (taken from the individual heuristic experiments).
The following equation was used to calculate a measure of the contribution of each weight to the fitness of the RNN:
(7) 
where is a structural coefficient calculated by (the mean absolute error of the RNN) and is the number of weights currently contained in the candidate RNN structure. Higher values represent RNNs where weights contribute more to the performance of the network.
Min  Max  Avg  Reduce%  


%  
%  
%  
%  
%  
%  
%  
L1  %  
L2  %  
EXAMM  %  

Min  Max  Avg  Reduce %  

%  
%  
%  
  
%  
%  
%  
L1  %  
L2  %  
EXAMM  %  

5 Discussion
To the best of our knowledge, this work represents the first application of ant colony optimization (ACO) to the problem of neuroevolution/neural architecture search for recurrent neural networks with varying recurrent time spans and more complex connectivity patterns (the only prior related study that investigated ACO for evolving RNNs was critically constrained to small RNNs with a single recurrent timestep and Elmanstyle connectons [desell2015evolving]). Specifically, we proposed the novel ant swarm neuroevolution algorithm (ASNE) for metaheuristically searching the the massive search space of possible RNNs with complex connectivity patterns (of both recurrent and feedforward forms). ASNE generates candidates from a massivelyconnected superstructure (the colony/swarm), taking advantage of ACO for structural optimization and concepts from neuroevolutionary/genetic approaches for maintaining populations of RNN candidates that are trained locally and asynchronously (making ASNE a memetic procedure as well). A hallmarkk of ASNE is its computational formalization of role/specialization in real ant colonies – ant agents internally are prevented from getting stuck “wandering” around the superstructure through the use of different kinds of ant agents that are constrained to only explore different components of the underlying complex graph space. This is a form of modularization that proves particularly useful in cutting up large, complex search spaces under ASNE.
Our experimental results show that using ants with different roles generated RNNs that were not only sparse but performant – these candidates almost entirely outperformed the more standard ant traversal strateg even when standard ants were biased to more likely select forward paths. This innovation of utilizing multiple ant types improves the ACO core of ASNE when searching for effective RNNs. Furthermore, Lamarckian weight inheritance greatly improved the accuracy of the generated RNNs^{3}^{3}3Corraborating prior studies that have also shown the benefits of such an initialization scheme [desell2018accelerating, ororbia2019examm]. and allowing ants to jump (or skip) layers proved to not only boost performance but also to increase sparsity. Lastly, to our knowledge the introduction of L1 and L2 regularization into the ACO pheromone deposition process is quite novel if albeit a bit unconventional. Our results show by playing with the form of the pheromone adjustment function, we can increase the likelihood that sparser RNNs are found that also outperform schemes that do not incorporate regularization/constraints. The strategies we formalize in this work are generic and could be applied to any other ACO algorithm’s pheromone update process.
The proposed ASNE metaheuristic not only provides advances and new concepts for the field of ant colony optimization research to further explore but also shows strong promise for its use as an alternative neuroevolution algorithm for automated RNN architecture search. It significantly outperforms the wellknown NEAT algorithm (even when NEAT is given an order of magnitude more computation), and, more importantly, ASNE outperforms the stateoftheart EXAMM genetic evolutionary algorithm on the time series problem studied in this paper.
The work also opens up a number of avenues for future study as well as presents some interesting questions. In particular, why were explorer ants able to find the best networks and yet performed quite poorly in the mean and median cases? Why did explorer ants combined with social recurrent ants perform extremely well in the mean and median cases but not in the best cases? Answering experimental queestions such as these might lead to insights as to how recurrent connections that skip multiple steps of time interact with recurrent memory cells, potentially leading to the design of more expressive, RNN structures that better capture longerterm dependencies in sequential data. Finally, future work should entail investigation of ASNE on other time series datasets as well as sequence modeling (and classification) problems more commonly explored in mainstream statistical learning research, such as language modeling [mikolov2010recurrent, ororbia2017learning].
Acknowledgements
This material is in part supported by the U.S. Department of Energy, Office of Science, Office of Advanced Combustion Systems under Award Number #FE0031547. We also thank Microbeam Technologies, Inc. for their help in collecting and preparing the coalfired power plant dataset. Most of the computation of this research was done on the high performance computing clusters of Research Computing at Rochester Institute of Technology. We would like to thank the Research Computing team for their assistance and the support they generously offered to ensure that the heavy computation this study required was available.
References
Appendix A Complete Results
Minimum  Maximum  Mean  Median  Std. Dev.  
NEAT  
EXAMM  
L2.65AJExpFrdBkw  
AJExpFrdBkw  
AJExpFrdBkw  
AJExpFrd  
L1.65AJExpFrdBkw  
L2.25AJExpFrd  
L1.25AJExpFrd  
L2.65AJExpFrd  
L2.25AJExpFrd  
L2.9AJExpFrd  
L2.65AJExpFrd  
L1.65AJExpFrd  
L2.65AJExpFrdBkw  
L2.65AJExpFrdBkw  
L2.9AJExpFrdBkw  
L2.65AJExpFrd  
L1.9AJExpFrd  
L1.25AJExpFrdBkw  
L1.65AJExpFrd  
L2.25AJExpFrdBkw  
L2.9AJExpFrd  
L2.25AJExpFrdBkw  
L1.65AJExpFrd  
L1.65AJExpFrd  
L1.65AJExpFrd  
L1.65AJExpFrd  
AJExpFrd  
L1.65AJExpFrdBkw  
L1.9AJExpFrd  
L2.9AJExpFrd  
L2.65AJExpFrd  
AJExpFrd  
L2.65AJExpFrdBkw  
AJExpFrd  
L2.65AJExpFrd  
L1.25AJExpFrdBkw  
AJExpFrdBkw  
L2.65AJExpFrd  
AJExpFrdBkw  
AJExpFrd  
AJExpFrdBkw  
L1.9AJExpFrd  
L2.65AJExpFrd  
L2.9AJExpFrdBkw  
AJExpFrdBkw  
L2.25AJExpFrd  
L1.65AJExpFrdBkw  
L2.65AJExpFrd  
L1.25AJExpFrd  
L1.25AJExpFrd  
AJExpFrd  
L2.25AJExpFrd  
L1.25AJExpFrd  
L1.9AJExpFrd  
L1.9AJExpFrdBkw  
L1.25AJExpFrdBkw  
L2.9AJExpFrd  
L1.9AJExpFrdBkw  
AJExpFrdBkw  
L2.9AJExpFrdBkw  
L2.25AJExpFrd  
L1.9AJExpFrdBkw  
L1.9AJExpFrdBkw  
L1.65AJExpFrdBkw  
L1.65AJExpFrdBkw  
L1.25AJExpFrd  
L1.65AJExpFrdBkw  
L2.25AJExpFrd  
AJExpFrdBkw  
L2.25AJExpFrdBkw  
L1.9AJExpFrdBkw  
L2.9AJExpFrdBkw  
AJExpFrdBkw  
AJExpFrdBkw  
L1.25AJExpFrd  
L1.65AJExpFrd  
L2.9AJExpFrdBkw  
L1.25AJExpFrdBkw  
AJExpFrdBkw  
L2.25AJExpFrdBkw  
L2.9AJExpFrdBkw  
L1.25AJExpFrdBkw  
AJExpFrd  
L1.9AJExpFrdBkw  
L1.9AJExpFrdBkw  
L2.65AJExpFrdBkw  
L2.9AJExpFrd  
L1.65AJExpFrdBkw  
L2.25AJExpFrd  
L2.65AJExpFrdBkw  
L2.65AJExpFrdBkw  
L2.9AJExpFrd  
L1.65AJExpFrdBkw  
L2.25AJExp  
L2.65AJExpFrdBkw  
L1.25AJExpFrdBkw  
AJExpFrd  
AJExpFrdBkw  
L2.9AJExpFrdBkw  
L1.9AJExpFrdBkw  
AJExpFrdBkw  
L1.25AJExpFrd  
L2.65AJExpFrd  
L1.25AJExpFrd  
AJExpFrd  
AJExpFrd  
L2.9AJExpFrdBkw  
L2.65AJExpFrd  
L2.9AJExpFrdBkw  
L2.65AJExpFrdBkw  
L2.65AJExpFrdBkw  
AJExpFrd  
L2.65AJExpFrd  
L2.9AJExpFrd  
AJExpFrd  
L1.25AJExpFrdBkw  
AJExpFrdBkw  
L2.9AJExpFrd  
L2.65AJExpFrd  
L2.25AJExpFrd  
L2.9AJExpFrd  
AJExpFrdBkw  
AJExpFrd  
L1.9AJExpFrd  
AJExpFrd  
L2.65AJExpFrdBkw  
L2.25AJExpFrd  
L1.25AJExpFrdBkw  
AJExpFrdBkw  
L1.9AJExpFrd  
AJExpFrd  
L1.65AJExpFrdBkw  
L1.65AJExpFrd  
L1.25AJExpFrdBkw  
L2.25AJExpFrdBkw  
AJExpFrdBkw  
AJExpFrdBkw  
L1.9AJExpFrdBkw  
AJExpFrdBkw  
L2.9AJExpFrd  
L2.25AJExpFrd  
L2.25AJExpFrdBkw  
L1.65AJExpFrd  
AJExpFrdBkw  
AJExpFrd  
L1.25AJExpFrdBkw  
L1.65AJExpFrd  
L1.9AJExpFrd  
AJExp  
AJExpFrd  
AJExpFrd  
L2.9AJExp  
AJExpFrd  
L1.65AJExpFrdBkw  
L1.65AJExpFrd  
L2.65AJExpFrd 