Towards coevolution of fitness predictors and Deep Neural Networks
Abstract
Deep neural networks proved to be a very useful and powerful tool with many practical applications. They especially excel at learning from large data sets with labeled samples. However, in order to achieve good learning results, the network architecture has to be carefully designed. Creating an optimal topology requires a lot of experience and knowledge. Unfortunately there are no practically applicable algorithms which could help in this situation. Using an evolutionary process to develop new network topologies might solve this problem. The limiting factor in this case is the speed of evaluation of a single specimen (a single network architecture), which includes learning based on the whole large dataset. In this paper we propose to overcome this problem by using a fitness prediction technique: use subsets of the original training set to conduct the training process and use its results as an approximation of specimen’s fitness. We discuss the feasibility of this approach in context of the desired fitness predictor features and analyze whether subsets obtained in an evolutionary process can be used to estimate the fitness of the network topology. Finally we draw conclusions from our experiments and outline plans for future work.
Keywords:
evolutionary algorithm, Deep Learning, neural networks, fitness predictors, fitness approximation1 Introduction
Deep neural networks (DNN) are a very powerful machine learning technique. They have numerous practical applications, with stateoftheart performance reported in several domains, ranging from visual object recognition ([1]), through text processing ([2], [3]) to speech recognition ([4], [5]).
Neural network models are especially well suited to tackle problems with available large data sets of labeled samples. Model capacity can be easily increased by adding more units (neurons) in layers or by adding more layers. Unfortunately, choosing the correct architecture is not straightforward. Even given a set of layers with optimal types and sizes, the learning process may ultimately fail: the resulting model can be underperforming in terms of accuracy or can be overfitted. To combat these kind of problems, a number of techniques was developed: L1/L2 regularization [6], dropout/dropconnect ([7], [8]), early stopping, pretraining [9], adaptive learning rate [10] etc. Each of them has own limitations and has to be used in specific context to actually improve the results. Building a wellperforming model requires a lot of experience and performing some experiments with the actually analyzed dataset.
Using the automated methods for creating deep neural network models would greatly improve their quality and speed up creating innovative structures. As demonstrated by Koza in [11], the use of evolutionary algorithms can provide complex problem solutions, whose quality is comparable to those created by a human. There are examples of successful application of this approach to the benchmark problems ([12], [13]) and real world challenges ([14]). The factor which limits the usability of such an approach for DNN is the time of training of the network  the time that it takes to evaluate the model. The mentioned models were relatively simple (hundreds of neurons) and had limited training datasets (thousands of samples). The Deep Neural Networks are much more complex. Additionally, increasing the scale of deep learning in respect to both training examples and the number of parameters, is recognized as the main factor which improves the quality of results of the learning process. Conducting the training requires a lot of processing resources. In recent years there were many reports of successfully using GPUs ([1], [15], [4]) or powerful largescale clusters [16], [17] to scale up training and inference algorithms. Unfortunately, due to the increase of data sets size, the speed of research still suffered. It is still necessary to wait hours to days or weeks in order to learn whether a chosen topology combined with specific learning parameters provide optimal results.
Evolutionary methods would become applicable if only the evaluation time could be reduced. One method which helps in this situation is the coevolution of so called fitness predictors [18], [19], [20]. It assumes that a lowcost heuristic can be used to compare individuals in the population, instead of performing time consuming evaluation over the full dataset. Since the processing time is greatly reduced, evolutionary algorithms become feasible again. In this paper we present our approach to the evaluation of using subsets of the training set as fitness predictors for Deep Neural Networks. We analyzed the properties of such a solution and verified the hypothesis with some experiments using the MNIST dataset [21].
The paper is structured as follows: the next section describes the background and related work. Next we discuss the feasibility of using subsets of training set as fitness predictors, present sample results of fitness predictor evolution and analyze the potential problems. Finally we conclude the results and provide directions for further research.
2 Background and related work
In this section we present the related work which sets the foundation for our research.
2.1 Deep Neural Networks
In the standard approach a neural network consists of many simple connected processing units called neurons. Each of them produces a real number, which is the result of an activation function applied to sum of inputs multiplied by corresponding weights. The neurons are grouped into socalled layers. The first one (the input layer) gets activated by sensors observing some environment. Further layers are formed by connecting outputs of one of them to inputs of another. Finally, some of the neurons  typically forming the last (output) layer, might influence the environment by triggering some actions or can be used as a model of some phenomenon.
Learning in neural networks is a process of finding the weight values that make the network exhibit a desired behavior. Depending on the problem and environment complexity, such behavior might require many computational stages. Each of them transforms, usually in a nonlinear way the aggregated activation of the network. In Deep Neural Networks there are many such stages.
Shallow networks have been known for a long period of time, with the earliest works published in 1940s ([22]). Models with successive layers of neurons were introduced a bit later [23], [24], however it took time to develop practical, efficient learning methods [25]. Using this approach was not popular at first, because of practical problems: the learning algorithm was not proven to reliably find a nearly optimal global set of weights for complex problems in a reasonable amount of time. It required many subsequent improvements, like introducing convolutional and subsampling layers [26], using GPUs [27], [28], maxpolling [29], L1/L2 regularization ([6]), dropout and dropconnect ([7], [8]), early stopping, pretraining [9], adaptive learning rate ([10]), to successfully apply networks with many hidden layers (deep neural networks) to practical problems like visual object recognition ([1]), text processing ([2], [3]) or speech recognition ([4], [5]). Nowadays, thanks to successes in international competitions ([30], [31], [32], [33], [34]), neural networks gained widespread attention and are a very dynamic field of research.
2.2 Evolution Algorithms and Neural Networks
Using the numerous layer types, techniques and learning algorithms introduces an additional layer of parameters to the machine learning system: the hyperparameters (as opposed to the parameters of the model  weights of the neurons inputs). Choosing them properly has a great impact on the overall performance and accuracy of a specific network topology. Together with the variety of network topologies, it makes the task of designing, learning and applying deep neural network very complicated.
In shallow neural networks, the neuroevolution  artificial evolution of neural networks using genetic algorithms, has shown a great promise to improve the situation. Evolution have been be applied in different scenarios, with majority research done in the following three scenarios: evolving the connection weights values, evolving the network topology or evolving both.
In the first case the evolution is effectively replacing the backpropagation algorithm [35]. It promises to overcome the drawbacks of gradientdescent based methods like trapping in local minima or inability to find a global minimum of a nondifferentiable functions. In practice however, those problems are not commonly observed and are outweighed by advantages: speed and scalability. Furthermore, this approach still requires creating and tuning the network architecture.
Evolving only the network topologies [36], [37] is the second option. In this approach, a topology can be created using different strategies: extending a very basic, minimal network with new neurons and connections (growing) [38], [39] or starting with a big network and gradually removing its elements (pruning) [40]. During the evaluation, the networks are trained and tested against a separate test set. Topologies obtained in this way were reported to provide better generalisations. Processing them should also be more performant, as they contain only the necessary layers and neurons, thus limit the amount of the needed computations.
The algorithms which create the third group are often referred to as Topology and Weight Evolving Artificial Neural Network (TWEANN) algorithms. The most widely known ones include: Neuroevolution of Augumenting Topologies (NEAT) [38] and HyperNEAT [41], Cartesian Genetic Programming Artificial Network (CGPANN) algorithm [42], GeNeralized Acquisition of Recurrent Links (GNARL) [43]. Evolving the topology and weights together is reported to provide better results than any of them alone [44]. Unfortunately, due to technical limitations, this method has been applied only to benchmark problems like the single pole balancing [42]. Applying this method to more complex usecases like visual pattern recognition has not been widely investigated yet.
The area of neuroevolution in deep neural networks is not explored to such an extent yet. In [45] HyperNEAT is used to train a neural network which learns to classify images from the classic MNIST dataset ([21]. In this approach, the topology of the network was predefined and evolutionary algorithm was used to find weight values. Experiments included two scenarios:

finding weights across all layers

finding weights for feature extracting layers of a convolutional network and combining them with a traditional neural network which was trained with backpropagation
In the best configuration, that approach achieved 92.1% accuracy which is subpar to the results obtained with gradientdescent methods.
In [46] a genetic algorithm is used together with backpropagation to conduct training of a neural network. For each layer of the network there are multiple sets of weights. In each iteration this set is evolved and the most fit individual (set of weights with the smallest root mean squared error over training samples) is chosen and further tuned with backpropagation. The authors describe how this approach was applied to the problem of classification of images from the MNIST dataset and report achieving a classification test error of only 1,44%.
As recognized in [47], in the context of deep neural networks, the main objective for using the evolutionary algorithms was to improve the learning mechanisms. Unfortunately, this approach, at least at this stage of development, can not be used to replace the gradientbased methods. On the other hand, the topology evolution in deep neural networks was not explored extensively yet. The use of traditional algorithms, like NEAT and GNARL, is limited by the time which has to be spent on training and testing a single individual. We believe that to unlock the potential of those algorithms, research has to focus on making the evaluation time shortest as possible.
2.3 Coevolution of fitness predictors
Coevolution is a kind of evolutionary algorithm where one individual within the same or a separate population, is used to determine the relative ranking between other individuals [48], [49]. In other words, whether individual A is inferior or superior to individual B may depend of a third individual C rather than on some external fitness metric which would provide an absolute ranking. There is a number of different forms of coevolution: antagonistic (e.g. predatorprey), cooperative (e.g. symbiosis) or nonsymmetric systems (e.g. hostparasite or teacherlearner). Fitness in the context of coevolution has two notions: objective and subjective. The former is the well defined absolute ordering metric used in classical evolutionary algorithms. The latter is defined by the coevolving individuals and may be only weakly correlated with the objective fitness.
One of the major limiting factors in the evolutional computations is the time of single individual evaluation. One approach to tackle this problem is to use the fitness modelling [50] techniques, which attempt to approximate the exact fitness by using a model or coarse simulation of analyzed system. In the context of evolutionary computations the major techniques include:

fitness inheritance  fitness values are transferred from parents to children during crossover

fitness imitation  individuals are clustered by using a distance metric. The central individual of each cluster is evaluated in full and the resulting fitness is assigned to all elements of the cluster.

partial evaluation  fitness for some individuals is calculated exactly, others are inherited or modeled.
A chosen modelling method can be incorporated into the evolutionary process in many ways, e.g. to initialize the population, guide the crossover and mutation or replace fitness evaluations. Such an approach has numerous advantages: it reduces the evaluation cost and frequency while maintaining evolutionary progress, destabilizes local optima, helps avoiding bloated solutions and it can be applied in a situation where no explicit fitness function exists. Unfortunately there are also many challenges which come with using fitness approximation, like choosing the correct model, training it properly or dealing with a loss of fitness accuracy.
Both ideas (coevolution and fitness approximation) have been combined and applied successfully to a field, which also suffers from long evaluation times on big data sets  symbolic regression [51], [18], [19]. In this approach, the fitness prediction technique was used. It replaces exact fitness evaluations with a lightweight approximation which adapts together with the solution population. To achieve that, a population of so called fitness predictors is coevolved together with the problem solution population. Their objective is to maximize prediction accuracy. The best of the predictors is used to evaluate the solutions to the original problem. Fitness predictors in the case of symbolic regression were encoded as a small subset of the full training data set. This allowed to dramatically speed up computations, improved the quality: reduced bloat and increased the fitness values of solutions of the original problem.
3 Subsets of training set as fitness predictors
There are many ways to represent the fitness predictor and how it is evolved and used. It is very important to choose an appropriate form correctly. The chosen predictors are used for all fitness evaluations within evolutionary algorithm iteration, hence they influence the direction of development of the main population. As stated in [18], the following constraints have to be met by fitness predictors:

They have to be able to approximate the fitness of a candidate solutions

They have to be processed significantly faster than the exact fitness calculation

They have to differentiate the fitness between a pair of individuals from a given population.
Fitness prediction can be conducted with use of different estimation methods and each of them will impose a different representation of a single predictor. One might use e.g. a decision tree and this would force use a decision tree representation which could be subjected to evolution. In our case, where we want to estimate the fitness of Deep Neural Networks, we propose to conduct the estimation by training and testing with an unchanged algorithm but by using each time a different subset of the full training data set. This allows to represent a single fitness predictor in a very simple way: as an array of indexes of the full data set. This approach have a number of advantages:

It is easy to tune the speed vs accuracy of approximation with the size of the subset

The representation is very simple  the implementation is less error prone

The evaluation of a single individual is basically the same procedure as training the neural network. It is possible to carry out experiments with different hyperparameters of the network.
The major disadvantage of our approach is the risk of overfitting the model because of using relatively small training sets. Given the complexity of a typical Deep Neural Network, where millions of parameters are not uncommon, overfitting is almost certain. However, we believe that this effect can be limited by using certain techniques e.g. the dropout layers.
We evaluated our idea in the environment described in [52], where a Convolutional Neural Network was used to recognize hand written digits from [21]. First we created a topology based on the LeNet5 (Table 1) using the Torch7 framework [53]. Initially we used the full original dataset consisting of 60000 training samples. Training was conducted using the Stochastic Gradient Descent method with minibatches. With this setup, we achieved an accuracy of 99,21% on the provided validation test set. The learning parameters are listed in Table 2. Such a result is on par with the original model.
No.  Layer Type  Number of parameters  Output size 

1.  Convolution (ReLU [54] activation)  832  
2.  MaxPooling  0  
3.  Convolution (ReLU activation)  53248  
4.  MaxPooling  0  
5.  Fully connected  25600  200 
6.  Fully connected  2000  10 
7.  LogSoftMax  0  1 
Parameter  Value 
Learning rate  0.1 
Learning rate decay  0.00001 
Minibatch size  128 
Learning epochs  20 
L1/L2 coefficients  0 
Momentum  0 
3.1 Approximation of fitness
First, we evaluated whether the proposed fitness predictors can accurately predict the fitness of a given candidate  a deep neural network. Using a subset of the training data set has an obvious advantage: in the worst case scenario the model will be able to recognize at least a part of the full data set. However, the samples should not be picked randomly as the quality of the chosen subset has a great influence on the result of learning. To quantify how big the impact of that effect is, we attempted to measure the difference in the prediction accuracy, when random and preselected subsets are used.
To find the subset which has the best training performance we implemented a genetic algorithm which evolved a population of potential fitness predictors with a fixed size. The mutation operation was implemented as a simple change of a single training sample with another, randomly chosen one. For simplicity we chose the onepoint variant of crossover. We experimented with different values of the crossover point location (fixed and random), however we did not observe any differences in that algorithm convergence speed. We also investigated using a niching technique (Deterministic Crowding [55]) versus simply choosing the best individuals from a combined children and parents population. The latter, simpler approach turned out to converge faster. The fitness in subsequent iterations with and without niching is presented in Fig. 1. The parameters of the evolutionary algorithm are listed in Table 3.
Parameter  Value 

Population size  128 
Evolution iterations  100 
Crossover probability  75% 
Mutation probability  1% 
Number of iterations  100 
Crossover variant  Single point 
Each predictor was evaluated using the same procedure:

A new instance of the neural network was instantiated.

The network was trained using the samples specified by the genotype of a given individual. We used the same training method as described earlier.

Network accuracy was evaluated using the full test dataset.
The results of evolution with different sizes can be compared reliably thanks to the use of the same validation set for each fitness predictor size.
The maximum accuracy across the population of the subsets in the subsequent evolution iterations is presented in Fig. 2. It is clear that with increasing the size of the fitness predictor, the accuracy of the trained neural network is higher. Regardless of the size, the evolution was able to improve the prediction results. The effect is weaker for bigger fitness predictors, however it is clear that using just a random subset might result in poor performance even if the model has an optimal architecture. The differences between the worst and best individuals are presented in Table 4.
Size  Min  Max  Max  Min  Full dataset  Max 

100  9,86%  77,76%  67,9%  21,45% 
250  50,77%  88,21%  37,44%  11,00% 
500  44,89%  94,24%  49,35%  4,95% 
1000  84,86%  96,44%  11,58%  2,77% 
2000  94,23%  97,56%  3,33%  1,65% 
4000  95,94%  98,28%  2,34%  0,93% 
Based on these results we conclude that training with only a subset of available training samples can be used to approximate the prediction accuracy of the full training dataset. However, the subset has to be chosen very carefully. It is possible to pick 250 and 1000 element datasets which will result in training a model with a similar recognition accuracy.
3.2 Time of processing
Cutting down the time of evaluating a single individual (neural network) is the main objective of this research. In this section we compare the time of training our deep neural network with use of the full training set and fitness predictors of different sizes.
Table 5 presents average times of a single iteration (epoch) of learning for each fitness predictor size. We used the same test set to evaluate the model’s accuracy in each case. Therefore in the table we exclude the evaluation time completely as it is invariant to the input size.
Fitness Predictor Size  Average training time  Standard deviation 

100  100.37  9.94 
250  245.12  12.15 
500  492.63  17.26 
1000  981.26  29.67 
2000  1923.69  46.93 
4000  3890.54  87.64 
60000  57698.10  369.26 
The times were measured on a Intel Xeon processor with 8 cores. For each fitness predictor size a single epoch of training was executed 100 times.
The results present a linear relationship between the data set size and training time. The time is extended by roughly 1ms per each sample. It is clear that by limiting the number of samples a considerable amount of time can be saved. By connecting this relationship to the sizeprediction accuracy dependence (Table 4) we obtain a tool which enables tuning the accuracy of trained neural network (the approximated fitness) to a level, where the processing time is acceptable. The only parameter which needs to change is the size of the fitness predictor  more elements result in improved accuracy at the cost of longer training time.
We acknowledge that the presented timing results shouldn’t be treated as a benchmark of Torch performance: they are related only to a single dataset and the system was not finetuned to reach peak performance.
3.3 Differentiating the fitness between individuals
One important feature of computing a fitness score, is the possibility to compare different individuals according to specific criteria. In our case this would mean comparing between different neural network architectures in the context of their capability to generalize knowledge from the provided training samples. Using only a subset of the original training set increases the pressure to generalize even more. All models start with the same subset of the training data set and are evaluated with use of the same test set. This means they are exposed in the same way to any flaws of fitness predictors, e.g. the distribution of samples is extremely biased towards one of the classification categories, in case of MNIST dataset, one of the digits. The fitness score might not create an objective (absolute) ordering of elements, but will provide a reliable subjective (relative) comparison within a population. Given that the fitness predictors improve over time, this enables to improve the general population of the network.
4 Conclusions and further work
In this paper we formulated and discussed the concept of using subsets of training data sets as fitness predictors for deep neural networks. We analyzed whether this form meets the fitness predictor criteria. Further we presented the results of an experiment which attempted to improve the quality of fitness predictors with use of an evolutionary algorithm. The classification accuracy of networks, after training with fitness predictors of different sizes, combined with the time of computations, showed that the accuracy can be traded off for shorter time of processing.
Those results prove that the proposed approach to fitness prediction of deep neural networks is a viable alternative to the evaluation by training and verifying over a complete training and test sets. This doesn’t replace the technique of using largest available datasets to achieve high quality results with the optimal network structure. However, it enables optimization of time and resources where many instances of Deep Neural Networks are compared to each other, e.g. in the case of using the evolutionary algorithms to find the optimal number of layers.
Furthermore, the subsets of training set found by the evolutionary algorithm, can be reused by other researchers to speed up experimentation with different network structures. This will enable further improvements in the area of deep learning.
We plan to continue this research by implementing a coevolutionary algorithm which will adjust the architecture of a deep neural network to a specific problem. Fitness predictors will be used to significantly cut down the evaluation time of a single network. This will optimize the resource usage and allow much more models to be evaluated. We also want to explore the methods of improving the evolution process by introducing another population which could be used to tune the values of hyperparameters.
Acknowledgements
We would like to thank the PL Grid project for providing computational resources to carry out computational experiments.
Funding
This research is partly supported by AGH grant no. 11.11.230.124.
References
 [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25 (NIPS2012), pages 1–9, 2012.
 [2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3:1137–1155, 2003.
 [3] Ronan Collobert and Jason Weston. A unified architecture for natural language processing. Proceedings of the 25th international conference on Machine learning  ICML ’08, 20(1):160–167, 2008.
 [4] George E. Dahl, Dong Yu, Li Deng, and Alex Acero. Contextdependent pretrained deep neural networks for largevocabulary speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 20(1):30–42, 2012.
 [5] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 [6] Andrew Y Ng. Feature selection, L 1 vs. L 2 regularization, and rotational invariance. Twentyfirst international conference on Machine learning  ICML ’04, page 78, 2004.
 [7] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout : A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research (JMLR), 15:1929–1958, 2014.
 [8] Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. Regularization of neural networks using dropconnect. ICML, (1):109–111, 2013.
 [9] Aaron Courville, Yoshua Bengio, and Pascal Vincent. Why Does Unsupervised Pretraining Help Deep Learning? Journal of Machine Learning Research, 9(2007):201–208, 2010.
 [10] KyungHyun Cho, Tapani Raiko, and Alexander Ilin. Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines. Neural Computation, 25:805–831, 2013.
 [11] John R. Koza. Humancompetitive results produced by genetic programming. Genetic Programming and Evolvable Machines, 11(34):251–284, 2010.
 [12] Sunil Kr. Jha and Filip Josheski. Artificial evolution using neuroevolution of augmenting topologies (NEAT) for kinetics study in diverse viscous mediums. Neural Computing and Applications, 2016.
 [13] Yamina Mohamed Ben Ali. Advances in evolutionary feature selection neural networks with coevolution learning. Neural Computing and Applications, 17(3):217–226, 2008.
 [14] Gul Muhammad Khan and Faheem Zafari. Dynamic feedback neuroevolutionary networks for forecasting the highly fluctuating electrical loads. Genetic Programming and Evolvable Machines, 17(4):391–408, 2016.
 [15] Rajat Raina, Anand Madhavan, and Andrew Y. Ng. Largescale deep unsupervised learning using graphics processors. Icml, 9:873–880, 2009.
 [16] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, and Quoc V Le. Large scale distributed deep networks. Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
 [17] Quoc V Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S Corrado, Jeff Dean, and Andrew Y Ng. Building highlevel features using large scale unsupervised learning. International Conference in Machine Learning, page 38115, 2011.
 [18] Michael Schmidt and Hod Lipson. Coevolving Fitness Predictors for Accelerating and Reducing Evaluations. GPTP 2006, 1, 2006.
 [19] Michael D. Schmidt and Hod Lipson. Coevolution of fitness predictors. IEEE Transactions on Evolutionary Computation, 12(6):736–749, 2008.
 [20] Wlodzimierz Funika and Pawel Koperek. Genetic Programming in Automatic Discovery of Relationships in Computer System Monitoring Data, pages 371–380. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014.
 [21] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
 [22] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4):115–133, 1943.
 [23] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386–408, 1958.
 [24] Kumpati S. Narendra and Mandayam A. L. Thathatchar. Learning automata – a survey. IEEE Transactions on Systems, Man, and Cybernetics, 4:323–334, 1974.
 [25] Paul J. Werbos. Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference, 31.8  4.9, NYC, pages 762–770, 1981.
 [26] Kunihiko Fukushima. Neural network model for a mechanism of pattern recognition unaffected by shift in position  Neocognitron. Trans. IECE, J62A(10):658–665, 1979.
 [27] Kumar Chellapilla, Sidd Puri, and Patrice Simard. High performance convolutional neural networks for document processing. In International Workshop on Frontiers in Handwriting Recognition, 2006.
 [28] Marc Ranzato, Christopher Poultney, Sumit Chopra, and Yann Lecun. Efficient learning of sparse representations with an energybased model. In Advances in Neural Information Processing Systems (NIPS 2006), pages 1137–1144, 2006.
 [29] Juyang Weng, Narendra Ahuja, and Thomas S. Huang. Cresceptron: a selforganizing neural network which grows adaptively. In International Joint Conference on Neural Networks (IJCNN), volume 1, pages 576–581. IEEE, 1992.
 [30] Viren Jain and Sebastian Seung. Natural image denoising with convolutional networks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems (NIPS) 21, pages 769–776. Curran Associates, Inc., 2009.
 [31] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The German traffic sign recognition benchmark: A multiclass classification competition. In International Joint Conference on Neural Networks (IJCNN 2011), pages 1453–1460. IEEE Press, 2011.
 [32] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32:323–332, 2012.
 [33] Dan C. Ciresan, Ueli Meier, Luca M. Gambardella, and Jürgen Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR, abs/1003.0358, 2010.
 [34] Dan C. Ciresan, Ueli Meier, Jonathan Masci, and Jürgen Schmidhuber. A committee of neural networks for traffic sign classification. In International Joint Conference on Neural Networks (IJCNN), pages 1918–1921, 2011.
 [35] David J. Montana and Lawrence Davis. Training Feedforward Neural Networks Using Genetic Algorithms. Proceedings of the 11th International Joint Conference on Artificial intelligence  Volume 1, 89:762–767, 1989.
 [36] Martin Mandischer. Evolving recurrent neural networks with nonbinary encoding. In Evolutionary Computation, 1995., IEEE International Conference on, volume 2, pages 584–589 vol.2, Nov 1995.
 [37] Hiroaki Kitano. Designing neural networks using genetic algorithms with graph generation system. Complex Systems Journal, 4(4):461–476, 1990.
 [38] Kenneth O. Stanley and Risto Miikkulainen. Efficient Reinforcement Learning Through Evolving Neural Network Topologies. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2002), 10(4):569–577, 2002.
 [39] Xinjian Qiang, Guojian Cheng, and Zheng Wang. An overview of some classical growing neural networks and new developments. In 2010 2nd International Conference on Education Technology and Computer, volume 3, pages V3–351–V3–355, June 2010.
 [40] Nils T. Siebel, Jonas Bötel, and Gerald Sommer. Efficient neural network pruning during neuroevolution. Proceedings of the International Joint Conference on Neural Networks, pages 2920–2927, 2009.
 [41] Kenneth O. Stanley, David B. D’Ambrosio, and Jason Gauci. A hypercubebased encoding for evolving largescale neural networks. Artificial Life, 15(2):185–212, 2009.
 [42] Maryam M. Khan, Gul M. Khan, and Julian F. Miller. Evolution of neural networks using Cartesian Genetic Programming. IEEE Congress on Evolutionary Computation, pages 1–8, 2010.
 [43] Peter J. Angeline, Gregory M. Saunders, and Jordan B. Pollack. An evolutionary algorithm that constructs recurrent neural networks. IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 5(1):54–65, 1994.
 [44] Yao, Xin. Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423–1447, 1999.
 [45] Phillip Verbancsics and Josh Harguess. Generative NeuroEvolution for Deep Learning. arXiv:1312.5355 [cs], pages 1–9, 2013.
 [46] Omid E. David and Iddo Greental. Genetic Algorithms for Evolving Deep Neural Networks. pages 1451–1452, 2014.
 [47] Sreenivas S. Tirumala. Implementation of evolutionary algorithms for deep architectures. CEUR Workshop Proceedings, 1315:164–171, 2014.
 [48] Josh C. Bongard and Hod Lipson. Nonlinear System Identification Using Coevolution of Models and Tests. IEEE Transactions on Evolutionary Computation, 9(4):361–384, 2005.
 [49] Björn Olsson. Coevolutionary search in asymmetric spaces. Information Sciences, 133(34):103–125, 2001.
 [50] Yaochu Jin. A Comprehensive Survey of Fitness Approximation in Evolutionary Computation. Soft Computing, 9(1):3–12, 2005.
 [51] Michael Schmidt and Hod Lipson. Distilling freeform natural laws from experimental data. Science, 324(5923):81–85, 2009.
 [52] Yann LeCun, Lèon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2323, 1998.
 [53] Ronan Collobert, Koray Kavukcuoglu, and Clement Farabet. Torch7: A matlablike environment for machine learning. BigLearn, NIPS Workshop, pages 1–6, 2011.
 [54] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep Sparse Rectifier Neural Networks. Aistats, 15:315–323, 2011.
 [55] Ole J Mengshoel and David E Goldberg. The crowding approach to niching in genetic algorithms. Evolutionary computation, 16(3):315–354, 2008.