Evolving Loss Functions with Multivariate Taylor Polynomial Parameterizations
Abstract
Loss function optimization for neural networks has recently emerged as a new direction for metalearning, with Genetic Loss Optimization (GLO) providing a general approach for the discovery and optimization of such functions. GLO represents loss functions as trees that are evolved and further optimized using evolutionary strategies. However, searching in this space is difficult because most candidates are not valid loss functions. In this paper, a new technique, Multivariate Taylor expansionbased genetic lossfunction optimization (TaylorGLO), is introduced to solve this problem. It represents functions using a novel parameterization based on Taylor expansions, making the search more effective. TaylorGLO is able to find new loss functions that outperform those found by GLO in many fewer generations, demonstrating that loss function optimization is a productive avenue for metalearning.
I Introduction
As deep learning systems have become more complex, their architectures and hyperparameters have become increasingly difficult and timeconsuming to optimize by hand. In fact, many good designs may be overlooked by humans with prior biases. Therefore, automating this process, known as metalearning, has become an essential part of the modern machine learning toolbox. Metalearning approaches this problem from numerous angles, both by optimizing different aspects of the architecture from hyperparameters [1] to topologies [2], and by using different approaches from Bayesian optimization to evolutionary computation [3].
Recently, lossfunction discovery and optimization has emerged as a new type of metalearning. It aims to tackle neural network’s root training goal, by discovering better ways to define what is being optimized. In doing so, it makes it possible to regularize the solutions automatically. Genetic Loss Optimization (GLO) [4] provided an initial implementation of this idea using a combination of genetic programming and evolutionary strategies.
However, loss functions can be challenging to represent in an optimization system because they have a discrete nested structure as well as continuous coefficients. GLO tackled this problem by discovering and optimizing loss functions in two separate steps: evolution of structure and optimization of coefficients. Such separate processes make it challenging to find a mutually optimal structure and coefficients. Furthermore, the structured search space is difficult since small changes to the genotype do not always result in small changes in the phenotype and can easily make a function invalid.
In an ideal case, loss functions would be smoothly mapped into arbitrarily long, fixedlength vectors in a Hilbert space. This mapping should be smooth, wellbehaved, welldefined, incorporate both a function’s structure and coefficients, and should by its very nature make large classes of infeasible loss functions a mathematical impossibility.
This paper introduces such an approach: Multivariate Taylor expansionbased genetic lossfunction optimization (TaylorGLO). By using a novel parameterization for loss functions, the key pieces of information that affect a loss function’s behavior can be compactly represented in a vector. Such vectors can then be optimized for a specific task using a CovarianceMatrix Adaptation Evolutionary Strategy (CMAES). Select techniques can be used to narrow down the search space and speed up evolution.
Loss functions discovered by TaylorGLO outperform the standard crossentropy loss (or log loss) on the MNIST dataset as well as the Baikal loss, discovered by the original GLO technique, with significantly fewer evaluations than GLO. Like Baikal, TaylorGLO loss functions have good performance characteristics on reduced training datasets.
The following section reviews literature in metalearning and loss function optimization, including the original GLO technique, and provides a background on multivariate Taylor approximations. A description of how such Taylor approximations are leveraged in a new loss function parameterization follows, and the TaylorGLO approach is described. TaylorGLO’s efficacy is then evaluated on the MNIST image classification task.
Ii Related Work
Designing performant deep neural networks involve a highlevel of manual tuning of both the architecture and hyperparameters. The field of metalearning was developed to tackle this issue algorithmically [1]. Significant work exists in neural network metalearning using a variety of approaches [3], with recent work tackling, for example, the evolution of entire architectures [2]. It is therefore compelling to apply metalearning to the design of loss functions as well.
Iia Loss Function Metalearning
Deep neural networks are traditionally trained iteratively, whereby model parameters (i.e., weights and biases) are updated using gradients propagated backwards through the network, starting from an error given by a loss function [5]. Loss functions represent the primary training objective of a neural network. In many tasks, such as classification and language modeling, the crossentropy loss (also known as the log loss) is used almost exclusively. While in some approaches a regularization term (e.g. weight regularization [6]) is added to the the loss function definition, the core component is still the crossentropy loss. The crossentropy loss is motivated by information theory; it aims to minimize the number of bits needed to identify a message from the true distribution, using a code from the predicted distribution.
Other types of tasks that do not fit neatly into a singlelabel classification framework often have different taskspecific loss functions [7, 8, 9, 10, 11]. This observation suggests that it is useful to change the loss function appropriately. Indeed, in statistics it is known that different regression loss functions have unique properties, such as the Huber Loss [12] which is resilient to outliers compared to other loss functions. Every instance where a loss function is chosen for a task without a specific justification is an opportunity to find a more optimal loss function automatically.
Genetic Loss Optimization (GLO) [4] provided an initial study into metalearning of loss functions. GLO is based on a twophase approach that (1) evolves a function structure using a tree representation, and (2) optimizes a structure’s coefficients using an evolutionary strategy. GLO was able to discover Baikal, a new loss function that outperformed the crossentropy loss on tested image classification tasks, and made it possible to learn with smaller datasets. Baikal appeared to have these characteristics due to an implicit regularization effect that GLO discovered.
Treebased representations, as used in GLO, have been dominant in genetic programming because they are flexible and can be applied to a variety of function evolution domains. For example, this genetic programming succeeded in discovering nontrivial mathematical formulas that underly physical systems using noisy experimental data [13].
However, due to the twostep approach that GLO takes, GLO is unable to discover a function with mutually beneficial structure and coefficients. Additionally, the majority of loss function candidates in GLO are not useful, since, for example, many of them have discontinuities. Further mutations to treebased representations can have disproportionate effects on the functions they represent. GLO’s search is thus inefficient, requiring large populations that are evolved for many generations.
The technique presented in this paper, TaylorGLO, aims to solve the representation problems by using evolutionary strategies and a novel loss function parameterization based on multivariate Taylor expansions. Furthermore, since such representations are continuous, the approach can take advantage of CMAES [14] as the search method, resulting in faster search.
IiB Multivariate Taylor Expansions
Taylor expansions [15] are a wellknown function approximator that can represent differentiable functions within the neighborhood of a point using a polynomial series. Below, the common univariate Taylor expansion formulation is presented, followed by a natural extension to arbitrarilymultivariate functions.
Given a smooth (i.e., first through derivatives are continuous), realvalued function, , a thorder Taylor approximation at point , , where , can be constructed as,
(1) 
Conventional, univariate Taylor expansions have a natural extension to arbitrarily highdimensional inputs of . Given a smooth, realvalued function, , a thorder Taylor approximation at point , , where , can be constructed. The stricter smoothness constraint compared to the univariate case allows for the application of Schwarz’s theorem on equality of mixed partials, obviating the need to take the order of partial differentiation into account.
Let us define an thdegree multiindex, , where , , . , and . Multivariate partial derivatives can be concisely written using a multiindex
(2) 
Thus, discounting the remainder term, the multivariate Taylor expansion for at is
(3) 
The unique partial derivatives in and are parameters for a th order Taylor expansion. Thus, a th order Taylor expansion of a function in variables requires parameters to define the center, , and one parameter for each unique multiindex , where . That is:
(4) 
The multivariate Taylor expansion can be leveraged for a novel lossfunction parameterization. It enables TaylorGLO, a way to efficiently optimize loss functions, as will be described in subsequent sections.
Iii Loss functions as multivariate Taylor expansions
Let an class classification loss function be defined as
(5) 
The function can be replaced by its thorder, bivariate Taylor expansion, . More sophisticated loss functions can be supported by having more input variables, beyond and , such as a time variable or unscaled logits. This approach can be useful, for example, to evolve loss functions that change as training progresses.
For example, a loss function in and has the following 3rdorder parameterization with parameters (where ):
(6)  
Notably, the reciprocalfactorial coefficients can be integrated to be a part of the parameter set by direct multiplication if desired.
As will be shown in this paper, the technique makes it possible to train neural networks that are more accurate and learn faster, than those with treebased loss function representations. Representing loss functions in this manner confers several useful properties:

Guarantees smooth functions;

Functions do not have poles (i.e., discontinuities going to infinity or negative infinity) within their relevant domain;

Can be implemented purely as compositions of addition and multiplication operations;

Can be trivially differentiated;

Nearby points in the search space yield similar results (i.e., the search space is locally smooth), making the fitness landscape easier to search;

Valid loss functions can be found in fewer generations and with higher frequency;

Loss function discovery is consistent and does not depend as much on a specific initial population; and

The search space has a tunable complexity parameter (i.e., the order of the expansion).
These properties are not necessarily held by alternative function approximators. For instance:
 Fourier series

are well suited for approximating periodic functions [16], therefore, they are not as well suited for loss functions, whose local behavior within a narrow domain is important. Being a composition of waves, Fourier series tend to have many critical points within the domain of interest. Gradients fluctuate around such points, making gradient descent infeasible. Additionally, close approximations require a large number of terms, which in itself can be injurious, causing large, highfrequency fluctuations, known as “ringing”, due to Gibb’s phenomenon [17].
 Padé approximants

can be more accurate approximations than Taylor expansions; indeed, Taylor expansions are a special case of Padé approximants where [18]. However, unfortunately, Padé approximants can model functions with one or more poles, which valid loss functions typically should not have. These problems still exist, and are exacerbated, for Chisholm approximants [19] (a bivariate extension) and Canterbury approximants [20] (a multivariate generalization).
 Laurent polynomials

can represent functions with discontinuities, the simplest being . While Laurent polynomials provide a generalization of Taylor expansions into negative exponents, the extension is not useful because it results in the same issues as Padé approximants.
 Polyharmonic splines

seem like an excellent fit since they can represent continuous functions within a finite domain. However, the number of parameters is prohibitive in multivariate cases.
The multivariate Taylor expansion is therefore a better choice than the alternatives. It makes it possible to optimize loss functions efficiently in TaylorGLO, as will be described next.
Iv The TaylorGLO Approach
TaylorGLO (Figure 1) aims to find the optimal parameters for a loss function parameterized as a multivariate Taylor expansion, as described in Section III. The parameters for a Taylor approximation (i.e., the center point and partial derivatives) are referred to as . , . TaylorGLO strives to find the vector that parameterizes the optimal loss function for a task. Because the values are continuous, as opposed to discrete graphs of the original GLO, it is possible to use continuous optimization methods.
In particular, Covariance Matrix Adaptation Evolutionary Strategy (CMAES) [14] is a popular populationbased, blackbox optimization technique for rugged, continuous spaces. CMAES functions by maintaining a covariance matrix around a mean point that represents a distribution of solutions. At each generation, CMAES adapts the distribution to better fit evaluated objective values from sampled individuals. In this manner, the area in the search space which is being sampled at each step dynamically grows, shrinks, and moves as needed to maximize sampled candidates’ fitnesses. The specific variant of CMAES that TaylorGLO uses is ()CMAES [21], which incorporates weighted rank updates [22] to reduce the number of objective function evaluations that are needed.
In TaylorGLO, CMAES is used to find try to find . At each generation, CMAES samples points in whose fitness is determined; this is accomplished by training a model with the corresponding loss function and evaluating the model on a validation dataset. Fitness evaluations may be distributed across multiple machines in parallel. An initial vector of is chosen as a starting point in the search space to avoid bias.
Note that fully training a model can prove to be prohibitively expensive for all but the simplest problems. Fundamentally, there is a positive correlation between performance near the beginning of training and at the end of training. In order to identify the most promising candidates, it is enough to train the models only partially. This type of approximate evaluation is widely done in the field [23, 24]. An additional positive effect is that evaluation then favors loss functions that learn more quickly.
For a loss function to be useful, it must have a derivative that depends on the prediction. Therefore, internal terms that do not contribute to can be trimmed away. This implies that any term, within , where , can be replaced with .
For example, this refinement simplifies Equation 6 to:
(7)  
providing a reduction in the number of parameters from twelve to eight.
V Experimental Setup
This section presents the experimental setup that was used to evaluate the TaylorGLO technique. The MNIST image classification task was used as to measure the technique’s efficacy, and provide a point of comparison against the original GLO technique and the standard crossentropy loss function.
Va Domain
The domain used for evaluation was the MNIST Handwritten Digits [25], a widely used dataset where the goal is to classify pixel images as one of ten digits. MNIST has 55,000 training samples, 5,000 validation samples, and 10,000 testing samples. The dataset is well understood and relatively quick to train. Being a classification problem, MNIST training is traditionally framed with the standard crossentropy loss:
(8) 
where is sampled from the true distribution, is from the predicted distribution, and is the number of classes. The crossentropy loss is used as a baseline in this paper’s experiments.
The same CNN architecture presented in the GLO study [4] is used to provide a direct point of comparison. The model architecture is standard, consisting of convolution and pooling layers. Notably, this architecture contains a dropout layer [26] to explicitly provide regularization. Additionally, as in GLO’s experimentation methodology, training is based on stochastic gradient descent (SGD) with a batch size of 100, a learning rate of 0.01, and, unless otherwise specified, occurred over 20,000 steps.
VB TaylorGLO
A CMAES was set to have a population size and an initial step size . These values were found to work well experimentally. The candidates were thirdorder (i.e., ) TaylorGLO loss functions (Equation 7). Such functions were found experimentally to have a better tradeoff between evolution time and performance compared to second and fourthorder TaylorGLO loss functions (although the differences were relatively).
During candidate evaluation, models were trained for 10% of a full training run. On MNIST, this equates to 2,000 steps (i.e., 4 epochs).
VC Implementation Details
Due to the number of partial training sessions that are needed to evaluate TaylorGLO loss function candidates, training was distributed across the network to a cluster, composed of dedicated machines with NVIDIA GeForce GTX 1080Ti GPUs. Training itself was implemented with TensorFlow [27] in Python. The primary components of TaylorGLO (i.e., the genetic algorithm and CMAES) were implemented in the Swift programming language which allows for easy parallelization. These components run centrally on one machine and asynchronously dispatch work to the cluster.
Vi Results
The subsequent subsections present experimental results that show both how the TaylorGLO evolution process functions, and how the loss functions from TaylorGLO can be used as highperformance, dropin replacements for the crossentropy loss function.
Via Baseline
On MNIST, the average testing accuracy for models trained with the crossentropy loss was 0.9903, with a standard deviation of 0.0005 (). At 10% through the training process, corresponding to the amount of training undergone to evaluate TaylorGLO candidates, the average testing accuracy was 0.9656 with a standard deviation of 0.0015 ().
ViB TaylorGLO Evolution
Figure 2 shows an overview of the evolution process over 60 generations on the MNIST dataset, more than sufficient to reach convergence. TaylorGLO is able to quickly discover highlyperforming loss functions, with all improvements being discovered within 20 generations. Generations’ average validation accuracy approaches generations’ best accuracy as evolution progresses, indicating that populations as a whole are improving. Notably, every single partial training session completed successfully without any instability, a stark contrast to GLO, whose unbounded search space often results in pathological functions.
Figure 3 shows the shapes and parameters of each generation’s highestscoring loss function. They are plotted as if they were being used for binary classification, using the procedure described in the GLO study [4]. The functions are colored according to their generation. Additionally, plotted loss function curves are vertically shifted such that their loss at is zero; the raw value of a loss function is not relevant, the derivative, however, is. The functions have a distinct pattern through the evolution process, where early generations show a wider variety of shapes that converge in later generations towards curves with a shallow minimum around (the best loss function found on MNIST—described below—specifically had a minimum at ). Most notably, these well performing loss functions are not monotonically decreasing as the crossentropy loss is. They each have a local minima near, but not at, , after which the loss increases again as approaches 1. This shape provides an implicit regularization effect in the same manner that the Baikal loss function does [4]: it discourages overfitting by preventing the model from being too confident in its predictions.
This structure becomes quite evident when dimensionality reduction using tSNE [28] is performed on every candidate loss function within a run of TaylorGLO evolution, as illustrated in Figure 4. Points on the plot quickly migrate and spread in initial generations as CMAES grows the parameter space that is being explored (demonstrating resilience to the initial step size ). This pattern concurs with the spreading that occurs in early generations in the bottom plot of Figure 3. The points gradually migrate across the tSNE space and finally coalesce in the smaller region of red points.
The best loss function obtained from running TaylorGLO on MNIST was found in generation 74. This function, with parameters , , , , , , , (rounded to four decimalplaces), achieved a 2kstep validation accuracy of 0.9950 on its single evaluation, higher than 0.9903 for the cross entropy loss. This loss function was a modest improvement over the previous best loss function from generation 16, which had a validation accuracy of 0.9958.
Over 10 fullytrained models, the best TaylorGLO loss function achieved a mean testing accuracy of 0.9951, with a standard deviation of 0.0005, while the crossentropy loss only reached 0.9899, and the BaikalCMA loss function from the original GLO [4] technique reached 0.9947. These figures are shown in Figure 5. The testing accuracy improvement the TaylorGLO loss function confers over the crossentropy loss is highly statistically significant, with a value of , in a heteroscedastic, twotailed Ttest, with 10 samples from each distribution. Using the same type of Ttest, the TaylorGLO loss function’s improvement over BaikalCMA has a value of 0.0625.
Notably, the top loss function from TaylorGLO on MNIST slightly outperforms the top loss function from the original GLO technique, while requiring significantly fewer generations and partial training sessions. BaikalCMA required 11,120 partial evaluations (i.e., 100 individuals over 100 GP generations plus 32 individuals over 35 CMAES generations, ignoring evaluations subsequent to the discovery of BaikalCMA), while the top TaylorGLO loss function only required 448 partial evaluations (that is, as many in this case). Thus, TaylorGLO is a practical replacement for GLO which can achieve comparable results with significantly fewer evaluations.
ViC Performance on Reduced Datasets
The best TaylorGLO loss function on MNIST has significantly better performance characteristics than the crossentropy loss function when applied on reduced datasets. This property appears to be a characteristic of optimized loss functions [4]. Figure 6 presents a comparison of accuracies for models trained on different portions of the MNIST dataset. Overall, TaylorGLO significantly outperforms the crossentropy loss. All models were trained for the same number of steps (i.e., 20,000).
ViD Evaluation Length Sensitivity
200step
TaylorGLO is surprisingly resilient when evaluations during evolution are shortened to 200 steps (i.e., 0.4 epochs) of training. With so little training, returned accuracies are noisy and dependent on each individual network’s particular random initialization. On a 60generation run with 200step evaluations, the best evolved loss function had a mean testing accuracy of 0.9946 across ten samples, with a standard deviation of 0.0016. While slightly lower, and significantly more variable, than the accuracy for the best loss function that was found on the main 2,000step run, the accuracy is still significantly higher than that of the crossentropy baseline, with a value of . This loss function was discovered in generation 31, requiring 1,388.8 2,000stepequivalent partial evaluations. That is, evolution with 200step partial evaluations is over threetimes less sample sample efficient than evolution with 2,000step partial evaluations.
20,000step
On the other extreme, where evaluations consist of the same number of steps as a full training session, one would expect better loss functions to be discovered, and more reliably, because the fitness estimates are less noisy. Surprisingly, that is not the case: The best loss function had a mean testing accuracy of 0.9945 across ten samples, with a standard deviation of 0.0015. While also slightly lower, and also significantly more variable, than the accuracy for the best loss function that was found on the main 2,000step run, the accuracy is significantly higher than the crossentropy baseline, with a value of . This loss function was discovered in generation 45, requiring 12,600 2,000stepequivalent partial evaluations; over 28times less sample sample efficient as evolution with 2,000step partial evaluations.
These results thus suggest that there is an optimal way to evaluate candidates during evolution, resulting in lower computational cost and better loss functions. Notably, the best evolved loss functions from all three runs (i.e., 200, 2,000, and 20,000step) have similar shapes, reinforcing the idea that partialevaluations can provide useful performance estimates.
Vii Discussion and Future Work
TaylorGLO’s sample efficiency compared to GLO allows it to scale to larger networks and more complex datasets. In preliminary experiments TaylorGLO was applied to the more challenging CIFAR10 dataset [29] on a ResNet model [30] with five residual blocks, commonly known as ResNet32, using standard hyperparameters. This setup has been heavily engineered and manually tuned by the research community, and the architecture is also narrow and deep. TaylorGLO was able to find a loss function on generation 12 with very similar performance characteristics to the crossentropy loss, without needing any prior human knowledge. Over ten runs each, the TaylorGLO loss function reached a mean testing accuracy of 0.9115, compared to 0.9105 for crossentropy, and a standard deviation of 0.0032, compared to 0.0033. While it may be possible to improve upon this result, it is also possible that loss function optimization is less effective on architectures like ResNet that are narrow and deep. An important direction of future work is therefore to evolve both loss functions and architectures together, taking advantage of possible synergies between them.
Function evolution is a wide field. Genetic programming (GP) approaches are a dominant technique in it, as they have been in prior loss function metalearning work [4]. GP systems are often able to discover solutions to problems that outperform humanbuilt solutions [31]. Early work aimed at developing pattern recognition systems using rulesets represented as trees [32] and has grown to be a flexible framework that can be applied to different domains. However, while GP is very flexible, it can be overly so. Thoughtfullydesigned search spaces for continuous optimization can lead to better results. TaylorGLO was developed by working backwards from the desiderata of a loss function evolution system, leading to loss functions that can outperform those found by GLO in a more sample efficient manner.
Future work will involve leveraging additional input variables in TaylorGLO loss functions, such as the percentage of training steps completed. TaylorGLO may then find loss functions that are best suited for different points in training, where, for example, different kinds of regularization work best [33]. Unintuitive changes to the training process, such as cycling learning rates [34], have been able to improve model performance; evolution could be a useful way to discover similar techniques.
The proper choice of loss function may depend on other types of state as well. For example, batch statistics could help evolve loss functions that are more welltuned to each batch; intermediate network activations could expose information that may help tune the function for deeper networks like ResNet; deeper information about the characteristics of a model’s weights and gradients, such as that from spectral decomposition of the Hessian matrix [35], could assist the evolution of loss functions that are able to adapt to the current fitness landscape.
Viii Conclusion
This paper proposes multivariate Taylor expansionbased genetic lossfunction optimization (TaylorGLO), a new technique for lossfunction metalearning. TaylorGLO leverages a novel parameterization for loss functions that is designed to have desirable properties and allows for a continuous optimization approach to be used, instead of genetic programming. TaylorGLO loss functions are able to significantly outperform the crossentropy loss on MNIST, and provide a slight improvement over the original GLO technique while requiring many fewer candidates to be evaluated. Analyses suggest that TaylorGLO is a mature technique that can provide customized loss functions that result in higher testing accuracies and better data utilization.
References
 J. Schmidhuber, “Evolutionary principles in selfreferential learning, or on learning how to learn: the metameta… hook,” Ph.D. dissertation, Technische Universität München, 1987.
 R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy et al., “Evolving deep neural networks,” in Artificial Intelligence in the Age of Neural Networks and Brain Computing. Elsevier, 2019, pp. 293–312.
 C. Lemke, M. Budka, and B. Gabrys, “Metalearning: a survey of trends and technologies,” Artificial Intelligence Review, vol. 44, no. 1, pp. 117–130, 2015.
 S. Gonzalez and R. Miikkulainen, “Improved training speed, accuracy, and data utilization through loss function optimization,” arXiv preprint arXiv:1905.11528, 2019.
 D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” California Univ San Diego La Jolla Inst for Cognitive Science, Tech. Rep., 1985.
 A. N. Tikhonov, “Solution of incorrectly formulated problems and the regularization method,” in Proceedings of the USSR Academy of Sciences, vol. 4, 1963, pp. 1035–1038.
 S. Gonzalez, J. Landgraf, and R. Miikkulainen, “Faster training by selecting samples using embeddings,” in 2019 International Joint Conference on Neural Networks (IJCNN), 2019.
 R. Gao and K. Grauman, “2.5D visual sound,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 324–333.
 D. Kingma and M. Welling, “Autoencoding variational Bayes,” in Proceedings of the Second International Conference on Learning Representations (ICLR), 12 2014.
 Y. Zhou, C. Liu, and Y. Pan, “Modelling sentence pairs with treestructured attentive encoder,” arXiv preprint arXiv:1610.02806, 2016.
 H. Dong, S. Yu, C. Wu, and Y. Guo, “Semantic image synthesis via adversarial learning,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5706–5714.
 P. J. Huber, “Robust estimation of a location parameter,” The Annals of Mathematical Statistics, pp. 73–101, 1964.
 M. Schmidt and H. Lipson, “Distilling freeform natural laws from experimental data,” Science, vol. 324, no. 5923, pp. 81–85, 2009. [Online]. Available: http://science.sciencemag.org/content/324/5923/81
 N. Hansen and A. Ostermeier, “Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation,” in Proceedings of IEEE international conference on evolutionary computation. IEEE, 1996, pp. 312–317.
 B. Taylor, Methodus incrementorum directa & inversa. Auctore Brook Taylor, LL. D. & Regiae Societatis Secretario. typis Pearsonianis: prostant apud Gul. Innys ad Insignia Principis in â¦, 1715.
 J. B. Fourier, “La théorie analytique de la chaleur,” Mémoires de l’Académie Royale des Sciences de l’Institut de France, vol. 8, pp. 581–622, 1829.
 H. Wilbraham, “On a certain periodic function,” The Cambridge and Dublin Mathematical Journal, vol. 3, pp. 198–201, 1848.
 P. GravesMorris, “The numerical calculation of Padé approximants,” in Padé approximation and its applications. Springer, 1979, pp. 231–245.
 J. Chisholm, “Rational approximants defined from double power series,” Mathematics of Computation, vol. 27, no. 124, pp. 841–848, 1973.
 P. GravesMorris and D. Roberts, “Calculation of Canterbury approximants,” Computer Physics Communications, vol. 10, no. 4, pp. 234–244, 1975.
 N. Hansen and A. Ostermeier, “Completely derandomized selfadaptation in evolution strategies,” Evolutionary computation, vol. 9, no. 2, pp. 159–195, 2001.
 N. Hansen and S. Kern, “Evaluating the CMA evolution strategy on multimodal test functions,” in International Conference on Parallel Problem Solving from Nature. Springer, 2004, pp. 282–291.
 J. J. Grefenstette and J. M. Fitzpatrick, “Genetic search with approximate function evaluations,” in Proceedings of an International Conference on Genetic Algorithms and Their Applications, 1985, pp. 112–120.
 Y. Jin, “Surrogateassisted evolutionary computation: Recent advances and future challenges,” Swarm and Evolutionary Computation, vol. 1, pp. 61–70, 06 2011.
 Y. LeCun, C. Cortes, and C. Burges, “The MNIST dataset of handwritten digits,” 1998.
 G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing coadaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
 M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: A system for largescale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Savannah, GA: USENIX Association, 2016, pp. 265–283. [Online]. Available: https://www.usenix.org/conference/osdi16/technicalsessions/presentation/abadi
 L. v. d. Maaten and G. Hinton, “Visualizing data using tSNE,” Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.
 A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
 J. R. Koza, “Humancompetitive results produced by genetic programming,” Genetic Programming and Evolvable Machines, vol. 11, no. 34, pp. 251–284, 2010.
 R. Forsyth, “BEAGLE â a Darwinian approach to pattern recognition,” Kybernetes, vol. 10, no. 3, pp. 159–166, 1981.
 A. S. Golatkar, A. Achille, and S. Soatto, “Time matters in regularizing deep networks: Weight decay and data augmentation affect early learning dynamics, matter little near convergence,” in Advances in Neural Information Processing Systems 32, 2019, pp. 10 677–10 687.
 L. N. Smith, “Cyclical learning rates for training neural networks,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2017, pp. 464–472.
 L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou, “Empirical analysis of the Hessian of overparametrized neural networks,” arXiv preprint arXiv:1706.04454, 2017.