# Guided evolutionary strategies: escaping the curse of dimensionality in random search

###### Abstract

Many applications in machine learning require optimizing a function whose true gradient is unknown, but where surrogate gradient information (directions that may be correlated with, but not necessarily identical to, the true gradient) is available instead. This arises when an approximate gradient is easier to compute than the full gradient (e.g. in meta-learning or unrolled optimization), or when a true gradient is intractable and is replaced with a surrogate (e.g. in certain reinforcement learning applications, or when using synthetic gradients). We propose Guided Evolutionary Strategies, a method for optimally using surrogate gradient directions along with random search. We define a search distribution for evolutionary strategies that is elongated along a guiding subspace spanned by the surrogate gradients. This allows us to estimate a descent direction which can then be passed to a first-order optimizer. We analytically and numerically characterize the tradeoffs that result from tuning how strongly the search distribution is stretched along the guiding subspace, and we use this to derive a setting of the hyperparameters that works well across problems. Finally, we apply our method to example problems including truncated unrolled optimization and a synthetic gradient problem, demonstrating improvement over both standard evolutionary strategies and first-order methods that directly follow the surrogate gradient. We provide a demo of Guided ES at: github.com/brain-research/guided-evolutionary-strategies.

Guided evolutionary strategies: escaping the curse of dimensionality in random search

Niru Maheswaranathan Google Brain nirum@google.com Luke Metz Google Brain lmetz@google.com George Tucker Google Brain gjt@google.com Jascha Sohl-Dickstein Google Brain jaschasd@google.com

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

Optimization of machine learning models often involves minimizing a cost function where the gradient of the cost with respect to the model parameters is known. When gradient information is available, first-order methods such as gradient descent and variants are popular due to their ease of implementation, memory efficiency (typically requiring storage on the order of the parameter dimension), and convergence guarantees [1]. When gradient information is not available, however, we turn to zeroth-order optimization methods, including random search methods such as the recently re-popularized evolutionary strategies [2, 3, 4].

However, what if only partial gradient information is available? That is, what if one has access to surrogate gradients that are correlated with the true gradient, but may be biased in some unknown fashion? This situation arises across a variety of problems in machine learning. For example, in unrolled optimization (computing gradients through an unrolled optimization process), Wu et al. [5] showed that the common practice of taking gradients with respect to a small number of unrolled steps was biased compared to computing the (costly) gradient after many unrolled steps. In other applications, the true gradients do not provide a learning signal, and we might use surrogate gradients as a proxy. For example, in quantization of neural networks, one wishes to train neural networks with discrete (or even binary) weights and/or activations. One approach is to use straight-through estimators [6] which generate a surrogate gradient by smoothing (or ignoring) nodes in the network that perform the quantization, and using that gradient to train the network. However, there is no guarantee that this direction is a good descent direction. Finally, surrogate gradients also arise in reinforcement learning algorithms including actor-critic methods [7] and Q-learning [8, 9, 10].

Naïvely, there are two extremal approaches to optimization with surrogate gradients. On one hand, you could ignore the surrogate gradient information entirely and perform zeroth-order optimization, using methods such as evolutionary strategies to estimate a descent direction. These methods exhibit poor convergence properties [11] when the parameter dimension is large. On the other hand, you could directly feed the surrogate gradients to a first-order optimization algorithm. However, bias in the surrogate gradients will interfere with optimizing the target problem [12]. Ideally, we would like a method that combines the complementary strengths of these two approaches: we would like to combine the unbiased descent direction estimated with evolutionary strategies with the low-variance estimate given by the surrogate gradient. In this work, we propose a method for doing this called guided evolutionary strategies (Guided ES).

Our idea is to keep track of a low-dimensional subspace defined by the recent history of surrogate gradients during optimization (inspired by quasi-Newton methods) which we call the guiding subspace. We then perform a finite difference random search (as in evolutionary strategies) preferentially within this subspace. By concentrating our search samples in a low-dimensional subspace where the true gradient has non-negative support, we can dramatically reduce the variance of our search direction.

Our contributions in this work are:

For a demo of the method, please see:

https://github.com/brain-research/guided-evolutionary-strategies

## 2 Related Work

This work builds heavily upon a random search method known as evolutionary strategies [2, 3], or ES for short, which generates a descent direction via finite differences over random perturbations of the parameters. ES has enjoyed a resurgence in popularity in recent years [4, 13]. Our method can primarily be thought of as a modification to the standard ES algorithm, where we augment the search distribution using surrogate gradients.

Extensions of ES that modify the search distribution include using natural gradient updates in the search distribution [14], or construct alternative (non-Gaussian) search distributions [15]. The idea of using gradients in concert with evolutionary algorithms was proposed by Lehman et al. [16], who use gradients of a network with respect to its inputs to augment ES.

Other methods for adapting the search distribution include covariance matrix adaptation ES (CMA-ES) [17], which uses the recent history of descent steps to adapt the distribution over parameters, or variational optimization [18], which optimizes the parameters of a probability distribution over model weights. Guided ES, by contrast, adapts the search distribution using surrogate gradient information. In addition, we never need to work with or compute a full covariance matrix.

The critical assumption underlying Guided ES is that we have access to surrogate gradient information, but not the true gradient. This scenario arises in a wide variety of machine learning problems, which typically fall into two categories: cases where the true gradient is unknown or not defined, and cases where the true gradient is hard or expensive to compute. Examples of the former include: models with discrete stochastic variables (where straight through estimators [6, 19] or Concrete/Gumble-Softmax methods [20, 21] are commonly used), and learned models in reinforcement learning (e.g. for Q functions [10, 22] or value estimation [23]). For the latter, examples include optimization using truncated backprop through time [24, 25, 5]. The assumption of having surrogate gradients is also applicable in situations where the gradients are explicitly modified during training, as in feedback alignment [26] and related methods [27, 28].

## 3 Guided evolutionary strategies

### 3.1 Vanilla ES

We wish to minimize a function over a parameter space in -dimensions (), where is either unavailable or uninformative. A popular approach is to estimate a descent direction with stochastic finite differences (commonly referred to as evolutionary strategies [2] or random search [29]). Here, we use antithetic sampling [30] (using a pair of function evaluations at and ) to reduce variance. This estimator is defined as:

(1) |

where , and is the number of sample pairs. We will set to one for all experiments, and when analyzing optimal hyperparameters. The overall scale of the estimate () and variance of the perturbations () are constants, to be chosen as hyperparameters of the algorithm. This estimate solely relies on computing function evaluations. However, it tends to have high variance, thus requiring a large number of samples to be practical, and scales poorly with the dimension . We refer to this estimator as vanilla evolutionary strategies (or vanilla ES) in subsequent sections.

### 3.2 Guided search

Even when we do not have access to , we frequently have additional information about , either from prior knowledge or gleaned from previous iterates during optimization. To formalize this, we assume we are given a set of vectors which may correspond to biased or corrupted gradients. That is, these vectors are correlated (but need not be perfectly aligned) with the true gradient. If we are given a single vector or surrogate gradient for a given parameter iterate, we can generate a subspace by keeping track of the previous surrogate gradients encountered during optimization. We use to denote an orthonormal basis for the subspace spanned by these vectors (i.e., ).

We propose to leverage this information by changing the distribution of in eq. (1) to with

where and are the subspace and parameter dimensions, respectively, and is a hyperparameter that trades off variance between the full parameter space and the subspace. Setting recovers the vanilla ES estimator (and ignores the guiding subspace), but as we show choosing non-zero values for can result in significantly improved performance. The other hyperparameter is the scale in (1), which controls the size of the estimated descent direction. The parameter controls the overall scale of the variance, and will drop out of the analysis of the bias and variance below, due to the factor in (1). In practice, if is stochastic, then increasing will dampen noise in the gradient estimate, while decreasing reduces the error induced by third and higher-order terms in the Taylor expansion of below. For an exploration of the effects of on ES-style algorithms, see Lehman et al. [31].

Samples of can be generated efficiently as where and . Our estimator requires 2 function evaluations in addition to the cost of computing the surrogate gradient. Furthermore, it may be possible to parallelize the forward pass computations. We found that in practice, using was sufficient.

Figure 1a depicts the geometry underlying our method. Instead of the true gradient (blue arrow), we are given a surrogate gradient (white arrow) which is correlated with the true gradient. We use this to form a guiding distribution (denoted with white contours) and use this to draw samples (white dots) which we use as part of a random search procedure. (Figure 1b demonstrates the performance of the method on a toy problem, and is discussed in §4.1.)

For the purposes of analysis, suppose exists. We can approximate the function in the local neighborhood of using a second order Taylor approximation: . For the remainder of §3, we take this second order Taylor expansion to be exact. By substituting this expression into (1), we see that our estimate is equal to

(2) |

Note that even terms in the Taylor expansion cancel out in the expression for due to antithetic sampling. The computational and memory costs of using Guided ES to compute parameter updates, compared to standard (vanilla) ES and gradient descent are outlined in Appendix §D.

### 3.3 Tradeoff between variance and safe bias

As we have alluded to, there is a bias-variance tradeoff lurking within our estimate . In particular, by emphasizing the search in the full space (i.e., choosing close to 1), we reduce the bias in our estimate at the cost of increased variance. Emphasizing the search along the guiding subspace (i.e., choosing close to 0) will induce a bias in exchange for a potentially large reduction in variance, especially if the subspace dimension is small relative to the parameter dimension . Below, we analytically and numerically characterize this tradeoff.

#### 3.3.1 Guided ES estimator always provides a descent direction, so bias is “safe”

Importantly, regardless of the choice of and , the Guided ES estimator always provides a descent direction in expectation (i.e., taking a sufficiently small step in the negative direction, , is guaranteed to reduce the function value ). This can be seen by observing that the mean of the estimator in eq. (2) is . Thus, corresponds to the negative gradient multiplied by a positive semi-definite (PSD) matrix, which remains a descent direction. This desirable property of our estimator ensures that trades off variance for “safe” bias. That is, the bias will never produce an ascent direction when we are trying to minimize .

#### 3.3.2 Alignment between gradient and guiding subspace

The alignment between the -dimensional orthonormal guiding subspace () and the true gradient () will be a key quantity for understanding the bias-variance tradeoff. We characterize this alignment using a -dimensional vector of uncentered correlation coefficients , whose elements are the correlation between the gradient and every column of . That is, . This correlation varies between zero (if the gradient is orthogonal to the subspace) and one (if the gradient is full contained in the subspace).

#### 3.3.3 Safe bias in gradient estimate

We can evaluate the squared norm of the bias of our estimate as

(3) |

We additionally define the normalized squared bias, , as the squared norm of the bias divided by the squared norm of the true gradient (this quantity is independent of the overall scale of the gradient). Plugging in our estimate for from eq. (2) yields the following expression for the normalized squared bias (see Appendix §A.1 for derivation):

(4) |

where again is a scale factor and is part of the parameterization of the covariance matrix that trades off variance in the full parameter space for variance in the guiding subspace (). We see that the normalized squared bias consists of two terms: the first is a contribution from the search in the full space and is thus independent of , whereas the second depends on the squared norm of the uncentered correlation, .

#### 3.3.4 Variance in gradient estimate

In addition to the bias, we are also interested in the variance of our estimate. We use total variance (i.e., ) to quantify the variance of our estimator

using an identity for the fourth moment of a Gaussian (see Appendix §A.2) and the fact that the trace is linear and invariant under cyclic permutations.

We are interested in the normalized variance, , which we define as the quantity above divided by the squared norm of the gradient. Plugging in our estimate yields the following expression for the normalized variance (see Appendix §A.2):

(5) |

Equations (4) and (5) quantify the bias and variance of our estimate as a function of the subspace and parameter dimensions ( and ), the parameters of the distribution ( and ), and the correlation . Note that for simplicity we have set the number of pairs of function evaluations, , to one. As increases, the variance will decrease linearly (without any increased bias), at the cost of extra function evaluations.

Figure 2 explores the tradeoff between normalized bias and variance for different settings of the relevant hyperparameters ( and ) for example values of , , and . Figure 2c shows the sum of the normalized bias plus variance, the global minimum of which (blue star) can be used to choose optimal values for the hyperparameters, discussed in the next section.

### 3.4 Choosing optimal hyperparameters by minimizing error in the gradient estimate

The expressions for the normalized bias and variance depend on the subspace and parameter dimensions ( and , respectively), the hyperparameters of the guiding distribution ( and ) and the uncentered correlation between the true gradient and the subspace (). All of these quantities except for the correlation are known or defined in advance.

#### 3.4.1 Criteria for optimal hyperparameters

To choose optimal hyperparameters, we minimize the sum of the normalized bias and variance, (equivalent to the expected normalized square error in the gradient estimate, ). This objective becomes:

(6) | ||||

subject to the feasibility constraints and .

#### 3.4.2 Numerical solution for optimal hyperparameters

We can solve for the optimal tradeoff () and scale () hyperparameters as a function of , , and . Figure 3a shows the optimal value for the tradeoff hyperparameter () in the 2D plane spanned by the correlation () and ratio of the subspace dimension to the parameter dimension . Remarkably, we see that for large regions of the plane, the optimal value for is either 0 or 1. In the upper left (blue) region, the subspace is of high quality (highly correlated with the true gradient) and small relative to the full space, so the optimal solution is to place all of the weight in the subspace, setting to zero (therefore ). In the bottom right (orange) region, we have the opposite scenario, where the subspace is large and low-quality, thus the optimal solution is to ignore the subspace and place all of the weight in the full space, setting to one (equivalent to vanilla ES, ). For the strip in the middle, we have an intermediate regime where the optimal is between 0 and 1.

#### 3.4.3 Analytic properties of optimal hyperparameters

We can derive an expression for when this transition in optimal hyperparameters occurs. To do this, we use the reparameterization . This allows us to express the objective in (3.4.1) as a least squares problem , subject to a non-negativity constraint (), where and depend solely on the problem data , , and (see Appendix §B.1 for details). In addition, is always a positive semi-definite matrix, so the reparameterized problem is a convex optimization problem. We are particularly interested in the point where the non-negativity constraint becomes tight, as this point bounds the regions in Figure 3a. Formulating the Lagrange dual of this problem and solving for the KKT conditions allows us to identify this point using the complementary slackness conditions [32]. This yields the equations and (see Appendix §B.2), which are shown in Figure 3a, and line up with the numerical solution.

Figure 3b further demonstrates this tradeoff, specifically for the intermediate regime . For fixed , we plot four curves for ranging from 1 to 30. As increases, the optimal hyperparameters sweep out a curve from to .

#### 3.4.4 Practical choice of hyperparameters

In practice, the correlation between the gradient and the guiding subspace () is typically unknown. However, we find that ignoring and setting and works well (these are the value used for all of the experiments in this paper). A direction for future work would be to estimate the correlation online during optimization, and to use this to choose the hyperparameters and by minimizing eq. (3.4.1).

## 4 Applications

### 4.1 Quadratic function with a biased gradient

We first test our method in a scenario where we control the bias and variance of the surrogate gradient explicitly. We generated random quadratic problems of the form where the entries of and were drawn independently from a standard normal distribution. When optimizing the function, the algorithms only had access to surrogate gradients that were corrupted with noise (resampled at every iteration) and a bias (sampled once at the beginning of optimization). In Figure 1b, we compare the performance of stochastic gradient descent (SGD) with standard (vanilla) evolutionary strategies (ES), CMA-ES, and our method, guided ES. For this, and all of the results in this paper, we set the hyperparameters as and , as described in section §3.4.4.

We see that Guided ES proceeds in two phases: it initially quickly descends the loss as it follows the biased gradient, and then transitions into random search. Vanilla ES and CMA-ES, however, do not get to take advantage of the information available in the surrogate gradient, and converge more slowly. We see this also in the plot of the uncentered correlation () between the true gradient and the surrogate gradient in Figure 1c. In particular, note that for Guided ES, the quick descent occurs when the surrogate gradient is still correlated with the true gradient. This is exactly the transition discussed in §3.4, where as varies we move from the regime where we want to only search in the subspace to the regime where we want to ignore the subspace (diagrammed in Figure 3a). Although we know in this case, we find the practical choice of fixing the hyperparameters (discussed in §3.4) independently of works well. Further experimental details are provided in §E.1.

### 4.2 Unrolled optimization

Another application where surrogate gradients are available is in unrolled optimization. Unrolled optimization refers to taking derivatives through an optimization process, with the goal of optimizing some part of the optimization process. For example, this approach has been used to optimize hyperparameters [33, 34, 35], to stabilize training [36], and even to train neural networks to act as optimizers [37, 38, 39, 40]. Taking derivatives through an optimization process with a large number of unrolled steps is prohibitive (both in terms of memory and computational costs), so a common approach is to instead choose a small number of unroll iterations, and use that as a target for training. However, Wu et al. [5] recently showed that this approach yields biased gradients, even when optimizing quadratic problems.

We generated a stochastic quadratic problem (as described in §4.1) which exhibits this unrolled truncation bias. Figure 4a shows the loss surface for optimizing the learning rate of gradient descent for this target problem, as a function of the number of unrolled optimization steps ranging from one iteration (orange) to 50 (blue). The optimal learning rate (which can be computed as a function of the eigenvalues of the Hessian) is shown as a dashed red line. The minimum of the objective with unrolls approaches this optimal value as grows.

Using this model, we trained fully connected neural networks, or multi-layer perceptrons (MLPs), to predict the learning rate to use for the quadratic problem. The features given to the MLP were the eigenvalues of the Hessian of the target quadratic. The MLP consisted of 3 hidden layers with 32 units each, with rectified linear (ReLU) activations after each hidden layer. We compute the surrogate gradient of the parameters in the neural network with respect to the loss after one iteration of gradient descent, and use the loss after 50 iterations as the desired function to minimize (). In Figure 4b, we show the absolute value of the difference between the optimal learning rate and the MLP prediction for different optimization algorithms. Further experimental details are provided in §E.2.

### 4.3 Synthesizing gradients for a guiding subspace

Finally, we explore using Guided ES in the scenario where the surrogate gradient is not provided, but where we instead train a model to generate surrogate gradients (we call these synthetic gradients). In real-world applications, training a model to produce synthetic gradients is the basis of model-based and actor-critic methods in RL [22, 41] and has recently been applied to decouple training across neural network layers [42] and to generate policy gradients [43]. A key challenge with such an approach is that early in training, the model generating the synthetic gradients is untrained, and thus will produce corrupted or biased gradients. In general, it is unclear during training when following these synthetic gradients will be beneficial. Here, we feed synthetic gradients to Guided ES and demonstrate that it automatically interpolates between vanilla evolutionary strategies and gradient descent as the quality of the synthetic gradients improves.

We trained a model to produce synthetic gradients for a simple target problem, (a 300 dimensional quadratic). We define a parametric model, (an MLP), which provides synthetic gradients for the target problem . These synthetic gradients are extracted by taking gradients of with respect to : . The target model is trained online to minimize mean squared error against evaluations of , . We take inspiration from reinforcement learning and sample data for training from a replay buffer containing historical evaluations [9].

Figure 5 compares vanilla ES, Guided ES, and directly passing synthetic gradients to the Adam optimizer [44]. We show training curves for these methods in Figure 5a, and the correlation between the synthetic gradient and the true gradient for Guided ES in Figure 5b. Despite the fact that the quality of the synthetic gradients varies wildly during optimization, Guided ES consistently makes progress on the target problem. These experiments suggest that using Guided ES is a promising approach for improving the stability and convergence of synthetic gradient methods. Further experimental details are provided in §E.3.

## 5 Discussion

We have introduced guided evolutionary strategies (Guided ES), an optimization algorithm which combines the benefits of first-order methods and random search, when we have access to surrogate gradients that are correlated with the true gradient. We analyzed the bias-variance tradeoff inherent in our method analytically, and demonstrated the generality of the technique by applying it to unrolled optimization and a problem with synthetic gradients.

One caveat of the method is that up to second order, the expected Guided ES estimator is equivalent to multiplying the true gradient by a positive semi-definite (PSD) matrix. The product of PSD matrices is not in general PSD, so care must be taken when passing the Guided ES update vector to an optimization method that also multiplies its input by a PSD matrix (such as RMSProp or Adam). Future work will determine if this is an issue in practice.

The nearly binary nature of the tradeoff in optimal hyperparameters (Figure 3) suggests that methods that switch between biased gradient descent and vanilla evolutionary strategies would also likely work, if the switch between methods occurs close to the appropriate value of . An interesting direction for future work is to estimate the correlation during optimization, and use this to adapt the hyperparameters online.

In conclusion, Guided ES is a general technique which has widespread applications in machine learning due to the ubiquity of problems where the gradient is either uninformative or only noisy gradient estimates are tractable.

## 6 Acknowledgements

The authors would like to thank the Google Brain team for general discussion, and Roy Frostig, Alex Alemi, Katherine Lee, and Benjamin Eysenbach for comments on the manuscript.

## References

- Sra et al. [2012] Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for machine learning. Mit Press, 2012.
- Rechenberg [1973] Ingo Rechenberg. Evolutionsstrategie–optimierung technisher systeme nach prinzipien der biologischen evolution. 1973.
- Nesterov and Spokoiny [2011] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Technical report, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2011.
- Salimans et al. [2017] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
- Wu et al. [2018] Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding short-horizon bias in stochastic meta-optimization. arXiv preprint arXiv:1803.02021, 2018.
- Bengio et al. [2013] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. 1998.
- Watkins and Dayan [1992] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
- Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Duchi et al. [2015] John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
- Tucker et al. [2017] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pages 2624–2633, 2017.
- Mania et al. [2018] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
- Wierstra et al. [2008] Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies. In Evolutionary Computation, 2008. CEC 2008.(IEEE World Congress on Computational Intelligence). IEEE Congress on, pages 3381–3387. IEEE, 2008.
- Glasmachers et al. [2010] Tobias Glasmachers, Tom Schaul, Sun Yi, Daan Wierstra, and Jürgen Schmidhuber. Exponential natural evolution strategies. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, pages 393–400. ACM, 2010.
- Lehman et al. [2017a] Joel Lehman, Jay Chen, Jeff Clune, and Kenneth O Stanley. Safe mutations for deep and recurrent neural networks through output gradients. arXiv preprint arXiv:1712.06563, 2017a.
- Hansen [2016] N. Hansen. The CMA Evolution Strategy: A Tutorial. ArXiv e-prints, April 2016.
- Staines and Barber [2012] Joe Staines and David Barber. Variational optimization. arXiv preprint arXiv:1212.4507, 2012.
- van den Oord et al. [2017] A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural Discrete Representation Learning. ArXiv e-prints, November 2017.
- Maddison et al. [2016] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
- Jang et al. [2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Lillicrap et al. [2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
- Rumelhart et al. [1985] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
- Williams and Peng [1990] Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural computation, 2(4):490–501, 1990.
- Lillicrap et al. [2014] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random feedback weights support learning in deep neural networks. arXiv preprint arXiv:1411.0247, 2014.
- Nøkland [2016] Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In Advances in Neural Information Processing Systems, pages 1037–1045, 2016.
- Gilmer et al. [2017] Justin Gilmer, Colin Raffel, Samuel S Schoenholz, Maithra Raghu, and Jascha Sohl-Dickstein. Explaining the learning dynamics of direct feedback alignment. 2017.
- Rastrigin [1963] LA Rastrigin. About convergence of random search method in extremal control of multi-parameter systems. Avtomat. i Telemekh, 24(11):1467–1473, 1963.
- Owen [2013] Art B. Owen. Monte Carlo theory, methods and examples. 2013.
- Lehman et al. [2017b] Joel Lehman, Jay Chen, Jeff Clune, and Kenneth O Stanley. Es is more than just a traditional finite-difference approximator. arXiv preprint arXiv:1712.06568, 2017b.
- Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
- Domke [2012] Justin Domke. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pages 318–326, 2012.
- Maclaurin et al. [2015] D. Maclaurin, D. Duvenaud, and R. P. Adams. Gradient-based Hyperparameter Optimization through Reversible Learning. ArXiv e-prints, February 2015.
- Baydin et al. [2017] Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank Wood. Online learning rate adaptation with hypergradient descent. arXiv preprint arXiv:1703.04782, 2017.
- Metz et al. [2016] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
- Andrychowicz et al. [2016] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
- Wichrowska et al. [2017] Olga Wichrowska, Niru Maheswaranathan, Matthew W Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Nando de Freitas, and Jascha Sohl-Dickstein. Learned optimizers that scale and generalize. International Conference on Machine Learning, 2017.
- Li and Malik [2017] Ke Li and Jitendra Malik. Learning to optimize. International Conference on Learning Representations, 2017.
- Lv et al. [2017] K. Lv, S. Jiang, and J. Li. Learning Gradient Descent: Better Generalization and Longer Horizons. ArXiv e-prints, March 2017.
- Heess et al. [2015] Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015.
- Jaderberg et al. [2016] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343, 2016.
- Houthooft et al. [2018] R. Houthooft, R. Y. Chen, P. Isola, B. C. Stadie, F. Wolski, J. Ho, and P. Abbeel. Evolved Policy Gradients. ArXiv e-prints, February 2018.
- Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ArXiv e-prints, December 2014.

## Appendix

## Appendix A Derivation of the bias and variance of the Guided ES update

### a.1 Bias

The squared bias norm is defined as:

where and the covariance is given by: . This expression reduces to (recall that is orthonormal, so ):

Dividing by the norm of the gradient () yields the expression for the normalized bias (eq. (4) in the main text).

### a.2 Variance

First, we state a useful identity. Suppose , then

We can see this by observing that the entry of is

by Isserlis’ theorem, and then we recover the identity by rewriting the terms in matrix notation.

The total variance is given by:

total variance |

Using the identity above, we can express the total variance as:

total variance | |||

Since the trace of the covariance matrix is 1, we can expand the quantity as:

Thus the expression for the total variance reduces to:

total variance |

and dividing by the norm of the gradient yields the expression for the normalized variance (eq. (5) in the main text).

## Appendix B Optimal hyperparameters

### b.1 Reparameterization

We wish to minimize the sum of the normalized bias and variance, eq. (3.4.1) in the main text. First, we use a reparameterization by using the substitution and . This substitution yields:

which is quadratic in . Therefore, we can rewrite the problem as: , where and are given by:

(7) |

Note that and depend on the problem data (, , and ), and that is a positive semi-definite matrix (as and are non-negative integers, and is between 0 and 1). In addition, we can express the constraints on the original parameters ( and ) as a non-negativity constraint in the new parameters ().

### b.2 KKT conditions

The optimal hyperparameters are defined (see main text) as the solution to the minimization problem:

(8) |

where are the hyperparameters to optimize, and and are specified above in eq. (7).

The Lagrangian for (8) is given by , and the corresponding dual problem is:

(9) |

Since the primal is convex, we have strong duality and the Karush-Kuhn-Tucker (KKT) conditions guarantee primal and dual optimality. These conditions include primal and dual feasibility, the condition that the gradient of the Lagrangian vanishes (), and complimentary slackness (which ensures that for each inequality constraint, either the constraint is satisfied or = 0).

Solving the condition on the gradient of the Langrangian for yields that the lagrange multipliers are simply the residual . Complimentary slackness tells us that , for all . We are interested in when this constraint becomes tight. To solve for this, we note that there are two regimes where each of the two inequality constraints is tight (the blue and orange regions in Figure 3a). These occur for the solutions (when the first inequality is tight) and (when the second inequality is tight). To solve for the transition point, we solve for the point where the constraint is tight and the lagrange multiplier () equals zero. We have two inequality constraints, and thus will have two solutions (which are the two solid curves in Figure 3a). Since the lagrange multiplier is the residual, these points occur when and .

The first solution yields the upper bound:

And the second solution yields the lower bound:

These are the equations for the lines separating the regimes of optimal hyperparameters in Figure 3.

## Appendix C Alternative motivation for optimal hyperparameters

Choosing hyperparameters which most rapidly descend the simple quadratic loss in eq. (10) is equivalent to choosing hyperparameters which minimize the expected square error in the estimated gradient, as is done in §3.4. This provides further support for the method used to choose hyperparameters in the main text. Here we derive this equivalence.

Assume a loss function of the form

(10) |

and that updates are performed via gradient descent with learning rate 1,

The expected loss after a single training step is then

(11) |

For this problem, the true gradient is simply . Substituting this into eq. (11), we find

Up to a multiplicative constant, this is exactly the expected square error between the descent direction and the gradient which is used as the objective for choosing hyperparameters in §3.4.

## Appendix D Computational and memory cost

Here, we outline the computational and memory costs of Guided ES and compare them to standard (vanilla) evolutionary strategies and gradient descent. As elsewhere in the paper, we define the parameter dimension as and the number of pairs of function evaluations (for evolutionary strategies) as . We denote the cost of computing the full loss as , and (for Guided ES and gradient descent), we assume that at every iteration we compute a surrogate gradient which has cost . Note that for standard training of neural networks with backpropogation, these quantities have similar cost (), however for some applications (such as unrolled optimization discussed in §4.2) these can be very different.

Algorithm | Computational cost | Memory cost |
---|---|---|

Gradient descent | ||

Vanilla evolutionary strategies | ||

Guided evolutionary strategies |

## Appendix E Experimental details

Below, we give detailed methods used for each of the experiments from §4. For each problem, we specify a desired loss function that we would like to minimize (), as well as specify the method for generating a surrogate or approximate gradient ().

### e.1 Quadratic function with a biased gradient

Our target problem is linear regression, , where is a random matrix and is a random -dimensional vector. The elements of and were drawn IID from a standard Normal distribution. We chose and for this problem. The surrogate gradient was generated by adding a random bias (drawn once at the beginning of optimization) and noise (resampled at every iteration) to the gradient. These quantities were scaled to have the same norm as the gradient. Thus, the surrogate gradient is given by: , where and are unit norm random vectors that are fixed (bias) or resampled (noise) at every iteration.

The plots in Figure 1b show the loss suboptimality (), where is the minimum of for a particular realization of the problem. The parameters were initialized to the zeros vector and optimized for 10,000 iterations. Figure 1b shows the mean and spread (std. error) over 10 random seeds. For each optimization algorithm, we performed a coarse grid search over the learning rate for each method, scanning 17 logarithmically spaced values over the range . The learning rates chosen were: 5e-3 for gradient descent, 0.2 for guided and vanilla ES, and 1.0 for CMA-ES. For the two evolutionary strategies algorithms, we set the overall variance of the perturbations as and used pair of samples per iteration. The subspace dimension for Guided ES was set to . The results were not sensitive to the choices for , , or .

### e.2 Unrolled optimization

We define the target problem as the loss of a quadratic after running steps of gradient descent. The quadratic has the same form as described above, , but with and . The learning rate for the optimizer was taken as the output of a multilayer perceptron (MLP), with three hidden layers containing 32 hidden units per layer and with rectified linear (ReLU) activations after each hidden layer. The inputs to the MLP were the 10 eigenvalues of the Hessian, , and the output was a single scalar that was passed through a softplus nonlinearity (to ensure a positive learning rate). Note that the optimal learning rate for this problem is , where and are the minimum and maximum eigenvalues of , respectively.

The surrogate gradients for this problem were generated by backpropagation through the optimization process, but by unrolling only optimization steps (truncated backprop). Figure 4b shows the distance between the MLP predicted learning rate and the optimal learning rate , during the course of optimization of the MLP parameters. That is, Figure 4b shows the progress on the meta-optimization problems (optimizing the MLP to predict the learning rate) using the three different algorithms (SGD, vanilla ES, and guided ES).

As before, the mean and spread (std. error) over 10 random seeds are shown, and the learning rate for each of the three methods was chosen by a grid search over the range . The learning rates chosen were 0.3 for gradient descent, 0.5 for guided ES, and 10 for vanilla ES. For the two evolutionary strategies algorithms, we set the variance of the perturbations to and used pair of samples per iteration. The results were not sensitive to the choices for , , or .

### e.3 Synthesizing gradients for a guiding subspace

Here, the target problem consisted of a mean squared error objective, , where was random sampled from a uniform distribution between [-1, 1]. The surrogate gradient was defined as the gradient of a model, , with inputs and parameters . We parameterize this model using a multilayered perceptron (MLP) with two 64-unit hidden layers and relu activations. The surrogate gradients were taken as the gradients of with respect to : .

The model was optimized online during optimization of by minimizing the mean squared error with the (true) function observations: . The data used to train were randomly sampled in batches of size 512 from the most recent 8192 function evaluations encountered during optimization. This is equivalent to uniformly sampling from a replay buffer, a strategy commonly used in reinforcement learning. We performed one update per update with Adam with a learning rate of 1e-4.

The two evolutionary strategies algorithms inherently generate samples of the function during optimization. In order to make a fair comparison when optimizing with the Adam baseline, we similarly generated function evaluations for training the model by sampling points around the current iterate from the same distribution used in vanilla ES (Normal with ). This ensures that the amount and spread of training data for (in the replay buffer) when optimizing with Adam is similar to the data in the replay buffer when training with vanilla or guided ES.

Figure 5a shows the mean and spread (standard deviation) of the performance of the three algorithms over 10 random instances of the problem. We set and used pair of samples per iteration. For Guided ES, we used a subspace dimension of . The results were not sensitive to the number of samples , but did vary with , as this controls the spread of the data used to train .