A view of Estimation of Distribution Algorithms through the lens of ExpectationMaximization
Abstract
We show that under mild conditions, Estimation of Distribution Algorithms (EDAs) can be written as variational ExpectationMaximization (EM) that uses a mixture of weighted particles as the approximate posterior. In the infinite particle limit, EDAs can be viewed as exact EM. Because EM sits on a rigorous statistical foundation and has been thoroughly analyzed, this connection provides a coherent framework with which to reason about EDAs. Importantly, the connection also suggests avenues for possible improvements to EDAs owing to our ability to leverage general statistical tools and generalizations of EM. For example, we make use of results about known EM convergence properties to propose an adaptive, hybrid EDAgradient descent algorithm; this hybrid demonstrates better performance than either component of the hybrid on several canonical, nonconvex test functions. We also demonstrate empirically that although one might hypothesize that reducing the variational gap could prove useful, it actually degrades performance of EDAs. Finally, we show that the connection between EM and EDAs provides us with a new perspective on why EDAs are performing approximate natural gradient descent.
[subfigure]style=plain, subcapbesideposition=top
1 Modelbased optimization
Many practical problems of interest can be cast as optimizing a function over a space of discrete or continuous inputs, . The solution to this problem is given by In many settings, such as reinforcement learning Peters2007 (), protein and molecule design Brookes2019 (), blackbox optimization Bengoetxea2001 (), Bayesian optimization Wilson2018 () and combinatorial optimization Zlochin2004 (); Mania2018 (), this optimization problem is transformed into the related problem, , where is a probability density over , parameterized by , for parameters. That is, the search over is replaced by a search over a parameterized family of distributions on . Herein we refer to this as modelbased optimization (MBO). When has the capacity to represent a point mass centered on , then these two formulations have the same objective values at their respective optima. Reasons for using the MBO formulation include the desire to obtain a set of good candidates rather than a single candidate Brookes2019 (), enabling of rigorous analysis of associated optimization algorithmsZlochin2004 (), and the leveraging of probabilistic formulation to incorporate auxiliary information Brookes2019 ().
Estimation of Distribution Algorithms
A common approach to solving the MBO objective is with a class of iterative algorithms known as Estimation of Distribution Algorithms (EDAs) Bengoetxea2001 (); Baluja1995 (), of which the Covariance Matrix Adaptation Evolution Strategy (CMAES)Hansen2001 (); Stulp2012 () and Cross Entropy Method (CEM)Rubinstein1999 () are perhaps the most wellknown. The general template for an EDA is as follows. Beginning with an initial parameter setting, , of the “search model", , an EDA algorithm generally proceeds in three steps at each iteration :

Draw samples, , from .

Evaluate for each .

Find a that uses the samples and corresponding function evaluations to move towards regions of that have large function values.
The final step can be performed in a variety of ways, but is typically done by solving
(1) 
which is simply a weighted maximum likelihood problem with weights, .
In some variations of EDAs, is transformed monotonically through a shaping function Wierstra2008 (); Wierstra2014 (); Yi2009 (), which alters convergence properties but does not change the optima.
An important aspect of any EDA is the choice of search model. Many EDAs use exponential family models, most commonly the multivariate Gaussian used in CMAES. However, Bayesian networks Ocenasek2002 (), Boltzmann machines Hu2011 (); Shim2013 (), Markov Random Fields Shakya2007 (), Variational Autoencoder Brookes2019 (), and others Kern2004 () have also been used. The choice of search model is driven by prior knowledge and empirics. Variants of some EDAs include reusing samples from previous iterations, and/or smoothing of the parameter estimates across iterations Hansen2001 (); Ollivier2017 (); Brookes2019 (); however, for clarity of exposition, we do not consider these variations.
Our contributions
We (i) show that under mild conditions EDAs can be viewed as variational ExpectationMaximization (EM) Beal1998 () with an approximate posterior given by a set of weighted particles (i. e., a mixture of delta functions); and, as exact EM in the limit of infinite particles Dempster77 (), (ii) derive insights from this viewpoint that suggest an improved adaptive hybrid EDAgradient descent approach, and show that it demonstrates superior performance than either of its component algorithms; we also show empirically that reducing the variational gap is not helpful and likely harmful, and (iii) show that the equivalence between EM and EDAs provides a new perspective on why EDAs can be seen to be performing approximate natural gradient descent.
2 Related Work
The Information Geometry Optimization (IGO) Ollivier2017 () framework encompasses a large class of MBO approaches by discretizing a natural gradient flow. Instantiations of IGO are most readily seen as tantamount to using the “log derivative trick" on the MBO objective, combined with natural gradient, as in Natural Evolution Strategies Wierstra2008 (); Wierstra2014 (). However, Ollivier et al. use the IGO framework to unify numerous approaches for tackling MBOs, including EDAs such as CEM Rubinstein1999 () and CMAES Hansen2001 (). The IGO framework does not make use of EM in achieving its unification, directly or otherwise; therefore, IGO provides a complementary viewpoint to that presented herein.
Staines et al. introduce the notion of variational optimization to tackle MBOs by considering the MBO formulation of the original optimization problem as an upper bound to be minimized; they clearly delineate when the bound can be satiated, sufficient conditions for it to be convex, and for when the derivative of the MBO objective exists Staines2013 (). However, they do not connect to the EDA literature.
In the reinforcement learning (RL) community, EM has been used to explain why the Relative Payoff Procedure is guaranteed, in some circumstances, to improve at each iteration Dayan97 (). Wierstra et al. directly use EM with heuristic addons such as a “forget factor" to optimize an MBO objective with a Gaussian density Wierstra2008b (). The relationship between parameterbased policy search and ExpectationMaximization is also discussed in Furmston16 (); Kober2009 (). These papers are focused on problems in RL and do not make connections to the EDA literature.
Finally, parenthetically, EM is used inside of the model fitting part of evolutionary algorithms for binary latent variables in Guiraud2018 (), which is conceptually orthogonal to our contributions.
In contrast to these works, we provide an exposition of how a broad class of EDAs can be viewed as EM, from which we derive insights that yield a promising new EDA approach.
3 Connection between EDAs, EM and gradient descent
First we review the Free Energy viewpoint of EM. On that basis, we then describe the connection between EM and EDAs before concluding the section with an exposition of how EM connects to gradientbased optimization.
3.1 ExpectationMaximization (EM)
EM is a method for performing maximum likelihood parameter estimation in latent variable models that exploits the structure of the joint likelihood between latent and observed variables. Intuitively, each Estep imputes the latent variables, and the subsequent Mstep then uses these “filled in" data to do standard maximum likelihood, typically in closed form. EM iterates between these E and M steps until convergence. We use the Free Energy interpretation of EM and its accompanying generalizations Neal2012 () in order to most easily connect EDAs to EM.
Let and be respectively observed and latent variables. The task of maximum likelihood is to find where , for some model density parameterized by . In Neal2012 (), the authors define a function known as the free energy, given by:
(2) 
where is any density over the latent variables and is the entropy of . The free energy lower bounds the loglikelihood, , and this bound is satiated only when is equal to the posterior, . If the posterior cannot be obtained exactly (e. g., analytically), one may approximate it to get a variational EM algorithm Beal1998 (). The variational posterior can be either a parameterized variational family, , or an implicit one based on a set of samples as in Monte Carlo EM Caffo2005 (); Fort2003 (). The samplebased posterior takes the form, with drawn from . In variational EM, typically the bound cannot be satiated, leading to a variational gap given by , where denotes the KL divergence—equal to zero if and only if is equal to the posterior .
EM and variational EM can both be viewed as alternating coordinate descent on the free energy function. The alternating updates at iteration are given by:

Estep: . It can be shown that this is equivalent to minimizing the variational gap by solving . This is where EM intuitively imputes the “missing" data, .

Mstep: , which is tantamount to doing weighted maximum likelihood of with weights .
When the variational gap can be driven to zero, as in exact EM, this procedure is guaranteed to never decrease the likelihood. Moreover, its convergence properties, to both local and global minima, have been carefully studied (e. g., Dempster77 (); Neal2012 (); Balakrishnan2017 ()). Rigorous generalizations of variational and exact EM to partial E and Msteps also emerge naturally from this viewpoint Neal2012 ().
3.2 Estimation of Distribution Algorithms as variational ExpectationMaximization
As described in the introduction, EDAs seek to solve the MBO objective,
(3) 
where is the function to be optimized, is what we refer to as the search model, parameterized by , and we define —which can be thought of as an EDA ‘log marginal likelihood’.
Note that it is common in EDAs to monotonically transform with a shaping function, , which may be, for example, a cumulative density function (CDF) Brookes2019 (), an exponential Peters2007 (), or a quantilebased transformation Ollivier2017 (); Rubinstein1999 (). Although this transformation does not change the optima, it may alter the optimization dynamics. Often this shaping function is changed at each iteration in a sampledependent, adaptive manner (which links these methods to annealed versions of EM and VI Hofmann2001 (); Mandt16 ()). In such a setting, the connection that we will show between EDA and EM holds within each full iteration, and the insights derived from this connection have bearing on the entire algorithm. For notational simplicity, we drop the and assume that has already been transformed.
To link (3) to EM, we introduce a density, , which allows us to derive a lower bound on using Jensen’s inequality:
(4)  
(5)  
(6)  
(7) 
where is the same free energy function appearing in the previous section on EM, except that the complete likelihood is replaced with the term . When is normalizable, then it can be shown that , where we define the ‘tilted density’
(8) 
which is the EDA counterpart to the exact posterior in EM. We can now construct a coordinate ascent algorithm on the free energy defined in (7) that mirrors the EM algorithm. In particular, this algorithm iterates between Esteps that solve , and Msteps that solve . To make the precise connection between practically implemented EDA and EM, we introduce a particular approximate posterior for the Estep that is given by a mixture of weighted particles:
(9) 
where are samples drawn from , as in EDAs. Using this posterior approximation, the Mstep amounts to solving the objective:
(10)  
(11)  
(12)  
(13) 
which is exactly the final step in one iteration of EDA (1). Therefore, we can view EDA as an EM algorithm that uses the particlebased posterior approximation given by (9). In the limit of infinite particles we trivially obtain by the Law of Large Numbers. In this limit, the approximate posterior exactly matches the tilted distribution—the EDA “exact posterior"—and our algorithm inherits the same guaranteed as exact EM, such as guaranteed improvement of the objective function at each iteration, as well as local and global convergence properties Neal2012 (); Balakrishnan2017 (); Salakhutdinov2008 ().
In the finite sample case of EDAs we are not performing exact EM. Instead, we are performing a “weak" form of variational EM—we say weak because typically in a variational setting, one would minimize the KLdivergence between the approximate posterior and the true posterior within the variational class, which is not done in EDAs. A “stronger" variational EDA EM would entail finding the weighted particles in the approximate posterior that minimizes the variational gap—a generally intractable problem, but one that we explore in Section 5.
From this exposition it’s also interesting to note that EDA can be seen as performing Importance Sampling Mackay2003 () with proposal distribution, , (sampled in the Estep), and then using importance weights in the posterioraveraged “complete log likelihood", , in the Mstep, to correct for the fact that these samples were not taken form the exact posterior.
3.3 Connection to gradientbased methods
Instead of using EDAs, one could alternatively solve the MBO objective directly with gradient descent (or natural gradient descent) where updates are given by , and is a step size parameter. This is known as REINFORCE Williams92 () in the RL community and is related to IGOML Ollivier2017 (). Typically, one cannot compute the gradient term exactly, in which case the “log derivative trick", (which, notably, does not require the gradient of Recht2018 ()) is often is often combined with Monte Carlo estimation to arrive at an update:
(14) 
where are samples drawn from .
We can connect the gradient based optimization in (14) to the EDA “EM" optimization presented in Section 3.2. In particular, consider the partial Mstep version of EM, where instead of fully solving for the new parameter, one instead performs one gradient step Neal2012 (); Balakrishnan2017 () —so called “firstorder" EM. Instantiating firstorder EM into EDA “EM" by partially solving the objective in (13) with a single gradient step (with step size ) will result in the same update as (14). In other words, performing “firstorder" EM in the EDA context is identical to performing gradient descent on the original MBO objective. In our experiments, we refer to “firstorder" EM for EDAs as SGD. It’s interesting to note that as one performs more gradient steps within an Mstep, one traces out a continuum of methods between gradient descent on the original objective, and EM.
In the next section we describe how we leverage the viewpoint of EDAs as EM to propose a new algorithm with promising results.
4 An adaptive, hybrid EDAgradient algorithm
It is wellknown that EM can exhibit superlinear convergence behaviour near a local optimum and in plateau regions, but can exhibit extremely slow convergence rates otherwise—in regions where gradientbased methods can do better Xu1996 (); Salakhutdinov2008 (). In order to get the best of both worlds Salakhutdinov et al. leverage a precise analysis of EM, including how to diagnose in an online, datadriven manner when EM is likely to be in a poor convergence phase. The technically correct diagnostic to achieve this requires difficult computations of the derivatives of the EM “mapping matrix" (the implicit matrix that maps ); instead, they propose to use the entropy of the posterior as a proxy to this quantity, using the intuition that both quantities inform on how much "information" EM has about the latent variables at any given point in the algorithm. This intuition follows from a technical condition that holds when, for example, the expected complete log likelihood of the Mstep has a single optimum, such as for (mixtures of) Gaussians, (mixtures of) Factor Analysis, HMMs, and probabilistic (mixtures of) Principle Components Analysis Salakhutdinov2008 (). For the gradientbased method, they use conjugate gradient descent (CGD) with a line search; they find that the hybrid method never does worse, and often performs better than both standard EM and SGD.
On this basis, we explore an analogous adaptive, hybrid EDAgradient approach, although we use the SGD updates described in (14) without a line search, instead using AdaGrad Duchi2011 () to control the learning rate; we do so in the hopes of better understanding the source of differences in performance—differences that do not depend on a line search. Our hybrid, denoted Hybrid, therefore switches between (i) standard EDA (13) which we denote EDA, and (ii) a gradientbased approach (14) which we denote SGD (we augment this by using natural gradients). As in Salakhutdinov2008 (), we switch from SGD to EDA when the estimated posterior entropy is above a specified cutoff, , and vice versa.
4.1 Experiments
We compared our Hybrid method to the pure EDA and SGD algorithms, in the minimization of 2 and 20 dimensional Rastrigin and Ackley functions—canonical multimodal, nonconvex functions commonly used to benchmark optimization approaches Hansen2001 ().
We used a Gaussian search model with diagonal covariance matrix and particles at each iteration. Each algorithm had a fixed budget of 50,000 function evaluations (which amounts to a maximum of 5,000 EM iterations). Crucially, as in Wierstra2014 (), each algorithms was allowed to randomly restart after convergence until such time as the function budget was used up because this ties in to practical use cases. Convergence was determined to be the point at which , where is the covariance matrix of the search model and is the dimensionality of the search space. Restarts were randomly initialized with particles drawn from a Gaussian with unit variance and mean randomly positioned on the hypersphere of radius centered at the global minimum (the origin)—the larger , the further the initialization is from the global optimum. The radius, , was set to 20 and 30 for the Rastrigin and Ackley functions, respectively, to make the optimizations sufficiently difficult such that none of the methods tended to reach the global minimum—this , allowed us to better distinguish the relative performance of methods, whose ordering was not sensitive this setting.
As in Wierstra2014 (); Ollivier2017 (); Rubinstein1999 (); Hansen2001 (), we used an adaptive shaping function, to monotonically transform the function values . In particular, we use a sigmoid weighting function given by , where and were estimated from samples at each iteration. This sigmoid shaping function is much like the CDF used in Brookes2019 (), but is differentiable which was useful for Section 5.
We approximated the entropy of our approximate posterior, a collection of weighted particles, by using the entropy of the search model itself, which can be shown to be . We found that this quantity tends to lie in the range during the course of the optimizations, and therefore chose as the entropy cutoff, , for switching in Hybrid . Even in the case where it is not a priori obvious how to set the , this hyperparameter can readily be set for new problems based on empirics from a few sample runs.
As shown in Figures 1 and 2, EDA converges very quickly, but to poor local minima. In constrast, SGD converges very slowly, but to better local minima. Hybrid effectively balances these two extremes by converging to similar local optima as SGD, but much more quickly. This allows Hybrid to more effectively use its iteration budget by restarting more often, and consequently finding lower values of (Figure 1d, 2). For this reason, on a practical use case with limited budget, one would expect Hybrid on average to obtain better optima then either SGD or EDA.
As an aside, we note that EM implicitly sets its own stepsize. However, in SalRoweis03 (), using an “overrelaxed bound optimization" technique, an adaptive step size is used instead of the implicit one. The authors demonstrate that this provides faster convergence. We tried using such an approach for EDAs, and indeed, found it converged more quickly, but to worse local optima, and did not pursue this avenue further.
In summary: we leveraged our insights about EDA viewed as EM to propose an adaptive, hybrid EDASGD method that demonstrates better performance than either of its component methods on several canonincal test functions in both low and high dimensions.
5 Decreasing the EDA variational gap is not helpful
As shown in Section 3.2, EDA is doing a “weak" form of variational EM—“weak" in the sense that it is not actually optimizing the variational family. This suggests a possible improvement to EDA where the weighted particles used in the approximate posterior (9) are adjusted to reduce the variational gap. We explored this possibility by using Stein Variational Gradient Descent (SVGD) Liu2016 () to iteratively propagate particles such that collectively they better approximate the specified target distribution—in this case, the EDA “exact posterior" (8). Although rigorously derived, intuitively, SVGD pushes the particle in the direction of the gradient of with respect to , while maintaining a repulsive force between the particles to ensure diversity. SVGD depends on the choice of a kernel, for which we used a Radial Basis Function (RBF) kernel with lengthscale parameter chosen as suggested in Liu2016 (). Additionally, to keep the method blackbox, we used a finite difference for the gradient of (required for (8)) after verifying that it recapitulated the analytical gradient, although there are entirely gradientfree SVGD approaches emerging Han2018 ().
In order to test the potential utility of reducing the variational gap in EDA, we used an increasing number of SVGD updates in the Estep of EDA during minimizations of the Rastrigin function. We use the same experimental setup described in the previous section but without a function evaluation budget, such that for each setting, , of the number of SVGD updates, we ran exactly 100 optimizations to convergence. Removing the evaluation budget allows us to clearly determine if decreasing the variational gap is helpful—if it is, then one should compare it to other approaches with a function budget experimental setup. Here we used to best expose differences between the varying number of SVGD updates, but our conclusions are not sensitive to this setting. As shown in Figure 3, as we decrease the variational gap by using an increasing number of SVGD updates, performance tends to degrade.
Although it has conventionally been deemed beneficial to reduce the variational gap—indeed, this is precisely what is done in variational inference—results are emerging that suggest it is not always advantageous to do so. For example, in Rainforth18 (), reducing the variational gap in VAEs is shown to be potentially harmful. Similarly, our results show that using SVGD in the Estep of EDAs yields worse performance. However, we speculate that the reason for this may be quite different than in Rainforth18 (). We speculate that SVGD within EDA may degrade performance because it more quickly converges to “closest" local optimum, rather than more thoroughly exploring the landscape. In other words, in MBO, the existence of a variational gap may induce better exploration, similarly to how stochasticity in SGD may be doing so, although this latter point itself remains in debate Neelakantan2016 ().
In summary: We showed that decreasing the EDA ‘variational gap’, by way of Stein Variational gradient descent, in the EDA ‘Estep’, tends to degrade the performance of EDAs.
6 Relationship of EDAs to natural gradient descent by way of EM
Akimoto et al. showed that CMAES, a particular EDA, can be viewed as approximate natural gradient descent (NGD) Akimoto2010 (). This result also emerges from the IGO framework of Ollivier et al.Ollivier2017 () who additionally show that CEM and several other EDAs can be viewed as NGD. Malago et al. show how EDAs using MRF for the search model can be interpreted as performing NGD Malago2013 ().
One might wonder whether using the relationship between EDA and EM might provide a different perspective on these connections. Indeed, it turns out that Chrétien and Hero showed that EM can be formulated as an optimization method known as the proximal point method (PPM)—in particular they showed that EM is using the PPM with the reverse KL divergence to maximize the loglikelihood function chretien98 (); chretien2000kullback (). Precisely, they show that each iteration of EM algorithm,
is equivalent to
the latter being a formulation of the PPM method (see also the Supplementary Information). The derivation of this equivalence requires use of the exact posterior and therefore does not technically hold for EDA which is more akin to variational EM. Nevertheless, the connection still exposes an interesting new perspective on EDAs in the asymptotic regime, as we shall see.
As noted in beck2003mirror (), mirror descent is a linearization of the PPM. In the case of relevance here—a PPM with a KLdivergence—the mirror descent update is
If one uses a secondorder Taylor series approximation to the KL divergence, then the resulting approximate mirror descent update is precisely NGD. In particular, the second order approximation is given by , where is the Fisher information metricmartens2014new (). When the KL divergence in mirror descent is replaced with its approximation, we obtain
which can be shown to be equivalent to (see Supplementary Information) the standard NGD update rule
In summary: EM is a PPM with KL divergence; linearization of the EM PPM is mirror descent; a second order approximation of the KL divergence in mirror descent yields NGD. Thus in the infinite sample limit of EDAs, they can be seen as approximate NGD.
A result related to that of EM performing approximate NGD has previously been shown. Specifically, Sato, and later others, showed that for exponential family models with hidden variables and a conjugate prior (to the joint observed and hidden model), that classical (nonamortized) meanfield variational inference is performing NGD Sato2001 (); Hensman2012 (); Hoffman2013 (). In contrast, the result we describe is for any exact EM, and is only approximate NGD.
On a separate note, Salimbeni et al. demonstrate empirically that the best step size in NGD (according to a Brent line search to find the best step size at each iteration) increases over iterations to 1.0, for Gaussian approximations to several models Salimbeni2018 (). This result nicely aligns with the fact shown herein that that EM implicitly sets its own step size, and that EM, viewed as approximate NGD, has a step size of precisely 1.0.
7 Discussion
We have shown an an explicit connection between Estimation of Distribution Algorithms and ExpectationMaximization. From this mapping we leveraged insights that yielded a promising new approach, a hybrid EDAgradient method, which may have implications in numerous fields, such as RL where one or the other approach are used, but to the best of our knowledge, never both. For example, REINFORCE type algorithms are gradientbased Williams92 (), while Reward Reweighted Regression type algorithms are EDAbased Peters2007 (). Furthermore, we anticipate that the connection between EDAs and EM may spur further improvements.
In light of EDAs as EM, one can ask whether this viewpoint sheds any light on the choice of EDA search models. Suppose for a moment that rather than a samplebased approximate posterior, one instead used a parametric posterior, as in standard VI. In such a case, when the search model were in some sense “conjugate" to , then the posterior could take on an analytical form. When might such an approach make sense? Suppose one were using EDAs to perform protein design Brookes2019 (), and that represented protein stability, and that it could be approximated by using an exponential model of linear additive marginal and pairwise terms. Then if one used a Potts model Marks2011 () as the search model, the exact posterior could also be written in this form, and a “parametric" EDA could be pursued.
EDAs are often improved by smoothing, such as exponentially decaying averaging of previous updates, or modelspecific schemes such as in CMAES Hansen2001 (). We did not employ such smoothing here, but expect that doing so would not alter the message of our paper.
Given the tight connection between EDAs with adaptive shape functions, and annealed versions of EM and VI Hofmann2001 (); Mandt16 (), we believe that these latter approaches could benefit from the simple and robust implicit annealing schemes found in the EDA literature, which arise not from an annealing schedule, but from simple quantile transformations and the like.
Acknowledgments
We thank Ben Recht for helpful discussion, Dilin Wang and Qiang Liu for help with SVGD, and also David Duvenaud, Roger Grosse and James Hensman for pointers to variational Bayes as NGD.
References
 [1] Jan Peters and Stefan Schaal. Reinforcement learning by rewardweighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 745–750, New York, NY, USA, 2007. ACM.
 [2] David Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling for robust design. In Proceedings, Intl. Conf. on Machine Learning (ICML), 2019.
 [3] Endika Bengoetxea, Pedro Larrañaga, Isabelle Bloch, and Aymeric Perchant. Estimation of distribution algorithms: A new evolutionary computation approach for graph matching problems. In Mário Figueiredo, Josiane Zerubia, and Anil K. Jain, editors, Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 454–469, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
 [4] James T. Wilson, Frank Hutter, and Marc Peter Deisenroth. Maximizing acquisition functions for Bayesian optimization. Technical report, 2018.
 [5] Mark Zlochin, Mauro Birattari, Nicolas Meuleau, and Marco Dorigo. Modelbased search for combinatorial optimization: A critical survey. Annals of Operations Research, 131(14):373–395, 2004.
 [6] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
 [7] Shumeet Baluja and Rich Caruana. Removing the genetics from the standard genetic algorithm. In Armand Prieditis and Stuart Russell, editors, Machine Learning Proceedings 1995, pages 38 – 46. Morgan Kaufmann, San Francisco (CA), 1995.
 [8] Nikolaus Hansen and Andreas Ostermeier. Completely derandomized selfadaptation in evolution strategies. Evol. Comput., 9(2):159–195, June 2001.
 [9] Freek Stulp and Olivier Sigaud. Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, pages 1547–1554, USA, 2012. Omnipress.
 [10] Reuven Rubinstein and William Davidson. The CrossEntropy Method for Combinatorial and Continuous Optimization. Methodol. Comput. Appl. Probab., 1:127–190, 1999.
 [11] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber. Natural evolution strategies. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), pages 3381–3387, June 2008.
 [12] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. Journal of Machine Learning Research, 15:949–980, 2014.
 [13] Sun Yi, Daan Wierstra, Tom Schaul, and Jürgen Schmidhuber. Stochastic search using the natural gradient. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1161–1168, New York, NY, USA, 2009. ACM.
 [14] J. Ocenasek and J. Schwarz. Estimation of distribution algorithm for mixed continuousdiscrete optimization problems. In Proceedings of the 29th International Coference on International Conference on Machine Learning, Slovakia, 2012. IOS Press.
 [15] Jiaqiao Hu and Ping Hu. Annealing adaptive search, crossentropy, and stochastic approximation in global optimization. Naval Research Logistics (NRL), 58(5):457–477, aug 2011.
 [16] V. A. Shim, K. C. Tan, J. Y. Chia, and A. Al Mamun. Multiobjective optimization with estimation of distribution algorithm in a noisy environment. Evolutionary Computation, 21(1):149–177, March 2013.
 [17] Siddhartha Shakya and John McCall. Optimization by estimation of distribution with deum framework based on markov random fields. International Journal of Automation and Computing, 4(3):262–272, Jul 2007.
 [18] Stefan Kern, Sibylle D. Müller, Nikolaus Hansen, Dirk Büche, Jiri Ocenasek, and Petros Koumoutsakos. Learning probability distributions in continuous evolutionary algorithms – a comparative review. Natural Computing, 3(1):77–112, Mar 2004.
 [19] Yann Ollivier, Ludovic Arnold, Anne Auger, and Nikolaus Hansen. InformationGeometric Optimization Algorithms: A Unifying Picture via Invariance Principles. Technical report, 2017.
 [20] Matthew J Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, 1998.
 [21] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977.
 [22] J Staines and D Barber. Optimization by variational bounding. In European Symposium on ANNs, 2013.
 [23] Peter Dayan and Geoffrey E. Hinton. Using expectationmaximization for reinforcement learning. Neural Comput., 9(2):271–278, February 1997.
 [24] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber. Fitness expectation maximization. In PPSN 2008, pages 337–346, Berlin, Germany, September 2008. MaxPlanckGesellschaft, Springer.
 [25] Thomas Furmston, Guy Lever, and David Barber. Approximate newton methods for policy search in markov decision processes. Journal of Machine Learning Research, 17:227:1–227:51, 2016.
 [26] Jens Kober and Jan R. Peters. Policy search for motor primitives in robotics. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 849–856. Curran Associates, Inc., 2009.
 [27] Enrico Guiraud, Jakob Drefs, and Jörg Lücke. Evolutionary expectation maximization. In Proc. Genet. Evol. Comput. Conf.  GECCO ’18, pages 442–449, New York, New York, USA, 2018. ACM Press.
 [28] Radford M. Neal and Geoffrey E. Hinton. A View of the EM Algorithm that Justifies Incremental, Sparse, and other Variants. pages 355–368, 2012.
 [29] Brian S Caffo, Wolfgang Jank, and Galin L. Jones. Ascentbased monte carlo expectationmaximization. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 67(2):235–251, 2005.
 [30] Gersende Fort and Eric Moulines. Convergence of the monte carlo expectation maximization for curved exponential families. Ann. Statist., 31(4):1220–1259, 08 2003.
 [31] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to samplebased analysis. Ann. Statist., 45(1):77–120, 02 2017.
 [32] Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1):177–196, Jan 2001.
 [33] Stephan Mandt, James McInerney, Farhan Abrol, Rajesh Ranganath, and David M. Blei. Variational tempering. In AISTATS, volume 51, pages 704–712. JMLR.org, 2016.
 [34] Ruslan Salakhutdinov, Sam Roweis, and Zoubin Ghahramani. Optimization with EM and ExpectationConjugateGradient. In Proceedings, Intl. Conf. on Machine Learning (ICML), pages 672–679, 2003.
 [35] David J. C. MacKay. Information theory, inference, and learning algorithms. Cambridge University Press, 2003.
 [36] Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Mach. Learn., 8(34):229–256, May 1992.
 [37] Benjamin Recht. A tour of reinforcement learning: The view from continuous control. arXiv, abs/1806.09460, 2018.
 [38] Lei Xu and Michael I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput., 8(1):129–151, January 1996.
 [39] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011.
 [40] Ruslan Salakhutdinov and Sam Roweis. Adaptive overrelaxed bound optimization methods. In Proceedings of the International Conference on Machine Learning, volume 20, pages 664–671, 2003.
 [41] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Neural Information Processing Systems, pages 2370–2378, 2016.
 [42] Jun Han and Qiang Liu. Stein variational gradient descent without gradient. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 1900–1908, 2018.
 [43] Tom Rainforth, Adam R. Kosiorek, Tuan Anh Le, Chris J. Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. Tighter variational bounds are not necessarily better. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pages 4274–4282, 2018.
 [44] Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks. CoRR, abs/1511.06807, 2015.
 [45] Youhei Akimoto, Yuichi Nagata, Isao Ono, and Shigenobu Kobayashi. Bidirectional relation between cma evolution strategies and natural evolution strategies. In Robert Schaefer, Carlos Cotta, Joanna Kołodziej, and Günter Rudolph, editors, Parallel Problem Solving from Nature, PPSN XI, pages 154–163, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
 [46] L. Malagò, M. Matteucci, and G. Pistone. Natural gradient, fitness modelling and model selection: A unifying perspective. In IEEE Congress on Evolutionary Computation, pages 486–493, June 2013.
 [47] Stephane Chrétien and Alfred O Hero III. Acceleration of the em algorithm via proximal point iterations. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), 1998.
 [48] Stéphane Chrétien and Alfred O. III Hero. Kullback proximal algorithms for maximumlikelihood estimation. IEEE transactions on information theory, 46(5):1800–1810, 2000.
 [49] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
 [50] James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
 [51] MasaAki Sato. Online model selection based on the variational bayes. Neural Comput., 13(7):1649–1681, July 2001.
 [52] James Hensman, Magnus Rattray, and Neil D. Lawrence. Fast variational inference in the conjugate exponential family. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25, Cambridge, MA, 00 2012.
 [53] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. J. Mach. Learn. Res., 14(1):1303–1347, May 2013.
 [54] Hugh Salimbeni, Stefanos Eleftheriadis, and James Hensman. Natural gradients in practice: Nonconjugate variational inference in gaussian process models. In Uncertainty In Artificial Intelligence, 03 2018.
 [55] Debora S. Marks, Lucy J. Colwell, Robert Sheridan, Thomas A. Hopf, Andrea Pagnani, Riccardo Zecchina, and Chris Sander. Protein 3d structure computed from evolutionary sequence variation. PLOS One, 6(12):1–20, 12 2011.
Supplementary Information: A view of Estimation of Distribution Algorithms through the lens of ExpectationMaximization
Appendix S1 Additional hybrid EDAgradient experiments
Appendix S2 Equivalence between ExpectationMaximization and the proximal point method
In Section 6, we discuss the equivalence between EM and using the proximal point method (PPM) with reverse KL divergence to maximize the loglikelihood function [47, 48]. Here, we provide the derivation of this equivalence, beginning with a formulation of the PPM:
Note that the last line is precisely the update rule for EM.
Appendix S3 Details of natural gradient descent
Recall from Section 6 that replacing the loglikelihood with its linearization and the KLdivergence with its secondorder Taylor series approximation in the PPM formulation of EM yields the update
One can further derive a closedform update by taking the derivative of the argument on the righthand side, setting it equal to zero as per firstorder optimality conditions, and solving:
This establishes the equivalence between the standard natural gradient descent update rule and the mirror descent update derived using the PPM formulation of EM as a starting point.