A view of EDAs through the lens of EM

# A view of Estimation of Distribution Algorithms through the lens of Expectation-Maximization

## Abstract.

We show that a large class of Estimation of Distribution Algorithms (EDAs), including, but not limited to, Covariance Matrix Adaption, can be written exactly as a Monte Carlo Expectation-Maximization (EM) algorithm, and as exact EM in the limit of infinite samples. Because EM sits on a rigorous statistical foundation and has been thoroughly analyzed, this connection provides a new coherent framework with which to reason about EDAs—one complementary to that of Information Geometry Optimization (IGO). To illustrate the potential benefits of such a connection, we leverage it to (i) formally show that this class of EDAs can be seen as approximating natural gradient descent, and (ii) leverage a rigorously-derived adaptive EM-based algorithm for EDAs, demonstrating potentially advantageous directions for adaptive hybrid approaches.

1

## 1. Introduction

Estimation of Distribution Algorithms (EDAs) are a widely used class of algorithms designed to solve optimization problems of the form , where is a function over a space of discrete or continuous inputs, . Instead of solving this objective directly, EDAs solve the related objective

 (1) θ∗=argmaxθEp(z|θ)[f(z)],

where is a probability density over , parameterized by parameters . When has the capacity to represent a point mass centered on , then these two formulations have the same optimal values. Reasons for using the latter formulation (1) include convenience for derivative-free optimization(Hansen and Ostermeier, 2001), the desire to obtain a set of good candidates rather than a single candidate (Brookes et al., 2019), enabling of rigorous analysis of associated optimization algorithms(Zlochin et al., 2004), and the leveraging of a probabilistic formulation to incorporate auxiliary information (Brookes et al., 2019). In many cases this objective is further modified by applying a monotonic shaping function, to (Wierstra et al., 2008, 2014; Yi et al., 2009), which alters convergence properties of algorithms to solve it, but does not change the optima.

The general template for EDA algorithms is as follows. Beginning with an initial parameter setting, , of the parametrized density, , each iteration of an EDA algorithm generally consists of three steps:

1. Draw samples, , from .

2. Evaluate for each .

3. Find a that uses the samples and corresponding function evaluations to move towards regions of that have large function values.

The last step is generally accomplished by fully or partially solving the objective

 (2) θ(t+1)=argmaxθ∑Ni=1W(f(zi))logp(zi|θ),

which can be seen as a weighted maximum likelihood problem with weights . We refer to this set of steps as a “core” EDA algorithm because it ties together all EDAs.

In the case of Covariance Matrix Adaptation (CMA-ES), this core algorithm is often modified in a variety of ways to improve performance. For example, samples from previous iterations may be used, directly or indirectly, in the last step, resulting in smoothing of the parameter estimates across iterations. Adaptive setting of parameter-specific step sizes, and “path evolution” heuristics (Hansen and Ostermeier, 2001; Ollivier et al., 2017; Brookes et al., 2019) are also common. These layers on top of the core EDA have generally been derived in a manner specific only to CMA-ES, and are not readily generalizable to other EDAs. For this reason, we restrict ourselves to the core EDA algorithm just described, although our formalism also allows for a large class of EDA parameter smoothing variations which encompass many, but not all, variations of CMA-ES.

EDAs are also distinguished by the choice of parametrized density, , which we refer to here as the “search model”2. Many EDAs use exponential family models for the search model, most commonly the multivariate Gaussian as in CMA-ES. However, Bayesian networks (Ocenasek and Schwarz, 2012), Boltzmann machines (Hu and Hu, 2011; Shim et al., 2013), Markov Random Fields (Shakya and McCall, 2007), Variational Autoencoder (Brookes et al., 2019), and others (Kern et al., 2004) have also been used. Unless otherwise specified, our analysis in the following is quite general and makes no assumptions on the choice of search model.
Our contributions We show, for the first time, that a large class of EDAs—including, but not limited to, variants of CMA-ES—can be written precisely as Monte Carlo (MC) Expectation-Maximization (EM) (Beal, 1998), and in the limit of infinite samples, as exact EM (Dempster et al., 1977). Because EM sits on a rigorous statistical foundation and has been thoroughly analyzed, this connection provides a new framework with which to reason about EDAs. Within this framework, we also show how many parameter-smoothed variants of CMA-ES can be written as Monte Carlo maximum a posteriori-EM (MAP-EM), which suggests a possible avenue to develop similar smoothing algorithms for EDAs with other search models. After presenting the connection between EDAs and EM, we then demonstrate its potential value with two illustrative examples. First, we provide a new perspective on how EDAs can be seen to be performing approximate natural gradient descent. Second, we leverage a rigorously-derived adaptive hybrid EM-gradient descent approach and apply it to EDAs. We expect others will similarly build on the EM-EDA connection toward a more rigorous understanding and development of EDAs.

### 1.1. Related work

The Information Geometry Optimization (IGO) framework (Ollivier et al., 2017) unifies a large class of approaches— including EDAs such as CEM (Rubinstein and Davidson, 1999) and CMA-ES (Hansen and Ostermeier, 2001)—for solving the objective in (1) by discretizing a natural gradient flow. Instantiations of IGO are most readily seen as tantamount to using the score function estimator (sometimes referred to as the “log derivative trick”) on (1), combined with natural gradient, as in Natural Evolution Strategies (Wierstra et al., 2008, 2014). The IGO framework does not connect to EM, directly or otherwise; therefore, IGO provides a complementary viewpoint to that presented herein.

Akimoto et al. (2012) show how CMA-ES with rank- updates but with global step size and evolution paths removed can be viewed as a natural evolution strategy. In this context, they briefly remark on how a simplified CMA-ES can be viewed as performing generalized EM with a partial-M step using a natural gradient. The technical underpinnings of this result are restricted specifically to CMA-ES because of the dependence on a Gaussian search model, and do not readily generalize to the broader EDA setting considered herein.

Staines et al. introduce the notion of variational optimization, by explicitly considering the fact that the optimum of the objective function in (1) is a lower bound on the optimum of . They clearly delineate conditions under which the bound can be satiated, and the objective function is convex and has derivatives that exist (Staines and Barber, 2013); no connections to EDAs are made.

## 2. Background: Free Energy view of EM

Before deriving the connection between EM and EDAs, we first provide some needed background on EM. EM is a well-known method for performing maximum likelihood parameter estimation in latent variable models that exploits the structure of the joint likelihood between latent and observed variables (Dempster et al., 1977). Intuitively, each E-step imputes the latent variables, and the subsequent M-step then uses these “filled in” data to do standard maximum likelihood estimation, typically in closed form. EM iterates between these E- and M- steps until convergence. We use the Free Energy interpretation of EM and its accompanying generalizations (Neal and Hinton, 1999) in order to reveal the connection between EDAs and EM in a particularly insightful way.

Let and be observed and latent variables, respectively. The task of maximum likelihood estimation is to find where

 L(ϕ)=logp(x|ϕ)=log∫p(x,z|ϕ)dz,

for some model density parameterized by . In Neal and Hinton (1999), the authors define a function known as the free energy, given by

 (3) F(q,ϕ)=Eq(z)[logp(x,z|ϕ)]+H[q],

where is any probability density over the latent variables, is the entropy of , and the term preceding the entropy is known as the expected complete log-likelihood. The free energy lower bounds the log-likelihood, , and this bound is satiated only when is equal to the true posterior, . If the true posterior cannot be calculated exactly, one may approximate it in one of several ways, two of which are described next.

In the first, the posterior is approximated by restricting it to a parameterized family of distributions, , where both the parameters of the likelihood and the variational family must now be estimated. This is known as variational EM. Unless the true posterior lies in the parameterized class, which is not typically the case, the bound cannot be satiated, leading to a variational gap given by , where denotes the KL divergence.

In another approximation, one draws samples from the true posterior, , to estimate the expected complete log likelihood. This is known as Monte Carlo EM (MC-EM). This also induces a “gap” between the true and approximate posterior, which has the same form as the variational gap, and hence we will refer to it as such.

The major distinction between variational and MC-EM is that variational EM explicitly attempts to minimize the variational gap at each iteration through optimization, while MC-EM only implicitly closes the gap as the number of samples goes to infinity.

EM, MC-EM, and variational EM can all be viewed as alternating coordinate descent on the free energy function  (Neal and Hinton, 1999). The alternating updates at iteration are given by:

• E-step: , that is, compute or estimate the posterior over the hidden variables. It can be shown that this is equivalent to minimizing the variational gap by solving .

• M-step: , which is equivalent to maximizing the expected complete log likelihood with respect to , .

When the variational gap can be driven to zero, as in exact EM, this procedure is guaranteed to never decrease the likelihood. Moreover, its convergence properties to both local and global minima have been carefully studied (e. g., Dempster et al. (1977); Neal and Hinton (1999); Balakrishnan et al. (2017)). Rigorous generalizations of EM to partial E- and M-steps also emerge naturally from this viewpoint (Neal and Hinton, 1999).

One can also use the free energy viewpoint of EM to optimize a maximum a posteriori (MAP) objective given by
, where is some prior distribution over parameters. This yields a corresponding free energy,

 (4) FMAP(q,ϕ)≡Eq(z)[logp(x,z|ϕ)+logp0(ϕ)]+H[q],

upon which can perform the same coordinate descent algorithm just described. This formulation is referred to as MAP-EM. We will draw connections between MAP-EM and EDAs with smoothed parameter updates.

## 3. Formal connection between EM and EDAs

We are now ready to use the EM framework presented in Section 2, to show that EDAs using the update rule in (2) can be written as MC-EM, and as exact EM in the infinite sample limit. We then show that generalizations of (2) that allow for parameter smoothing—such as CMA-ES’s smoothing with the previous estimate and use of an evolutionary path—can be readily incorporated into the EM-EDA connection by using the MAP-EM framework. Finally, we also show a connection between EM and standard gradient-based optimization, which will be useful for the hybrid algorithm presented in Section 4.2.

### 3.1. Derivation of EDAs as MC-EM

As described in the introduction, EDAs seek to solve the objective defined in (1),

 (5) ^θ ≡argmaxθEp(z|θ)[f(z)] (6) =argmaxθlogEp(z|θ)[f(z)] (7) =argmaxθLEDA(θ),

where is the black-box function to be optimized, is what we refer to as the search model, parameterized by , and we define —which can be thought of as an EDA equivalent to the ‘log marginal likelihood’ in EM, only without any observed data, .

Some EDAs monotonically transform with a shaping function, , which may be, for example, an exponential (Peters and Schaal, 2007), a quantile-based transformation (Ollivier et al., 2017; Rubinstein and Davidson, 1999), or a cumulative density function (CDF) (Brookes et al., 2019). Although this transformation does not change the optima, it may alter the optimization dynamics. Often this shaping function is changed at each iteration in a sample-dependent, adaptive manner (which links these methods to annealed versions of EM and VI (Hofmann, 2001; Mandt et al., 2016)). In such a setting, the connection that we will show between EDAs and EM holds within each full iteration, and the insights derived from this connection have bearing on the entire algorithm. For notational simplicity, we drop the and assume that has already been transformed. We additionally assume that the transformation is such that for all .

To link (7) to EM, we introduce a probability density, , which allows us to derive a lower bound on using Jensen’s inequality:

 (8) LEDA(θ) =logEp(z|θ)[f(z)] (9) =logEq(z)[p(z|θ)f(z)q(z)] (10) ≥Eq(z)[log(p(z|θ)f(z))]+H[q] (11) =F(q,θ),

where is the same free energy function appearing in the previous section on EM, except that the complete likelihood is replaced with the term . When is normalizable, then it can be shown that there is an EDA “variational gap” given by , where we define the tilted density,

 (12) ~p(z|θ)=p(z|θ)f(z)∫Zp(z|θ)f(z)dz,

which is the EDA counterpart to the exact posterior in EM. We can now construct a coordinate ascent algorithm on the free energy defined in (11) that mirrors the EM algorithm. In particular, this algorithm iterates between E-steps that solve
, and M-steps that solve
. To make the precise connection between practically implemented EDA and EM, we introduce a particular approximate ‘posterior’ for the E-step that is given by a mixture of weighted particles:

 (13) q(t+1)(z)=∑Ni=1f(zi)δzi(z)∑Ni=1f(zi),

where are samples drawn from , as in EDAs. Using this posterior approximation, the M-step amounts to solving the objective:

 (14) θ(t+1)= argmaxθEq(t+1)(z)[log(p(z|θ)f(z))] (15) = argmaxθ∑Ni=1∫Zf(zi)δzi(z)logp(z|θ)dz∑Ni=1f(zi) (16) = argmaxθN∑i=1f(zi)logp(zi|θ),

which is exactly the generic EDA update step defined in (2).

From this exposition it also becomes clear that EDAs can be thought of as performing a type of MC-EM that uses Importance Sampling (MacKay, 2003) rather than direct sampling from the posterior. In particular, the EDA sampling procedure uses proposal distribution, , (sampled in the E-step), and importance weights to estimate the expected “complete log likelihood”, , in the M-step.

This is our main result, as it shows that many EDAs can be viewed as an EM algorithm that uses the particle-based posterior approximation given by (13). For any , we have as by the law of large numbers and Slutsky’s theorem (Grimmett and Stirzaker, 2001). In this limit of infinite particles, the approximate posterior matches the tilted distribution—the EDA equivalent to the “exact posterior”—and our algorithm inherits the same guarantees as exact EM, such as guaranteed improvement of the objective function at each iteration, as well as local and global convergence properties (Neal and Hinton, 1999; Balakrishnan et al., 2017; Salakhutdinov et al., 2003).

In many cases, EDAs use a member of an exponential family for the search model. Letting denote the expectation parameters, then the search model has the density:

 (17) p(z|θ)=h(z)exp(η(θ)TT(z)−A(θ)),

where is the base measure, are the natural parameters, are sufficient statistics and is the log-partition function. Then the update of (16) takes the simple form of a weighted maximum likelihood estimate for the exponential family:

 (18) θ(t+1)=∑Ni=1f(zi)T(zi)∑Ni=1f(zi).

Next we show how exponential family search models can be used to connect parameter-smoothed EDAs to MAP-EM.

### 3.2. Smoothed EDAs as MAP-EM

In CMA-ES, the parameter updates are smoothed between iterations in a number of ways (Hansen et al., 2003). For example, the covariance estimate is typically smoothed with the previous estimate. Additionally, it may be further smoothed using a rank one covariance matrix obtained from the “evolution path” that the algorithm has recently taken. (See the two terms in Eq 11 of (Hansen et al., 2003)). Next, we consider these types of smoothing in detail. However, any smoothing update, or combination thereof, can be similarly derived by adjusting the form of the prior distribution. A benefit of viewing CMA-ES updates in this general form is that it becomes more straightforward to determine how to do similar types of smoothing for EDAs without Gaussian search models.

When smoothing with the previous parameter estimate, the updates can be written as

 (19) θ(t+1)=(1−γ)θ(t)+γ~θ(t+1),

where is the solution to (16) and is a hyperparameter that controls the amount of smoothing. In the case where the search model is a member of an exponential family, as is defined in (17), we will show that the smoothed update (19) is equivalent to a particular MAP-EM update that uses the tilted density (12) as the approximate posterior. To see this, consider the conjugate prior to the exponential family,

 (20) p0(θ|λ)=exp(λT1η(θ)−λ2A(θ)−B(λ)),

where , is the log-partition function of the prior, and we have assumed that the base measure is constant. We now consider the modified EDA objective:

 ^LEDA(θ)=logEp(z|θ)[f(z)p0(θ|λ)],

which is analogous to the MAP objective defined in Section 2. It can be shown that by replacing the loss in (8) with this modified objective, performing analogous steps as those in (9-11) and (14-16), and using the same definition of the tilted density and approximate posterior as in (12) and (13), respectively, we arrive at the update equation:

 (21) θ(t+1)=argmaxθN∑i=1f(zi)logp(zi|θ)p0(θ|λ)

If we now let and allow to change every iteration as , it can then be shown that the solution to this objective is given by:

 (22) θ(t+1) =λ11+λ2θ(t)+11+λ2~θ(t+1) (23) =(1−γ)θ(t)+γ~θ(t+1),

where is (18), the solution of original EDA objective. Since (23) is identical to the smoothed update (19), we can see that smoothing can be viewed as a consequence of performing MAP-EM in the context of EDAs, with a particular choice of prior parameters. See the Appendix A.2 for a full derivation.

### 3.3. A continuum between EM-like algorithms and gradient descent

Instead of using EDAs, one could alternatively solve the objective (1) directly with gradient descent (or natural gradient descent) where updates are given by , and is a step size parameter. This is known as REINFORCE (Williams, 1992) in the RL community and is related to IGO-ML (Ollivier et al., 2017). Typically, one cannot compute the gradient term exactly, in which case the “log derivative trick”,

 (24) ∇θ Ep(z|θ)[f(z)]|θ=θ(t) (25) =Ep(z|θ(t))[f(z)∇θlogp(z|θ)|θ=θ(t)],

is often combined with Monte Carlo estimation to arrive at an update:

 (26) θ(t+1)=θ(t)+αN∑i=1f(zi)∇θlogp(zi|θ)|θ=θ(t),

where are samples drawn from . Notably, this update does not require the gradient of (Recht, 2018).

We can connect the gradient-based optimization in (26) to the EDA “EM” optimization presented in Section 3.1. In particular, consider the partial M-step version of EM, where instead of fully solving for the new parameter, one instead performs one gradient step (Neal and Hinton, 1999; Balakrishnan et al., 2017) —so called “first-order” EM. Instantiating first-order EM into EDA “EM” by partially solving the objective in (16) with a single gradient step (with step size ) will result in the same update as (26). In other words, performing “first-order” EM in the EDA context is identical to performing gradient descent on (1). In our experiments, we refer to “first-order” EM for EDAs as MC-GD, due to the fact that the gradient is estimated with Monte Carlo. As we can see, as one performs more gradient steps within an M-step, one traces out a continuum of methods between gradient descent on the original objective and EM.

## 4. Illustrative use cases of the EM-EDA connection

In the next two sections, we provide two illustrative examples of how the connection between EDAs and EM can be leveraged. First, we use the framework to provide a new perspective on how EDAs can be seen as approximating natural gradient descent. Then, separately, we borrow results about a hybrid EM algorithm and apply it to EDAs, with promising results.

### 4.1. A unifying view of EDAs as natural gradient descent

Akimoto et al. (2010) showed that CMA-ES, a particular EDA, can be viewed as approximate natural gradient descent (NGD) . This result also emerges from the IGO framework of Ollivier et al. (2017), who additionally showed that CEM and several other EDAs can be viewed as NGD. Malagò et al. (2013) also show how EDAs using MRF for the search model can be interpreted as NGD. The relationship we derived between EDA and EM provides a unifying perspective on these connections, which also allows for them to be readily generalized to any EDA that can be written as EM. Next we develop this unifying view.

#### EM as NGD

It does not appear widely known that EM can be viewed as approximate NGD. Thus we next synthesize a series of known results to show this. As a method for maximizing the log-likelihood function, Chrétien and Hero showed that EM can be formulated as the proximal point method (PPM) with the reverse KL divergence (Chrétien and Hero III, 1998; Chrétien and Hero, 2000). Specifically, this formulation shows that in our context, each update

 (27) θ(t+1)= argmaxθ∫Z~p(z|θ(t))log(p(z|θ)f(z))dz

is equivalent to

 (28) θ(t+1)=argmaxθ[LEDA(θ)−DKL(~p(z|θ(t))||~p(z|θ))],

which is a PPM (see also Appendix B). This equivalence requires the exact posterior (tilted distribution), , and therefore does not technically hold for EDA in practice, which is more akin to MC-EM. Nevertheless, the connection exposes an interesting new perspective on EDAs in the asymptotic regime, as we shall see.

Next, consider the mirror descent update, which is a linearization of the PPM (Beck and Teboulle, 2003):

 (29) θ(t+1) =argmaxϕLEDA(θ(t)) (30) +⟨∇θLEDA(θ)∣∣θ=θ(t),θ−θ(t)⟩ (31) −DKL(~p(z|θ(t))||~p(z|θ)).

Mirror descent with KL divergence is equivalent to NGD on the dual Reimannian manifold (Raskutti and Mukherjee, 2015). Even when the original and dual manifolds differ, the two algorithms can be related: using a second-order Taylor series approximation of the KL divergence yields an approximate mirror descent update that is precisely NGD. Specifically, replacing the KL divergence in mirror descent with the second-order approximation as follows,

 (32) DKL (~p(z|θ(t))||~p(z|θ)) (33) =12(θ−θ(t))⊤I(θ(t))(θ−θ(t))+O((θ−θ(t))3),

where is the Fisher information metric (Martens, 2014), yields the update

 (34) θ(t+1) =argmaxθLEDA(θ(t)) (35) +⟨∇θLEDA(θ)∣∣θ=θ(t),θ−θ(t)⟩ (36) −12(θ−θ(t))⊤I(θ(t))(θ−θ(t)),

which can be shown to be equivalent to the NGD update

 (37) θ(t+1)=θ(t)+I(θ(t))−1∇θLEDA(θ)∣∣θ=θ(t).

See Appendix C for further details.

In summary: EM is a PPM with KL divergence; mirror descent is a linearization of the PPM; and a second-order approximation of the KL divergence in mirror descent yields NGD. Thus in the infinite sample limit, EDAs can be seen as approximate NGD.

Note that Sato (2001), and later others, showed the related result that for exponential family models with hidden variables and a conjugate prior (to the joint observed and hidden model), classical (non-amortized) mean-field variational inference performs NGD (Hensman et al., 2012; Hoffman et al., 2013). In contrast, our result applies to any exact EM, but only connects it to approximate NGD.

An interesting side note is that Salimbeni et al. (2018) demonstrated empirically that the best step size for NGD, according to Brent line searches, increases over iterations to 1.0, for Gaussian approximations of several models. Their result aligns nicely with our results here that EM, viewed as approximate NGD, has a step size of precisely 1.0.

### 4.2. Leveraging an adaptive hybrid EM algorithm for EDAs

It is well-known that EM can exhibit super-linear convergence behaviour near a local optimum and in plateau regions, but can exhibit extremely slow convergence rates otherwise—in regions where gradient-based methods can do better (Xu and Jordan, 1996; Salakhutdinov et al., 2003). In order to get the best of both worlds Salakhutdinov et al. leverage a precise analysis of EM, including how to diagnose in an on-line, data-driven manner when EM is likely to be in a poor convergence phase. The technically correct diagnostic to achieve this requires difficult computations of the derivatives of the EM “mapping matrix” (the implicit matrix that maps ); instead, they propose to use the entropy of the posterior as a proxy to this quantity, using the intuition that both quantities inform on how much “information” EM has about the latent variables at any given point in the algorithm. This intuition follows from a technical condition that holds when, for example, the expected complete log likelihood of the M-step has a single optimum, such as for (mixtures of) Gaussians, (mixtures of) Factor Analysis, HMMs, and probabilistic (mixtures of) Principle Components Analysis (Salakhutdinov et al., 2003). For the gradient-based method, they use conjugate gradient descent (CGD) with a line search; they find that the hybrid method never does worse, and often performs better than both standard EM and SGD.

On this basis, we explore an analogous adaptive, hybrid EDA-gradient approach, although we use the approximate GD updates described in (26) without a line search, instead using AdaGrad (Duchi et al., 2011) to control the learning rate—we do so in the hopes of better understanding the source of differences in performance, differences that do not depend on a line search. Our hybrid approach, denoted Hybrid, switches between (i) a generic EDA algorithm that uses the basic update rule of (16) which we denote EDA, and (ii) a gradient-based approach (26) which we denote MC-GD. As in Salakhutdinov et al. (2003), we switch from MC-GD to EDA when the estimated posterior entropy is above a specified cutoff, , and vice versa.

Although we could in principle devise an algorithm similar to our Hybrid method without use of the EM connection, such a method would lack a rational switching mechanism. In contrast, the EM viewpoint directly provides us with a rigorous switching mechanism for EDAs by way of the analysis of (Salakhutdinov et al., 2003).

#### Experiments

To get a sense of the potential utility of the EM-derived hybrid approach, we compared our Hybrid method to the pure EDA and MC-GD algorithms, in the minimization of 2- and 10-dimensional Rastrigin and Ackley functions—canonical multi-modal, non-convex functions commonly used to benchmark optimization approaches (Hansen and Ostermeier, 2001).

We used a -dimensional Gaussian search model with full-rank covariance matrix and particles at each iteration. The algorithms had a fixed budget of 50,000 and 1 million function evaluations for and , respectively (which amounts to a maximum of 5,000 and 100,000 EM iterations, respectively). Crucially, as in Wierstra et al. (2014), each algorithm was allowed to randomly restart after convergence until the function budget was used up because this ties in to practical use cases. Convergence was determined to be the point at which , where is the covariance matrix of the search model and is the dimensionality of the search space. Restarts were randomly initialized with particles drawn from a Gaussian with unit variance and mean randomly positioned on the hypersphere of radius centered at the global minimum (the origin)—the larger , the further the initialization is from the global optimum. The radius, , was set to 20 and 30 for the Rastrigin and Ackley functions, respectively, to make the optimizations sufficiently difficult such that none of the methods tended to reach the global minimum—this allowed us to better distinguish the relative performance of methods, whose ordering was not sensitive this setting.

As in Wierstra et al. (2014); Ollivier et al. (2017); Rubinstein and Davidson (1999); Hansen and Ostermeier (2001), we used an adaptive shaping function, to monotonically transform the function values . In particular, we use a sigmoid weighting function given by , where and were estimated from samples at each iteration.

We approximated the entropy of our approximate posterior, a collection of weighted particles, by using the entropy of the search model itself, which can be shown to be . We found that this quantity tends to lie in the range during the course of the optimizations, and therefore chose as the entropy cutoff, , for switching in Hybrid . Even in the case where it is not a priori obvious how to set the , this hyper-parameter can readily be set for new problems based on empirics from a few sample runs.

As shown in Figures 1 and 2, EDA converges very quickly, but to poor local minima. In constrast, MC-GD converges very slowly, but to better local minima. Hybrid effectively balances these two extremes by converging to similar local optima as MC-GD, but much more quickly. This allows Hybrid to more effectively use its iteration budget by restarting more often, and consequently finding lower values of (Figure 1d, 2). For this reason, our results suggest that on a practical use case with limited budget, that Hybrid  may to obtain better optima then either MC-GD or EDA.

As an aside, we note that EM implicitly sets its own step-size. However, in Salakhutdinov and Roweis (2003), using an “overrelaxed bound optimization” technique, an adaptive step size is used instead of the implicit one. The authors demonstrate that this provides faster convergence. We tried using such an approach for EDAs, and indeed, found it converged more quickly, but to worse local optima, and did not pursue this avenue further.

In summary: using our connection betweeen EM and EDAs, we leveraged a rigorously derived, already-known, adaptive, hybrid EM-GD method for EDAs. The results of our experiments suggest this avenue may prove promising. More importantly, we have shown an example of how rich algorithmic results from EM can be rather directly applied to EDAs.

## 5. Discussion

We have shown a novel, rigorous connection between a broad class of Estimation of Distribution Algorithms and Expectation-Maximization. Additionally, we have presented two illustrative example of insights from EM that can be applied to EDAs by way of this connection. Specifically, we presented a new connection between Natural Gradient Descent and EDAs, by way of EM. Additionally, we leveraged a hybrid EDA-gradient method, which may have implications in numerous fields, such as Reinforcement Learning where one or the other approach are used, but to the best of our knowledge, never both. For example, REINFORCE type algorithms are gradient-based (Williams, 1992), while Reward Weighted Regression type algorithms are EDA-based (Peters and Schaal, 2007). We anticipate that this connection will open a large number of possibilties for further insights and improvements to EDAs.

In light of EDAs as EM, one can ask whether this viewpoint sheds any light on the choice of EDA search models. Suppose for a moment that rather than a sample-based approximate posterior, one instead used a parametric posterior, as in standard VI. In such a case, if the search model were in some sense “conjugate” to , then the posterior could take on an analytical form. When might such an approach make sense? Suppose one were using EDAs to perform protein design (Brookes et al., 2019), and that represented protein stability, and that it could be approximated by using an exponential model of linear additive marginal and pairwise terms (Bryngelson and Wolynes, 1987). Then if one used a Potts model (Marks et al., 2011) as the search model, the exact posterior could also be written in this form, and a “parametric” EDA could be pursued.

Given the tight connection between EDAs with adaptive shape functions, and annealed versions of EM and VI (Hofmann, 2001; Mandt et al., 2016), we believe that these latter approaches could benefit from the simple and robust implicit annealing schemes found in the EDA literature, which arise not from an annealing schedule, but from simple and effective quantile transformations and the like.

## Appendix A Derivations in the case of Exponential Family search models

In equations (18) and (22) we present exact update equations for EDAs when the search model is a member of an exponential family. We derive these updates here.

### a.1. EDA Update

For the first update (18), we use the definition of the exponential family search model density (17) in the EDA update objective (16):

 (38) θ(t+1) =argmaxθN∑i=1f(zi)logp(zi|θ) (39) =argmaxθ(N∑i=1f(zi)η(θ)TT(zi))+A(θ)N∑i=1f(zi)

To solve this, we now take the gradient of the argument with respect to :

 (40) ∇θ(N∑i=1f(zi)η(θ)TT(zi))+∇θA(θ)N∑i=1f(zi) (41) = ∇θη(θ)TN∑i=1f(zi)T(zi)+∇θη(θ)TEη[T(z)]N∑i=1f(zi)

where we have used the chain rule and the properties of the log-partition function (i.e. cumulant generating function) to equate . Now, recognizing that are expectation parameters, so , and setting (41) equal to zero, we arrive at the update the equation (18):

 (42) ∇θη(θ)TN∑i=1f(zi)T(zi) +∇θη(θ)TEη[T(z)]N∑i=1f(zi)=0 (43) ⇒θ(t+1) =∑Ni=1f(zi)T(zi)∑Ni=1f(zi)

### a.2. MAP Update

For the second update (22), we use the definition of the exponential family search model density (17), and the prior (20), in the MAP-EDA update objective (21):

 θ(t+1) =argmaxθN∑i=1f(zi)logp(zi|θ)p0(θ|λ) =argmaxθN∑i=1f(zi)η(θ)TT(zi)+(λT1η(θ)+(1+λ2)A(θ))N∑i=1f(zi)

Taking the gradient of the argument with respect to and setting it equal to zero:

 ∇θη(θ)TN∑i=1f(zi)T(zi) +(∇θη(θ)Tλ1+(1+λ2)∇θη(θ)TEη[T(z)])N∑i=1f(zi)=0 ⇒θ(1+λ2)N∑i=1f(zi)=N∑i=1f(zi)T(zi)+λ1N∑i=1f(zi) ⇒θ(t+1)=11+λ2∑Ni=1f(zi)T(zi)∑Ni=1f(zi)+λ11+λ2

where again we use . Now, we can recognize is the solution to the original EDA objective, and letting and we arrive at:

 θ(t+1) =11+\nicefrac1γ−1~θ(t+1)+\nicefrac1γ−11+\nicefrac1γ−1θ(t) =γ~θ(t+1)+(1−γ)θ(t)

which is the desired smoothed update.

## Appendix B Equivalence between Expectation-Maximization and the proximal point method

In Section 6, we discuss the equivalence between EM and using the proximal point method (PPM) with reverse KL divergence to maximize the log-likelihood function (Chrétien and Hero III, 1998; Chrétien and Hero, 2000). Here, we provide the derivation of this equivalence, beginning with a formulation of the PPM:

 ϕ(t+1) =argmaxϕlogL(ϕ)−DKL(p(z|y,ϕ(t))||p(z|y,ϕ)) =argmaxϕlogp(y|ϕ)−∫p(z|y,ϕ(t))logp(z|y,ϕ(t))p(z|y,ϕ)dz =argmaxϕlogp(y|ϕ)+∫p(z|y,ϕ(t))logp(z|y,ϕ)dz =argmaxϕlogp(y|ϕ)∫p(z|y,ϕ(t))dz +∫p(z|y,ϕ(t))logp(z|y,ϕ)dz =argmaxϕ∫p(z|y,ϕ(t))(logp(y|ϕ)+logp(z|y,ϕ))dz =argmaxϕ∫p(z|y,ϕ(t))logp(y,z|ϕ)dz.

Note that the last line is precisely the update rule for EM.

## Appendix C Details of natural gradient descent

Recall from Section 6 that replacing the log-likelihood with its linearization and the KL-divergence with its second-order Taylor series approximation in the PPM formulation of EM yields the update

 ϕ(t+1) =argmaxϕlogL(ϕ(t))+⟨∇ϕlogL(ϕ)∣∣ϕ=ϕ(t),ϕ−ϕ(t)⟩ −12(ϕ−ϕ(t))⊤I(ϕ(t))(ϕ−ϕ(t)).

One can further derive a closed-form update by taking the derivative of the argument on the right-hand side, setting it equal to zero as per first-order optimality conditions, and solving:

 ∇ϕ(logL(ϕ(t))+⟨∇ϕlogL(ϕ)∣∣ϕ=ϕ(t),ϕ−ϕ(t)⟩ −12(ϕ−ϕ(t))⊤I(ϕ(t))(ϕ−ϕ(t)))∣∣ϕ=ϕ(t+1)=0 ⟹∇ϕlogL(ϕ)∣∣ϕ=ϕ(t)−I(ϕ(t))(ϕ(t+1)−ϕ(t))=0 ⟹ϕ(t+1)=ϕ(t)+I(ϕ(t))−1∇ϕlogL(ϕ)∣∣ϕ=ϕ(t).

This establishes the equivalence between the standard natural gradient descent update rule and the mirror descent update derived using the PPM formulation of EM as a starting point.

###### Acknowledgements.
We thank Ben Recht for helpful discussion, David Duvenaud, Roger Grosse and James Hensman for pointers to variational Bayes as NGD, Petros Koumoutsakos for encouraging us to think about how to incorporate smoothing, Nikolaus Hansen for pointing us to relevant work, Youhei Akimoto for replies to questions about his papers, and Meng Xiangming for catching technical typos. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

### Footnotes

2. This object is often simply referred to as the "probability distribution" (Kern et al., 2004), however; we use "search model" to distinguish it from distinct probability distributions that will be defined further on.

### References

1. Bidirectional relation between cma evolution strategies and natural evolution strategies. In Parallel Problem Solving from Nature, PPSN XI, R. Schaefer, C. Cotta, J. Kołodziej and G. Rudolph (Eds.), Berlin, Heidelberg, pp. 154–163. External Links: ISBN 978-3-642-15844-5 Cited by: §4.1.
2. Theoretical foundation for cma-es from information geometry perspective. Algorithmica 64 (4), pp. 698–716. Cited by: §1.1.
3. Statistical guarantees for the em algorithm: from population to sample-based analysis. Ann. Statist. 45 (1), pp. 77–120. External Links: Cited by: §2, §3.1, §3.3.
4. Variational Algorithms for Approximate Bayesian Inference. Ph.D. Thesis. External Links: Link Cited by: §1.
5. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters 31 (3), pp. 167–175. Cited by: §4.1.1.
6. Conditioning by adaptive sampling for robust design. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 773–782. External Links: Link Cited by: §1, §1, §1, §3.1, §5.
7. Spin glasses and the statistical mechanics of protein folding. 84 (21), pp. 7524–7528. External Links: Document, ISSN 0027-8424, Link, https://www.pnas.org/content/84/21/7524.full.pdf Cited by: §5.
8. Kullback proximal algorithms for maximum-likelihood estimation. IEEE transactions on information theory 46 (5), pp. 1800–1810. Cited by: Appendix B, §4.1.1.
9. Acceleration of the em algorithm via proximal point iterations. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Cited by: Appendix B, §4.1.1.
10. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B 39 (1), pp. 1–38. Cited by: §1, §2, §2.
11. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, pp. 2121–2159. External Links: ISSN 1532-4435 Cited by: §4.2.
12. Probability and random processes. Vol. 80, Oxford university press. Cited by: §3.1.
13. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evol. Comput. 11 (1), pp. 1–18. External Links: ISSN 1063-6560 Cited by: §3.2.
14. Completely derandomized self-adaptation in evolution strategies. Evol. Comput. 9 (2), pp. 159–195. External Links: ISSN 1063-6560, Link, Document Cited by: §1.1, §1, §1, §4.2.1, §4.2.1.
15. Fast variational inference in the conjugate exponential family. In Advances in Neural Information Processing Systems, P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou and K. Q. Weinberger (Eds.), Vol. 25, Cambridge, MA. Cited by: §4.1.1.
16. Stochastic variational inference. J. Mach. Learn. Res. 14 (1), pp. 1303–1347. External Links: ISSN 1532-4435 Cited by: §4.1.1.
17. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42 (1), pp. 177–196. External Links: ISSN 1573-0565, Document Cited by: §3.1, §5.
18. Annealing adaptive search, cross-entropy, and stochastic approximation in global optimization. Naval Research Logistics (NRL) 58 (5), pp. 457–477. External Links: Document, ISSN 0894069X Cited by: §1.
19. Learning probability distributions in continuous evolutionary algorithms – a comparative review. Natural Computing 3 (1), pp. 77–112. Cited by: §1, footnote 1.
20. Information theory, inference, and learning algorithms. Cambridge University Press. External Links: ISBN 978-0-521-64298-9 Cited by: §3.1.
21. Natural gradient, fitness modelling and model selection: a unifying perspective. In IEEE Congress on Evolutionary Computation, Vol. , pp. 486–493. External Links: ISSN 1089-778X Cited by: §4.1.
22. Variational tempering. In AISTATS, Vol. 51, pp. 704–712. Cited by: §3.1, §5.
23. Protein 3d structure computed from evolutionary sequence variation. PLOS One 6 (12), pp. 1–20. External Links: Cited by: §5.
24. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193. Cited by: §4.1.1.
25. Learning in graphical models. M. I. Jordan (Ed.), pp. 355–368. External Links: ISBN 0-262-60032-3, Link Cited by: §2, §2, §2, §2, §3.1, §3.3.
26. Estimation of distribution algorithm for mixed continuous-discrete optimization problems. In Proceedings of the 29th International Coference on International Conference on Machine Learning, Slovakia. Cited by: §1.
27. Information-geometric optimization algorithms: a unifying picture via invariance principles. J. Mach. Learn. Res. 18 (1), pp. 564–628. External Links: ISSN 1532-4435, Link Cited by: §1.1, §1, §3.1, §3.3, §4.1, §4.2.1.
28. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, New York, NY, USA, pp. 745–750. Cited by: §3.1, §5.
29. The information geometry of mirror descent. IEEE Transactions on Information Theory 61 (3), pp. 1451–1457. Cited by: §4.1.1.
30. A tour of reinforcement learning: the view from continuous control.. arXiv abs/1806.09460. Cited by: §3.3.
31. The Cross-Entropy Method for Combinatorial and Continuous Optimization. Methodol. Comput. Appl. Probab. 1, pp. 127–190. Cited by: §1.1, §3.1, §4.2.1.
32. Optimization with EM and Expectation-Conjugate-Gradient. In Proceedings, Intl. Conf. on Machine Learning (ICML), pp. 672–679. External Links: Link Cited by: §3.1, §4.2, §4.2, §4.2.
33. Adaptive overrelaxed bound optimization methods. In Proceedings of the International Conference on Machine Learning, Vol. 20, pp. 664–671. Cited by: §4.2.1.
34. Natural gradients in practice: Non-conjugate variational inference in gaussian process models. In Uncertainty In Artificial Intelligence, pp. . Cited by: §4.1.1.
35. Online model selection based on the variational bayes. Neural Comput. 13 (7), pp. 1649–1681. External Links: ISSN 0899-7667, Document Cited by: §4.1.1.
36. Optimization by estimation of distribution with deum framework based on markov random fields. International Journal of Automation and Computing 4 (3), pp. 262–272. External Links: ISSN 1751-8520, Document, Link Cited by: §1.
37. Multi-objective optimization with estimation of distribution algorithm in a noisy environment. Evolutionary Computation 21 (1), pp. 149–177. External Links: Document, ISSN 1063-6560 Cited by: §1.
38. Optimization by variational bounding. In European Symposium on ANNs, Cited by: §1.1.
39. Natural evolution strategies. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), Vol. , pp. 3381–3387. External Links: Document, ISSN 1089-778X Cited by: §1.1, §1.
40. Natural evolution strategies. Journal of Machine Learning Research 15, pp. 949–980. External Links: Link Cited by: §1.1, §1, §4.2.1, §4.2.1.
41. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8 (3-4), pp. 229–256. External Links: ISSN 0885-6125, Document Cited by: §3.3, §5.
42. On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput. 8 (1), pp. 129–151. External Links: ISSN 0899-7667 Cited by: §4.2.
43. Stochastic search using the natural gradient. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA, pp. 1161–1168. External Links: ISBN 978-1-60558-516-1, Document Cited by: §1.
44. Model-based search for combinatorial optimization: a critical survey. Annals of Operations Research 131 (1-4), pp. 373–395. Cited by: §1.