Quality Gain Analysis of the Weighted Recombination Evolution Strategy on General Convex Quadratic Functions111This is the extension of our extended abstract presented at FOGA’2017 Akimoto2017foga ().
Quality gain is the expected relative improvement of the function value in a single step of a search algorithm. Quality gain analysis reveals the dependencies of the quality gain on the parameters of a search algorithm, based on which one can derive the optimal values for the parameters. In this paper, we investigate evolution strategies with weighted recombination on general convex quadratic functions. We derive a bound for the quality gain and two limit expressions of the quality gain. From the limit expressions, we derive the optimal recombination weights and the optimal step-size, and find that the optimal recombination weights are independent of the Hessian of the objective function. Moreover, the dependencies of the optimal parameters on the dimension and the population size are revealed. Differently from previous works where the population size is implicitly assumed to be smaller than the dimension, our results cover the population size proportional to or greater than the dimension. Numerical simulation shows that the asymptotically optimal step-size well approximates the empirically optimal step-size for a finite dimensional convex quadratic function.
keywords:Evolution strategy, weighted recombination, quality gain analysis, optimal step-size, general convex quadratic function
Evolution Strategies (ES) are randomized search algorithms to minimize a black-box function in continuous domain, where neither the gradient nor the Hessian matrix of the objective function is available. The most advanced and commonly used category of evolution strategies is covariance matrix adaptation evolution strategy (CMA-ES) Hansen2004ppsn (); Hansen:2013vt (), which is recognized as the state-of-the-art black box continuous optimizer. It generates multiple candidate solutions from a multivariate normal distribution. They are evaluated on the objective function. The distribution parameters such as the mean vector and the covariance matrix are updated by using the candidate solutions and their ranking information, where the objective function values are not directly used. Due to its population-based and comparison-based nature, the algorithm is invariant to any strictly increasing transformation of the objective function in addition to the invariance to scaling, translation, and rotation of the search space Hansen2000ppsn (). These invariance properties guarantee that the algorithm shows exactly the same behavior on a function and on its transformation , where is a strictly increasing function and is a combination of scaling, translation and rotation defined as with a positive real , an dimensional vector , and an dimensional orthogonal matrix . These invariance properties are the essence of the success of CMA-ES.
The performance evaluation of evolutionary algorithms is often based on empirical studies such as benchmarking on a test function suite Hansen2010geccobbobres (); rios2013derivative () and well-considered performance assessment hansen2014assess (); Krause2017foga (). It is easier to check the performance of an algorithm on a specific problem in simulation than to analyze it mathematically. The invariance properties of an algorithm then generalize the empirical result to a class of infinitely many functions defined by the invariance relation. On the other hand, theoretical studies often require simplification of algorithms and assumptions on the objective function, because of the difficulty of the analysis of advanced algorithms due to their comparison-based and population-based nature and the complex adaptation mechanisms. Nevertheless, theoretical studies lead us to a better understanding of algorithms and reveal the dependency of the performance on the interval parameter settings. For example, the recombination weights in CMA-ES are selected based on the mathematical analysis of an evolution strategy Arnold2005foga ()222The weights of CMA-ES were set before the publication Arnold2005foga () because the theoretical result of optimal weights on the sphere was known before the publication.. The theoretical result of the optimal step-size on the sphere function is used to design a box constraint handling technique Hansen2009tec () and to design a termination criterion for a restart strategy Yamaguchi2017BBOB (). A recent variant of CMA-ES Akimoto2016ppsn () exploits the theoretical result of the optimal rate of convergence of the step-size to estimate the condition number of the product of the covariance matrix and the Hessian matrix of the objective function.
Quality Gain Analysis
Quality gain and progress rate analysis Rechenberg1994 (); Beyer1994ppsn (); BeyerBOOK2001 () measure the expected progress of the mean vector in one step. On one side, differently from convergence analysis (e.g., Auger2005tcs ()), analyses based on these quantities do not guarantee the convergence and often take a limit to derive an explicit formula. Moreover, the step-size adaptation and the covariance matrix adaptation are not taken into account. On the other side, one can derive quantitative explicit estimates of these quantities, which is not the case in convergence analysis. The quantitative explicit formulas are particularly useful to know the dependency of the expected progress on the parameters of the algorithm such as the population size, number of parents, and recombination weights, which we may not recognize from empirical studies of algorithms. The above mentioned recombination weights in CMA-ES are derived from the quality gain analysis of evolution strategies Arnold2005foga ().
Although the quality gain analysis is not meant to guarantee the convergence of the algorithm since it analyzes only a single step expected improvement, the progress rate is linked to the convergence rate of algorithms. It is directly related to the convergence rate of an “artificial” algorithm where the step-size is proportional to the distance to the optimum on the sphere function (see e.g., Auger2006gecco ()). Moreover, the convergence rate of this artificial algorithm gives a bound on the convergence rate of algorithms that implement a proper step-size adaptation. For or ESs the bound holds on any function with a unique global optimum; that is, any step-size adaptive -ES optimizing any function with a unique global optimum can not achieve a convergence rate faster than the convergence rate of the artificial algorithm on the sphere function where the step-size is the distance to the optimum times the optimal constant jebalia2008log (); jebalia2010log (); Auger2015 ()333More precisely, -ES optimizing any function (that may have more than one global optimum) can not converge towards a given optimum faster in the search space than the artificial algorithm with step-size proportional to the distance to .. For algorithms implementing recombination, this bound still holds on spherical functions jebalia2010log (); Auger2015 ().
In this paper, we investigate ESs with weighted recombination on a general convex quadratic function. ESs with weighted recombination samples multiple candidate solutions at one time and compute the weighted average of the candidate solutions to update the distribution mean vector. Weighted recombination ESs are among the most important categories of ESs since the standard CMA-ES and most of the recent variants of CMA-ES Ros2008ppsn (); Loshchilov2014 (); Akimoto2016gecco () employ weighted recombination.
The first analysis of weighted recombination ESs was done in Arnold2005foga (), where the quality gain has been derived on the infinite dimensional sphere function . The optimal step-size and the optimal recombination weights are derived. Reference Arnold2007gecco () studied a variant of weighted recombination ESs called -ES, where stands for intermediate recombination, where the recombination weights are equal for the best candidate solutions and zero for the other candidate solutions. The analysis has been performed on the quadratic functions with the Hessian , where the number of diagonal elements that are is controlled by the ratio of short axes. Reference Jagerskupper:2006cf () studied the -ES with the one-fifth success rule on the same function and showed the convergence rate of . Reference Finck2009foga () studied ES with weighted recombination on the same function. Their results, progress rate and quality gain, depend on the so-called localization parameter, the steady-state value of which is then analyzed to obtain the steady-state quality gain. References beyer2014dynamics (); Beyer2016ec () studied the progress rate and the quality gain of -ES on the general convex quadratic model.
The quality gain analysis and the progress rate analysis in the above listed references rely on a geometric intuition of the algorithm in the infinite dimensional search space and on various approximations. On the other hand, the rigorous derivation of the progress rate (or convergence rate of the algorithm with step-size proportional to the distance to the optimum) on the sphere function provided for instance in Auger2006gecco (); Auger2015 (); abh2011b (); jebalia:inria-00495401 () only holds on spherical functions and provides solely a limit without a bound between the finite dimensional convergence rate and its asymptotic limit. The result of this paper is different in that we consider the general weighted recombination on the general convex quadratic objective and cover finite dimensional cases as well as the limit .
We study the weighted recombination ES on a general convex quadratic function on the finite dimensional search space. We investigate the quality gain , that is, the expectation of the relative function value decrease. We decompose as the product of two functions: , a function that depends only on the mean vector of the sampling distribution and the Hessian , and , the so-called normalized quality gain that depends essentially on all the algorithm parameters such as the recombination weights and the step-size. We approximate by an analytically tractable function . We call the asymptotic normalized quality gain. The main contributions are summarized as follows.
First, we derive the error bound between and for finite dimension . To the best of our knowledge, this is the first work that performs the quality gain analysis for finite and provides an error bound. The asymptotic normalized quality gain and the bounds in this paper are improved over the previous work Akimoto2017foga (). Thanks to the explicit error bound derived in the paper, we can treat the population size increasing with and provide (for instance) a rigorous sufficient condition on the dependency between and such that the per-iteration quality gain scales with for algorithms with intermediate recombination BeyerBOOK2001 ().
Second, we show that the error bound between and converges to zero as the learning rate for the mean vector update tends to infinity. We derive the optimal step-size and the optimal recombination weights for , revealing the dependencies of these optimal parameters on and . In contrast, the previous works of quality gain analysis mentioned above take the limit while is fixed, hence assuming . Therefore, they do not reveal the dependencies of and the optimal parameters on when . We validate in experiments that the optimal step-size derived for provides a reasonable estimate of the optimal step-size even for .
Third, we prove that converges toward as under the condition , where is the limit of on the sphere function for derived in Arnold2005foga (). The condition holds, for example, for positive definite with bounded eigenvalues. It also holds for some positive semi-definite and for some positive definite with unbounded eigenvalues, for example with eigenvalues in . The result implies that the optimal recombination weights are independent of , whereas the optimal step-size heavily depends on and the distribution mean. This part of the contribution is a generalization of the previous foundation in beyer2014dynamics (); Beyer2016ec (), but the proof methodology is rather different. Furthermore, the error bound between and derived in this paper allows us to further investigate how fast converges toward as , depending on the eigenvalue distribution of .
This paper is organized as follows. In Section 2, we formally define the evolution strategy with weighted recombination. The quality gain analysis on the infinite dimensional sphere function is revisited. In Section 3, we derive the quality gain bound for a finite dimensional convex quadratic function. In Section 4, important consequences of the quality gain bound are discussed. In Section 5, we conclude our paper. Properties of the normal order statistics that are important to understand our results are summarized in A and the detailed proofs of lemmas are provided in B.
We apply the following mathematical notations throughout the paper. For integers , such that , we denote the set of integers between and (including and ) by . Binomial coefficients are denoted as . For real numbers , such that , the open and the closed intervals are denoted as and , respectively. For an -dimensional real vector , let denote the -th coordinate of . A sequence of length is denoted as , or just as , and an infinite sequence is denoted as . For , the absolute value of is denoted by . For , the Euclidean norm is denoted by . Let be the indicator function which is if condition is true and otherwise. Let be the cumulative density function (c.d.f.) deduced by the (one-dimensional) standard normal distribution . Let be the -th smallest random variable among independently and standard normally distributed random variables, i.e., . The expectation of a random variable (or vector) is denoted as . The conditional expectation of given is denoted as . For a function of random variables , the conditional expectation of given for some is denoted as . Similarly, the conditional expectation of given and for different is denoted as .
2.1 Evolution Strategy with Weighted Recombination
We consider an evolution strategy with weighted recombination. At each iteration , it draws independent random vectors from the -dimensional standard normal distribution , where is the zero vector and is the identity matrix of dimension . The candidate solutions are computed as , where is the mean vector and is the standard deviation, also called the step-size or the mutation strength. The candidate solutions are evaluated on a given objective function . Without loss of generality (w.l.o.g.), we assume to be minimized. Let be the index of the -th best candidate solution among , i.e., , and be the real-valued recombination weights. W.l.o.g., we assume . Let denote the so-called effective variance selection mass. The mean vector is updated according to
where is the learning rate of the mean vector update.
In this paper we reformulate (1) to investigate the algorithm with mathematical rigor. Hereunder, we write the candidate solutions, , and the corresponding random vectors, , as sequences and for short. First, we introduce the weight function
i.e., and are the numbers of strictly and weakly better candidate solutions than , respectively. The weight value for is the arithmetic average of the weights for the tie candidate solutions. In other words, all the tie candidate solutions have the same weight values. If there is no tie, the weight value for the -th best candidate solution is simply . In the following, we drop the subscripts and the superscripts for sequences unless they are unclear from the context and write simply as . With the weight function, we rewrite the mean vector update (1) as
The above update (3) is equivalent with the original update (1) if there is no tie among candidate solutions. If the objective function is a convex quadratic function, there will be no tie with probability one. Therefore, they are equivalent with probability one. Algorithm 1 summarizes the single step of the algorithm, where we rewrite (3) by using .
The above formulation is motivated twofold. One is to well define the update even when there is a tie. In our formulation, tie candidate solutions receive the equal recombination weights. The other is a technical reason. In (1) the already sorted candidate solutions are all correlated and they are not anymore normally distributed. However, they are assumed to be normally distributed in the previous work Arnold2005foga (); beyer2014dynamics (); Beyer2016ec (). To ensure that such an approximation leads to the asymptotically true quality gain limit, a mathematically involved analysis has to be done. See Auger2006gecco (); abh2011b (); jebalia:inria-00495401 () for details. In (3), the weight function explicitly includes the ranking computation and are still independent and normally distributed. This allows us to derive the quality gain on a convex quadratic function rigorously.
2.2 Quality Gain Analysis on the Spherical Function
The quality gain is defined as the expectation of the relative decrease of the function value. Formally, it is the conditional expectation of the relative decrease of the function value conditioned on the mean vector and the step-size , defined as follows.
The quality gain of Algorithm 1 given and is
where is (one of) the global minimum point of . Note that the quality gain depends also on , , and the dimension .
respectively. This normalization of the step-size suggests that is proportional to and inverse proportional to and to . The normalized step-size is proportional to the ratio between the actual step-size and the distance between the current mean and the optimal solution. This reflects the scale invariance of the algorithm on the sphere function, that is, the single step response is solely determined by the normalized step-size. The dimension in the numerator implies that the step-size needs to be inversely proportional to . The normalized quality gain is simply the quality gain given scaled by . The scaling by reflects that the convergence speed can not exceed for any comparison based algorithm Teytaud2006ppsn (). By taking , the normalized quality gain converges pointwise (w.r.t. ) to
where denotes the normalized step-size optimizing given and is given by
Consider the optimal recombination weights that maximize in (6). The optimal recombination weights are given independently of by
and is written as
Note that . Given and , we achieve the maximal value of that is .
The optimal normalized step-size (7) and the normalized quality gain (6) given depend on . Particularly, they are proportional to . For instance, under the optimal weights (8), we have for a sufficiently large 444We used the facts and . See A for details.. Then, from (6) and (7) we know and . Moreover, using the relation , we can reword it as that the optimal step-size and the normalized quality gain given are proportional to . Figure 1 shows how scales with when the optimal step-size is set. This shows that the normalized quality gain, and hence the optimal normalized step-size, are proportional to for standard weight schemes. When the optimal weights are used, goes up to as increases. On the other hand, nonnegative weights can not achieve the value of above . The CMA type weights are designed to approximate the optimal nonnegative weights, where the first half weights are proportional to the optimal setting and the last half are zero. The truncation weights result in a smaller normalized quality gain. It is shown in BeyerBOOK2001 () that the truncation weights achieve .
The normalized quality gain limit depends only on the normalized step-size and the weights . Since the normalized step-size does not change if we multiply by some factor and divide by the same factor, does not have any impact on , hence on the quality gain . This is unintuitive and is not true in a finite dimensional space. The step-size realizes the standard deviation of the sampling distribution and it has an impact on the ranking of the candidate solutions. On the other hand, the product is the step-size of the -update that depends on the ranking of the candidate solutions. The normalized quality gain limit provided above tells us that the ranking of the candidate solutions is independent of in the infinite dimensional space. We will discuss this further in Section 4.
The quality gain is to measure the improvement in one iteration. If we generate and evaluate candidate solutions every iteration, the quality gain per evaluation (-call) is times smaller, i.e., the quality gain per evaluation is , rather than . It implies that the number of iterations to achieve the same amount of the quality gain is inversely proportional to . This is the best we can hope for when the algorithm is implemented on a parallel computer. However, since the above result is obtained in the limit while is fixed, it is implicitly assumed that . The optimal down scaling of the number of iterations indeed only holds for . In practice, the quality gain per iteration tends to level out as increases. We will revisit this point in Section 4 and see how the optimal values for and depend on and when both are finite.
3 Quality Gain Analysis on General Quadratic Functions
In this section we investigate the normalized quality gain of Algorithm 1 minimizing a quadratic function with its Hessian assumed to be nonnegative definite and symmetric, i.e.,
where is the global optimal solution555We use the following terminology in this paper. A nonnegative definite matrix is a matrix having only nonnegative eigenvalues, i.e., for all . A nonnegative definite matrix is called positive definite if for all , otherwise it is called positive semi-definite. If is positive semi-definite, the optimum is not unique.. W.l.o.g., we assume 666None of the algorithmic components and the quality measures used in the paper are affected by multiplying a positive constant to , or equivalently to . To consider a general , simply replace with in the following of the paper.. For the sake of notation simplicity we denote the directional vector of the gradient of at by . To make the dependency of on clear, we sometimes write it as .
3.1 Normalized Quality Gain and Normalized Step-Size
We introduce the normalized step-size and the normalized quality gain. First of all, if the objective function is homogeneous around the optimal solution , the optimal step-size must be a homogeneous function of degree with respect to . This is formally stated in the following proposition. The proof is found in B.1.
Let be a homogeneous function of degree , i.e., for a fixed integer for any and any . Consider Algorithm 1 minimizing a function . Then, the quality gain is scale-invariant, i.e., for any . Moreover, the optimal step-size , if it is well-defined, is a function of . For the sake of simplicity we write the optimal step-size as a map . It is a homogeneous function of degree , i.e., for any .
Note that the quadratic function is homogeneous of degree , and the function is homogeneous of degree around . The latter is our candidate for the optimal step-size. We define the normalized step-size, the scale-invariant step-size, and the normalized quality gain for a quadratic function as follows.
For a convex quadratic function (10), the normalized step-size and the scale-invariant step-size given are defined as and .
Let be the -dependent scaling factor of the normalized quality gain defined as . The normalized quality gain for a quadratic function is defined as .
Note that the normalized step-size and the normalized quality gain defined above coincide with (5) if , where , and . Moreover, they are equivalent to Eq. (4.104) in BeyerBOOK2001 () introduced to analyze the -ES and the -ES. The same normalized step-size has been used for -ES beyer2014dynamics (); Beyer2016ec (). See Section 4.3.1 of BeyerBOOK2001 () for the motivation of these normalization.
Non-Isotropic Gaussian Sampling
Throughout the paper, we assume that the multivariate normal sampling distributions have an isotropic covariance matrix. We can generalize all the following results to an arbitrary positive definite symmetric covariance matrix by considering a linear transformation of the search space. Indeed, let , and consider the coordinate transformation . In the latter coordinate system the function can be written as . The multivariate normal distribution is transformed into by the same transformation. Then, it is easy to prove that the quality gain on the function given the parameter is equivalent to the quality gain on the function given . The normalization factor of the quality gain and the normalized step-size are then rewritten as
3.2 Conditional Expectation of the Weight Function
The quadratic objective (10) can be written as
where , and and are the conditional expectations given and , respectively.
The following lemma provides the expression of the conditional expectation of the weight function, which allows us to derive the bound for the difference between and . In the following, let
denote the probability mass functions of the binomial and trinomial distributions, respectively, where , , , , and . The proof of the lemma is provided in B.2.
Let and be i.i.d. copies of . Let be the c.d.f. of the function value . Then, we have for any , ,
Thanks to Lemma 5 and the fact that are i.i.d., we can further rewrite the normalized quality gain as
Here and are independent and -distributed, and and , where is the scale-invariant step-size. Note that and are independent and -distributed.
The following Lemma shows the Lipschitz continuity of , , and . The proof is provided in B.3.
The functions , , and are -Lipschitz continuous, i.e., , , and , with the Lipschitz constants
Upper bounds for the above Lipschitz constants are discussed in B.4.
3.3 Theorem: Normalized Quality Gain on Convex Quadratic Functions
The following main theorem provides the error bound between and .
Consider Algorithm 1 and let be a convex quadratic objective function (10). Let the normalized step-size and the normalized quality gain defined in Definition 3 and Definition 4, respectively. Let and . Define
and let , , be the Lipschitz constants of , and defined in Lemma 5, respectively. Then,
The above theorem claims that if the right-hand side (RHS) of (18) is sufficiently small, the normalized quality gain is approximated by the asymptotic normalized quality gain defined in (17). Compared to in (6) derived for the infinite dimensional sphere function, is different even when . We investigate the properties of in Section 4.1. The situations where the RHS of (18) is sufficiently small are discussed in Section 4.2 and Section 4.3. We remark that Theorem 3.4 in Akimoto2017foga () provides a bound for the difference between and , instead of the difference between and . Introducing allows us to consider a finite dimensional case and to derive a tighter bound.
3.4 Outline of the Proof of the Main Theorem
In the following of the section and in B, let , , and for . Then, and and they are independent. Define