Analysis of a Natural Gradient Algorithm on Monotonic Convex-Quadratic-Composite Functions
In this paper we investigate the convergence properties of a variant of the Covariance Matrix Adaptation Evolution Strategy (CMA-ES). Our study is based on the recent theoretical foundation that the pure rank- update CMA-ES performs the natural gradient descent on the parameter space of Gaussian distributions. We derive a novel variant of the natural gradient method where the parameters of the Gaussian distribution are updated along the natural gradient to improve a newly defined function on the parameter space. We study this algorithm on composites of a monotone function with a convex quadratic function. We prove that our algorithm adapts the covariance matrix so that it becomes proportional to the inverse of the Hessian of the original objective function. We also show the speed of covariance matrix adaptation and the speed of convergence of the parameters. We introduce a stochastic algorithm that approximates the natural gradient with finite samples and present some simulated results to evaluate how precisely the stochastic algorithm approximates the deterministic, ideal one under finite samples and to see how similarly our algorithm and the CMA-ES perform.
remarkRemark \newdefdefinitionDefinition \newdefassumptionAssumption \conferenceinfoGECCO’12, July 7-11, 2012, Philadelphia, Pennsylvania, USA. \CopyrightYear2012 \crdata978-1-4503-1177-9/12/07
G.1.6Numerical AnalysisOptimization[Global optimization, Gradient methods, Unconstrained optimization] \categoryF.2.1Analysis of Algorithms and Problem ComplexityNumerical Algorithms and Problems
Errata in the original GECCO’2012 paper (which have been revised in this version)
Eq. (21) in Theorem 4
In the original paper:
In this version:
The paragraph after (24)
In the original paper:
Moreover, since the right-hand side of the above inequality is maximized when is minimized and is bounded from below by because of (23), we have
The Covariance Matrix Adaptation Evolution Strategy (CMA-ES, [15, 14, 13]) is a stochastic search algorithm for non-separable and ill-conditioned black-box continuous optimization. In the CMA-ES, search points are generated from a Gaussian distribution and the mean vector and the covariance matrix of the Gaussian distribution are adapted by using the sampled points and their objective value ranking. These parameters’ update rules are designed so as to enhance the probability of generating superior points in the next iteration in a way similar to but slightly different from the (weighted) maximum likelihood estimation. Adaptive-ESs including the CMA-ES are successfully applied in practice. However, their theoretical analysis even on a simple function is complicated and linear convergence has been proven only for simple algorithms compared to the CMA-ES [6, 17].
Resent studies [1, 11] demonstrate the link between the parameter update rules in the CMA-ES and the natural gradient method, the latter of which is the steepest ascent/descent method on a Riemannian manifold and is often employed in machine learning [22, 8, 23, 2, 21, 4]. The natural gradient view of the CMA-ES has been developed and extended in  and the Information-Geometric Optimization (IGO) algorithm has been introduced as the unified framework of natural gradient based stochastic search algorithms. Given a family of probability distributions parameterized by , the IGO transforms the original objective function, , to a fitness function, , defined on . The IGO algorithm performs a natural gradient ascent aiming at maximizing . For the family of Gaussian distributions, the IGO algorithm recovers the pure rank- update CMA-ES , for the family of Bernoulli distributions, PBIL  is recovered. The IGO algorithm can be viewed as the deterministic model of a recovered stochastic algorithm in the limit of the number of sample points going to infinity.
The IGO offers a mathematical tool for analyzing the behavior of stochastic algorithms. In this paper, we analyze the behavior of the deterministic model of the pure rank- update CMA-ES, which is slightly different from the IGO algorithm. We are interested in knowing what is the target matrix of the covariance matrix update and how fast the covariance matrix learns the target. The CMA is designed to solve ill-conditioned objective function efficiently by adapting the metric—covariance matrix in the CMA-ES—as well as other variable metric methods such as quasi-Newton methods . Speed of optimization depends on the precision and the speed of metric adaptation to a great extend. There is a lot of empirical evidence that the covariance matrix tends to be proportional to the inverse of the Hessian matrix of the objective function in the CMA-ES. However, it has not been mathematically proven yet. We are also interested in the speed of convergence of the mean vector and the covariance matrix. Convergence of the CMA-ES has not been reported up to this time. We tackle these issues in this work.
In this paper, we derive a novel natural gradient algorithm in a similar way to the IGO algorithm, where the objective function is transformed to a function in a different way from the IGO so that we can derive the explicit form of the natural gradient for composite functions of a strictly increasing function and a convex quadratic function. We call the composite functions monotonic convex-quadratic-composite functions. The resulting algorithm inherits important properties of the IGO and the CMA-ES, such as invariance under monotone transformation of the objective function and invariance under affine transformation of the search space. We theoretically study this natural gradient method on monotonic convex-quadratic-composite functions. We prove that the covariance matrix adapts to be proportional to the inverse of the Hessian matrix of the objective function. We also investigate the speed of the covariance matrix adaptation and the speed of convergence of the parameters.
The rest of the paper is organized as follows. In Section 2 we propose a novel natural gradient method and present a stochastic algorithm that approximates the natural gradient from finite samples. The basic properties of both algorithms are described. In Section 3 we study the convergence properties of the deterministic algorithm on monotonic convex-quadratic-composite functions. The convergence of the condition number of the product of the covariance matrix and the Hessian matrix of the objective function to one and its linear convergence are proven. Moreover, the rate of convergence of the parameter is shown. In Section 4, we conduct experiments to see how accurately the stochastic algorithm approximates the deterministic algorithm and to see how similarly our algorithm and the CMA-ES behave on a convex quadratic function. Finally, we summarize and conclude this paper in Section 5.
2 The algorithms
We first introduce a generic framework of the natural gradient algorithm that includes the IGO algorithm.
The original objective is to minimize , where is a metric space. Let and be the Borel -field and a measure on . Hereunder, we assume that is -measurable. Let represent any monotonically increasing set function on , i.e., for any , s.t. . We transform to an invariant cost function defined as . Given a family of probability distributions on , we define a quasi-objective function on the parameter space as the expected value of over , namely
Our algorithm performs the natural gradient descent on a Riemannian manifold equipped with the Fisher metric . The Fisher metric is the unique metric that does not depend on the choice of parameterization . The natural gradient—the gradient taken w.r.t. the Fisher metric—is given by the product of the inverse of the Fisher information matrix and the “vanilla” gradient of the function. Therefore, the natural gradient of is and the parameter update follows
where is the learning rate.
2.1 Deterministic NGD Algorithm on
In the following, we focus on the optimization in . Thus, , is the Lebesgue measure on , and is the Borel -field on .
We choose as the sampling distribution the Gaussian parameterized by , where the mean vector is in and the covariance matrix is a symmetric and positive definite matrix of dimension .
We define the invariant cost by using the Lebesgue measure as . Then, the infimum of is zero located on a boundary of the domain where the mean vector equals the global minimum of and the covariance matrix is zero.
The choice of the parameterization of Gaussian distributions affects the behavior of the natural gradient update (1) with finite learning rate , although the steepest direction of on the statistical manifold is invariant under the choice of parameterization. We choose the mean vector and the covariance matrix of the Gaussian distribution as the parameter as well as are chosen in the CMA-ES and in other algorithms such as EMNA  and cross entropy method . Let , where be the vectorization of such that the (, )th element of corresponds to th element of (see ). Then the Fisher information matrix has an analytical form
where denotes the Kronecker product operator. Under some regularity conditions for the exchange of integration and differentiation we have
where is the log-likelihood. The gradient of the log-likelihood can be written as
Then, the natural gradient at can be written by part
With different learning rates for mean vector and covariance matrix updates, the natural gradient descent (1) reads
We refer to (5) for the deterministic natural gradient descent (NGD) method.
2.2 Stochastic NGD Algorithm on
When is not given, we need to estimate the gradient. We approximate the natural gradient and simulate the natural gradient descent as follows. Initialize the mean vector and the covariance matrix and repeat the following steps until some termination criterion is satisfied:
Compute the eigenvalue decomposition of , , where is an orthogonal matrix and is a diagonal matrix such that .
Compute the square root of , .
Generate normal random vectors .
Compute , for .
Evaluate the objective values ;
Compute the baseline .
Compute the weights .
Estimate the natural gradient and as
Compute the learning rates and .
Update the parameters as and .
We refer to this algorithm for the stochastic NGD algorithm.
This algorithm generates samples from in steps 1–4 and evaluates their objective values in step 5. In step 6, the invariant costs are evaluated. The estimates are obtained as follows. By definition we have
Applying Monte-Carlo approximation we have
Since , we have the estimates in step 6. Step 7 computes the baseline that is often introduced to reduce the estimation variance of gradients while adding no bias . We simply choose the mean value of the as the baseline. Replacing the expectation in (5) with the sample mean and adding the baseline (in step 8) we have the Monte-Carlo estimate of the natural gradient in step 9. Finally in step 11, we update the parameters along the estimated natural gradient with the learning rates computed in step 10. The learning rates are chosen in the following so that they are inverse proportional to the largest eigenvalue of the following matrix
2.3 Difference from the IGO
The difference between the IGO algorithm and our deterministic algorithm is that the invariant cost in the IGO algorithm is defined by negative of the weighted quantile, , where is non-increasing weight function. Since the quantile depends on the current parameter , for each in the IGO algorithm changes from iteration to iteration, whereas it is fixed in our algorithm. This property makes our algorithm easier to analyze mathematically.
The difference between our stochastic algorithm and the pure rank- update CMA-ES  is the same as the difference between the deterministic one and the IGO algorithm. The pure rank- update CMA-ES approximates the quantile value by the number of better solutions divided by the number of samples , . Therefore, the pure rank- update CMA-ES simulates the same lines as the stochastic NGD algorithm described in Section 2.2 with the weights .
In Section 4 we compare the stochastic NGD algorithm with the pure rank- update CMA-ES where
2.4 Basic Properties
Invariance. Our algorithms inherit two important invariance properties from the IGO and the CMA-ES: invariance under monotonic transformation of the objective function and invariance under affine transformation of the search space (with the same transformation of the initial parameters). Invariance under monotonic transformation of the objective function makes the algorithm perform equally on a function and on any composite function where is any strictly increasing function. For example, the convex sphere function is equivalent to the non-convex function for this algorithm, whereas conventional gradient methods, e.g. Newton method, assume the convexity of the objective function and require a fine line search to solve non-convex functions. This invariance property is obtained as a result of the transformation . Invariance under affine transformation of the search space is the essence of variable metric methods such as Newton’s method. By adapting the covariance matrix, this algorithm attains universal performance on ill-conditioned objective functions.
Positivity. The covariance matrix of the Gaussian distribution must be positive definite and symmetric at each iteration. The next proposition gives the condition on the learning rate such that the covariance matrix is always positive definite symmetric.
Suppose that the learning rate for the covariance update in the deterministic NGD algorithm, where denotes the largest eigenvalue of the argument matrix. If is positive definite symmetric, is positive definite symmetric for each . Similarly, if in the stochastic NGD algorithm, where is defined in (8), and if is positive definite symmetric, the same result holds.
Consider the deterministic case (5). Suppose that is positive definite and symmetric. Then,
Since by the assumption, all the eigenvalues of is smaller than one. Thus, the inside of the brackets is positive definite symmetric and hence is positive definite symmetric. By mathematical induction, we have that is positive definite and symmetric for all . The analogous result for the stochastic case is obtained in the same way.
Consistency. The gradient estimator (6) is not necessarily unbiased, yet it is consistent as is shown in the following proposition. Therefore, one can expect that the stochastic NGD approximates the deterministic NGD well when the sample size is large. Let be the natural gradient operator.
By the Cauchy-Schwarz inequality we have that . Note that . By Jensen’s inequality we have that . Therefore, (10) implies
Define and decompose as
First, we show the following almost sure convergence
By the definition of , we have
|and since (12) implies almost everywhere, we have by LLN|
almost surely and almost everywhere in . This implies almost everywhere in .
For , we have
Since almost everywhere in as , we have almost everywhere in as . By the Lebesgue’s dominated convergence theorem we have as . Therefore, we have that for large enough. Then, by applying LLN, we have that the right most side of (15) converges to as and this expectation converges to as . This ends the proof of (14).
Now we can obtain the almost sure convergence of the third term of (13) to zero. Indeed, the almost sure convergence (14) implies that the limit agrees with and we have from (12) and LLN that . Also, by LLN we have that as . Therefore, the third term of (13) converges to zero almost surely.
To show the convergence of the second term of (13) to zero, we apply the Cauchy-Schwarz inequality to it and we have
By LLN we have that the second term of the right hand side converges to . So we have to prove that the first term on the right hand side converges to zero almost surely. The proof of this convergence is done in the same way as above with replacing .
3 Convergence Properties of the Deterministic NGD Algorithm
We investigate the convergence properties of the deterministic NGD algorithm (5) on a monotonic convex-quadratic composite function , where is any strictly increasing function and is a positive definite symmetric matrix.
The natural gradient can be written as
Since is equivalent to the volume of the ellipsoid , we have that
where denotes the volume of the sphere with radius and is proportional to . Therefore . Since the proportionality constant is independent of , we have
Differentiating the both side of the above relation, we have
Premultiplying by , we obtain the intended result.
Now the deterministic NGD algorithm on is implicitly written as
where is the proportionality constant appearing in the proof of Proposition 3.
In the following, we work on the following assumption: There are and such that
These assumptions are satisfied, for example, if and are set for each iteration so that . In this case the natural gradient can be considered to be normalized by and the pseudo-learning rate is .
The next theorem states that the covariance matrix converges proportionally to the inverse of the Hessian matrix.
Since is positive definite and symmetric, there exists the square root . Premultiplying and postmultiplying both side of covariance matrix update (17) by , we have
Since is positive definite and symmetric, there exists an eigenvalue decomposition , where the diagonal elements of are the eigenvalues of and each column of is the eigenvector corresponding to each diagonal element of . Then,
This means, also diagonalizes . By mathematical induction we have that an orthogonal matrix which diagonalizes diagonalizes for any and we have
Next, we show that the condition number of converses to as . Remember . We have . Then, by assumption (19) we have
Moreover, since for any , we have for any , .
Suppose without loss of generality. From (22) and the inequality , we have
with equality holding if and only if . Therefore, if , then , which implies that if th and th diagonal elements of are the maximum and minimum elements, th and th elements of are also the maximum and minimum elements of . Without loss of generality we suppose for any for all . Then, . According to (22) we have
and the right-most side of the above inequality is bounded by because of (23), we have
This proves (20).
Note that if and , we have that . We have from (24) that
and the rate of convergence becomes .
The next theorem states the global convergence of and and the speed of the convergence. In the following, we let denote the Frobenius norm of , namely .
Let denote the th largest singular value of the argument matrix. According to J. von Neumann’s trace inequality  we have , where and are any matrices in . Let be nonnegative definite and is nonnegative definite symmetric. From the above inequality, we have