Analysis of a Natural Gradient Algorithm on Monotonic Convex-Quadratic-Composite Functions

# Analysis of a Natural Gradient Algorithm on Monotonic Convex-Quadratic-Composite Functions

Youhei Akimoto
Project TAO INRIA Saclay, LRI Université Paris-Sud, 91405 Orsay Cedex, France
###### Abstract

In this paper we investigate the convergence properties of a variant of the Covariance Matrix Adaptation Evolution Strategy (CMA-ES). Our study is based on the recent theoretical foundation that the pure rank- update CMA-ES performs the natural gradient descent on the parameter space of Gaussian distributions. We derive a novel variant of the natural gradient method where the parameters of the Gaussian distribution are updated along the natural gradient to improve a newly defined function on the parameter space. We study this algorithm on composites of a monotone function with a convex quadratic function. We prove that our algorithm adapts the covariance matrix so that it becomes proportional to the inverse of the Hessian of the original objective function. We also show the speed of covariance matrix adaptation and the speed of convergence of the parameters. We introduce a stochastic algorithm that approximates the natural gradient with finite samples and present some simulated results to evaluate how precisely the stochastic algorithm approximates the deterministic, ideal one under finite samples and to see how similarly our algorithm and the CMA-ES perform.

Covariance Matrix Adaptation, Natural Gradient, Hessian Matrix, Information Geometric Optimization, Theory
\newdef

remarkRemark \newdefdefinitionDefinition \newdefassumptionAssumption \conferenceinfoGECCO’12, July 7-11, 2012, Philadelphia, Pennsylvania, USA. \CopyrightYear2012 \crdata978-1-4503-1177-9/12/07

\numberofauthors

1

\category

G.1.6Numerical AnalysisOptimization[Global optimization, Gradient methods, Unconstrained optimization] \categoryF.2.1Analysis of Algorithms and Problem ComplexityNumerical Algorithms and Problems

\terms

Algorithms, Theory

## Errata

Errata in the original GECCO’2012 paper (which have been revised in this version)

• Eq. (21) in Theorem 4
In the original paper:

 Cond(CtA)⩽1+(1−2γC,min1−γC,min)t(Cond(C0A−1).

In this version:

 Cond(CtA)⩽1+(1−γC,min)t(Cond(C0A)−1).
• The paragraph after (24)
In the original paper:
Since

 1−ηtC(λt1+λtd)(1−ηtCλtd)=1−ηtCλt1(1+λtd/λt1)1−ηtCλt1(λtd/λt1)⩽1−2ηtCλt11−ηtCλt1,

we have

 Cond(Ct+1A)−1Cond(CtA)−1⩽1−2ηtCλt11−ηtCλt1.

Moreover, since the right-hand side of the above inequality is maximized when is minimized and is bounded from below by because of (23), we have

 Cond(Ct+1A)−1Cond(CtA)−1⩽1−2γC,min1−γC,min.

This implies . The rate of convergence (20) and the upper bound (21) are immediate consequences of the above inequality.

In this version:
Since

 1−ηtC(λt1+λtd)(1−ηtCλtd)=1−ηtCλt11−ηtCλt1Cond−1(CtA)⩽1−ηtCλt1

and the right-most side of the above inequality is bounded by because of (23), we have

 Cond(Ct+1A)−1Cond(CtA)−1⩽1−γC,min.

This implies and (21). Moreover, since , we have from (24) that

 limsupt→∞Cond(Ct+1A)−1Cond(CtA)−1 =limsupt→∞1−2ηtCλt11−ηtCλt1 ⩽1−2γC,min1−γC,min.

This proves (20).

## 1 Introduction

The Covariance Matrix Adaptation Evolution Strategy (CMA-ES, [15, 14, 13]) is a stochastic search algorithm for non-separable and ill-conditioned black-box continuous optimization. In the CMA-ES, search points are generated from a Gaussian distribution and the mean vector and the covariance matrix of the Gaussian distribution are adapted by using the sampled points and their objective value ranking. These parameters’ update rules are designed so as to enhance the probability of generating superior points in the next iteration in a way similar to but slightly different from the (weighted) maximum likelihood estimation. Adaptive-ESs including the CMA-ES are successfully applied in practice. However, their theoretical analysis even on a simple function is complicated and linear convergence has been proven only for simple algorithms compared to the CMA-ES [6, 17].

Resent studies [1, 11] demonstrate the link between the parameter update rules in the CMA-ES and the natural gradient method, the latter of which is the steepest ascent/descent method on a Riemannian manifold and is often employed in machine learning [22, 8, 23, 2, 21, 4]. The natural gradient view of the CMA-ES has been developed and extended in [5] and the Information-Geometric Optimization (IGO) algorithm has been introduced as the unified framework of natural gradient based stochastic search algorithms. Given a family of probability distributions parameterized by , the IGO transforms the original objective function, , to a fitness function, , defined on . The IGO algorithm performs a natural gradient ascent aiming at maximizing . For the family of Gaussian distributions, the IGO algorithm recovers the pure rank- update CMA-ES [14], for the family of Bernoulli distributions, PBIL [7] is recovered. The IGO algorithm can be viewed as the deterministic model of a recovered stochastic algorithm in the limit of the number of sample points going to infinity.

The IGO offers a mathematical tool for analyzing the behavior of stochastic algorithms. In this paper, we analyze the behavior of the deterministic model of the pure rank- update CMA-ES, which is slightly different from the IGO algorithm. We are interested in knowing what is the target matrix of the covariance matrix update and how fast the covariance matrix learns the target. The CMA is designed to solve ill-conditioned objective function efficiently by adapting the metric—covariance matrix in the CMA-ES—as well as other variable metric methods such as quasi-Newton methods [20]. Speed of optimization depends on the precision and the speed of metric adaptation to a great extend. There is a lot of empirical evidence that the covariance matrix tends to be proportional to the inverse of the Hessian matrix of the objective function in the CMA-ES. However, it has not been mathematically proven yet. We are also interested in the speed of convergence of the mean vector and the covariance matrix. Convergence of the CMA-ES has not been reported up to this time. We tackle these issues in this work.

In this paper, we derive a novel natural gradient algorithm in a similar way to the IGO algorithm, where the objective function is transformed to a function in a different way from the IGO so that we can derive the explicit form of the natural gradient for composite functions of a strictly increasing function and a convex quadratic function. We call the composite functions monotonic convex-quadratic-composite functions. The resulting algorithm inherits important properties of the IGO and the CMA-ES, such as invariance under monotone transformation of the objective function and invariance under affine transformation of the search space. We theoretically study this natural gradient method on monotonic convex-quadratic-composite functions. We prove that the covariance matrix adapts to be proportional to the inverse of the Hessian matrix of the objective function. We also investigate the speed of the covariance matrix adaptation and the speed of convergence of the parameters.

The rest of the paper is organized as follows. In Section 2 we propose a novel natural gradient method and present a stochastic algorithm that approximates the natural gradient from finite samples. The basic properties of both algorithms are described. In Section 3 we study the convergence properties of the deterministic algorithm on monotonic convex-quadratic-composite functions. The convergence of the condition number of the product of the covariance matrix and the Hessian matrix of the objective function to one and its linear convergence are proven. Moreover, the rate of convergence of the parameter is shown. In Section 4, we conduct experiments to see how accurately the stochastic algorithm approximates the deterministic algorithm and to see how similarly our algorithm and the CMA-ES behave on a convex quadratic function. Finally, we summarize and conclude this paper in Section 5.

## 2 The algorithms

We first introduce a generic framework of the natural gradient algorithm that includes the IGO algorithm.

The original objective is to minimize , where is a metric space. Let and be the Borel -field and a measure on . Hereunder, we assume that is -measurable. Let represent any monotonically increasing set function on , i.e., for any , s.t. . We transform to an invariant cost function defined as . Given a family of probability distributions on , we define a quasi-objective function on the parameter space as the expected value of over , namely

 J(θ)=EX∼Pθ[Vf(X)].

Our algorithm performs the natural gradient descent on a Riemannian manifold equipped with the Fisher metric . The Fisher metric is the unique metric that does not depend on the choice of parameterization [3]. The natural gradient—the gradient taken w.r.t. the Fisher metric—is given by the product of the inverse of the Fisher information matrix and the “vanilla” gradient of the function. Therefore, the natural gradient of is and the parameter update follows

 θt+1=θt−ηtI−1θt∇J(θt), (1)

where is the learning rate.

### 2.1 Deterministic NGD Algorithm on Rd

In the following, we focus on the optimization in . Thus, , is the Lebesgue measure on , and is the Borel -field on .

We choose as the sampling distribution the Gaussian parameterized by , where the mean vector is in and the covariance matrix is a symmetric and positive definite matrix of dimension .

We define the invariant cost by using the Lebesgue measure as . Then, the infimum of is zero located on a boundary of the domain where the mean vector equals the global minimum of and the covariance matrix is zero.

The choice of the parameterization of Gaussian distributions affects the behavior of the natural gradient update (1) with finite learning rate , although the steepest direction of on the statistical manifold is invariant under the choice of parameterization. We choose the mean vector and the covariance matrix of the Gaussian distribution as the parameter as well as are chosen in the CMA-ES and in other algorithms such as EMNA [18] and cross entropy method [10]. Let , where be the vectorization of such that the (, )th element of corresponds to th element of (see [16]). Then the Fisher information matrix has an analytical form

 Iθ=[C−10012(C−1⊗C−1)], (2)

where denotes the Kronecker product operator. Under some regularity conditions for the exchange of integration and differentiation we have

 ∇J(θ)=EX∼Pθ[Vf(X)∇l(θ;X)], (3)

where is the log-likelihood. The gradient of the log-likelihood can be written as

 (4)

Then, the natural gradient at can be written by part

With different learning rates for mean vector and covariance matrix updates, the natural gradient descent (1) reads

 mt+1=mt−ηtmδmt,Ct+1=Ct−ηtCδCt. (5)

We refer to (5) for the deterministic natural gradient descent (NGD) method.

### 2.2 Stochastic NGD Algorithm on Rd

When is not given, we need to estimate the gradient. We approximate the natural gradient and simulate the natural gradient descent as follows. Initialize the mean vector and the covariance matrix and repeat the following steps until some termination criterion is satisfied:

• Compute the eigenvalue decomposition of , , where is an orthogonal matrix and is a diagonal matrix such that .

• Compute the square root of , .

• Generate normal random vectors .

• Compute , for .

• Evaluate the objective values ;

• Estimate as

 ˆVf(xi)=(2π)d/2det(D)1/2n∑j:f(xj)⩽f(xi)exp(∥zj∥22).
• Compute the baseline .

• Compute the weights .

• Estimate the natural gradient and as

 ˆδmt=n∑i=1wi(xi−mt)ˆδCt=n∑i=1wi((xi−mt)(xi−mt)T−Ct). (6)
• Compute the learning rates and .

• Update the parameters as and .

We refer to this algorithm for the stochastic NGD algorithm.

This algorithm generates samples from in steps 1–4 and evaluates their objective values in step 5. In step 6, the invariant costs are evaluated. The estimates are obtained as follows. By definition we have

 Vf(x)=(∫1{f(y)⩽f(x)}pθt(y)pθt(y)dy)2/d.

Applying Monte-Carlo approximation we have

 ˆVf(x)=(1nn∑j=11{f(xj)⩽f(x)}pθt(xj))2/d. (7)

Since , we have the estimates in step 6. Step 7 computes the baseline that is often introduced to reduce the estimation variance of gradients while adding no bias [12]. We simply choose the mean value of the as the baseline. Replacing the expectation in (5) with the sample mean and adding the baseline (in step 8) we have the Monte-Carlo estimate of the natural gradient in step 9. Finally in step 11, we update the parameters along the estimated natural gradient with the learning rates computed in step 10. The learning rates are chosen in the following so that they are inverse proportional to the largest eigenvalue of the following matrix

 Zt=(Ct)−1/2ˆδCt(Ct)−1/2=n∑i=1wi(zizTi−Id). (8)

### 2.3 Difference from the IGO

The difference between the IGO algorithm and our deterministic algorithm is that the invariant cost in the IGO algorithm is defined by negative of the weighted quantile, , where is non-increasing weight function. Since the quantile depends on the current parameter , for each in the IGO algorithm changes from iteration to iteration, whereas it is fixed in our algorithm. This property makes our algorithm easier to analyze mathematically.

The difference between our stochastic algorithm and the pure rank- update CMA-ES [14] is the same as the difference between the deterministic one and the IGO algorithm. The pure rank- update CMA-ES approximates the quantile value by the number of better solutions divided by the number of samples , . Therefore, the pure rank- update CMA-ES simulates the same lines as the stochastic NGD algorithm described in Section 2.2 with the weights .

In Section 4 we compare the stochastic NGD algorithm with the pure rank- update CMA-ES where

 wi={1/⌊n/4⌋if Ri/n⩽⌊n/4⌋0otherwize. (9)

### 2.4 Basic Properties

Invariance. Our algorithms inherit two important invariance properties from the IGO and the CMA-ES: invariance under monotonic transformation of the objective function and invariance under affine transformation of the search space (with the same transformation of the initial parameters). Invariance under monotonic transformation of the objective function makes the algorithm perform equally on a function and on any composite function where is any strictly increasing function. For example, the convex sphere function is equivalent to the non-convex function for this algorithm, whereas conventional gradient methods, e.g. Newton method, assume the convexity of the objective function and require a fine line search to solve non-convex functions. This invariance property is obtained as a result of the transformation . Invariance under affine transformation of the search space is the essence of variable metric methods such as Newton’s method. By adapting the covariance matrix, this algorithm attains universal performance on ill-conditioned objective functions.

Positivity. The covariance matrix of the Gaussian distribution must be positive definite and symmetric at each iteration. The next proposition gives the condition on the learning rate such that the covariance matrix is always positive definite symmetric.

###### Proposition 1

Suppose that the learning rate for the covariance update in the deterministic NGD algorithm, where denotes the largest eigenvalue of the argument matrix. If is positive definite symmetric, is positive definite symmetric for each . Similarly, if in the stochastic NGD algorithm, where is defined in (8), and if is positive definite symmetric, the same result holds.

{proof}

Consider the deterministic case (5). Suppose that is positive definite and symmetric. Then,

 Ct+1=√Ct(Id−ηtC√Ct−1δCt√Ct−1)√Ct.

Since by the assumption, all the eigenvalues of is smaller than one. Thus, the inside of the brackets is positive definite symmetric and hence is positive definite symmetric. By mathematical induction, we have that is positive definite and symmetric for all . The analogous result for the stochastic case is obtained in the same way.

Consistency. The gradient estimator (6) is not necessarily unbiased, yet it is consistent as is shown in the following proposition. Therefore, one can expect that the stochastic NGD approximates the deterministic NGD well when the sample size is large. Let be the natural gradient operator.

###### Proposition 2

Let be independent and identically distributed random vectors following . Let and be the invariant cost (7) and the natural gradient (6) where are replaced with . Suppose that

 E[Vf(X)2]<∞. (10)

Then, , where represents almost sure convergence.

{proof}

By the Cauchy-Schwarz inequality we have that . Note that . By Jensen’s inequality we have that . Therefore, (10) implies

 E[∥Vf(X)~∇l(θ;X)∥]<∞ and (11) E[Vf(X)]<∞. (12)

Define and decompose as

 Gnθ=1nn∑i=1Vf(Xi)~∇l(θ;Xi)+1nn∑i=1hn(Xi)~∇l(θ;Xi)−(1nn∑i=1ˆVf(Xi)=b)(1nn∑i=1~∇l(θ;Xi)). (13)

By (11) and the strong law of large numbers (LLN), the first summand converges to almost surely as . So we have to show that the second term and the third term of (13) converge almost surely to zero.

First, we show the following almost sure convergence

 1nn∑i=1hn(Xi)\lx@stackrela.s.→0as n→∞. (14)

By the definition of , we have

 limn→∞ˆVf(x) =(limn→∞1nn∑j=11f(Xj)⩽f(x)pθ(Xj))2/d and since (12) implies μLeb[y:f(y)⩽f(x)]<∞ almost everywhere, we have by LLN =μ2/dLeb[y:f(y)⩽f(x)]=Vf(x)

almost surely and almost everywhere in . This implies almost everywhere in .

For , we have

 limn→∞∣∣1nn∑i=1hn(Xi)∣∣⩽limn→∞1nn∑i=1supj⩾n|hj(Xi)|⩽limn→∞1nn∑i=1supj⩾m|hj(Xi)|. (15)

Since almost everywhere in as , we have almost everywhere in as . By the Lebesgue’s dominated convergence theorem we have as . Therefore, we have that for large enough. Then, by applying LLN, we have that the right most side of (15) converges to as and this expectation converges to as . This ends the proof of (14).

Now we can obtain the almost sure convergence of the third term of (13) to zero. Indeed, the almost sure convergence (14) implies that the limit agrees with and we have from (12) and LLN that . Also, by LLN we have that as . Therefore, the third term of (13) converges to zero almost surely.

To show the convergence of the second term of (13) to zero, we apply the Cauchy-Schwarz inequality to it and we have

 ∣∣ ∣∣n∑i=1hn(Xi)~∇l(θ;Xi)n∣∣ ∣∣2⩽n∑i=1hn(Xi)2nn∑i=1∥~∇l(θ;Xi)∥2n.

By LLN we have that the second term of the right hand side converges to . So we have to prove that the first term on the right hand side converges to zero almost surely. The proof of this convergence is done in the same way as above with replacing .

We remark that (12) is the necessary and sufficient condition for to exist and that (11) is a sufficient condition for the exchange of integral and differentiation used in (3). See e.g. [9, Theorem 16.8].

## 3 Convergence Properties of the Deterministic NGD Algorithm

We investigate the convergence properties of the deterministic NGD algorithm (5) on a monotonic convex-quadratic composite function , where is any strictly increasing function and is a positive definite symmetric matrix.

###### Proposition 3

The natural gradient can be written as

 I−1θ∇J(θ)∝[CAmvec(CAC)].
{proof}

Since is equivalent to the volume of the ellipsoid , we have that

 μLeb[y:f(y)⩽f(x)]=2det(A)Vd(√xTAx),

where denotes the volume of the sphere with radius and is proportional to . Therefore . Since the proportionality constant is independent of , we have

 J(θ)=EX∼Pθ[Vf(X)]∝EX∼Pθ[XTAX]=mTAm+Tr(AC).

Differentiating the both side of the above relation, we have

 ∇J(θ)∝[2Amvec(A)].

Premultiplying by , we obtain the intended result.

Now the deterministic NGD algorithm on is implicitly written as

 mt+1 =mt−ηtmδmt, δmt =cCtAmt (16) Ct+1 =Ct−ηtCδCt, δCt =cCtACt, (17)

where is the proportionality constant appearing in the proof of Proposition 3.

In the following, we work on the following assumption: There are and such that

 γm,min⩽ηtmλ1((Ct)−1δCt)⩽1, (18) γC,min⩽ηtCλ1((Ct)−1δCt)⩽1/2. (19)

These assumptions are satisfied, for example, if and are set for each iteration so that . In this case the natural gradient can be considered to be normalized by and the pseudo-learning rate is .

The next theorem states that the covariance matrix converges proportionally to the inverse of the Hessian matrix.

###### Theorem 4

Assume (19). The condition number of converges to one with the rate of convergence

 limsupt→∞Cond(Ct+1A)−1Cond(CtA)−1⩽1−2γC,min1−γC,min. (20)

Moreover, we have an upper bound

 Cond(CtA)⩽1+(1−γC,min)t(Cond(C0A)−1). (21)

If the limit exists, is replaced with in (20).

{proof}

Since is positive definite and symmetric, there exists the square root . Premultiplying and postmultiplying both side of covariance matrix update (17) by , we have

 c√ACt+1√A=c√ACt√A−ηtC(c√ACt√A)2.

Since is positive definite and symmetric, there exists an eigenvalue decomposition , where the diagonal elements of are the eigenvalues of and each column of is the eigenvector corresponding to each diagonal element of . Then,

 c√ACt+1√A=Qt(Λt−ηtC(Λt)2)(Qt)T.

This means, also diagonalizes . By mathematical induction we have that an orthogonal matrix which diagonalizes diagonalizes for any and we have

 Λt+1=Λt−ηtC(Λt)2. (22)

Next, we show that the condition number of converses to as . Remember . We have . Then, by assumption (19) we have

 γC,min⩽ηtCλ1(Λt)⩽1/2. (23)

Moreover, since for any , we have for any , .

Suppose without loss of generality. From (22) and the inequality , we have

 λt+1i−λt+1j=λti(1−ηtCλti)−λtj(1−ηtCλtj)=(1−ηtC(λti+λtj)⩽1)(λti−λtj⩾0)⩾0

with equality holding if and only if . Therefore, if , then , which implies that if th and th diagonal elements of are the maximum and minimum elements, th and th elements of are also the maximum and minimum elements of . Without loss of generality we suppose for any for all . Then, . According to (22) we have

 λt+11−λt+1dλt+1dCond(Ct+1A)−1 =λt1(1−ηtCλt1)−λtd(1−ηtCλtd)λtd(1−ηtCλtd) =(λt1−λtd)λtdCond(CtA)−11−ηtC(λt1+λtd)(1−ηtCλtd). (24)

Since

 1−ηtC(λt1+λtd)(1−ηtCλtd)=1−ηtCλt11−ηtCλt1Cond−1(CtA)⩽1−ηtCλt1

and the right-most side of the above inequality is bounded by because of (23), we have

 Cond(Ct+1A)−1Cond(CtA)−1⩽1−γC,min. (25)

This implies and (21). Moreover, since , we have from (24) that

 limsupt→∞Cond(Ct+1A)−1Cond(CtA)−1 =limsupt→∞1−2ηtCλt11−ηtCλt1 ⩽1−2γC,min1−γC,min.

This proves (20).

If the limit exists, it is easy to see from (25) that can be replaced with in (20). This completes the proof.

Note that if and , we have that . We have from (24) that

 Cond(Ct+1A)=Cond(CtA)1−α1−αCond−1(CtA) (26)

and the rate of convergence becomes .

The next theorem states the global convergence of and and the speed of the convergence. In the following, we let denote the Frobenius norm of , namely .

###### Theorem 5

Assume (19) and (18). Then, and converge to zero with the rate of convergence

 limsup∥κt+1∥∥κt∥⩽1−γκ,min, (27)

where is either or and is either or . If the limit exists, is replaced with in (27).

{proof}

Let denote the th largest singular value of the argument matrix. According to J. von Neumann’s trace inequality [19] we have , where and are any matrices in . Let be nonnegative definite and is nonnegative definite symmetric. From the above inequality, we have