A Framework for Generalising the Newton Method and Other Iterative Methods from Euclidean Space to Manifolds

A Framework for Generalising the Newton Method and Other Iterative Methods from Euclidean Space to Manifolds

Jonathan H. Manton Department of Electrical and Electronic Engineering
The University of Melbourne, Victoria 3010, Australia
j.manton@ieee.org
Abstract

The Newton iteration is a popular method for minimising a cost function on Euclidean space. Various generalisations to cost functions defined on manifolds appear in the literature. In each case, the convergence rate of the generalised Newton iteration needed establishing from first principles. The present paper presents a framework for generalising iterative methods from Euclidean space to manifolds that ensures local convergence rates are preserved. It applies to any (memoryless) iterative method computing a coordinate independent property of a function (such as a zero or a local minimum). All possible Newton methods on manifolds are believed to come under this framework. Changes of coordinates, and not any Riemannian structure, are shown to play a natural role in lifting the Newton method to a manifold. The framework also gives new insight into the design of Newton methods in general.

keywords:
Newton iteration, Newton method, convergence rates, optimisation on manifolds, geometric computing
journal: Numerische Mathematik (Accepted on 4 April 2014)

1 Introduction

The Newton iteration function associated with a smooth cost function is

(1)

where and are the gradient and Hessian of , respectively; does not depend on the choice of inner product with respect to which the gradient and Hessian are defined. Starting with an initial guess , the Newton method uses the Newton iteration function to generate the iterates . Under certain conditions bk:Polak:opt (), this sequence is well-defined and converges to a critical point of , meaning exists for all , and exists and satisfies .

Let now be a smooth cost function defined on an -dimensional manifold . Since locally looks like , it is natural to ask how the Newton iteration function (1) can be extended to an iteration function such that the iterates enjoy the same locally quadratic rate of convergence as do the Euclidean Newton iterates.

One approach Gabay:optm () is to endow the manifold with a metric and define by a formula analogous to (1) but with and replaced by the Riemannian gradient and Hessian of , and the straight-line increment replaced by an increment along a geodesic, namely

(2)

where is the Riemannian exponential map centred at .

The Riemannian Newton method (2) has some disadvantages and other Newton methods on manifolds are possible Adler:2002cc (); Manton:opt_mfold ().

What is the most general form of a Newton method on a manifold? Here, a Newton method is defined as any iterative algorithm that converges locally quadratically to every non-degenerate critical point of every reasonable cost function , where depends only on the 2-jet of the function at ; if and agree to second order at then .

Theorem 11 affords an answer, expressed in terms of parametrisations. A parametrisation of a manifold is a function whose restriction to the tangent space of at any point provides a (not necessarily one-to-one) correspondence between a neighbourhood of and a neighbourhood of ; the former is a subset of a vector space and therefore easier to work with. It suffices for to be -smooth and satisfy , but interestingly, there exist valid parametrisations that are not continuous. (Precise definitions are given in the body of the paper.)

Theorem 11 states that for any pair of parametrisations and , the iteration function is a Newton method on the manifold , where is the Euclidean Newton iteration function (1) but on the abstract vector space rather than . (Since (1) does not depend on the choice of inner product, there is no need for a Riemannian metric on .) Justification is given in the body of the paper for believing this to be the most general form possible of a Newton method on a manifold.

Requiring a Newton method to be strictly of the form places an unnecessary global topological constraint on the parametrisations. Instead, could be constructed on demand by “transporting” the old parametrisation from to . As transport is generally path dependent, may depend on where is relative to . A uniformity constraint on the family of possible parametrisations allows for the generalisation of Theorem 11 to this situation; see Section 6 for details.

The expression “lifts” the Newton iteration function from Euclidean space to a manifold. Section 8 explores in generality the lifting of an iteration function from Euclidean space to a manifold.

1.1 Implications, Limitations and Examples

Two broad types of optimisation problems can be distinguished. One is when little is known in advance about the possible cost functions (save perhaps that they are convex, for example) and an algorithm is desired that scales well with increasing dimension. The other is when the family of possible cost functions is known in advance and an algorithm is desired that works well for all members of the family. The latter is the implicit focus of the current paper and relates to real-time optimisation problems in signal processing: at each instance, a new observation is made; this serves to select a cost function ; it is required to find quickly an that maximises .

Although generic choices are possible of the pair of parametrisations and defining a Newton method on a particular manifold, the fact remains that for large-scale problems, the Newton method is generally abandoned in favour of quasi-Newton methods that build up approximations of the Hessian over time, thereby making computational savings by not evaluating the Hessian at each iteration. Quasi-Newton methods have memory and thus are not of the form . An intended sequel will study how to lift algorithms with memory to manifolds.

How can a Newton method be customised for a given family of cost functions? It is propounded that thinking in terms of parametrisations and offers greater insight into the design of optimisation algorithms. Notwithstanding that identifying a “killer application” for the theory is work in progress, the following example may sway some readers.

Generalising the Rayleigh quotient to higher dimensions yields two well-studied optimisation problems Helmke:1994ec (). Recall that the -Stiefel manifold is the set of matrices satisfying , where superscript denotes transpose and is the identity matrix. (The manifold structure is inherited from .) Let be symmetric and diagonal, both with distinct positive eigenvalues. A minimising of subject to has as its columns the eigenvectors of corresponding to the smallest eigenvalues of . If it was only required to find the subspace spanned by these minor eigenvectors, known as the minor subspace of , then it suffices to minimise on the Grassmann manifold. The Grassmann manifold is a quotient space obtained from the Stiefel manifold by declaring two matrices as equivalent whenever there exists an orthogonal matrix such that . In other words, each point on the -Grassmann manifold represents a particular -dimensional subspace of .

The rate of convergence of is dictated by how close to being quadratic is about whenever is near a critical point of . As is already quadratic, the parametrisation should be as linear as possible. One possibility is defining as the point on the Stiefel manifold closest (in the Euclidean metric) to the matrix ; Section 7.4 proves that parametrisations based on projections are linear to at least second order. The role of is to map back to the manifold with a minimum of fuss. Choosing to be the same as suffices. (The option exists of choosing to be an approximation of that makes overall less computationally demanding to evaluate numerically than .) Since the Grassmann manifold is a quotient of the Stiefel manifold, the above argument readily extends to minimising on a Grassmann manifold; see Manton:opt_mfold () for the precise calculations.

The above algorithm was trivial to derive yet is a sound starting point upon which clever refinements are possible Absil:2002js (); Absil:2004bd (). The cubic rate of convergence is readily explained in terms of being quadratic to third order at critical points; compare with Example 10. A feature of the derivation is choosing with purpose rather than by trial and error.

How should a theory of optimisation on manifolds be framed? This third italicised objective of the paper is in response to misconceptions including: a connection is required for a Newton method to be definable; only Riemannian Newton methods are “true” Newton methods; and, methods not exploiting the curvature of the manifold must be inferior. These misconceptions come from overplaying the geometry of the manifold itself.

The most relevant geometry is that of the family of cost functions Manton:OG (). Knowing the possible cost functions allows for the customisation of the Newton method by choosing a parametrisation that makes approximately quadratic, and such a choice depends not on the manifold but on the family of cost functions. (Placing a sensible geometry on might be advantageous — perhaps computational burden can be reduced by exploiting symmetry — but the overall benefit nevertheless will depend on the cost functions.)

It is not pragmatic to insist that only Riemannian Newton methods (2) are true Newton methods. Different methods work better for some cost functions and worse for others; no single method can be superior for every smooth cost function. Any method achieving a locally quadratic rate of convergence is worthy of the title Newton method, provided of course it depends only on the 2-jet of the function; see the definition given earlier.

Under this more general definition, there are Newton methods that cannot be defined in terms of a connection. A connection must vary smoothly whereas no such requirement exists for the parametrisation . More importantly, thinking of parametrisations instead of connections is more conducive to customising a Newton method for a given family of cost functions. (The Riemannian approach (2) does not offer explicit insight into which metric to use if there are two or more competing metrics, or what to do if there is no convenient choice of metric.)

This paper avoids any need of Riemannian geometry by framing the theory of optimisation on manifolds in terms of robustness of the iteration function to changes of coordinates; see Section 8 for details. This appears to be the most natural point of view.

1.2 Motivation and Relationship with Other Work

Given the extensive background and bibliography made available in the book Absil:2008uy (), only a handful of papers are discussed below.

The Riemannian Newton method (2) was introduced in Gabay:optm () but apparently went unnoticed. The same methodology was rediscovered in the influential paper Edelman:1998ei (). The mindset is that the Newton method is defined by its formula (1), and its extension to a manifold thus necessitates endowing the manifold with a Riemannian metric so the gradient and Hessian can be defined.

Numerically evaluating the Riemannian exponential map in (2) can be costly. It is common to replace the exponential map by an approximation that is cheaper to evaluate numerically. This is formalised in Adler:2002cc (), with a precursor in Shub:1986ub (). It corresponds to . A Riemannian metric is still required for computing the Newton increment, with what is termed a retraction mapping the result back to the manifold. The retraction must satisfy several conditions, including being smooth. In the present paper, need not be continuous and hence is not even a retraction in the topological sense. The way retractions are commonly used in topology differs in spirit from how and are being used to lift the Newton method to a manifold, hence the persistence here of calling them parametrisations.

The Riemannian mindset was challenged in Manton:opt_mfold (). The basic idea is that since the Newton method is a local method, the cost function in a neighbourhood of the current point can be pulled back to a cost function on Euclidean space via a parametrisation, one step of the Newton method carried out in Euclidean space, and the result mapped back to the manifold. No Riemannian metric is necessary. This corresponds to . Using projections to define the parametrisations was emphasised. (The resulting algorithms differ significantly from projected Newton methods that take a Newton step in the ambient space then project back to the constraint surface.)

Combining the use of in Adler:2002cc () and the use of in Manton:opt_mfold () immediately yields the general form that is the protagonist of the present paper. This form is developed systematically in Sections 4 and 5 in a way that suggests it is the most general form possible of a Newton method.

The use of projections to define parametrisations, advocated in Manton:opt_mfold (), was studied in Absil:2012eb (), but for . Convergence proofs were based on calculus techniques requiring more orders of differentiability than necessary; see Section 3.

Another active stream of research is finding lower bounds on the radius of convergence of Riemannian Newton methods Alvarez:2008hv (); Argyros:2007cw (); Ferreira:2002uj (). This has not been addressed in the present paper, although in principle, a careful study of the constants in the bounds derived here would provide that information.

The question of the most general form of a Newton method on a manifold appears not to have been addressed before.

2 Basic Notation and Definitions

For a function between Euclidean spaces, the following definitions are made. The Euclidean norm on is used throughout. The norm of the second-order derivative is . All other norms are operator norms. Gradients and Hessians are calculated with respect to the Euclidean inner product. The identity operator is denoted by (or sometimes by in the one-dimensional case). The notation and its abbreviation denote the open ball centred at of radius . Its closure is .

An iteration function , which may not be defined on the whole of , is said to converge locally to with rate and constant if there exists an open set containing such that is defined on and

(3)

If then it is further required that , and convergence is called linear. If the convergence is super-linear, and if the convergence is quadratic.

Although (3) implies , the sequence need not converge to for an arbitrary . Nevertheless, define if , or if and . Then is mapped into itself by whenever . Moreover, implies .

The focus of this paper is on convergence rates greater than one.

3 Local Convergence of the Newton Iteration on Euclidean Space

Convergence proofs for the Newton method include the Newton-Kantorovich theorem (applicable for the Newton method on Banach spaces) and the Newton-Mysovskikh theorem; see bk:Kantorovich:fn_analysis (); bk:Ortega:iter_soln () and the bibliographic note (bk:Ortega:iter_soln, , p. 428). These theorems give sufficient but not necessary conditions, concentrating instead on explicitly finding a region within which the Newton method is guaranteed to converge. The affine invariance of the Newton method is exploited in bk:Deuflhard:newton () to sharpen these classical results.

In pursuit of the most general Newton method on a manifold, it is informative to derive a necessary and sufficient condition for the standard Newton method to converge to a non-degenerate critical point.

Theorem 1.

Let be -smooth. Let be a non-degenerate critical point, that is, and is invertible. A necessary and sufficient condition for in (1) to be locally quadratically convergent to is for there to exist such that implies

(4)
Proof.

Define the second-order Taylor series remainder term

(5)

Since is , so is . Moreover,

(6)
(7)

Substitution into (1) shows

(8)
(9)

Since is continuous, for any there exists a such that implies: is invertible; ; and .

To prove sufficiency, first observe

(10)

Choose as in the theorem. If then

(11)
(12)
(13)

Choosing as above, if then is well-defined and

(14)

proving local quadratic convergence.

To prove necessity, first note from (9) that

(15)

Thus, choosing as above, if then

(16)

By hypothesis, converges locally quadratically to , hence by shrinking if necessary, there exists a such that implies

(17)

Define the closed ball and the function . Setting ensures is well-defined and continuous on . Assume to the contrary, for all , the scalar satisfies . For any ,

(18)
(19)
(20)
(21)

Let be such that . Since and

(22)
(23)

choosing makes (23) contradict (17), proving the theorem. ∎

Corollary 2.

Let be -smooth and a non-degenerate critical point. Then in (1) converges locally quadratically to .

Proof.

If is then is , hence (4) holds. ∎

Corollary 3.

Let and satisfy the conditions in Theorem 1, including (4). The perturbed iteration function converges locally quadratically to if there exists a such that the operator norm of the matrix satisfies in a neighbourhood of .

Proof.

Observe

(24)

Therefore,

(25)

In a sufficiently small neighbourhood of , is bounded above by a constant and the three other terms are bounded by a constant times ; refer to (13) and the hypotheses on and . ∎

Despite calculus offering a simpler and more elegant alternate, convergence proofs are based here on hard analysis because calculus requires a higher order of smoothness than necessary, as now demonstrated. (See also the opening paragraph of C.2.) Recall the basic principle.

Lemma 4.

Let be -smooth for some integer . If for then converges locally to with rate .

Applying Lemma 4 to (1) shows that being -smooth is sufficient for to converge locally quadratically to a nondegenerate critical point. If were only then would only be and Lemma 4 could not be applied. The actual condition (4) falls strictly between -smoothness and -smoothness.

Example 5.

Define . The origin is a non-degenerate critical point. The Newton iteration function is and has super-linear but not quadratic convergence, despite being -smooth.

Remark 6.

The quadratic convergence rate of the Newton method is coordinate independent, in the following sense. Assume satisfies the conditions in Theorem 1 about the point . If is a -diffeomorphism then is a non-degenerate critical point of , and by Proposition 37, condition (4) holds for about the point . Thus, if converges locally quadratically to then converges locally quadratically to .

4 The Coordinate Adapted Newton Iteration

The most general form of a Newton method in Euclidean space is explored.

4.1 Coordinate Adaptation

Applying a change of coordinates to (1) yields the new iteration function . Expedient choices of can increase the domain of attraction, decrease the computational complexity per iteration and improve the convergence rate. As an extreme example, if is such that is quadratic then converges in a single iteration. Although Morse’s Lemma guarantees the existence of such a locally, finding it is generally not practical. This motivates using a different change of coordinates at each iteration, namely . When varies with , the convergence properties of need not follow from the convergence properties of . Significantly then, it is established that under mild conditions, converges locally quadratically to non-degenerate critical points of .

Coordinate adaptation is defined in terms of a function , alternatively written , satisfying the condition that, , , , , the following hold:

P1

;

P2

.

Implicit in P1 is the requirement that exists, which in turn requires the existence of for sufficiently close to .

Given such a , the coordinate adapted Newton iteration function is

(26)

where is the Newton iteration function (1). This agrees with the earlier expression for because P2 implies .

Theorem 7.

Let and satisfy the conditions in Theorem 1, including (4). Let satisfy P1 and P2, defined above. Then the coordinate adapted Newton iteration function , defined in (26), converges locally quadratically to .

Proof.

P2 implies and . Hence

(27)
(28)

Let be the matrix representation of . Then is symmetric and satisfies for any . If P1 holds, it can be shown that . Thus, in a neighbourhood of , there exist constants such that implies . Since , it follows from Corollary 3 that, for a possibly smaller , there exists a such that whenever . To be able to apply P2, shrink if necessary to ensure . Then

(29)
(30)
(31)
(32)

whenever , proving the theorem. ∎

As now explained, P1 and P2 are not only mild, it is conjectured they cannot be weakened. For (26) to be defined, the Hessian of at must exist, necessitating the existence of implicit in P1. The local bound in P1 ensures converges locally quadratically. A side-effect of P2 is that and ; this loses no generality because the Newton iteration function is invariant to affine changes of coordinates. The main purpose of P2 is to prevent the residual term from being unbounded locally, ensuring that if is an arbitrary iteration function converging locally quadratically to then continues to converge locally quadratically to . The situation in which, for a sufficiently large class of cost functions , fails to have local quadratic convergence yet has local quadratic convergence is conjectured to be impossible. The claim that P1 and P2 are mild comes from the fact deducible from B that any -smooth with and , satisfies P1 and P2. Furthermore, neither or need be continuous except on the diagonal , and need not be locally -smooth.

The following example shows that if arbitrary changes of coordinates are allowed then the coordinate adapted Newton method may not even be defined, much less converge at a quadratic rate.

Example 8.

Let be an arbitrary scalar. Consider the coordinate adapted Newton iteration function applied to using when and when . If then which is not defined if . If then , which in general exhibits at best linear convergence.

4.2 The Generalised Coordinate Adapted Newton Iteration

The proof of Theorem 7 can be modified trivially to prove the following result. A new function analogous to is introduced and property P2 in Section 4.1 is replaced by

P2’

.

Theorem 9.

Let and satisfy the conditions in Theorem 1, including (4). Let satisfy P1 in Section 4.1. Assume further that and . Let satisfy P2’ above; the qualifiers for and in P2’ are the same as for P2. Then the generalised coordinate adapted Newton iteration function

(33)

converges locally quadratically to .

Being able to change both and allows greater control over the computational complexity, the domain of attraction and the rate of convergence of the iteration function (33), as now discussed.

4.3 Discussion

The choice of coordinate changes and in (33) determines which class of cost functions the generalised coordinate adapted Newton method will perform well for. The challenge then is to determine suitable coordinate changes to use for the class of cost functions at hand. For inherently difficult optimisation problems this will not be easy by definition. Nevertheless, thinking in terms of coordinate adaptation leads to the following new strategy.

The closer the cost function is to being quadratic, the faster the convergence rate of (1). Ideally then, in (26) makes approximately quadratic for every cost function in the given family. For improving local convergence, it suffices to restrict attention to cost functions with a critical point near because, by definition of local, it can be assumed a critical point is nearby , and this limits the possibilities of which cost function from the family has been selected to be minimised. For the special case when has the form for some unknown scalar , where has a minimum at the origin, it suffices for to be approximately quadratic when .

Example 10.

Consider the family of cost functions . The coordinate adapted Newton iteration function using the coordinate systems is . This converges cubically to the critical point for any cost function in the family. Here, was chosen so has no cubic term.

If the domain of attraction is of primary concern then similar intuition suggests choosing such that, for any belonging to the given class of cost functions, has a relatively large domain of attraction, especially if is at all close to the minimum of .

The extra freedom afforded by in (33) can be used to reduce the computational complexity per iteration without compromising the rate of convergence; in some cases, an expedient choice of leads to cancellations, so becomes less computationally intensive to evaluate than on its own.

The coordinate adapted Newton method is different from variable metric methods. Variable metric methods explicitly or implicitly perform a change of coordinates and then take a steepest-descent (not Newton) step in the new coordinate system. They do not evaluate the Hessian of the cost function but instead build up an approximation to the Hessian from current and past gradient information. They are of the form ; see bk:Polak:opt (). The generalised coordinate adapted Newton method (33) with can be written as ; see the proof of Theorem 7. This differs from a variable metric method in several ways; makes use of the Hessian of but does not; has no “memory” but is built up over time; variable metric methods generally only achieve super-linear convergence whereas has quadratic convergence. The philosophy is also different; variable metric methods wish for to be as close as possible to the true Hessian, whereas the generalised coordinate adapted Newton method intentionally uses a perturbed version of the true Hessian to improve the performance of the algorithm.

5 Generalised Newton Methods on Manifolds

Throughout this section, will be a -smooth cost function defined on an -dimensional -differentiable manifold . Recall from A that if is an iteration function with local quadratic convergence to a point then converges locally quadratically to for any chart with . Fix . For cost functions with a critical point in , the coordinate adapted Newton method of Section 4 can be extended to manifolds by seeking an such that is a coordinate adapted Newton iteration function for the equivalent cost function . For functions with critical points outside , in principle a different coordinate chart needs to be taken, but as shown presently, it is straightforward to guess an appropriate form for globally.

Solving yields

(34)
(35)

The affine invariance of the Newton method allows this to be rewritten as where . Although this defines only locally, an obvious extension is where, for each , is a parametrisation of a neighbourhood on the -dimensional manifold centred at , that is, . This extension is justified by the proof of Theorem 11 in which it is shown that does indeed take the form of a coordinate adapted Newton method for any chart on .

Although tempting to generalise and in Section 4 to maps from to , the global geometry of can prevent any such map from being smooth. The tangent bundle , being equivalent to locally, offers an alternative. As twists in the “right” way, smooth parametrisations from to can be anticipated to exist; this was appreciated by Shub Shub:1986ub (); Adler:2002cc (). (While smoothness is not essential, in practice it may be convenient to work with smooth parametrisations.)

The functions and will be required to satisfy conditions C1–C2 below, which generalise P1 and P2’ in Sections 4.1 and 4.2. Local coordinates are needed. Let be the projection taking a tangent vector to its base point . A -chart induces the -chart on , sending to where is the linear isomorphism taking to . The local coordinate representation of is .

Conditions C1–C2 are satisfied if, , -chart with , :

C1

satisfies H1 and H2, defined below;

C2

satisfies H3, defined below.

Consider a function which need not be defined everywhere; its domain of definition will be clarified shortly. It satisfies H1 if implies and . It satisfies H2 if there exists a constant such that for every . It satisfies H3 if there exists a constant such that implies . If the subscript is omitted, the existence of an appropriate is implied.

For the derivatives to exist, if satisfies H1 or H2 then its domain of definition must include a set of the form where is a function of . Such a set need not contain a neighbourhood of the origin. For H3 though, it is required that lies in the domain of . See B for further properties.

A generalised Newton iteration function is any of the form

(36)

where and are the restrictions of and to the tangent space at the point on . In (36), represents the Newton iteration (1) but on the abstract vector space rather than .

The local coordinate representation of can be written as , alternatively denoted . The local coordinate representation of is , which can be written in terms of , namely, . Analogously for .

Theorem 11.

Let be a -smooth cost function on a -smooth manifold . Let be a non-degenerate critical point, that is, and if then . Assume there is a chart on , with , and an such that satisfies, for ,

(37)

Let satisfy C1–C2, defined above. Then the generalised Newton iteration function (36) converges locally quadratically to .

Proof.

Let be as in the theorem. Proposition 38 implies there exists a such that satisfies H1 and H2, and satisfies H3. (By Lemma