Pták’s nondiscrete induction and its application to matrix iterations

# Pták’s nondiscrete induction and its application to matrix iterations

Jörg Liesen111Institute of Mathematics, Technical University of Berlin, Straße des 17. Juni 136, 10623 Berlin, Germany (liesen@math.tu-berlin.de).
###### Abstract

Vlastimil Pták’s method of nondiscrete induction is based on the idea that in the analysis of iterative processes one should aim at rates of convergence as functions rather than just numbers, because functions may give convergence estimates that are tight throughout the iteration rather than just asymptotically. In this paper we motivate and prove a theorem on nondiscrete induction originally due to Potra and Pták, and we apply it to the Newton iterations for computing the matrix polar decomposition and the matrix square root. Our goal is to illustrate the application of the method of nondiscrete induction in the finite dimensional numerical linear algebra context. We show the sharpness of the resulting convergence estimate analytically for the polar decomposition iteration and for special cases of the square root iteration, as well as on some numerical examples for the square root iteration. We also discuss some of the method’s limitations and possible extensions.

n

ondiscrete induction, matrix iterations, matrix polar decomposition, matrix square root, matrix functions, Newton’s method, convergence analysis

{AMS}

65F30, 65H05, 65J05

## 1 Introduction

In the late 1960s, Vlastimil Pták (1925–1999) derived the method of nondiscrete induction. This method for estimating the convergence of iterative processes was originally motivated by a quantitative refinement Pták had found for the closed graph theorem from functional analysis in 1966 [10]. He published about 15 papers on the method, five of them jointly with Potra in the 1980s, and the work on the method culminated with their 1984 joint monograph [9]. For historical remarks on the development of the method see [9, Preface] or [14, pp. 67–68]. Pták described the general motivation for the method of nondiscrete induction in his paper “What should be a rate of convergence?”, published in 1977 [13]:

It seems therefore reasonable to look for another method of estimating the convergence of iterative processes, one which would satisfy the following requirements.

• It should relate quantities which may be measured or estimated during the actual process.

• It should describe accurately in particular the initial stage of the process, not only its asymptotic behaviour since, after all, we are interested in keeping the number of steps necessary to obtain a good estimate as low as possible.

This seems to be almost too much to ask for. Yet, for some iterations the method of nondiscrete induction indeed leads to analytical convergence estimates which satisfy the above requirements. The essential idea, as we will describe in more detail below, is that in the method of nondiscrete induction the rate of convergence of an iterative process is considered a function rather than just a number. Pták derived convergence results satisfying the above requirements for a handful of examples including Newton’s method [11] (this work was refined in [8]), and an iteration for solving a certain eigenvalue problem [12]. For these examples it was shown that the convergence estimates resulting from the method of nondiscrete induction are indeed optimal in the sense that in certain cases they are attained in every step of the iteration. In addition to the original papers, comprehensive statements of such sharpness results are given in the book of Potra and Pták; see, e.g., [9, Proposition 5.10] for Newton’s method and [9, Proposition 7.5] for the eigenvalue iteration.

Despite these strong theoretical results, it appears that the method of nondiscrete induction never became widely known. Even in the literature on Newton methods for nonlinear problems it is often mentioned only marginally (if at all); see, e.g., Deuflhard’s monograph [2, p. 49]. Part of the reason for this neglect of the method may be the lack of numerically computed examples in the original publications.

The goals of this paper are: (1) to motivate and prove a theorem on nondiscrete induction due to Potra and Pták that is directly applicable in the analysis of iterative processes, (2) to explain on two examples (namely the Newton iterations for the matrix polar decomposition and the matrix square root) how the method of nondiscrete induction can be applied to matrix iterations in numerical linear algebra, and (3) to demonstrate the method’s effectiveness as well as discuss some of its weaknesses in the numerical linear algebra context. It must be stressed upfront, that most theoretical results in Sections 3 and 4 also could be derived using the general theory of Newton’s method for nonlinear operators in Banach spaces as described in [9, Chapters 2 and 5] and in the related papers of Potra and Pták, in particular [8, 11]. The strategy in this paper is, however, to apply the method of nondiscrete induction without any functional analytic framework and differentiability assumptions directly to the given algebraic matrix iterations.

The paper is organized as follows. In Section 2 we describe the method of nondiscrete induction and derive the theorem of Potra and Pták. We then apply this theorem in analysis the Newton iterations for computing the matrix polar decomposition (Section 3) and the matrix square root (Section 4). For both iterations we prove convergence results and illustrate them numerically. In Section 5 we give concluding remarks and an outlook to further work.

## 2 The method of nondiscrete induction

In most of the publications on the method of nondiscrete induction, Pták formulated the method based on his “Induction Theorem”, which is an inclusion result for certain subsets of a metric space; see, e.g., [11, p. 280], [12, p. 225], [13, p. 282], [14, p. 52], or the book of Potra and Pták [9, Proposition 1.7]. Instead of the original Induction Theorem we will in this paper use a result that was stated and proven by Potra and Pták as a “particular case of the Induction Theorem” in [9, Proposition 1.9]; also cf. [8, p. 66]. Unlike the Induction Theorem, this result is formulated directly in terms of an iterative algorithm, and therefore it is easier to apply in our context. In order to fully explain the result and its consequences for the analysis of iterative algorithms we will below give a motivation of the required concepts as well as a complete proof of the assertion without referring to the Induction Theorem.

Let be a complete metric space, where denotes the distance between . Consider a mapping , where denotes the domain of definition of , i.e., the subset of all for which is well defined. For each we may then consider the iterative algorithm with iterates given by

 xk+1=G(xk),k=0,1,2,…. (1)

If all iterates are well defined, i.e., for all , then the iterative algorithm is called meaningful.

It is clear from (1) that a (meaningful) iterative algorithm can converge to some only when the distances between and decrease to zero for . This requirement, and the rate by which the convergence to zero occurs, are formalized in the method of nondiscrete induction by means of a (nondiscrete) family of subsets that depend on a positive real parameter . The key idea is that for each , where possibly , the set should contain all elements for which the distance between and is at most , i.e. , and then to analyze what happens with the elements of under one application of the mapping . This means that the method compares the values

 d(x,G(x))andd(G(x),G(G(x))).

Hopefully, the new distance is (much) smaller than the previous distance . Using the family of sets this can formally be written as

 G(x)∈Z(ω(t)),

where should be a (positive) “small function” of that needs to be determined. This function is called the rate of convergence of the iterative algorithm. Since in the limit the distances between the iterates should approach zero, the limiting point(s) of the algorithm are contained in the limit set , which is not to be constructed explicitly but is defined as

 Z(0)≡limt→0Z(t)=⋂0

where denotes the closure of the set . Note that the set possibly is not a subset of .

We need the following formal definition of a rate of convergence.

{definition}

Let be a given real interval, where possibly . Let be a function and denote by its th iterate, i.e., and for . The function is called a rate of convergence on when the corresponding series

 σ(t)≡t+ω(t)+ω(2)(t)+⋯=∞∑k=0ω(k)(t) (3)

converges for all .

The reason for the convergence of the series in (3) will become apparent in the proof of the following result of Potra and Pták, cf. [9, Proposition 1.9].

{theorem}

Let be a complete metric space, let be a given mapping, and let . If there exist a rate of convergence on a real interval , with the corresponding function as in (3), and a family of sets for all , such that

• for some ,

• and for each and ,

then:

1. is a meaningful iterative algorithm and the sequence of its iterates converges to a point . (In particular, .)

2. The following relations are satisfied for all :
a) ,
b) ,
c) .
The right hand side of the bound in (2.c) is a strictly monotonically decreasing function of . Moreover, if equality holds in (2.c) for some , then equality holds for all .

{proof}

Since , we know that exists. Now (ii) implies that

 x1∈Z(ω(t0))andd(x0,x1)≤t0.

We can apply the same reasoning to , which yields the existence of with

 x2∈Z(ω(2)(t0))andd(x1,x2)≤ω(t0).

Continuing in this way we obtain a sequence of well defined iterates with

 xk∈Z(ω(k)(t0))andd(xk,xk+1)≤ω(k)(t0),k≥0.

This shows items (2.a) and (2.b).

Next observe that, for all and ,

 d(xk,xk+m) ≤ d(xk,xk+1)+⋯+d(xk+m−1,xk+m) ≤ ω(k)(t0)+⋯+ω(k+m−1)(t0),

where we have used (2.b). Convergence of the series for all implies that the sequence is a Cauchy sequence, which by completeness of the space converges to a well defined limit point . From and for , we obtain . Hence we have shown (1).

For all and ,

 d(xk,x∗) ≤ d(xk,xk+1)+⋯+d(xk+m−1,xk+m)+d(xk+m,x∗) ≤ ω(k)(t0)+⋯+ω(k+m−1)(t0)+d(xk+m,x∗) = d(xk+m,x∗)+σm−1(ω(k)(t0)),% whereσm−1(t)≡m−1∑j=0ω(j)(t).

Taking the limit shows that , which proves (2.c).

Since and , we have for all , and thus

 σ(ω(k+1)(t0))=∞∑j=k+1ω(j)(t0)=⎛⎝∞∑j=kω(j)(t0)⎞⎠−ω(k)(t0)<σ(ω(k)(t0)).

For the proof of the last assertion about the equality in (2.c) we refer to the proof of [9, Proposition 1.11]. (In that proof it should read “If (10) is verified” instead of “If (9) is verified”.)

From the formulation of Theorem 2 it becomes apparent why Pták considered his method a “continuous analogue” of the classical mathematical induction; also cf. his own explanation in [12, p. 225–226]: In condition (i) we require the existence of at least one set that contains the initial value . This corresponds to the base step in the classical mathematical induction. Condition (ii) then considers what happens with the elements of the set under one application of the mapping . This corresponds to the inductive step.

In the classical theory of iterative processes, an iteration is called convergent of order when there exist a constant and a positive integer , such that

 d(x∗,xk+1)≤cd(x∗,xk)pfor all k≥k0. (4)

In particular, and give “linear” and ”quadratic” convergence, respectively, where in the first case we require that . As Pták pointed out, e.g., in [13, 14], such classical bounds compare quantities that are not available at any finite step of the process, simply because is unknown. In addition, since is the same fixed number for all and often is large, such bounds in many cases cannot give a good description of the initial stages of the iteration; cf. the quote from [13] given in the Introduction above. Item (2.c) in Theorem 2, on the other hand, gives an a priori bound on the error norm in each step of the iterative algorithm. Moreover, since the convergence rate is a function (rather than just a number), there appears to be a better chance that the bound is tight throughout the iteration.

Finally, we note that from the statement of Theorem 2 it is clear that if and is continuous at , then , i.e., is a fixed point of . In contrast to classical fixed point results, however, Theorem 2 contains no explicit continuity assumption on . As shown in [9, pp. 10–12] (see also [12, Section 3]), the Banach fixed point theorem, which assumes Lipschitz continuity of , can be easily derived as a corollary of Theorem 2.

## 3 Application to the Newton iteration for the matrix polar decomposition

Let be nonsingular, and let be a singular value decomposition with , for , and with unitary matrices . The factorization

 A=UH,whereU≡WV∗,H≡VΣV∗

is called a polar decomposition of . Here is unitary and is Hermitian positive definite. The theory of the matrix polar decomposition and many algorithms for computing this decomposition are described in Higham’s monograph on matrix functions [5, Chapter 8].

One of the algorithms for computing the matrix polar decomposition is Newton’s iteration [5, Equation (8.17)]:

 X0=A. (5) Xk+1=12(Xk+X−∗k),k=0,1,2,…. (6)

This iteration can be derived by applying Newton’s method to the matrix equation ; cf. [5, Problem 8.18]. As shown in [5, Theorem 8.12], its iterates converge to the unitary polar factor , and they satisfy the convergence bound

 ∥U−Xk+1∥≤12∥X−1k∥∥U−Xk∥2,k=0,1,2,…,

where denotes any unitarily invariant and submultiplicative matrix norm. This is a bound of the form (4) that shows quadratic convergence of the iteration (5)–(6). The derivation of this bound as well as further a posteriori convergence bounds for this iteration in terms of the singular values of can also be found in [3, Theorem 3.1].

### 3.1 A convergence theorem for the iteration (5)–(6)

We will now analyze the algebraic matrix iteration (5)–(6) using Theorem 2. In the notation of the theorem we consider and we write the distance between as , where denotes the spectral norm. We also consider the mapping defined by

 G(X)≡12(X+X−∗);cf. (???).

We need to determine and define the sets for some interval on which the nondiscrete induction is based. The value of is not specified yet; it will be determined when the rate of convergence has been found.

Obviously, is equal to the set of the invertible matrices in , and hence every must be invertible. We also require that every satisfies ; cf. the motivation of Theorem 2 and the first assumption in (ii) of the theorem. Moreover, for we get . Inductively we see that every that can possibly occur in the iteration (5)–(6) is of the form for some diagonal matrix with real and positive diagonal elements. For the given matrix and each we therefore define

 Z(t)≡{X∈Cn×n | X=WSV∗,S=\diag(s1,…,sn),sj>0,j=1,…,n, ∥X−G(X)∥≤t}.

In particular, .

Now let , so that for some with for . Then , and

 X−ˆX=12W(S−S−1)V∗, ∥X−ˆX∥=max1≤j≤nf(sj),wheref(s)≡12|s−s−1|. (7)

From (7) we see that in the limit we obtain , i.e., the limiting set consists only of the unitary polar factor of .

In order to determine a rate of convergence we compute

 ˆX−G(ˆX) = 12ˆX−∗(ˆX∗ˆX−I)=W(S+S−1)−114(S−S−1)2W∗, (8)

and hence

 ∥ˆX−G(ˆX)∥ = max1≤j≤ng(sj),where (9) g(s) ≡ 14(s−s−1)2s+s−1=f2(s)2(f2(s)+1)1/2. (10)

Here we have used the definition of from (7) and that . Note that and for all positive .

{lemma}

The maxima in (7) and (9) are attained at the same point .

{proof}

Without loss of generality we can assume that . Elementary computations show that

 f′(s)<0 for 00 for s>1,
 g′(s)<0 for 00 for s>1,

i.e., both and are strictly monotonically decreasing for and strictly monotonically increasing for . This means that

 max1≤j≤nf(sj)=max{f(s1),f(sn)}andmax1≤j≤ng(sj)=max{g(s1),g(sn)}.

If resp. , then obviously the maximum for both functions is attained at resp. . In the remaining case we will use that and for all , which can be easily shown. Suppose that the maximum in (7) is attained at , i.e., . Then also , and thus we must have because of the (strict) monotonicity of for . But since is also (strictly) monotonically increasing for , we get . An analogous argument applies when the maximum in (7) is attained at .

For there exist an index and a real number such that

 ∥X−ˆX∥=max1≤j≤nf(sj)=f(si)=˜t≤t.

The previous lemma then implies

 ∥ˆX−G(ˆX)∥ = max1≤j≤ng(sj)=g(si)=f2(si)2(f2(si)+1)1/2=˜t22(˜t2+1)1/2 (11) ≤ t22(t2+1)1/2≡ω(t), (12)

where we have used that the function introduced in (12) is (strictly) monotonically increasing for . Pták showed in [11] (also cf. [8, 13] or [9, Example 1.4]) that is a rate of convergence on in the sense of Definition 2 with the corresponding series given by111Here the notation should not be confused with the singular values of . This slight inconvenience is the price for retaining the original notation of Pták [11, 12, 13].

 σ(t)=t−1+(t2+1)1/2. (13)

He pointed out in [13] that for large , and for small . Thus, the rate of convergence can describe both a linear and a quadratic phase of the convergence; cf. the numerical examples in Section 3.2.

We have shown above that implies for defined in (12), and hence we have satisfied condition (ii) in Theorem 2. In order to satisfy condition (i) we simply set , then is guaranteed (recall that ). Theorem 2 now gives the following result.

{theorem}

Let with , for , and unitary matrices . Then the iterative algorithm (5)–(6) is meaningful and its sequence of iterates converges to the unitary polar factor of . Moreover, with , and resp. defined as in (12) resp. (13), we have

 ∥U−Xk∥≤σ(ω(k)(t0)),k=0,1,2,…. (14)

The inequality (14) is an equality for all .

{proof}

It remains to show that (14) is an equality for all . According to the last part of Theorem 2 it suffices to show that . With we get , so that

 ∥U−X1∥ = ∥∥∥I−12(Σ+Σ−1)∥∥∥=max1≤j≤n∣∣∣1−12(σj+σ−1j)∣∣∣ = max1≤j≤n12(σj+σ−1j)−1=max1≤j≤n(14(σj−σ−1j)2+1)1/2−1 = (t20+1)1/2−1=σ(t0)−t0=σ(ω(t0)).

(Recall that for all by definition of the series .)

The fact that (14) is an equality for all clearly demonstrates the advantage of using functions rather than just numbers for describing the convergence of iterative algorithms. The equality could be expected when considering that Potra and Pták showed that the convergence estimates obtained by the method of nondiscrete induction applied to Newton’s method for nonlinear operators in Banach spaces are sharp for special cases of scalar functions; see, e.g., [9, Proposition 5.10]. One can interpret the iteration (5)–(6) in terms of uncoupled scalar iterations, and can then derive Theorem 3.1 using results in [9, Chapter 5] or the original papers [8, 12].

### 3.2 Numerical examples

We will illustrate the convergence of the iteration (5)–(6) and the bound (14) on four examples. All experiments shown in this paper were computed in MATLAB. For numerical purposes we first computed a singular value decomposition using MATLAB’s command svd and then applied (5)–(6) with . The error in every step , given by , is shown by the solid lines in Figures 1 and 2. The pluses () in these figures indicate the corresponding values of the right hand side of (14). It should be stressed that the right hand side of (14) is a simple scalar function that only depends on the value . This function may also be computed by the more explicit expressions given in [8] or [9, Chapter 5].

We present numerical results for four different matrices:

Moler matrix. Generated using MATLAB’s command gallery(’moler’,16), this is a symmetric positive definite matrix with 15 eigenvalues between and , and one small eigenvalue of order . For this matrix ; the numerical results are shown in the left part of Figure 1.

Fiedler matrix. Generated using gallery(’fiedler’,88), this is a symmetric indefinite matrix with 87 negative eigenvalues and 1 positive eigenvalue. For this matrix ; the numerical results are shown in the right part of Figure 1.

Jordan blocks matrix. The matrix , where is an Jordan block with eigenvalue . For this matrix ; the numerical results are shown in the left part of Figure 2.

Frank matrix. Generated using gallery(’frank’,12), this is a nonnormal matrix with ill-conditioned real eigenvalues; the condition number of its eigenvector matrix computed in MATLAB is . For this matrix ; the numerical results are shown in the right part of Figure 2.

In the examples we observe the typical behavior of Newton’s method: An initial phase of linear convergence being followed by a phase of quadratic convergence in which the error is reduced very quickly. While the convergence bound (14) is an equality for all , it may happen that as well as ; see the results for the Moler matrix and the Frank matrix. As shown in Theorem 2, the error is guaranteed to decrease strictly monotonically after the first step. In practice the iteration can be scaled for stability as well as for accelerating the convergence; see [1, 3] or [5, Section 8.6] for details. We give some brief comments on scaled iterations in Section 5.

## 4 Application to the Newton iteration for the matrix square root

Let be a given, possibly singular, matrix. Any matrix that satisfies is called a square root of . For the theory of matrix square roots and many algorithms for computing them we refer to [5, Chapter 6].

One of the algorithms for computing matrix square roots is Newton’s iteration; cf. [5, Equation (6.12)]:

 X0 is invertible and commutes with A. (15) Xk+1=12(Xk+X−1kA),k=0,1,2,…. (16)

Similar to the Newton iteration for the matrix polar decomposition, the iteration (15)–(16) can be derived by applying Newton’s method to a matrix equation, here ; see [5, pp. 139–140] for details. As shown in [5, Theorem 6.9], the iterates converge (under some natural assumptions) to the principal square root , and they satisfy the convergence bound

 ∥A1/2−Xk+1∥≤12∥X−1k∥∥A1/2−Xk∥2,

where denotes any submultiplicative matrix norm. This is a bound of the classical form (4) that shows quadratic convergence.

### 4.1 A convergence theorem for the iteration (15)–(16)

The approach for analyzing the iteration (15)–(16) using Theorem 2 contains some differences in comparison with the one for the matrix polar decomposition. These differences will be pointed out in the derivation. Ultimately they will lead to a convergence result giving formally the same bound as in Theorem 3.1 (with the appropriate modifications), but containing a strong assumption on the initial matrix .

As in Section 3 we consider and the spectral norm . Now the mapping is defined by

 G(X)≡12(X+X−1A);cf. (???).

We need to determine and define the sets for some interval , where we will again specify in the end.

Clearly, is equal to the set of the invertible matrices in , so that each must be invertible. The first major difference in comparison to the iteration for the polar decomposition, where the iterates converge to a unitary matrix, is that in case of (15)–(16) the conditioning of the iterates may deteriorate during the process. In fact, when is (close to) singular, any of its square roots will be (close to) singular as well. As originally suggested by Pták in [11, 12] (also cf. [9, pp. 20–22]), such deterioration may be dealt with by requiring that each satisfies

 σmin(X)≥h(t),

where is some positive function on that will be specified below. Further requirements for each are that commutes with (cf. (15)) and that , which is the first condition in (ii) of Theorem 2. We thus define for the given matrix and the set

 Z(t)≡{X∈Cn×n|AX=XA,σmin(X)≥h(t),∥X−G(X)∥≤t}⊂D(G).

Satisfying (i) and the second condition in (ii) of Theorem 2 will now prove the convergence of Newton’s method (15)–(16) to a matrix with the error norm in each step of the iteration bounded as in (2.c) of Theorem 2. We will have a closer look at the set below.

For the second condition in (ii), let and be given. Since commutes with , we see that the matrix commutes with . We need to show that

 σmin(ˆX)≥h(ω(t))and∥X−ˆX∥≤ω(t)

for a positive function on and a rate of convergence on . Using and , we get

 σmin(ˆX) = σmin(X−(X−ˆX))≥σmin(X)−∥X−ˆX∥≥h(t)−t,

from which we obtain the condition

 h(t)−t≥h(ω(t)). (17)

Next, note that can equivalently be written as Using this and the fact that and commute, we get

 ˆX2−A=ˆX2−A+(X2−2XˆX+A)=(X−ˆX)2. (18)

Consequently,

 ∥ˆX−G(ˆX)∥ = 12∥ˆX−1(A−ˆX2)∥≤12∥ˆX−1∥∥ˆX2−A∥=12∥(X−ˆX)2∥σmin(ˆX) ≤ 12t2h(t)−t,

which gives the condition

 12t2h(t)−t≤ω(t). (20)

The inequality in (4.1) represents the second major difference in the derivation compared to the matrix polar decomposition. In (4.1) we have used the submultiplicativity of the matrix norm, while in (8)–(10) no inequalities occur because the matrices are effectively diagonal when considered under the unitarily invariant norm. The absence of inequalities in the derivation ultimately led to a convergence bound in Theorem 3.1 that is attained in every step .

The two (sufficient) conditions (17) and (20) for form a system of functional inequalities that was ingeniously solved by Pták in [11]. Here it suffices to say that for any the rate of convergence on and the corresponding function given by

 ω(t)≡12t2(t2+γ2)1/2% andσ(t)≡t−γ+(t2+γ2)1/2 (21)

(i.e. (12) and (13) for ) and the function

 h(t)≡γ+σ(t),

which is positive on , satisfy the conditions (17) and (20).

It remains to satisfy the condition (i) in Theorem 2, i.e., to show that for some there exists an . This will determine the parameter , which so far is not specified. We require that commutes with , and that

 σmin(X0)≥h(t0)=t0+(t20+γ2)1/2%and12∥X−10A−X0∥≤t0.

If we define

 t0≡12∥X−10A−X0∥,

then the second inequality is automatically satisfied. Some simplifications of the first inequality lead to

 σmin(X0)(σmin(X0)−2t0)≥γ2.

Hence we must have , and

 γ=γ0≡σ1/2min(X0)(σmin(X0)−2t0)1/2 (22)

is a feasible choice of the parameter in order to satisfy all conditions. Theorem 2 now shows that under these conditions the iterative algorithm (15)–(16) is meaningful and converges to some .

Let us now consider the set . First note that, for any and ,

 ∥ˆX2−A∥≤∥X−ˆX∥2≤t2;

see (18). Taking the limit we see that each must satisfy . Moreover, each commutes with and satisfies . If , then all matrices in the limiting set are invertible and . If , then a singular limiting matrix is possible and in this case is not a subset of . This will be discussed further below and numerically illustrated in the next section.

In summary, we have shown the following result.

{theorem}

Let and let be invertible and commute with . If

 σmin(X0)≥∥X−10A−X0∥≡2t0, (23)

then the iterative algorithm (15)–(16) is meaningful and its sequence of iterates converges to a matrix that commutes with and satisfies . Moreover, with and as in (21), where as in (22), we have

 ∥X∗−Xk∥≤σ(ω(k)(t0)),k=0,1,2,…. (24)

If , then is invertible with .

We will now discuss the condition (23). First note that from

 ∥X−10A−X0∥≤∥X−10∥∥A−X20∥=σ−1min(X0)∥A−X20∥

it follows that (23) is implied by . It therefore appears that the condition is satisfied whenever commutes with , is well conditioned and sufficiently close (in norm) to a square root of . However, as we will see below, the condition (23) seriously limits the applicability of Theorem 4.1.

The fact that the initial matrix has to satisfy some conditions (unlike in the case of Theorem 3.1) is to be expected, since not every matrix has a square root; cf. [5, Section 1.5]. If an invertible matrix that commutes with and satisfies (23) can be found, then the convergence of the Newton iteration guarantees the existence of a square root of . Pták referred to this as an iterative existence proof [12].

Now let be real and symmetric with the real eigenvalues and let for some real . Then is invertible and commutes with , and (23) becomes

 α2≥∥A−α2I∥=max1≤j≤n|λj−α2|=max{|λ1−α2|,|λn−α2|}. (25)

This condition cannot be satisfied if . (Note that in this case the principal square root of does not exist [5, Theorem 1.29].) On the other hand, if and hence is positive semidefinite, then a simple computation shows that each matrix of the form

 X0=αIwith|α|≥(λ1+λn2)1/2>0 (26)

satisfies (23) and thus the conditions of Theorem 4.1. If and hence is singular, then for every equality holds in (25). We then get in (22), and in this case the rate of convergence is

 ω(t)=t2,givingω(k)(t)=t2kandσ(ω(k)(t))=t2k−1. (27)

A numerical example with a real, symmetric and singular matrix is shown in the right part of Figure 3.

The following result shows that, despite the inequality in (4.1), the bound (24) is attained in every step for certain and .

{corollary}

If is symmetric with eigenvalues and is as in (26), then the conditions of Theorem 4.1 are satisfied, the sequence of iterates of (15)–(16) converges to the principal square root of , and the bound (24) is an equality in each step .

{proof}

We already know that and satisfy the conditions of Theorem 4.1. Let be an eigendecomposition of with orthogonal and . Then with is the principal square root of . It suffices to show that , then the result follows from the last part of Theorem 2.

Using and we get

 t0 = 12∥α−1Λ−αI∥=12|α|max1≤j≤n|λj−α2|=α2−λ12|α|, γ0 = σmin(X0)1/2(σmin(X0)−2t0)1/2=(α2−2|α|t0)1/2=λ1/21,

and thus

 σ(t0) = t0−γ0+(t20+γ20)1/2=t0−γ0+(t20+α2−2|α|t0)1/2=t0−γ0+(|α|−t0)