Convergence of the randomized Kaczmarz method for phase retrieval

# Convergence of the randomized Kaczmarz method for phase retrieval

Halyun Jeong1  and C. Sinan Güntürk2
Courant Institute, NYU
11email: halyun@cims.nyu.edu
22email: gunturk@cims.nyu.edu
June 30, 2017; revised July 16, 2017
###### Abstract

The classical Kaczmarz iteration and its randomized variants are popular tools for fast inversion of linear overdetermined systems. This method extends naturally to the setting of the phase retrieval problem via substituting at each iteration the phase of any measurement of the available approximate solution for the unknown phase of the measurement of the true solution. Despite the simplicity of the method, rigorous convergence guarantees that are available for the classical linear setting have not been established so far for the phase retrieval setting. In this short note, we provide a convergence result for the randomized Kaczmarz method for phase retrieval in . We show that with high probability a random measurement system of size will be admissible for this method in the sense that convergence in the mean square sense is guaranteed with any prescribed probability. The convergence is exponential and comparable to the linear setting.

## 1 Introduction

The classical Kaczmarz iteration is a popular and convenient method for the recovery of any real or complex -dimensional vector from a collection of sufficient linear measurements , , where denotes the Euclidean inner product of and . Starting with any initial point , the algorithm produces a succession of iterates defined by

 xk+1=xk+(yt−xk⋅ϕt)ϕt∥ϕt∥2, (1)

where is the index of the selected vector (and the corresponding measurement) at time . This equation has a simple interpretation: is the orthogonal projection of on the solution hyperplane . In other words, the update is the orthogonal projection of the error on the chosen direction . Kaczmarz’s original scheme cycles through the indices periodically, but it has been shown that random selection generally yields faster convergence. For this and other results, see [11, 8, 7, 2].

This method can be adapted in a straightforward manner to the phase retrieval problem where we only have access to the intensities : By simply using the sign (phase) of the approximate measurement in place of that of , we get the phase-adapting Kaczmarz iteration

 (2)

where is the sign (or phase) of the scalar , defined by the relation . We will assume the convention that . This method has been proposed by various authors (e.g. [14, 6]) and has been observed to perform well in practice. For general theory and some other main approaches to the phase retrieval problem, such as PhaseLift and PhaseCut, see [4, 1, 13].

Intuitively, this scheme has the biggest chance of success if the iterates can be guaranteed to stay reasonably close to one of the solutions of the phaseless equations so that the approximate signs have a chance to frequently match (or approximate, in the complex case) the true signs and make progress. Each time there is a phase mismatch, the iterate gets an update in the wrong direction, so it is important that this event does not happen too frequently. Hence, unlike the linear classical Kaczmarz scheme (1) which is not susceptible to the initial condition, a good initialization is needed for the nonlinear phase-adapting version (2). There are now good methods for this, such as the truncated spectral initialization [3].

### 1.1 Contribution

This paper will be about the real case, i.e. both and the are in . Without loss of generality, we assume that the are of unit norm, since we can always run the iteration (2) with normalized vectors and intensity measurements . Hence we will work with the iteration

 xk+1=xk+(σ(xk⋅ϕt)|x⋅ϕt|−xk⋅ϕt)ϕt. (3)

There will be two sources of randomness in this paper. The first and the primary source of randomness is the following: Given any measurement system , we will assume that the indices are chosen uniformly and independently from . We will call the resulting method phase-adapting randomized Kaczmarz iteration, irrespective of how may have been chosen. In Section 3, we present a certain deterministic condition on called “-admissibility”(which consists of four individual properties), and show that with a -admissible (and for a sufficiently small ), if the starting relative error is less than , then after one iteration the error shrinks in conditional expectation (with respect to the random choice of ). We then carry out a probabilistic analysis of convergence in Section 4 via “drift analysis” and “hitting-time” bounds.

The secondary source of randomness will come into play when we want to show that most measurement systems are -admissible in the regime. To achieve this, we will assume that the are chosen independently from the uniform distribution on the unit sphere in . The standard Gaussian distribution on can also be used.

For convenience, we state here a summarized theorem combining these two types of randomness. Individual (and stronger) results are stated separately in Sections 3 and 4. We use the notation

 dist(u,v):=min(∥u−v∥,∥u+v∥)

to denote the distance between and up to a global phase.

###### Theorem 1.1.

Let be chosen independently and uniformly on . There exist absolute positive constants , , and such that if , then with probability the system satisfies the following property:

For any , if the phase-adapting randomized Kaczmarz method with respect to is applied to any initial point satisfying the relative error bound

 dist(x,x0)∥x∥≤δ0ε,

then the stability event

 Σ:={dist(x,xk)∥x∥≤δ0  % for all  k≥1}

holds with probability at least , and conditioned on this event the expected squared error decays exponentially. More precisely, we have

 E[dist2(x,xk)1Σ]≤e−k/4ddist2(x,x0)

for all .

We prove this theorem at the end of Section 4. Some remarks are in order:

• As is the case for the randomized Kaczmarz method for linear inverse problems, the exponential convergence of to is achieved in the mean-squared sense. However, an important distinction is that this is conditional on a stability event. (In the linear case, this event is automatic due to the fact that error decreases deterministically.) We handle this problem using methods that are known as “drift analysis” (see [5]).

• The above stated probability lower bound for the stability event is not tight. Furthermore, our preliminary calculations suggest that the methods of this paper can be extended to achieve an improved probabilistic guarantee of the form for any fixed . For the sake of exposition we do not pursue this extension in this manuscript.

• We have left out performance guarantees regarding the initialization procedure from the above theorem because we have no new results to offer here. One may simply use the truncated spectral method [3] which is capable of providing the kind of guarantee that is compatible with the above theorem in that for any accuracy guarantee it can operate in the regime and succeed with probability .

Note for the revision: We would like to note here that simultaneously with the initial posting of this paper, Y. Shuo Tan and R. Vershynin posted a manuscript (see [10]) on the randomized Kaczmarz method for phase retrieval, with results that are somewhat similar to ours, but established using different methods. Subsequently, we were also informed that Zhang et al. [15] had previously established a conditional error contractivity result for the Gaussian measurement model and using the so-called “reshaped Wirtinger flow” method.

## 2 Basic relations

Let . Then (3) can be rewritten as

 zk+1 = zk−(σ(xk⋅ϕt)|x⋅ϕt|−xk⋅ϕt)ϕt (4) = zk−(zk⋅ϕt)ϕt+[σ(x⋅ϕt)−σ(xk⋅ϕt)]|x⋅ϕt|ϕt.

Since and are orthogonal, we obtain

 ∥zk+1∥2 = ∥zk−(zk⋅ϕt)ϕt∥2+|σ(x⋅ϕt)−σ(xk⋅ϕt)|2|x⋅ϕt|2 (5) = ∥zk∥2−|zk⋅ϕt|2+|σ(x⋅ϕt)−σ(xk⋅ϕt)|2|x⋅ϕt|2.

When and have opposite signs we have so that

 |σ(x⋅ϕt)−σ(xk⋅ϕt)||x⋅ϕt|≤|σ(x⋅ϕt)−σ(xk⋅ϕt)||zk⋅ϕt|

is always valid. Hence (5) implies

 ∥zk+1∥2≤∥zk∥2+[|σ(x⋅ϕt)−σ(xk⋅ϕt)|2−1]|zk⋅ϕt|2. (6)

Note that (3) is invariant under the transformation . Hence we actually have

 ∥±x−xk+1∥2≤∥±x−xk∥2+[|σ(±x⋅ϕt)−σ(xk⋅ϕt)|2−1]|(±x−xk)⋅ϕt|2, (7)

i.e. the analysis is identical for and . For convenience of notation and without loss of generality we will work to analyze and make our initial condition assumption on .

### 2.1 Heuristic for convergence

Let be uniformly distributed on . It is a standard fact that

 E |z⋅ϕ|2=1d∥z∥2,

and an easy calculation (see Appendix) yields

 E |z⋅ϕ|4=3d(d+2)∥z∥4.

It can also be checked easily that for any two nonzero and we have

 P{σ(x⋅ϕ)≠σ(y⋅ϕ)}=1πθx,y=:d(^x,^y),

where is the angle between and , and therefore is the normalized geodesic distance on between and . Hence, by Cauchy-Schwarz inequality, we obtain

 E|σ(x⋅ϕ)−σ(y⋅ϕ)|2|(x−y)⋅ϕ|2 ≤ 4(P{σ(x⋅ϕ)≠σ(y⋅ϕ)})1/2(E|(x−y)⋅ϕ|4)1/2 (8) ≤ 4d(3θx,yπ)1/2∥x−y∥2.

Hence, if is sufficiently small (e.g., less than ), then

Guided by these calculations, we turn to the error bound (6). We see that if is sufficiently small (which, for a fixed , would be guaranteed by a sufficiently small ) and if we were to choose each uniformly and independently on the unit sphere, then we would have

 E[∥zk+1∥2∣∣Fk,θx,xk<164]≤(1−12d)∥zk∥2,

where is the sigma-algebra generated by , and for any event , is the sigma-algebra in formed by intersecting elements of with .

Hence the stochastic process is contractive in conditional expectation which is also conditional on the size of . Without the size condition on , the analysis would have been fairly straightforward, similar to the situation of the randomized Kaczmarz iteration for linear systems. As we will see, this condition makes the task non-trivial.

However, we must also establish a similar contractivity result (conditional and in expectation) for the actual random model used in this paper, i.e., when is chosen uniformly from a fixed collection . This collection itself may also have been chosen randomly, though with the above observation we can now define certain deterministic properties of that are needed for the algorithm to work.

Let and be a given collection of nonzero vectors in . Following [9], we say that , or more appropriately, the linear hyperspaces produce a -uniform tessellation of if for all and in , we have

 ∣∣∣1mcard{1≤i≤m:σ(x⋅ϕi)≠σ(y⋅ϕi)}−d(x,y)∣∣∣<δ. (9)

Then by Theorem 1.2 of [9], there exists two positive absolute constants and such that if and the are chosen independently from the uniform distribution on , then with probability at least , we get a -uniform tessellation of .

If is chosen from the collection uniformly at random and is any function on , then we define the empirical mean

 Eϕ∼Φ f(ϕ):=1mm∑i=1f(ϕi).

With a that yields a -uniform tessellation, we have that the empirical mean is within of the ensemble mean . The upper part of this bound obviously yields

 Eϕ∼Φ 1{σ(x⋅ϕ)≠σ(y⋅ϕ)}≤δ+d(^x,^y)    for all x,y. (10)

The above result provides a pathway for mimicking the argument in Section 2.1 with replaced by . Under the same random model for , a useful concentration result (i.e. for the regime ) holds for the empirical mean . Indeed as it follows from [12, Theorem 5.39], there exist absolute positive constants and such that for and with probability we have

 ∥z∥22d≤Eϕ∼Φ |z⋅ϕ|2≤3∥z∥22d    for all z. (11)

(If desired, the constants and can be chosen closer to without changing the form of this statement.)

In order to continue on the same path, one would wish to have with high probability. As it turns out,333We thank Y. Shuo Tan and R. Vershynin for bringing this fact to our attention. this is impossible in the regime . We will circumvent this obstacle by tightening the Cauchy-Schwarz argument of Section 2.1: In order to do this, we will invoke (10) coupled with the Cauchy-Schwarz inequality only in the event does not exceed a fixed multiple of its mean value , and show that the above desirable upper bound is then achievable with high probability. At the same time, we will show that the second moment contribution from the large values is in fact small, so in this event we will only invoke the trivial bound on .

To this end, given , consider the alternative weaker conditions

 Eϕ∼Φ |z⋅ϕ|41{|z⋅ϕ|2≤∥z∥2/δd}≤4∥z∥4d2   % for all z, (12)

and

 Eϕ∼Φ |z⋅ϕ|21{|z⋅ϕ|2>∥z∥2/δd}≤4δ∥z∥2d    % for all z. (13)

We will say that is -admissible if all of the four conditions (10), (11), (12), and (13) hold. Note that all of these are deterministic conditions on . We will show in Lemma 3.2 that a random measurement system is -admissible with high probability when , but first let us show how these two alternative conditions are used instead of a bound on . Suppose is -admissible. Noting that , we have

 Eϕ∼Φ |σ(x⋅ϕ)−σ(y⋅ϕ)|2|(x−y)⋅ϕ|2 = 4 Eϕ∼Φ 1{σ(x⋅ϕ)≠σ(y⋅ϕ)} |(x−y)⋅ϕ|21{|(x−y)⋅ϕ|2≤∥x−y∥2/δd} +4 Eϕ∼Φ 1{σ(x⋅ϕ)≠σ(y⋅ϕ)} |(x−y)⋅ϕ|21{|(x−y)⋅ϕ|2>∥x−y∥2/δd} +4 Eϕ∼Φ |(x−y)⋅ϕ|21{|(x−y)⋅ϕ|2>∥x−y∥2/δd}. (14)

Invoking (12) and (13), we get that

 Eϕ∼Φ |σ(x⋅ϕ)−σ(y⋅ϕ)|2|(x−y)⋅ϕ|2 ≤ (8(δ+d(^x,^y))1/2+16δ)∥x−y∥2d (15) ≤ 14d∥x−y∥2,

provided and is sufficiently small (e.g. ). Hence, together with the lower bound of (11), we have

 Eϕ∼Φ (|σ(x⋅ϕ)−σ(y⋅ϕ)|2−1)|(x−y)⋅ϕ|2≤−14d∥x−y∥2

and therefore

 E[∥zk+1∥2∣∣Fk,d(^x,^xk)≤δ]≤(1−14d)∥zk∥2,

where again is the sigma-algebra generated by .

At this point, it will be helpful to replace the condition by a size condition on . Note that for any two nonzero vectors and , we have

so that the condition implies . Therefore we have

 E[∥zk+1∥2∣∣Fk,∥zk∥≤δ∥x∥]≤(1−14d)∥zk∥2. (16)

With the above discussion we have established the following result:

###### Lemma 3.1.

There exists such that, if and is -admissible, then

 E[∥zk+1∥2∣∣Fk,∥zk∥≤b]≤ρ∥zk∥2    for all k≥0, (17)

where and .

We now show that a random is -admissible with high probability in the regime .

###### Lemma 3.2.

For every , there exists positive constants and depending only on such that if , then a random measurement system that is chosen independently from the uniform distribution on is -admissible with probability at least .

###### Proof.

The property (10) is proven in [9] and (11) is covered by [12, Theorem 5.39]. Hence we only need to establish (12) and (13). Note that by homogeneity we may assume .

We start with (12). As is standard in this type of question, we would like to establish the stated inequality for fixed first (with high probability) and then use approximation over an -net of to achieve uniformity over . However is a discontinuous function of the random variable , presenting a difficulty for the approximation argument. The solution will follow by incorporating a suitable Lipschitz extension, as also done in [3].

For this purpose, let be defined by

 γ1(s):=⎧⎪⎨⎪⎩s2,s≤δ−1,(2δ−1−s)δ−1,δ−1

Then is a Lipschitz function with Lipschitz constant . Furthermore,

 s2χ[0,δ−1](s)≤γ1(s)≤s2

so that for any we have

 = 1d2Eϕ∼Φ |z⋅(√dϕ)|4χ[0,δ−1](|z⋅(√dϕ)|2) (19) ≤ 1d2Eϕ∼Φ γ1(|z⋅(√dϕ)|2).

Now, let denote the random vector uniformly distributed on so that is a spherical random vector in (see [12, Section 5.2.5]). Let stand for the sub-exponential norm and the sub-Gaussian norm (see [12, Section 5.2.3 and 5.2.4]). Noting that , we have

 ∥∥γ1(|z⋅(√dϕ)|2)∥∥ψ1≤δ−1∥∥|z⋅(√dϕ)|2∥∥ψ1≤2δ−1∥∥|z⋅(√dϕ)|∥∥2ψ2≲δ−1

where in the second step we have used [12, Lemma 5.14]) and in the last step the fact that the sub-Gaussian norm of a spherical random vector is bounded by an absolute constant (see [12, Section 5.2.5]; a direct computation is also possible).

Hence, by the Bernstein-type inequality [12, Proposition 5.16], there is an absolute constant such that for any we have, with probability at least ,

 Eϕ∼Φ γ1(|z⋅(√dϕ)|2) ≤ E γ1(|z⋅(√dϕ)|2) + t (20) ≤ E |z⋅(√dϕ)|4 + t ≤ 3 + t,

where in the second step we have used instead.

Now pick an -net of the unit sphere of cardinality at most where . For each and such that , we have

 Eϕ∼Φ ∣∣γ1(|z′⋅(√dϕ)|2)−γ1(|z⋅(√dϕ)|2)∣∣ ≤ 2δ−1Eϕ∼Φ ∣∣|z′⋅(√dϕ)|2−|z⋅(√dϕ)|2∣∣ (21) = 2δ−1Eϕ∼Φ ∣∣[(z′−z)⋅(√dϕ)][(z′+z)⋅(√dϕ)]∣∣ ≤ 6δ−1ϵ,

where in the first step we have utilized the Lipschitz continuity of , and in the last step Cauchy-Schwarz inequality coupled with the upper bound of (11). Combining (19), (20), and (21), we find that with probability at least , we have

 Eϕ∼Φ |z′⋅ϕ|41{|z′⋅ϕ|2≤1/δd}≤1d2(3+t+6δ−1ϵ)

for every . We may choose and so that and therefore (12) holds with probability at least provided .

We continue with (13). We will use the same method, but with a different Lipschitz function. Let be defined by

 γ2(s):={δs2,s≤δ−1s,s>δ−1. (22)

Then is a Lipschitz function that fixes with Lipschitz constant . We have

 sχ(δ−1,∞)(s)≤γ2(s)=min(δs2,s)

so that for any fixed we have

 = 1dEϕ∼Φ |z⋅(√dϕ)|2χ(δ−1,∞)(|z⋅(√dϕ)|2) (23) ≤ 1dEϕ∼Φ γ2(|z⋅(√dϕ)|2).

Noting that , we now have

 ∥∥γ2(|z⋅(√dϕ)|2)∥∥ψ1≲1

so that by the Bernstein-type inequality (and reducing the value of if necessary), for any we have, with probability at least ,

 Eϕ∼Φ γ2(|z⋅(√dϕ)|2) ≤ E γ2(|z⋅(√dϕ)|2) + t (24) ≤ δE |z⋅(√dϕ)|4 + t ≤ 3δ + t,

where in the second step we have used instead. We again pick an -net of the unit sphere of cardinality at most . For each and such that , this time we have

 Eϕ∼Φ ∣∣γ2(|z′⋅(√dϕ)|2)−γ2(|z⋅(√dϕ)|2)∣∣ ≤ 2Eϕ∼Φ ∣∣|z′⋅(√dϕ)|2−|z⋅(√dϕ)|2∣∣ (25) ≤ 6ϵ.

Hence by the union bound, with probability at least we have

 Eϕ∼Φ |z′⋅ϕ|21{|z′⋅ϕ|2>1/δd}≤1d(3δ+t+6ϵ)

for every . We may choose and so that and therefore (13) holds with probability at least provided . ∎

## 4 Probabilistic analysis of the error sequence

Our goal in this section will be to bound the probability that exceeds at some point and to obtain probabilistic guarantees on the exponential decay of . Lemma 3.1 uses the randomness present in the selection of only. To be able to iterate this result recursively we need to condition on the event . We define the “hitting time”

 τb:=min{j≥0:∥zj∥>b}.

Hence is the same as and the event means for all .

###### Lemma 4.1.

Suppose (17) holds. Then

 E[∥zk+1∥2∣∣τb>k]≤ρ E[∥zk∥2∣∣τb>k−1] (26)

and therefore

 E[∥zk∥2∣∣τb>k−1]≤ρk∥z0∥2 (27)

for all .

###### Proof.

Note that is the trivial sigma-algebra. We have

 E[∥zk+1∥2∣∣τb>k] = E[∥zk+1∥2∣∣τb>k,F0] (28) = E[E[∥zk+1∥2∣∣τb>k,Fk]∣∣τb>k,F0] ≤ ρ E[E[∥zk∥2∣∣τb>k,Fk]∣∣τb>k,F0] = ρ E[∥zk∥2∣∣τb>k].

Note also that

 E[∥zk∥2∣∣τb>k]≤b2≤E[∥zk∥2∣∣τb=k]. (29)

Hence we have

 E[∥zk∥2∣∣τb>k] = P(∥zk∥≤b∣∣τb>k−1)E[∥zk∥2∣∣τb>k] (30) + P(∥zk∥>b∣∣τb>k−1)E[∥zk∥2∣∣τb>k] ≤ P(∥zk∥≤b∣∣τb>k−1)E[∥zk∥2∣∣τb>k] + P(∥zk∥>b∣∣τb>k−1)E[∥zk∥2∣∣τb=k] = P(∥zk∥≤b∣∣τb>k−1)E[∥zk∥2∣∣∥zk∥≤b,τb>k−1] + P(∥zk∥>b∣∣τb>k−1)E[∥zk∥2∣∣∥zk∥>b,τb>k−1] = E[∥zk∥21{∥zk∥≤b}∣∣τb>k−1]+E[∥zk∥21{∥zk∥>b}∣∣τb>k−1] = E[∥zk∥2∣∣τb>k−1].

The result now follows by combining (28) and (30). ∎

We can now use Lemma 4.1 to control (i) the probability of the event that the error exceeds at some point (i.e. ), and (ii) the expected decay of squared error conditional on the event that the error remains bounded by (i.e. ).

###### Lemma 4.2.

Suppose (17) holds. Then for any we have

 E[∥zk∥2∣∣τb=∞]≤11−P(τb<∞)ρk∥z0∥2. (31)

and for any

 P(∥zk∥≥ρk/2a)≤(∥z0∥a)2+P(τb<∞). (32)
###### Proof.

For the first claim it suffices to observe that so that

 P(τb=∞)E[∥zk∥2∣∣τb=∞]≤P(τb>k−1)E[∥zk∥2∣∣τb>k−1].

The result follows by bounding the right hand side of this inequality using (27).

For the second claim, note that

 P(∥zk∥≥ϵ) = P(∥zk∥≥ϵ∣∣τb>k−1)P(τb>k−1)+P(∥zk∥≥ϵ∣∣τb≤k−1)P(τb≤k−1) (33) ≤ P(∥zk∥≥ϵ∣∣τb>k−1)+P(τb<∞) ≤ ϵ−2E[∥zk∥2∣∣τb>k−1]+P(τb<∞).

The result follows by setting and using (27) again. ∎

Next we give a bound on .

###### Lemma 4.3.

Suppose (17) holds. Then

 b2P(τb<∞)+(ρ−1−1)∞∑k=1P(τb>k)E[∥zk+1∥2∣∣τb>k]≤ρ∥z0∥2, (34)

in particular we have