Solving Systems of Random Quadratic Equations via Truncated Amplitude Flow

# Solving Systems of Random Quadratic Equations via Truncated Amplitude Flow

Gang Wang,  Georgios B. Giannakis,
and Yonina C. Eldar,
G. Wang and G. B. Giannakis were supported in part by NSF grants 1500713 and 1514056. G. Wang and G. B. Giannakis are with the Digital Technology Center and the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USA. G. Wang is also with the State Key Lab of Intelligent Control and Decision of Complex Systems, Beijing Institute of Technology, Beijing 100081, P. R. China. Y. C. Eldar is with the Department of Electrical Engineering, Technion – Israel Institute of Technology, Haifa 32000, Israel. Emails: {gangwang,georgios}@umn.edu; yonina@ee.technion.ac.il.
###### Abstract

This paper presents a new algorithm, termed truncated amplitude flow (TAF), to recover an unknown vector from a system of quadratic equations of the form , where ’s are given random measurement vectors. This problem is known to be NP-hard in general. We prove that as soon as the number of equations is on the order of the number of unknowns, TAF recovers the solution exactly (up to a global unimodular constant) with high probability and complexity growing linearly with both the number of unknowns and the number of equations. Our TAF approach adopts the amplitude-based empirical loss function, and proceeds in two stages. In the first stage, we introduce an orthogonality-promoting initialization that can be obtained with a few power iterations. Stage two refines the initial estimate by successive updates of scalable truncated generalized gradient iterations, which are able to handle the rather challenging nonconvex and nonsmooth amplitude-based objective function. In particular, when vectors and ’s are real-valued, our gradient truncation rule provably eliminates erroneously estimated signs with high probability to markedly improve upon its untruncated version. Numerical tests using synthetic data and real images demonstrate that our initialization returns more accurate and robust estimates relative to spectral initializations. Furthermore, even under the same initialization, the proposed amplitude-based refinement outperforms existing Wirtinger flow variants, corroborating the superior performance of TAF over state-of-the-art algorithms.

Index terms— Nonconvex optimization, phase retrieval, amplitude-based cost function, initialization, truncated gradient, linear convergence to global minimum.

## I Introduction

Consider a system of quadratic equations

 yi=|⟨ai,x⟩|2,1≤i≤m\vspace−.em (1)

where the data vector and feature vectors or are known, whereas the vector or is the wanted unknown. When and/or are complex, the magnitudes of their inner-products are given but phase information is lacking; in the real case only the signs of are unknown. Assuming that the system of quadratic equations in (1) admits a unique solution (up to a global unimodular constant), our objective is to reconstruct from phaseless quadratic equations, or equivalently, to recover the missing signs/phases of under real-/complex-valued settings. It has been established that or generic measurements as in (1) suffice for uniquely determining an -dimensional real-valued or complex-valued vector  [1, 2], respectively, while the former with has also been shown to be necessary [1, 3].

The problem in (1) constitutes an instance of nonconvex quadratic programming, that is generally known to be NP-hard [4]. Specifically for real-valued vectors and , problem (1) can be understood as a combinatorial optimization since one seeks a series of signs , such that the solution to the system of linear equations , where , obeys the given quadratic system. Evidently, there are a total of different combinations of , among which only two lead to up to a global sign. The complex case becomes even more complicated, where instead of a set of signs , one must determine a collection of unimodular complex scalars . Special cases with (entry-wise inequality), , and , correspond to the so-called stone problem [5, Section 3.4.1][6].

In many fields of physical sciences and engineering, the problem of recovering the phase from intensity/magnitude-only measurements is commonly referred to as phase retrieval [7, 8, 9]. Relevant application domains include X-ray crystallography [10], optics [11, 12], array and high-power coherent diffractive imaging [13, 14], astronomy [15], and microscopy [16]. In these settings, due to physical limitations, optical sensors and detectors such as charge-coupled device (CCD) cameras, photosensitive films, and human eyes can record only the (squared) modulus of the Fresnel or Fraunhofer diffraction pattern, while losing the phase of the incident light striking the object. It has been shown that reconstructing a discrete, finite-duration signal from its Fourier transform magnitudes is generally NP-complete [17]. Even checking quadratic feasibility (i.e., whether a solution to a given quadratic system exists or not) is itself an NP-hard problem [18, Theorem 2.6]. Thus, despite its simple form and practical relevance across various fields, tackling the quadratic system in (1) is challenging and NP-hard in general.

### I-a Prior art

Adopting the least-squares criterion, the task of recovering from data observed in additive white Gaussian noise (AWGN) can be recast as that of minimizing the intensity-based empirical loss [19]

 minimizez∈Rn/Cn  f(z):=12mm∑i=1(yi−|⟨ai,z⟩|2)2. (2)

An alternative is to consider an amplitude-based loss, in which is observed instead of in AWGN [7, 20]

 minimizez∈Rn/Cn  h(z):=12mm∑i=1(ψi−|⟨ai,z⟩|)2. (3)

Unfortunately, the presence of quadratic terms in (2) or the modulus in (3) renders the corresponding objective function nonconvex. Minimizing nonconvex objectives, which may exhibit many stationary points, is in general NP-hard [21]. In fact, even checking whether a given point is a local minimum or establishing convergence to a local minimum turns out to be NP-complete [21].

In the classical discretized one-dimensional (D) phase retrieval, the amplitude vector corresponds to the -point Fourier transform of the -dimensional signal  [22]. It has been shown based on spectral factorization that in general there is no unique solution to D phase retrieval, even if we disregard trivial ambiguities [23]. To overcome this ill-posedness, several approaches have been suggested. One possibility is to assume additional constraints on the unknown signal such as sparsity [24, 25, 14, 26]. Other approaches rely on introducing redundancy into the measurements using for example, the short-time Fourier transform, or masks [27, 28]. Finally, recent works assume random measurements (e.g., Gaussian designs) [29, 8, 26, 30, 19, 6, 31]. Henceforth, this paper focuses on random measurements obtained from independently and identically distributed (i.i.d.) Gaussian designs.

Existing approaches to solving (2) (or related ones using the Poisson likelihood; see, e.g., [6]) or (3) fall under two categories: nonconvex and convex ones. Popular nonconvex solvers include alternating projection such as Gerchberg-Saxton [32] and Fineup [7], AltMinPhase [29], (Truncated) Wirtinger flow (WF/TWF) [19, 6, 33], and Karzmarz variants [34] as well as trust-region methods [35]. Inspired by WF, other relevant judiciously initialized counterparts have also been developed for faster semidefinite optimization [36, 37], blind deconvolution [38], and matrix completion [39]. Convex counterparts on the other hand rely on the so-called matrix-lifting technique or Shor’s semidefinite relaxation to obtain the solvers abbreviated as PhaseLift [30], PhaseCut [40], and CoRK [41]. Further approaches dealing with noisy or sparse phase retrieval are discussed in [24, 31, 42, 43, 44, 45, 46].

In terms of sample complexity, it has been proven that111The notation means that there is a constant such that . noise-free random measurements suffice for uniquely determining a general signal [26]. It is also self-evident that recovering a general -dimensional requires at least measurements. Convex approaches enable exact recovery from the optimal bound of noiseless Gaussian measurements [47]; they are based on solving a semidefinite program with a matrix variable of size , thus incurring worst-case computational complexity on the order of  [40] that does not scale well with the dimension . Upon exploiting the underlying problem structure, can be reduced to  [40]. Solving for vector variables, nonconvex approaches achieve significantly improved computational performance. Using formulation (3) and adopting a spectral initialization commonly employed in matrix completion [48], AltMinPhase establishes exact recovery with sample complexity under i.i.d. Gaussian designs with resampling [29].

Concerning formulation (2), WF iteratively refines the spectral initial estimate by means of a gradient-like update, which can be approximately interpreted as a stochastic gradient descent variant [19], [33]. The follow-up TWF improves upon WF through a truncation procedure to separate gradient components of excessively extreme (large or small) sizes. Likewise, due to the heavy tails present in the initialization stage, data are pre-screened to yield improved initial estimates in the so-termed truncated spectral initialization method [6]. WF allows exact recovery from measurements in time/flops to yield an -accurate solution for any given  [19], while TWF advances these to measurements and time [6]. Interestingly, the truncation procedure in the gradient stage turns out to be useful in avoiding spurious stationary points in the context of nonconvex optimization, as will be justified in Section IV by the numerical comparison between our amplitude flow (AF) algorithms with or without the judiciously designed truncation rule. It is also worth mentioning that when for some sufficiently large positive constant , the objective function in (3) is shown to admit benign geometric structure that allows certain iterative algorithms (e.g., trust-region methods) to efficiently find a global minimizer with random initializations [35]. Hence, the challenge of solving systems of random quadratic equations lies in the case where a near-optimal number of equations are involved, e.g., in the real-valued setting.

Although achieving a linear (in the number of unknowns ) sample and computational complexity, the state-of-the-art TWF approach still requires at least equations to yield stable empirical success rate (e.g., ) under the noiseless real-valued Gaussian model [6, Section 3], which is more than twice the known information-limit of  [1]. Similar though less obvious results hold in the complex-valued scenario. While the truncated spectral initialization in [6] improves upon the “plain-vanilla” spectral initialization, its performance still suffers when the number of measurements is relatively small and its advantage (over the untruncated one) diminishes as the number of measurements grows; see more details in Fig. 4 and Section II. Furthermore, extensive numerical and experimental validation confirms that the amplitude-based cost function performs significantly better than the intensity-based one [49]; that is, formulation (3) is superior to (2). Hence, besides enhancing initialization, markedly improved performance in the gradient stage can be expected by re-examining the amplitude-based cost function and incorporating judiciously designed gradient regularization rules.

### I-B This paper

Along the lines of suitably initialized nonconvex schemes [19, 6] and inspired by [49], the present paper develops a linear-time (i.e., the computational time linearly in both dimensions and ) algorithm to minimize the amplitude-based cost function, referred to as truncated amplitude flow (TAF). Our approach provably recovers an -dimensional unknown signal exactly from a near-optimal number of noiseless random measurements, while also featuring near-perfect statistical performance in the noisy setting. TAF operates in two stages: In the first stage, we introduce an orthogonality-promoting initialization that is computable using a few power iterations. Stage two refines the initial estimate by successive updates of truncated generalized gradient iterations.

Our initialization is built upon the hidden orthogonality characteristics of high-dimensional random vectors [50], which is in contrast to spectral alternatives originating from the strong law of large numbers (SLLN) [14, 19, 6]. Furthermore, one challenge of phase retrieval lies in reconstructing the signs/phases of in the real-/complex-valued settings. Our TAF’s refinement stage leverages a simple yet effective regularization rule to eliminate the erroneously estimated phases in the generalized gradient components with high probability. Simulated tests corroborate that the proposed initialization returns more accurate and robust initial estimates than its spectral counterparts in the noiseless and noisy settings. In addition, our TAF (with gradient truncation) markedly improves upon its “plain-vanilla” version AF. Empirical results demonstrate the advantage of TAF over its competing alternatives.

Focusing on the same amplitude-based cost function, an independent work develops the so-termed reshaped Wirtinger flow (RWF) algorithm [51], which coincides with amplitude flow (AF). A slightly modified variant of spectral initialization [19] is used to obtain an initial guess, followed by a sequence of non-truncated generalized gradient iterations [51]. Numerical comparisons show that the proposed TAF method performs better than RWF especially when the number of equations approaches the information-theoretic limit ( in the real case).

The remainder of this paper is organized as follows. The amplitude-based cost function, as well as the two algorithmic stages is described and analyzed in Section II. Section III summarizes the TAF algorithm and establishes its theoretical performance. Extensive simulated tests comparing TAF with Wirtinger-based approaches are presented in Section IV. Finally, main proofs are given in Section V, while technical details are deferred to the Appendix.

## Ii Truncated Amplitude Flow

In this section, the two stages of our TAF algorithm are detailed. First, the challenge of handling the nonconvex and nonsmooth amplitude-based cost function is analyzed, and addressed by a carefully designed gradient regularization rule. Limitations of (truncated) spectral initializations are then pointed out, followed by a simple motivating example to inspire our orthogonality-promoting initialization method. For concreteness, the analysis will focus on the real-valued Gaussian model with and i.i.d. design vectors . Numerical experiments using the complex-valued Gaussian model with and i.i.d. will be discussed briefly.

To start, let us define the Euclidean distance of any estimate to the solution set: for real signals, and for complex ones [19], where denotes the Euclidean norm. Define also the indistinguishable global phase constant in the real-valued setting as

 ϕ(z):={0,∥z−x∥≤∥z+x∥,π,otherwise.\vspace−.em (4)

Henceforth, fixing to be any solution of the given quadratic system (1), we always assume that ; otherwise, is replaced by , but for simplicity of presentation, the constant phase adaptation term will be dropped whenever it is clear from the context.

### Ii-a Truncated generalized gradient stage

For brevity, collect all vectors in the matrix , and all amplitudes to form the vector . One can rewrite the amplitude-based cost function in matrix-vector representation as

 \vspace−.emminimizez∈Rn  ℓ(z):=1mm∑i=1ℓi(z)=12m∥∥ψ−|Az|∥∥2\vspace−.em (5)

where with the superscript () denoting (Hermitian) transpose; and with a slight abuse of notation, . Apart from being nonconvex, is also nondiffentiable, hence challenging the algorithmic design and analysis. In the presence of smoothness or convexity, convergence analysis of iterative algorithms relies either on continuity of the gradient (ordinary gradient methods) [52], or, on the convexity of the objective functional (subgradient methods) [53]. Although subgradient methods have found widespread applicability in nonsmooth optimization, they are limited to the class of convex functions [54, Page 4]. In nonconvex nonsmooth optimization settings, the so-termed generalized gradient broadens the scope of the (sub)gradient to the class of almost everywhere differentiable functions [55].

Consider a continuous but not necessarily differentiable function defined over an open region . We then have the following definition.

###### Definition 1.

[56, Definition 1.1] The generalized gradient of a function at , denoted by , is the convex hull of the set of limits of the form , where as , i.e.,

 \vspace−0.em∂h(z):=conv{limk→+∞∇h(zk):zk→z,zk∉Gℓ}

where the symbol ‘conv’ signifies the convex hull of a set, and denotes the set of points in at which fails to be differentiable.

Having introduced the notion of a generalized gradient, and with denoting the iteration count, our approach to solving (5) amounts to iteratively refining the initial guess (returned by the orthogonality-promoting initialization method to be detailed shortly) by means of the ensuing truncated generalized gradient iterations

 zt+1=zt−μt∂ℓtr(zt). (6)

Here, is the step size, and the (truncated) generalized gradient is given by

 ∂ℓtr(zt):=1m∑i∈It+1(aTizt−ψiaTizt|aTizt|)ai\vspace−.em (7)

for some index set to be designed next. The convention is adopted, if . It is easy to verify that the update in (6) with a full generalized gradient in (7) monotonically decreases the objective function value in (5).

Any stationary point of can be characterized by the following fixed-point equation [57, 58]

 AT(Az∗−ψ⊙Az∗|Az∗|)=0\vspace−.em (8)

for entry-wise product , which may have many solutions. Clearly, if is a solution, then so is . Furthermore, both solutions/global minimizers and satisfy (8) due to the fact that . Considering any stationary point that has been adapted such that , one can write

 z∗=x+(ATA)−1AT[ψ⊙(Az∗|Az∗|−Ax|Ax|)]. (9)

Thus, a necessary condition for in (9) is . Expressed differently, there must be sign differences between and whenever one gets stuck with an undesirable stationary point . Inspired by this observation, it is reasonable to devise algorithms that can detect and separate out the generalized gradient components corresponding to mistakenly estimated signs along the iterates .

Precisely, if and lie at different sides of the hyperplane , then the sign of will be different than that of ; that is, . Specifically, one can re-write the -th generalized gradient component as

 ∂ℓi(z) =(aTiz−ψiaTiz|aTiz|)ai =(aTiz−|aTix|aTix|aTix|)ai+(aTix|aTix|−aTiz|aTiz|)ψiai =aiaTi(z−x)+(aTix|aTix|−aTiz|aTiz|)ψiai =aiaTih+(aTix|aTix|−aTiz|aTiz|)ψiai△=ri, (10)

where . Intuitively, the SLLN asserts that averaging the first term over instances approaches , which qualifies it as a desirable search direction. However, certain generalized gradient entries involve erroneously estimated signs of ; hence, nonzero terms exert a negative influence on the search direction by dragging the iterate away from , and they typically have sizable magnitudes as will be further elaborated in Remark 2 shortly.

Figure 1 demonstrates this from a geometric perspective, where the black dot denotes the origin, and the red dot the solution ; here, is omitted for ease of exposition. Assume without loss of generality that the -th missing sign is positive, i.e., . As will be demonstrated in Theorem 1, with high probability, the initial estimate returned by our orthogonality-promoting method obeys for some sufficiently small constant . Therefore, all points lying on or within the circle (or sphere in high-dimensional spaces) in Fig. 1 satisfy . If does not intersect with the circle, then all points within the circle satisfy qualifying the -th generalized gradient as a desirable search (descent) direction in (10). If, on the other hand, intersects the circle, then points lying on the same side of with in Fig. 1 admit correctly estimated signs, while points lying on different sides of with would have . This gives rise to a corrupted search direction in (10), implying that the corresponding generalized gradient component should be eliminated.

Nevertheless, it is difficult or even impossible to check whether the sign of equals that of . Fortunately, as demonstrated in Fig. 1, most spurious generalized gradient components (those corrupted by nonzero terms) hover around the watershed hyperplane . For this reason, TAF includes only those components having sufficiently away from its watershed, i.e.,

 It+1:={1≤i≤m∣∣ ∣∣|aTizt||aTix|≥11+γ},t≥0\vspace−.em (11)

for an appropriately selected threshold . To be more specific, the light yellow color-coded area denoted by in Fig. 1 signifies the truncation region of : if satisfies the condition in (11), then the corresponding generalized gradient component will be thrown out. However, the truncation rule may mis-reject certain ‘good’ gradients if lies in the upper part of ; ‘bad’ gradients may be missed as well if belongs to the spherical cap . Fortunately, as we will show in Lemmas 5 and 6, the probabilities of misses and mis-rejections are provably very small, hence precluding a noticeable influence on the descent direction. Although not perfect, it turns out that such a regularization rule succeeds in detecting and eliminating most corrupted generalized gradient components with high probability, therefore maintaining a well-behaved search direction.

Regarding our gradient regularization rule in (11), two observations are in order.

###### Remark 1.

The truncation rule in (11) includes only relatively sizable ’s, hence enforcing the smoothness of the (truncated) objective function at . Therefore, the truncated generalized gradient employed in (6) and (7) boils down to the ordinary gradient/Wirtinger derivative in the real-/complex-valued case.

###### Remark 2.

As will be elaborated in (80) and (82), the quantities and in (10) have magnitudes on the order of and , respectively. In contrast, Proposition 1 asserts that the first term in (10) obeys for a sufficiently small . Thus, spurious generalized gradient components typically have large magnitudes. It turns out that our gradient regularization rule in (11) also throws out gradient components of large sizes. To see this, for all such that  in (28), one can re-express

 m∑i=1∂ℓi(z)=m∑i=1(1−|aTix||aTiz|)△=βiaiaTiz (12)

for some weight assigned to the direction due to . Then of an excessively large size corresponds to a large in (12), or equivalently a small in (11), thus causing the corresponding to be eliminated according to the truncation rule in (11).

Our truncation rule deviates from the intuition behind TWF, which throws away gradient components corresponding to large-size in (11). As demonstrated by our analysis in Appendix A-E, it rarely happens that a gradient component having large yields an incorrect sign of under a sufficiently accurate initialization. Moreover, discarding too many samples (those for which in TWF [6, Section 2.1]) introduces large bias into , so that TWF does not work well when is close to the information-limit of . In sharp contrast, the motivation and objective of our truncation rule in (11) is to directly sense and eliminate gradient components that involve mistakenly estimated signs with high probability.

To demonstrate the power of TAF, numerical tests comparing all stages of (T)AF and (T)WF will be presented throughout our analysis. The basic test settings used in this paper are described next. For fairness, all pertinent algorithmic parameters involved in all compared schemes were set to their default values. Simulated estimates are averaged over independent Monte Carlo (MC) realizations without mentioning this explicitly each time. Performance of different schemes is evaluated in terms of the relative root mean-square error, i.e.,

 Relative error:=dist(z,x)∥x∥, (13)

and the success rate among trials, where a success is claimed for a trial if the returned estimate incurs a relative error less than  [6]. Simulated tests under both noiseless and noisy Gaussian models are performed, corresponding to  [29] with and , respectively, with i.i.d. or .

Numerical comparison depicted in Fig. 2 using the noiseless real-valued Gaussian model suggests that even when starting with the same truncated spectral initialization, TAF’s refinement outperforms those of TWF and WF, demonstrating the merits of our gradient update rule over TWF/WF. Furthermore, comparing TAF (gradient iterations in (6)-(7) with truncation in (11) initialized by the truncated spectral estimate) and AF (gradient iterations in (6)-(7) initialized by the truncated spectral estimate) corroborates the power of the truncation rule in (11).

### Ii-B Orthogonality-promoting initialization stage

Leveraging the SLLN, spectral initialization methods estimate as the (appropriately scaled) leading eigenvector of , where is an index set accounting for possible data truncation. As asserted in [6], each summand follows a heavy-tail probability density function lacking a moment generating function. This causes major performance degradation especially when the number of measurements is small. Instead of spectral initializations, we shall take another route to bypass this hurdle. To gain intuition into our initialization, a motivating example is presented first that reveals fundamental characteristics of high-dimensional random vectors.

Fixing any nonzero vector , generate data using i.i.d. , . Evaluate the following squared normalized inner-product

 cos2θi:=|⟨ai,x⟩|2∥ai∥2∥x∥2=ψ2i∥ai∥2∥x∥2,1≤i≤m (14)

where is the angle between vectors and . Consider ordering all in an ascending fashion, and collectively denote them as with . Figure 3 plots the ordered entries in for varying by from to with . Observe that almost all vectors have a squared normalized inner-product with smaller than , while half of the inner-products are less than , which implies that is nearly orthogonal to a large number of ’s.

This example corroborates the folklore that random vectors in high-dimensional spaces are almost always nearly orthogonal to each other [50]. This inspired us to pursue an orthogonality-promoting initialization method. Our key idea is to approximate by a vector that is most orthogonal to a subset of vectors , where is an index set with cardinality that includes indices of the smallest squared normalized inner-products . Since appears in all inner-products, its exact value does not influence their ordering. Henceforth, we assume with no loss of generality that .

Using data , evaluate according to (14) for each pair and . Instrumental for the ensuing derivations is noticing from the inherent near-orthogonal property of high-dimensional random vectors that the summation of over all indices should be very small; rigorous justification is deferred to Section V. Therefore, the sum is also small, or according to (14), equivalently,

 ∑i∈I0|⟨ai,x⟩|2∥ai∥2∥x∥2=x∥x∥(∑i∈I0aiaTi∥ai∥2)x∥x∥ (15)

is small. Therefore, a meaningful approximation of can be obtained by minimizing the former with replaced by the optimization variable , namely

 minimize∥z∥=1  zT⎛⎝1|I0|∑i∈I0aiaTi∥ai∥2⎞⎠z. (16)

This amounts to finding the smallest eigenvalue and the associated eigenvector of (the symbol means positive semidefinite). Finding the smallest eigenvalue calls for eigen-decomposition or matrix inversion, each typically requiring computational complexity on the order of . Such a computational burden may be intractable when grows large. Applying a standard concentration result, we show how the computation can be significantly reduced.

Since has unit norm and is uniformly distributed on the unit sphere, it is uniformly spherically distributed.222A random vector is said to be spherical (or spherically symmetric) if its distribution does not change under rotations of the coordinate system; that is, the distribution of coincides with that of for any given orthogonal matrix . Spherical symmetry implies that has zero mean and covariance matrix  [59]. Appealing again to the SLLN, the sample covariance matrix approaches as grows. Simple derivations lead to

 (17)

where is the complement of in the set . Define , and form by removing the rows of whose indices belong to . Seeking the smallest eigenvalue of then reduces to computing the largest eigenvalue of the matrix

 (18)

namely,

 ~z0:=argmax∥z∥=1  zT\lx@overaccentset\cc@style––Y0z\vspace−.em (19)

which can be efficiently solved via simple power iterations.

When , the estimate from (19) is scaled so that its norm matches approximately that of , which is estimated as , or more accurately . To motivate these estimates, using the rotational invariance property of normal distributions, it suffices to consider the case where , with denoting the first canonical vector of . Indeed,

 (20)

where is some unitary matrix, and means that terms on both sides of the equality have the same distribution. It is then easily verified that

 1mm∑i=1yi=1mm∑i=1a2i,1∥x∥2≈∥x∥2, (21)

where the last approximation arises from the following concentration result using again the SLLN. Regarding the second estimate, one can rewrite its square as

 n∑mi=1yi∑mi=1∥ai∥2=1mm∑i=1yi⋅n(1/m)⋅∑mi=1∥ai∥2. (22)

It is clear from (21) that the first term on the right hand side of (22) approximates . The second term approaches because the denominator appealing to the SLLN again and the fact that . For simplicity, we choose to work with the first norm estimate

 z0=√∑mi=1yim~z0. (23)

It is worth highlighting that, compared to the matrix used in spectral methods, our constructed matrix in (18) does not depend on the observed data explicitly; the dependence is only through the choice of the index set . The novel orthogonality-promoting initialization thus enjoys two advantages over its spectral alternatives: a1) it does not suffer from heavy-tails of the fourth-order moments of Gaussian vectors common in spectral initialization schemes; and, a2) it is less sensitive to noisy data.

Figure 4 compares three different initialization schemes including spectral initialization [29, 19], truncated spectral initialization [6], and the proposed orthogonality-promoting initialization. The relative error of their returned initial estimates versus the measurement/unknown ratio is depicted under the noiseless and noisy real-valued Gaussian models, where was randomly generated and increases by from to . Clearly, all schemes enjoy improved performance as increases in both noiseless and noisy settings. The orthogonality-promoting initialization achieves consistently superior performance over its competing spectral alternatives under both noiseless and noisy Gaussian data. Interestingly, the spectral and truncated spectral schemes exhibit similar performance when becomes sufficiently large (e.g., in the noiseless setup or in the noisy one). This confirms that the truncation helps only if is relatively small. Indeed, the truncation discards measurements of excessively large or small sizes emerging from the heavy tails of the data distribution. Hence, its advantage over the non-truncated spectral initialization diminishes as the number of measurements increases, which gradually straightens out the heavy tails.

## Iii Main Results

The TAF algorithm is summarized in Algorithm 1. Default values are set for pertinent algorithmic parameters. Assuming independent data samples drawn from the noiseless real-valued Gaussian model, the following result establishes the theoretical performance of TAF.

###### Theorem 1 (Exact recovery).

Let be an arbitrary signal vector, and consider (noise-free) measurements , in which , . Then with probability at least for some universal constant , the initialization returned by the orthogonality-promoting method in Algorithm 1 satisfies

 dist(z0,x)≤ρ∥x∥\vspace−.em (24)

with (or any sufficiently small positive constant), provided that for some numerical constants , and sufficiently large . Furthermore, choosing a constant step size along with a truncation level , and starting from any initial guess satisfying (24), successive estimates of the TAF solver (tabulated in Algorithm 1) obey

 dist(zt,x)≤ρ(1−ν)t∥x∥,t=0,1,2,…\vspace−.em (25)

for some , which holds with probability exceeding .

22footnotetext: The symbol is the ceiling operation returning the smallest integer greater than or equal to the given number.

Typical parameter values for TAF in Algorithm 1 are , and . The proof of Theorem 1 is relegated to Section V. Theorem 1 asserts that: i) TAF reconstructs the solution exactly as soon as the number of equations is about the number of unknowns, which is theoretically order optimal. Our numerical tests demonstrate that for the real-valued Gaussian model, TAF achieves a success rate of when is as small as , which is slightly larger than the information limit of (Recall that is necessary for the uniqueness.) This is a significant reduction in the sample complexity ratio, which is for TWF and for WF. Surprisingly, TAF also enjoys a success rate of over when is the information limit , which has not yet been presented for any existing algorithms. See further discussion in Section IV; and, ii) TAF converges exponentially fast with convergence rate independent of the dimension . Specifically, TAF requires at most iterations to achieve any given solution accuracy (a.k.a., ), with iteration cost . Since the truncation takes time on the order of , the computational burden of TAF per iteration is dominated by the evaluation of the gradient components. The latter involves two matrix-vector multiplications that are computable in flops, namely, yields , and the gradient, where . Hence, the total running time of TAF is , which is proportional to the time taken to read the data .

In the noisy setting, TAF is stable under additive noise. To be more specific, consider the amplitude-based data model . It can be shown that the truncated amplitude flow estimates in Algorithm 1 satisfy

 dist(zt,x)≲(1−ν)t∥x∥+1√m∥η∥,t=0,1,… (26)

with high probability for all , provided that for sufficiently large and the noise is bounded with , where , and are some universal constants. The proof can be directly adapted from those of Theorem 1 above and Theorem 2 in [6].

## Iv Simulated Tests

In this section, we provide additional numerical tests evaluating performance of the proposed scheme relative to (T)WF 333Matlab codes directly downloaded from the authors’ websites: http://statweb.stanford.edu/~candes/TWF/algorithm.html; http://www-bcf.usc.edu/~soltanol/WFcode.html. and AF. The initial estimate was found based on power iterations, and was subsequently refined by gradient-type iterations in each scheme. The Matlab implementations of TAF are available at https://gangumn.github.io/TAF/ for reproducibility.

Left panel in Fig. 5 presents the average relative error of three initialization methods on a series of noiseless/noisy real-valued Gaussian problems with fixed, and varying from to , while those for the corresponding complex-valued Gaussian instances are shown in the right panel. Clearly, the proposed initialization method returns more accurate and robust estimates than the spectral ones. Under the same condition for the real-valued Gaussian model, Fig. 6 compares the initialization implemented in Algorithm 1 obtained by solving the maximum eigenvalue problem in (19) with the one obtained by tackling the minimum eigenvalue problem in (16) via the Lanczos method [60]. When the number of equations is relatively small (less than about ), the former performs better than the latter. Interestingly though, the latter works remarkably well and almost halves the error incurred by the implemented initialization of Alg