# Sparse Phase Retrieval via Truncated Amplitude Flow

## Abstract

This paper develops a novel algorithm, termed *SPARse Truncated Amplitude flow* (SPARTA), to reconstruct a sparse signal from a small number of magnitude-only measurements. It deals with what is also known as sparse phase retrieval (PR), which is *NP-hard* in general and emerges in many science and engineering applications. Upon formulating sparse PR as an amplitude-based nonconvex optimization task, SPARTA works iteratively in two stages: In stage one, the support of the underlying sparse signal is recovered using an analytically well-justified rule, and subsequently a sparse orthogonality-promoting initialization is obtained via power iterations restricted on the support;
and, in the second stage, the initialization is successively refined by means of hard thresholding based gradient-type iterations. SPARTA is a simple yet effective, scalable, and fast sparse PR solver. On the theoretical side, for any -dimensional -sparse () signal with minimum (in modulus) nonzero entries on the order of , SPARTA recovers the signal exactly (up to a global unimodular constant) from about random Gaussian measurements with high probability. Furthermore, SPARTA incurs computational complexity on the order of with total runtime proportional to the time required to read the data, which improves upon the state-of-the-art by at least a factor of . Finally, SPARTA is robust against additive noise of bounded support. Extensive numerical tests corroborate markedly improved recovery performance and speedups of SPARTA relative to existing alternatives.

Index terms— Nonconvex optimization, support recovery, iterative hard thresholding, compressive sampling, linear convergence.

## 1 Introduction

In many fields of engineering and applied physics, one is often tasked with reconstructing a signal from the (squared) modulus of its Fourier (or any linear) transform, which is also known as phase retrieval (PR). Such a task arises naturally in applications such as X-ray crystallography, microscopy and ptychography, astronomy, optics, as well as array and coherent diffraction imaging. In these settings, optical sensors and detectors such as charge-coupled device cameras, photosensitive films, and human eyes record only the intensity (squared magnitude) of a light wave, but not the phase. In particular, solution to PR has led to significant accomplishments, including the discovery in of DNA double helical structure from diffraction patterns, and the characterization of aberrations in the Hubble Space Telescope from measured point spread functions [1]. Due to the absence of Fourier phase information, the one-dimensional (D) Fourier PR problem is generally ill-posed. It can be shown that there are in fact exponentially many non-equivalent solutions beyond trivial ambiguities in the D PR case [2]. A common approach to overcome this ill-posedness is exploiting additional information on the unknown signal such as non-negativity, sparsity, or bounded magnitude [3, 4, 5]. Other viable solutions consist of introducing redundancy into the measurement transforming system to obtain over-sampled and short-time Fourier transform (STFT) measurements [6], random Gaussian measurements [7, 8, 9], and coded diffraction patterns using structured illumination and random masks [10, 11, 7], just to name a few; see [10] for contemporary reviews on the theory and practice of PR.

Past PR approaches can be mainly categorized as convex and nonconvex ones. A popular class of nonconvex approaches is based on alternating projections including the seminal works by Gerchberg-Saxton [12] and Fienup [4], [13], [14], alternating minimization with re-sampling (AltMinPhase) [5], (stochastic) truncated amplitude flow (TAF) [15, 9, 16, 17, 18] and the Wirtinger flow (WF) variants [7, 8, 19, 20], trust-region [21], (stochastic) proximal linear algorithms [22, 23]. See also related discussion in [10, 1, 24, 25, 26, 27, 28, 29]. Specifically, the WF variants and the trust-region methods minimize the intensity (modulus squared) based empirical risk, while AltMinPhase and TAF cope with the amplitude-based empirical risk. The convex alternatives either rely on the so-called Shor’s relaxation to obtain semidefinite programming (SDP) based solvers abbreviated as PhaseLift [30] and PhaseCut [31], or solve a basis pursuit problem in the dual domain as in PhaseMax [32, 33, 34].

Nevertheless, in various applications, especially those related to imaging, the underlying signal is naturally sparse or admits a sparse representation after some known and deterministic linear transformation [35].
For example, astronomical imaging centers around sparsely distributed stars, while electron microscopy deals with sparsely distributed atoms or molecules. As PR of sparse signals is of practical relevance, SDP, AltMinPhase, and WF recovery methods have been generalized to sparse PR producing
solvers termed compressive phase retrieval via lifting (CPRL) [36], sparse AltMinPhase [5], thresholded Wirtinger flow (TWF) [37], SparsePhaseMax [38].
CPRL in particular, accounts for the sparsity by adding an -regularization term on the wanted signal to the original PhaseLift formulation.
The other two approaches are two-stage iterative counterparts consisting of a (sparse) initialization, and a series of refinements of the initialization with gradient-type iterations.
The greedy sparse phase retrieval (GESPAR) algorithm is based on a fast -opt local search [3]. A probabilistic approach is developed based on the generalized approximate message passing (GAMP) algorithm [39]. Majorization-minimization algorithms are devised in [40].
Assuming noise-free Gaussian random measurements, CPRL recovers any -sparse -dimensional ()
signal exactly from ^{1}

Building on TWF and TAF, we propose here a novel sparse PR algorithm, which we call
*SPARse Truncated Amplitude flow* (SPARTA). Adopting an amplitude-based nonconvex formulation of the sparse PR, SPARTA emerges as a two-stage iterative solver: In stage one, the support of the underlying signal is estimated first using a well-justified rule,
and subsequently power iterations are employed to obtain an initialization restricted on the recovered support;
while the second stage successively refines the initialization with a series of
hard thresholding based truncated gradient iterations.
Both stages are conceptually simple, scalable, and fast.
Moreover, we demonstrate that SPARTA recovers any -sparse -dimensional real-/complex-valued signal
() with minimum nonzero entries (in modulus) on the order of
from measurements.
Further, to reach any given solution accuracy ,
SPARTA incurs total computational cost of , which improves upon the state-of-the-art by at least a factor of . This computational advantage is paramount in large-scale imaging applications, where the basis factor is large,
typically on the order of millions.
In addition, SPARTA can be shown robust to additive noise of bounded support. Extensive simulated tests demonstrate markedly improved exact recovery performance (in the absence of noise), robustness to noise, and runtime speedups relative to the state-of-the-art algorithms.

The remainder of this paper is organized as follows. Section 2 reviews the sparse PR problem, and also presents known necessary and sufficient conditions for uniqueness. Section 3 details the two stages of the proposed algorithm, whose analytic performance analysis is the subject of Section 4. Finally, numerical tests are reported in Section 5, proof details are given in Section 6, and conclusions are drawn in Section 7. Supporting lemmas are presented in the Appendix.

Regarding common notation used throughout the paper, lower- (upper-) case boldface letters denote column vectors (matrices) of suitable dimensions, and symbol () as superscript stands for matrix/vector transposition (conjugate transposition). Calligraphic letters are reserved for sets, e.g., . For vectors, represents the Euclidean norm, while denotes the pseudo-norm counting the number of nonzero entries. Finally, the ceiling operation returns the smallest integer greater than or equal to the given number, and the cardinality reports the number of elements in the set .

## 2 Sparse Phase Retrieval

Succinctly stated, the sparse PR task amounts to reconstructing a sparse (or ) given a system of phaseless quadratic equations taking the form [42]

(1) |

where are the observed modulus data, and are known sensing (feature) vectors.
The sparsity level is assumed known *a priori* for theoretical analysis purposes,
while numerical implementations with unknown values will be tested as well.
Alternatively, the data can be given in modulus squared (i.e., intensity) form as .
It has been established that generic ^{2}

For concreteness of our analytical results, the present paper focuses on the real-valued Gaussian model, which assumes independently and identically distributed (i.i.d.) standard Gaussian sensing vectors , , and . Nevertheless, our proposed algorithm works also for the complex-valued Gaussian model with and i.i.d. . Given and assuming also the existence of a unique -sparse solution (up to a global sign), our objective is to develop simple yet effective algorithms to provably reconstruct any -sparse -dimensional signal from a small number (far less than ) of phaseless quadratic equations as in (1).

Adopting the least-squares criterion (which coincides with the maximum likelihood one when assuming additive white Gaussian noise in (1)), the problem of recovering a -sparse solution from phaseless quadratic equations naturally boils down to that of minimizing the ensuing amplitude-based empirical loss function

(2) |

Clearly,
both the objective function and the -norm constraint in (2)
are nonconvex, which render the optimization problem *NP-hard* in general [47], and thus computationally intractable. Besides nonconvexity, another notable challenge here involves the non-smoothness of the cost function.
It is worth emphasizing that (thresholded) Wirtinger alternatives dealt with the smooth counterpart of (2) based on squared magnitudes ,
which was numerically and experimentally shown to be less effective than the amplitude-based one even when no sparsity is exploited [9, 48].
Although focusing on a formulation similar to (but different than) (2),
sparse AltMinPhase first estimates the support of the underlying signal, and performs standard PR of signals with dimension . More importantly, sparse AltMinPhase relying on alternating minimization with re-sampling entails solving a series of least-squares problems, and performs matrix inversion at every iteration. Numerical tests suggest that a very large number of measurements are
required to estimate the support exactly. Once wrong, sparse AltMinPhase confining the PR task on the estimated support would be impossible to recover the underlying sparse signal. On the other hand, motivated by the iterative hard thresholding (IHT) algorithms for compressive sensing [49, 50],
an adaptive hard thresholding procedure that maintains only certain largest entries per iteration during the gradient refinement stage turns out to be effective [37].
Yet both sparse AltMinPhase and TWF were based on the simple spectral initialization, which was recently shown to be less accurate and robust than the orthogonality-promoting initialization [9].

Broadening the TAF approach and the sparse PR solver TWF, the present paper puts forth a novel iterative solver for (2) that proceeds in two stages: S1) a sparse orthogonality-promoting initialization is obtained by solving a PCA-type problem with a few simple power iterations on an estimated support of the underlying sparse signal; and, S2) successive refinements of the initialization are effected by means of a series of truncated gradient iterations along with a hard thresholding per iteration to set all entries to zero, except for the ones of largest magnitudes. The two stages are presented in order next.

## 3 Algorithm: Sparse Truncated Amplitude Flow

In this section, the initialization stage and the gradient refinement stage of SPARTA will be described in detail. To begin, let us introduce the distance from any estimate to the solution set to be . Define also the indistinguishable global phase constant in the real case as

(3) |

Hereafter, assume to be the fixed solution to problem (1) with ; otherwise, one can replace by , but the constant phase shift shall be dropped for notational brevity. Assume also without loss of generality that , which will be justified and generalized shortly.

### 3.1 Sparse Orthogonality-promoting Initialization

When no sparsity is exploited, the orthogonality-promoting initialization proposed in [9] starts with a popular folklore in stochastic geometry: High-dimensional random vectors are almost always nearly orthogonal to each other [51]. The key idea is approximating the unknown by another vector that is most orthogonal to a carefully chosen subset of sensing vectors , where is some index set to be designed next. It is well known that the orthogonality between two vectors can be interpreted by their squared normalized inner-product . Intuitively, the smaller the squared normalized inner-product between two vectors and is, the more orthogonal they are to each other. Upon evaluating the inner-product between each and for all pairs , one can construct to include the indices of ’s corresponding to the -smallest squared normalized inner-products with . Therefore, it is natural to approximate by computing a vector most orthogonal to the set of sensing vectors [9]. Mathematically, this is equivalent to solving a smallest eigenvector (defined to be the eigenvector associated with the smallest eigenvalue of a symmetric positive definite matrix) problem

(4) |

The smallest eigenvalue (eigenvector) problem can be solved by fully eigen-decomposing the matrix at computational complexity (assuming to be on the order of ). Upon defining to be the complement of the set in , one can rewrite Recall that for i.i.d. standard Gaussian sensing vectors , the following concentration result holds [52]

(5) |

where denotes the expected value. It follows from (5) that the smallest eigenvector problem in (4) can be approximated by the largest (principal) eigenvector

(6) |

whose solution can be well approximated with a few (e.g., ) power iterations at a much cheaper computational complexity [than required for solving (4)]. When is unknown, from (6) can be scaled by the norm estimate of to obtain [7, 9]. If is large enough, it has been shown that the orthogonality-promoting initialization can produce an estimate of any given constant relative error [9].

When is *a priori* known to be -sparse with , one may expect to recover from a significantly smaller number () of measurements. The orthogonality-promoting initialization (and spectral based alternatives) requiring to be on the order of would fail in the case of PR for sparse signals given a small number of measurements [5, 7, 8, 9, 19]. By accounting for the sparsity prior information with the regularization, the same rationale as the orthogonality-promoting initialization in (4) would lead to

(7) |

The problem at hand is *NP-hard* in general due to the combinatorial constraint. Additionally, it can not be readily converted to a (sparse) PCA problem since the number of data samples available is much smaller than the signal dimension , thus hardly validating the non-asymptotic result in (5). Although at much higher computational complexity than power iterations, semidefinite relaxation could be applied [53]. Instead of coping with (7) directly, we shall take another route and develop our sparse orthogonality-promoting initialization approach to obtain a meaningful sparse initialization from the given limited number of measurements.

#### Exact support recovery

Along the lines of sparse AltMinPhase and sparse PCA [54], our approach is to first estimate the support of the underlying signal based on a carefully-designed rule; next, we will rely on power iterations to solve (6) restricted on the estimated support, thus ensuring a -sparse estimate ; and, subsequently we will scale by the norm estimate to yield a -sparse orthogonality-promoting initialization .

Starting with the support recovery procedure, assume without loss of generality that is supported on with . Consider the random variables , . Recalling that for standardized Gaussian variables, we have , , the rotational invariance property of Gaussian distributions confirms for all that

(8) |

where is obtained by deleting the -th entry from ; and likewise for . If , then yielding in (3.1.1). If on the other hand , it holds that , which leads to . It is now clear that there is a separation of in the expected values of for and . As long as the gap is sufficiently large, the support set can be recovered exactly in this way. Specifically, when all values are available, the set of indices corresponding to the -largest values recover exactly the support of . In practice, are not available. One has solely access to a number of their independent realizations. Appealing to the strong law of large numbers, the sample average approaches the ensemble one, namely, as increases. Hence, the support can be estimated as

(9) |

which will be shown to recover exactly with high probability provided that measurements are taken and the minimum nonzero entry is on the order of . The latter is postulated to guarantee such a separation between quantities having their indices belonging or not belonging to the support set. It is worth stressing that when , hence largely reducing the sampling size and also the computational complexity.

#### Orthogonality-promoting intialization

When the estimated support in (9) turns out to be exact, i.e., , one can rewrite , , where includes the -th entry of if and only if ; and likewise for . Instead of seeking directly an -dimensional initialization as in (7), one can apply the orthogonality-promoting initialization steps in (4)-(6) on the dimensionality reduced data to produce a -dimensional vector

(10) |

and subsequently reconstruct a -sparse -dimensional initialization by zero-padding at entries with indices not belonging to . Similarly, in the case of , in (10) is rescaled by the norm estimate of to obtain . We also note that our proposed algorithm can recover the underlying sparse signal when , as long as is sufficiently close to regardless of support mismatch, which is described further in Lemma 3.

### 3.2 Thresholded Truncated Gradient Stage

Upon obtaining a sparse orthogonality-promoting initialization , our approach to solving (2) boils down to iteratively refining by means of a series of -sparse hard thresholding based truncated gradient iterations, namely,

(11) |

where is the iteration index, a constant step size, and denotes a -sparse hard thresholding operation that sets all entries in to zero except for the entries of largest magnitudes. If there are multiple such sets comprising the -largest entries, a set can be chosen either randomly or according to a predefined ordering of the elements. Similar to [9], the truncated (generalized) gradient is

(12) |

where the index set is defined to be

(13) |

for some to be determined shortly, where are the given modulus data.

It is clear now that the difficulty of minimizing our nonconvex objective function reduces to that of correctly estimating the signs of by at each iteration. The truncation rule in (13) was shown capable of eliminating most “bad” gradient components involving erroneously estimated signs, i.e., . This rule improved performance of TAF [9] considerably. Recall that our objective function in (2) is also non-smooth at points obeying . Evidently, the gradient regularization rule in (13) keeps only the gradients of component functions (i.e., the summands in (2)) that bear large enough values; this rule thus maintains away from and protects the cost function in (2) from being non-smooth at points satisfying . As a consequence, the (truncated) generalized gradient employed in (12) reduces to the (truncated) gradient at such points, which also simplifies theoretical convergence analysis.

## 4 Main Results

The proposed sparse phase retrieval solver is summarized in Algorithm 1 along with default parameter values. Given data samples generated from i.i.d. sensing vectors, the following result establishes the statistical convergence rate for the proposed SPARTA algorithm in the case of .

###### Theorem 1 (Exact recovery).

Fix to be any -sparse () vector of the minimum nonzero entry on the order of , namely, for some number . Consider the noiseless measurements from i.i.d. , . If , Step 3 of SPARTA (tabulated in Algorithm 1) recovers the support of exactly with probability at least . Furthermore, there exist numerical constants such that with a fixed step size , and a truncation threshold , successive estimates of SPARTA obey

(14) |

which holds with probability exceeding provided that . Here, , and are some numerical constants.

Proof of Theorem 1 is deferred to Section 6 with supporting lemmas presented in the Appendix. We typically take parameters , and , which will also be validated by our analytical results on the feasible region of the step size. The constant depends on , on and , and and rely on both and . In the case of PR of unstructured signals, existing algorithms such as TAF ensures exact recovery when the number of measurements is about the number of unknowns , i.e., . Hence, it would be more meaningful to study the sample complexity bound for PR of sparse signals when . To this end, the sample complexity bound in Theorem 1 can often be rewritten as for some constant and large enough . Regarding Theorem 1, three observations are in order.

###### Remark 1.

###### Remark 2.

SPARTA converges at a linear rate to the globally optimal solution with convergence rate independent of the signal dimension . In other words, for any given solution accuracy , after running at most SPARTA iterations (11), the returned estimate is at most away from the global solution .

###### Remark 3.

SPARTA enjoys a low computational complexity of , and incurs a total runtime of to produce an -accurate solution. The runtime is proportional to the time taken to read the data . To see this, recall that the support recovery incurs computational complexity , power iterations incur complexity , and thresholded truncated gradient iterations have complexity ; hence, leading to a total complexity on the order of . Given the linear convergence rate, SPARTA takes a total runtime of to achieve any fixed solution accuracy .

Besides exact recovery guarantees in the case of noiseless measurements, it is worth mentioning that SPARTA exhibits robustness to additive noise, especially when the noise has bounded values. Numerical results using SPARTA for noisy sparse PR will be presented in the ensuing section.

## 5 Numerical Experiments

Simulated tests evaluating performance of SPARTA relative to truncated amplitude flow (TAF) [9] (which does not exploit the sparsity) and thresholded Wirtinger flow (TWF) [37] are presented in this section. For fair comparisons, the algorithmic parameters involved in all schemes were set to their suggested values. The initialization in each scheme was obtained based upon power iterations, and was subsequently refined by gradient iterations. In all reported experiments, the true -sparse signal vector or was generated first using or , followed by setting of its entries to zero uniformly at random. For reproducibility, the Matlab implementation of SPARTA is publicly available at https://gangwg.github.io/SPARTA/.

The first experiment evaluates the exact recovery performance of various approaches in terms of the empirical success rate over independent Monte Carlo trials, where the true signals are real-valued. A success is declared for a trial provided that the returned estimate incurs a relative mean-square error defined as

less than . We fixed the signal dimension to , and the sparsity level at , while the number of measurements increases from to by . Curves in Fig. 1 clearly demonstrate markedly improved performance of SPARTA over state-of-the-art alternatives. Even when the exact number of nonzero elements in , namely, is unknown, setting in Algorithm 1 as an upper limit on the theoretically affordable sparsity level (e.g., when is about according to Theorem 1) works well too (see the magenta curve, denoted SPARTA0). Comparison between TAF and SPARTA shows the advantage of exploiting sparsity in sparse PR settings.

The second experiment examines how SPARTA recovers real-valued signals of various sparsity levels given a fixed number of measurements. Figure 2 depicts the empirical success rate versus the sparsity level , where equals the exact number of nonzero entries in . The results suggest that with a total of phaseless quadratic equations, TAF representing the state-of-the-art for PR of unstructured signals fails, as shown by the blue curve. Although TWF works in some cases, SPARTA significantly outperforms TWF, and it ensures exact recovery of sparse signals with up to about nonzero entries (due to existence of polylog factors in the sample complexity), hence justifying our analytical results.

The next experiment validates the robustness of SPARTA against additive noise present in the data. Postulating the noisy Gaussian data model [5], we generated i.i.d. Gaussian noise according to , . From Fig. 1, it is clear that to achieve exact recovery, SPARTA requires about measurements, TAF about measurements, and TWF much more than . In this case, parameters were taken as , , and , with the number of measurements large enough to guarantee that TWF and TAF also work. It is worth mentioning that SPARTA can work with a far smaller number of measurements than . As seen from the plots, SPARTA performs only a few gradient iterations to achieve the most accurate solution among the three approaches, while its competing TAF and TWF require nearly an order more number of iterations to converge to less accurate estimates.

To demonstrate the stability of SPARTA in the presence of additive noise, the relative MSE is plotted as a function of the signal-to-noise (SNR) values in dB. Our experiments are based on the additive Gaussian noise model with a -sparse signal and the noise , where the variance is chosen such that certain values are achieved. The ratio takes values , and the SNR in dB is varied from dB to dB. Averaging over Monte Carlo realizations, Fig. 4 demonstrates that the relative MSE for all values scales inversely proportional to SNR, hence corroborating the stability of SPARTA in the presence of additive noise.

The last experiment tested the efficacy of SPARTA in the complex-valued setting, where the underlying -sparse signal was generated using , and the design vectors for . The relative MSE versus iteration count was plotted in Fig. 5, which validates the scalability and effectiveness of SPARTA in recovering complex signals. In terms of runtime, SPARTA recovers exactly a -dimensional complex-valued signal from magnitude-only measurements in a few seconds.

Regarding computation times, SPARTA converges much faster (both in time and in the number of iterations required to achieve certain solution accuracy) than TWF and TAF in all reported experiments. All numerical experiments were implemented with MATLAB Ra on an Intel CPU @ GHz ( GB RAM) computer.

## 6 Proof of Theorem 1

The proof of Theorem 1 will be provided in this section. To that end, we will first evaluate the performance of our sparse orthogonality-promoting initialization. The following result demonstrates that if the number of measurements is sufficiently large (on the order of within polylog factors), Step 3 of the SPARTA algorithm 1 reconstructs the support of exactly with high probability.

###### Lemma 1.

###### Proof of Lemma 1.

As elaborated in Section 3.1, there is a clear separation in the expected values for and ; that is,

(15) |

Consider the case of first. Based on with being a positive integer and the symbol denoting the double factorial, has second-order moment

(16) |

where is some index from different than . Letting for all , it holds that . Furthermore, one has , and

Appealing to Lemma 4, one establishes for all that

Taking leads to

Recalling our assumption that is on the order of , i.e., for certain constant , the following holds with probability at least for all

(17) |

provided that for some absolute constant .

Now let us turn to the case of , in which is a weighted sum of random variables. According to Lemma 5, it holds that

(18) |

In addition, for any constants , Chebyshev’s inequality together with the union bound confirms that

(19a) | |||