Convolutional Phase Retrieval via Gradient Descent
We study the convolutional phase retrieval problem, of recovering an unknown signal from measurements consisting of the magnitude of its cyclic convolution with a given kernel . This model is motivated by applications such as channel estimation, optics, and underwater acoustic communication, where the signal of interest is acted on by a given channel/filter, and phase information is difficult or impossible to acquire. We show that when is random and the number of observations is sufficiently large, with high probability can be efficiently recovered up to a global phase shift using a combination of spectral initialization and generalized gradient descent. The main challenge is coping with dependencies in the measurement operator. We overcome this challenge by using ideas from decoupling theory, suprema of chaos processes and the restricted isometry property of random circulant matrices, and recent analysis of alternating minimization methods.
We consider the problem of convolutional phase retrieval where our goal is to recover an unknown signal from the magnitude of its cyclic convolution with a given filter . Specifically, the measurements take the form
where is cyclic convolution modulo and denotes entrywise absolute value. This problem can be rewritten in the common matrix-vector form111The convolution can be written in matrix-vector form as , where denotes the circulant matrix generated by and is a zero padding operator. In other words, with being a matrix formed by the first columns of .
This problem is motivated by applications in areas such as channel estimation [Walk:asilomar2015], noncoherent optical communication [gagliardi1976optical], and underwater acoustic communication [stojanovic1994phase]. For example, in millimeter-wave (mm-wave) wireless communications for 5G networks [shahmansoori20155g], one important problem is to estimate the angle of arrival (AoA) of a signal from measurements taken by the convolution of an antenna pattern and AoAs. As the phase measurements are often very noisy, unreliable, and expensive to acquire, it may be preferred to only take measurements of signal magnitude in which case the phase information is lost.
Most known results on the exact solution of phase retrieval problems [candes2013phaselift, soltanolkotabi2014algorithms, chen2015solving, wang2016solving, waldspurger2015phase, waldspurger2016phase] pertain to generic random matrices, where the entries of are independent subgaussian random variables. We term problem (1.2) with a generic sensing matrix as generalized phase retrieval222We use this term to make distinctions from Fourier phase retrieval, where the matrix is an oversampled DFT matrix. Here, generalized phase retrieval refers to the problem with any generic measurement other than Fourier. . However, in practice it is difficult to implement purely random measurement matrices. In most applications, the measurement is much more structured – the convolutional model studied here is one such structured measurement operator. Moreover, structured measurements often admit more efficient numerical methods: by using the fast Fourier transform for matrix-vector products, the benign structure of the convolutional model (1.1) allows to design methods with memory and computation cost per iteration. In contrast, for generic measurements, the cost is around .
In this work, we study the convolutional phase retrieval problem (1.1) under the assumption that the kernel is randomly generated from an i.i.d. standard complex Gaussian distribution , i.e.,
Compared to the generalized phase retrieval problem, the random convolution model (1.1) we study here is far more structured: it is parameterized by only independent complex normal random variables, whereas the generic model involves random variables. This extra structure poses significant challenges for analysis: the rows and columns of the sensing matrix are probabilistically dependent, so that classical probability tools (based on concentration of functions of independent random vectors) do not apply.
We propose and analyze a local gradient descent type method, minimizing a weighted, nonconvex and nonsmooth objective
where denotes the Hadamard product. Here, is a weighting vector, which is introduced mainly for analysis purposes. The choice of is discussed in Section 3. Our result can be informally summarized as follows.
Here, denotes the circulant matrix corresponding to cyclic convolution with a length zero padding of , and denotes a polynomial in . Compared to the results of generalized phase retrieval with i.i.d. Gaussian measurement, the sample complexity here has extra dependency on . The operator norm is inhomogeneous over : for a typical333e.g., is drawn uniformly at random from . , is of the order and the sample complexity matches that of the generalized phase retrieval up to factors; the “bad” case is when is sparse in the Fourier domain: and can be as large as .
Our proof is based on ideas from decoupling theory [de1999decoupling], the suprema of chaos processes and restricted isometry property of random circulant matrices [rauhut2010compressive, krahmer2014suprema], and is also inspired by a new iterative analysis of alternating minimization methods [waldspurger2016phase]. Our analysis draws connections between the convergence properties of gradient descent and the classical alternating direction method. This allows us to avoid the need to argue uniform concentration of high-degree polynomials in the structured random matrix , as would be required by a straightforward translation of existing analysis to this new setting. Instead, we control the bulk effect of phase errors uniformly in a neighborhood around the ground truth. This requires us to develop new decoupling and concentration tools for controlling nonlinear phase functions of circulant random matrices, which could be potentially useful for analyzing other random circulant convolution problems, such as sparse blind deconvolution [zhang2017global] and convolutional dictionary learning [heide2015fast].
1.1 Comparison with literature
Prior arts on phase retrieval.
The challenge of developing efficient, guaranteed methods for phase retrieval has attracted substantial interest over the past several decades [shechtman2015phase, jaganathan2015phase]. The problem is motivated by applications such as X-ray crystallography [millane1990phase, robert1993phase], microscopy [jianwei2002high], astronomy [fienup1987phase], diffraction and array imaging [bunk2007diffractive, anwei2011array], optics [walther1963question], and more. The most classical method is the error reduction algorithm derived by Gerchberg and Saxton [gerchberg1972practical], also known as the alternating direction method. This approach has been further improved by the hybrid input-output (HIO) algorithm [fienup1982phase]. For oversampled Fourier measurements, it often works surprisingly well in practice, while its global convergence properties still largely remains as a mystery [pauwels2017fienup].
For the generalized phase retrieval where the sensing matrix is i.i.d. Gaussian, the problem is better-studied: in many cases, when the number of measurements is large enough, the target solution can be exactly recovered by using either convex or nonconvex optimization methods. The first theoretical guarantees for global recovery of generalized phase retrieval with i.i.d. Gaussian measurement are based on convex optimization – the so-called Phaselift/Phasemax methods [candes2013phaselift, candes2013phase-matrix, waldspurger2015phase]. These methods lift the problem to a higher dimension and solve a semi-definite programming (SDP) problem. However, the high computational cost of SDP limits their practicality. Quite recently, [bahmani2016phase, goldstein2016phasemax, hand2016elementary] reveal that the problem can also be solved in the natural parameter space via linear programming.
Recently, nonconvex approaches have led to new computational guarantees for global optimizations of generalized phase retrieval. The first result of this type is due to Netrapalli et al. [netrapalli2013phase], showing that the alternating minimization method provably converges to the truth when initialized using a spectral method and provided with fresh samples at each iteration. Later on, Candès et al. [candes2015phase] showed that with the same initialization, gradient descent for the nonconvex least squares objective,
provably recovers the ground truth, with near-optimal sample complexity . The subsequent work [chen2015solving, zhang2016reshaped, wang2016solving] further reduced the sample complexity to by using different nonconvex objectives and truncation techniques. In particular, recent work by [zhang2016reshaped, wang2016solving] studied a nonsmooth objective that is similar to ours (1.4) with weighting . Compared to the SDP-based techniques, these methods are more scalable and closer to the approaches used in practice. Moreover, Sun et. al. [sun2016geometric] reveal that the nonconvex objective (1.5) actually has a benign global geometry: with high probability, it has no bad critical points with samples444[soltanolkotabi2017] further tightened the sample complexity to by using more advanced probability tools.. Such a result enables initialization-free nonconvex recovery555For convolutional phase retrieval, it would be nicer to characterize the global geometry of the problem as in [ge2015escaping, sun2015nonconvex, sun2016geometric, sun2016complete_a, sun2016complete_b]. However, the inhomogeneity of over causes tremendous difficulties for concentration with samples. [chen2018gradient, gilboa2018efficient].
Structured random measurements.
The study of structured random measurements in signal processing has quite a long history [krahmer2014structured]. For compressed sensing [candes2006robust], the work [candes2006stable, candes2006near, eldar2012compressed] studied random Fourier measurements, and later [rauhut2010compressive, krahmer2014suprema] proved similar results for partial random convolution measurements. However, the study of structured random measurements for phase retrieval is still quite limited. In particular, [david2013partial] and [candes2015diffraction] studied t-designs and coded diffraction patterns (i.e., random masked Fourier measurements) using semidefinite programming. Recent work studied nonconvex optimization using coded diffraction patterns [candes2015phase] and STFT measurements [bendory2016non], both of which minimize a nonconvex objective similar to (1.5). These measurement models are motivated by different applications. For instance, coded diffraction is designed for imaging applications such as X-ray diffraction imaging, STFT can be applied to frequency resolved optical gating [tsang1996frequency] and some speech processing tasks [lim1979enhancement]. Both of the results show iterative contraction in a region that is at most -close to the optimum. Unfortunately, for both results either the radius of the contraction region is not large enough for initialization to reach, or they require extra artificial technique such as resampling the data. In comparison, the contraction region we show for the random convolutional model is larger , which is achievable in the initialization stage via the spectral method. For a more detailed review of this subject, we refer the readers to Section 4 of [krahmer2014structured].
The convolutional measurement can also be reviewed as a single masked coded diffraction patterns [candes2015diffraction, candes2015phase], since , where is the Fourier transform of and is the oversampled Fourier transform of . The sample complexity for coded diffraction patterns in [candes2015phase] suggests that the dependence of our sample complexity on for convolutional phase retrieval might not be necessary and can be improved in the future. On the other hand, our results suggest that the contraction region is larger than for coded diffraction patterns, and resampling for initialization might not be necessary.
1.2 Notations, Wirtinger Calculus, and Organizations
We use to denote a circulant matrix generated from , i.e.,
where denotes a circulant shift by samples. We use to represent the unit complex sphere in , and let be the standard complex Gaussian distribution as introduced in (1.3). We use and to denote the real and Hermitian transpose of a vector or matrix, respectively, and use and to denote the real and imaginary parts of a complex variable, respectively. We use to denote the independence of two random variables . Given a matrix , and are its column and row space. For any vector , we define
to be the projection onto the span of and its orthogonal complement, respectively. We use and to denote the Frobenius norm and spectral norm of a matrix, respectively. For a random variable , its norm is defined as for any positive . For a smooth function , its norm is defined as . For an arbitrary set , we use to denote the cardinality of , and use to denote the support set of . If is the indicator function of the set , then
where is the th coordinate of a given vector. If , we use to denote a mapping that maps a vector into its coordinates restricted to the set . Let denote a unnormalized Fourier matrix with , and let be an oversampled Fourier matrix. For all theorems and proofs, we use and to denote positive numerical constants.
Consider a real-valued function . The function is not holomorphic, so that it is not complex differentiable unless it is constant [ken2009complex]. However, if one identifies with and treats as a function in the real domain, can be differentiable in the real sense. Doing calculus for directly in the real domain tends to produce cumbersome expressions. A more elegant way is adopting the Wirtinger calculus [wirtinger1927formalen], which can be considered as a neat way of organizing the real partial derivatives (see also [soltanolkotabi2014algorithms] and Section 1 of [sun2016geometric]). The Wirtinger derivatives can be defined formally as
Basically it says that when evaluating , one just writes in the pair of , and conducts the calculus by treating as if it was a constant. We compute in a similar fashion. To evaluate the individual partial derivatives, such as , all the usual rules of calculus apply. For more details on Wirtinger calculus, we refer interested readers to [ken2009complex].
The rest of the paper is organized as follows. In Section 2, we introduce the basic formulation of the problem and the proposed algorithm. In Section 3, we present the main results and a sketch of the proof; detailed analysis is postponed to Section 6. In Section 4, we corroborate our analysis with numerical experiments. We discuss the potential impacts of our work in Section 5. Finally, all the basic probability tools that are used in this paper are described in the appendices.
2 Nonconvex Optimization via Gradient Descent
In this work, we develop an approach to convolutional phase retrieval based on local nonconvex optimization. Our proposed algorithm has two components: (1) a careful data-driven initialization using a spectral method; (2) local refinement by gradient descent. We introduce the two steps below.
2.1 Minimization of a nonconvex and nonsmooth objective
We consider minimizing a weighted nonconvex and nonsmooth objective introduced in (1.4). The adoption of the positive weights facilitates our analysis, by enabling us to compare certain functions of the dependent random matrix to functions involving more independent random variables. We will substantiate this claim in the next section. As aforementioned, we consider the generalized Wirtinger gradient of (1.4),
Here, because of the nonsmoothness of (1.4), is not differentiable everywhere even in the real sense. To deal with this issue, we specify
for any complex number and . Starting from some initialization , we minimize the objective (1.4) by generalized gradient descent
where is the stepsize. Indeed, can be interpreted as the subgradient of in the real case; this method can be seen as a variant of amplitude flow [wang2016solving].
2.2 Initialization via spectral method
Similar to [netrapalli2013phase, soltanolkotabi2014algorithms], we compute the initialization via a spectral method, detailed in Algorithm 1. More specifically, is a scaled version of the leading eigenvector of the following matrix
which is constructed from the knowledge of the sensing vectors and observations. The leading eigenvector of can be efficiently computed via the power method. Note that , so the leading eigenvector of is proportional to the target solution . Under the random convolutional model of , by using probability tools from [krahmer2014structured], we show that concentrates to its expectation for all whenever , ensuring that the initialization is close to the optimal set . It should be noted that several variants of this initialization approach in Algorithm 1 have been introduced in the literature. They improve upon the factors of sample complexity for generalized phase retrieval with i.i.d. measurements. Those methods include the truncated spectral method [chen2015solving], null initialization [chen2015phase] and orthogonality-promoting initialization [wang2016solving]. For the simplicity of analysis, here we only consider Algorithm 1 for the convolutional model.
3 Main Result and Sketch of Analysis
In this section, we introduce our main theoretical result, and sketch the basic ideas behind the analysis. Without loss of generality, we assume the ground truth signal to be . Because the problem can only be solved up to a global phase shift, we define the optimal solution set as , and correspondingly define
which measures the distance from a point to the optimal set .
3.1 Main Result
Suppose the weighting vector in (1.4), where
with . Our main theoretical result shows that with high probability, the generalized gradient descent (2.1) with spectral initialization converges linearly to the optimal set .
Theorem 3.1 (Main Result)
Our result shows that by initializing the problem -close to the optimum via the spectral method, the gradient descent (2.1) converges linearly to the optimal solution. As we can see, the sample complexity here also depends on , which is quite different from the i.i.d. case. For a typical (e.g., is drawn uniformly random from ), is on the order of , and the sample complexity matches the i.i.d. case up to log factors. However, is nonhomogeneous over : if is sparse in the Fourier domain (e.g., ), the sample complexity can be as large as . Such a behavior is also demonstrated in the experiments of Section 4. We believe the (very large!) number of logarithms in our result is an artifact of our analysis, rather than a limitation of the method. We expect to reduce the sample complexity to by a tighter analysis, which is left for future work. The choices of the weighting in (3.1), , and the stepsize are purely for the purpose of analysis. In practice, the algorithm converges with and a choice of small stepsize , or by using backtracking linesearch for the stepsize .
3.2 A Sketch of the Analysis
In this subsection, we briefly highlight some major challenges and new ideas behind the analysis. All the detailed proofs are postponed to Section 6. The core idea behind the analysis is to show that the iterate contracts once we initialize close enough to the optimum. In the following, we first describe the basic ideas of proving iterative contraction, which critically depends on bounding a certain nonlinear function of a random circulant matrix. We sketch the core ideas of how to bound such a complicated term via the decoupling technique.
3.2.1 Proof sketch of iterative contraction
Our iterative analysis is inspired by the recent analysis of alternating direction method (ADM) [waldspurger2016phase]. In the following, we draw connections between the gradient descent method (2.1) and ADM, and sketch the basic ideas of convergence analysis.
ADM is a classical method for solving phase retrieval problems [gerchberg1972practical, netrapalli2013phase, waldspurger2016phase], which can be considered as a heuristic method for solving the following nonconvex problem
At every iterate , ADM proceeds in two steps:
which leads to the following update
where is the pseudo-inverse of . Let . The distance between and is bounded by
Gradient descent with .
For simplicity and illustration purposes, let us first consider the gradient descent update (2.1) with . Let , with stepsize . The distance between the iterate and the optimal set is bounded by
Towards iterative contraction.
By measure concentration, it can be shown that
for some constant sufficiently small, where such that . By borrowing ideas from controlling (3.6) in the ADM method [waldspurger2016phase], this observation provides a new way of analyzing the gradient descent method. As an attempt to show (3.6) for the random circulant matrix , we invoke Lemma A.1 in the appendix, which controls the error in a first order approximation to . Let us decompose
where with , and . Notice that , so that by Lemma A.1, for any we have
The first term is relatively much smaller than , which can be bounded by a small numerical constant using the restricted isometry property of a random circulant matrix [krahmer2014suprema], together with some auxiliary analysis. The detailed analysis is provided in Section 6.4. The second term involves a nonlinear function of the random circulant matrix . Controlling this nonlinear, highly dependent random process for all is a nontrivial task. In the next subsection, we explain why bounding is technically challenging, and describe the key ideas on how to control a smoothed variant of , by using the weighting introduced in (3.1). We also provide intuitions for why the weighting is helpful.
3.2.2 Controlling a smoothed variant of the phase term
As elaborated above, the major challenge of showing iterative contraction is bounding the suprema of the nonlinear, dependent random process over the set
By using the fact that for any , we have
for some constant .
Let () be a row vector of , then the term
is a summation of dependent random variables. To address this problem, we deploy ideas from decoupling [de1999decoupling]. Informally, decoupling allows us to compare moments of random functions to functions of more independent random variables, which are usually easier to analyze. The book [de1999decoupling] provides a beautiful introduction to this area. In our problem, notice that the random vector occurs twice in the definition of – one in the phase term , and another in the quadratic term. The general spirit of decoupling is to seek to replace one of these copies of with an independent copy of the same random vector, yielding a random process with fewer dependencies. Here, we seek to replace with
The utility of this new, decoupled form of is that it introduces extra randomness — is now a chaos process of conditioned on . This makes analyzing amenable to existing analysis of suprema of chaos processes for random circulant matrices [krahmer2014structured]. However, achieving the decoupling requires additional work; the most general existing results on decoupling pertain to tetrahedral polynomials, which are polynomials with no monomials involving any power larger than one of any random variable. By appropriately tracking cross terms, these results can also be applied to more general (non-tetrahedral) polynomials in Gaussian random variables [kwapien1987decoupling]. However, our random process involves a nonlinear phase term which is not a polynomial, and hence is not amenable to a direct appeal to existing results.
Decoupling is “recoupling”.
Existing results [kwapien1987decoupling] for decoupling polynomials of Gaussian random variables are derived from two simple facts:
orthogonal projections of Gaussian variables are independent666If two random variables are jointly Gaussian, they are statistically independent if and only if they are uncorrelated.;
For the random vector , let us introduce an independent copy . Write
Because of Footnote 6, and are two independent vectors. Now, by taking conditional expectation with respect to , we have
Thus, we can see that the key idea of decoupling into , is essentially “recoupling” via conditional expectation – the “recoupled” term can be reviewed as an approximation of . Notice that by Fact 2, Jensen’s inequality, for any convex function ,
Thus, by choosing appropriately, i.e., as , we can control all the moments of via
This type of inequality is very useful because it relates the moments of to those of . As discussed previously, is a chaos process of conditioned on . Its moments can be bounded using existing results [krahmer2014suprema].
If was a tetrahedral polynomial, then we have , i.e., the approximation is exact. As the tail bound of can be controlled via its moments bounds [foucart2013mathematical, Chapter 7.2], this allows us to directly control the object of interest . The reason of achieving this bound is because the conditional expectation operator “recouples” back to the target . In other words, (Gaussian) decoupling is recoupling.
“Recoupling” is Gaussian smoothing.
In convolutional phase retrieval, a distinctive feature of the term is that is a phase function and therefore is not a polynomial. Hence, it may be challenging to posit a which “recouples” back to . In other words, as in the existing form, we need to tolerate an approximation error. Although is not exactly , we can still control through its approximation ,
As we discussed above, the term can be controlled by using decoupling and the moments bound in (3.10). Therefore, the inequality (3.11) is useful to derive a sufficiently tight bound for if is very close to uniformly, i.e., the approximation error is small. Now the question is: for what is it possible to find a “well-behaved” such that the approximation error is small? To understand this question, recall that the mechanism that links to is the conditional expectation operator . For our case, from (3.9) orthogonality leads to
Note that the function is not exactly , but generated by convolving with a multivariate Gaussian pdf: indeed, recoupling is Gaussian smoothing. The Fourier transform of a multivariate Gaussian is again a Gaussian; it decays quickly with frequency. So, in order to admit a small approximation error, the target must be smooth. However, in our case, the function is discontinuous at ; it changes extremely rapidly in the vicinity of , and hence its Fourier transform (appropriately defined) does not decay quickly at all. Therefore, the term is a poor target for approximation by using a smooth function . From Figure 1, the difference between and increases as . The poor approximation error results in a trivial bound for instead of the desired bound (3.7).
Decoupling and convolutional phase retrieval.
To reduce the approximation error caused by the nonsmoothness of at , we smooth . More specifically, we introduce a new weighted objective (1.4) with Gaussian weighting in (1.2) , replacing the analyzing target with
Consequently, we obtain a smoothed variant of ,
Similar to (3.13), we obtain
Now the approximation error in (3.13) is replaced by . As observed from Figure 1, the function smoothes especially near the vicinity of , such that the new approximation error is significantly reduced. Thus, by using similar ideas above, we can provide a nontrivial bound
for some , which is sufficient for showing iterative contraction. Finally, because of the weighting , it should be noticed that the overall analysis needs to be slightly modified accordingly. For a more detailed analysis, we refer the readers to Section 6.
In this section, we conduct experiments on both synthetic and real datasets to demonstrate the effectiveness of the proposed method.
4.1 Experiments on synthetic dataset
Dependence of sample complexity on .
First, we investigate the dependence of the sample complexity on . We assume the ground truth , and consider three cases:
with to be the standard basis vector, such that ;
is uniformly random generated on the complex sphere ;
, such that .
For each case, we fix the signal length and vary the ratio . For each ratio , we randomly generate the kernel in (1.1) and repeat the experiment times. We initialize the algorithm by the spectral method in Algorithm 1 and run the gradient descent (2.1). Given the algorithm output , we judge the success of recovery by
Another observation is that the larger is, the more samples we needed for the success of recovery. One possibility is that the sample complexity depends on , another possibility is that the extra logarithmic factors in our analysis are truly necessary for worst case (here, spectral sparse) inputs.
Necessity of initializations.
As has been shown in [sun2016geometric, soltanolkotabi2017], for phase retrieval with generic measurement, when the sample complexity satisfies , with high probability the landscape of the nonconvex objective (1.5) is nice enough that it enables initialization free global optimization. This raises an interesting question of whether spectral initialization is necessary for the random convolutional model. We consider a similar setting as the previous experiment, where the ground truth is drawn uniformly at random from . We fix the dimension and change the ratio . For each ratio, we randomly generate the kernel in (1.1) and repeat the experiment times. For each instance, we start the algorithm from random and spectral initializations, respectively. We choose the stepsize via backtracking linesearch and terminate the experiment either when the number of iterations is larger than or the distance of the iterate to the solution is smaller than . As we can see from Figure 3, the number of samples required for successful recovery with random initializations is only slightly more than that with the spectral initialization. This implies that the requirement of spectral initialization is an artifact of our analysis. For convolutional phase retrieval, the result in [chen2018gradient] shows some promises for analyzing global convergence of gradient methods with random initializations.
Effects of weighting .
Although the weighting in (3.1) that we introduced in Theorem 3.1 is mainly for analysis, here we investigate its effectiveness in practice. We consider the same three cases for as we did before. For each case, we fix the signal length and vary the ratio . For each ratio , we randomly generate the kernel in (1.1) and repeat the experiment times. We initialize the algorithm by the spectral method in Algorithm 1 and run the gradient descent (2.1) with weighting and in (3.1), respectively. We judge success of recovery once the error (4.1) is smaller than . From Figure 4, we can see that the sample complexity is slightly larger for , the benefit of weighting here is more for the ease of analysis.
Comparison with generic random measurements.
Another interesting question is that, in comparison with a pure random model, how many more samples are needed for the random convolutional model in practice? We investigate this question numerically. We consider the same three cases for as we did before, and consider two random measurement models
where , and is a row vector of . For each case, we fix the signal length and vary the ratio . We repeat the experiment times. We initialize the algorithm by the spectral method in Algorithm 1 for both models, and run gradient descent (2.1). We judge success of recovery once the error (4.1) is smaller than . From Figure 5, we can see that when is typical (e.g., or is uniformly random generated from ), under the same settings, the samples needed for the two random models are almost the same. However, when is Fourier sparse (e.g., ), more samples are required for the random convolution model.