Near-optimal bounds for phase synchronization
The problem of phase synchronization is to estimate the phases (angles) of a complex unit-modulus vector from their noisy pairwise relative measurements , where is a complex-valued Gaussian random matrix. The maximum likelihood estimator (MLE) is a solution to a unit-modulus constrained quadratic programming problem, which is nonconvex. Existing works have proposed polynomial-time algorithms such as a semidefinite relaxation (SDP) approach or the generalized power method (GPM) to solve it. Numerical experiments suggest both of these methods succeed with high probability for up to , yet, existing analyses only confirm this observation for up to . In this paper, we bridge the gap, by proving SDP is tight for , and GPM converges to the global optimum under the same regime. Moreover, we establish a linear convergence rate for GPM, and derive a tighter bound for the MLE. A novel technique we develop in this paper is to track (theoretically) closely related sequences of iterates, in addition to the sequence of iterates GPM actually produces. As a by-product, we obtain an perturbation bound for leading eigenvectors. Our result also confirms intuitions that use techniques from statistical mechanics.
Keywords: angular synchronization, nonconvex optimization, semidefinite relaxation, power method, maximum likelihood estimator, eigenvector perturbation bound.
Phase synchronization is the problem of estimating angles in based on noisy measurements of their differences . This is equivalent to estimating phases from measurements of relative phases .
A typical noise model for this estimation problem is as follows. The target parameter (the signal) is the vector with entries . The measurements are stored in a matrix such that, for ,
where is the noise level and are independent standard complex Gaussian variables. Under this model, defining and for consistency, the model is compactly written in matrix notation as
where both and are Hermitian. An easy derivation111Since is Gaussian, an MLE minimizes the squared Frobenius norm: . Owing to , this is equivalent to maximizing . shows that a maximum likelihood estimator (MLE) for the signal is a global optimum of the following quadratically constrained quadratic program (we define ):
Problem (P) is non-convex and hard in general ([39, Prop. 3.5]). Yet, numerical experiments in [3, 2] suggest that, provided ,222The notation suppresses potential log factors. the following convex semidefinite relaxation for (P) admits as its unique global optimum with high probability (more generally, if the problem below admits a solution of rank 1, , then the relaxation is said to be tight and is an optimum of (P)):
In this paper, we give a rigorous proof for this observation, improving on the previous best result which only handles . Our result also provides some justification for the analytical prediction in  on optimality of the semidefinite relaxation approach.333To be precise, in , it is predicted—but not proved—via statistical mechanics arguments that the SDP relaxation is nearly optimal when . Instead of showing a solution of (SDP) has rank one, a rescaled leading eigenvector of its solution is used as an estimator.
In the theorem statement, “up to phase” refers to the fact that the measurements are relative: the distribution of does not change if is replaced by for any angle , so that if is a solution of (P), then necessarily so is for any , and can only be recovered up to a global phase shift.
Theorem 1 shows that the non-convex problem (P) enjoys what is sometimes called hidden convexity, that is, in the proper noise regime, it is equivalent to a (tractable) convex problem. As a consequence, it is not a hard problem in that regime, suggesting local solvers may be able to solve it in its natural dimension. This is desirable, since the relaxation (SDP), while convex, has the disadvantage of lifting the problem from to dimensions.
And indeed, numerical experiments in  suggest that local optimization algorithms applied to (P) directly succeed in the same regime as (SDP). This was confirmed theoretically in  for , using both a modification of the power method called the generalized power method (GPM) and local optimization algorithms acting directly on the search space of (P), which is a manifold. Results pertaining to GPM have been rapidly improved to allow for in .
In this paper, we consider a version of GPM listed as Algorithm 1 and prove that it works in the same regime as the semidefinite relaxation, thus better capturing the empirical observation. Note that GPM, as a local algorithm, is a more desirable approach versus semidefinite relaxation in practice. GPM and its variants are also considered in a number of related problems [21, 15, 31, 11], and can be seen as special cases of the conditional gradient algorithm [21, 24].
To establish both results, we develop an original proof technique based on following separate but closely related sequences of feasible points for (P), designed so that they will have suitable statistical independence properties. Furthermore, as a necessary step toward proving the main theorems, we prove an perturbation bound for eigenvectors, which is of independent interest.
It is worth noting that for , it is impossible to reliably detect, with probability tending to 1, whether is of the form or if it is only of the form [28, Thm. 6.11], which suggests that is necessary in order for a good estimator to exist. This can be made precise by considering the simpler problem of synchronization,444The problem is formulated as follows: , noise is a real random matrix, e.g., a Gaussian Wigner matrix, and the goal is to recover from . where we have the stronger knowledge that . For the synchronization problem, non-rigorous arguments that use techniques from statistical mechanics show is the information-theoretic threshold for mean squared estimation error (MSE): when is above this threshold, no estimator is able to beat the trivial estimator as . In , it was rigorously proved that is the threshold for a different notion of MSE. These results a fortiori imply, for phase synchronization, that is necessary555Suppose the prior is supported and uniformly distributed on . By independence, the Bayes-optimal estimator for phase synchronization is a function of , so we can use information-theoretic results about the synchronization problem. in order for an estimator to have nontrivial MSE (better than the trivial estimator ). It is also known that both the eigenvector estimator and the MLE have nontrivial MSE as soon as [10, 5, 19]. Whether the extra logarithmic factor is necessary to compute the MLE efficiently up to the threshold remains to be determined.
To close this introduction, we state the relevance of the MLE as an estimator for .
Let be a global optimum of (P), with global phase such that . Then, deterministically,
Furthermore, if , then with high probability for large ,
The bound on error appears in [2, Lem. 4.1], while the bound on error improves on [2, Lem. 4.2] as a by-product of the results obtained here. We remark that is necessary for a nontrivial error (smaller than , which is trivially attained by ) due to [2, 1].666In [2, 1], it is suggested information-theoretically exact recovery with high probability is impossible for synchronization over if . We can use this result to show is necessary for a nontrivial error by putting a uniform prior on .
It is important to state that the eigenvector estimator mentioned above is order-wise as good an estimator as the MLE, in that it satisfies the same error bounds as in Theorem 3 up to constants. From the perspective of optimization, the main merit of Theorems 1 and 2 is that they rigorously explain the empirically observed tractability of (P) despite non-convexity.
The difficulty: statistical dependence
As will be argued momentarily, the main difficulty in the analysis is proving a sharp bound for , which involves two dependent random quantities: the noise matrix and a solution of (P), which is a nontrivial function of . While in the -norm the simple bound is sharp, no such simple argument is known to bound the -norm. The need to study perturbations in -norm appears inescapable, as it arises from the entry-wise constraints of (P) and the aim to control as well as .
where is an optimum of (P) which is then unique up to phase. From this expression, it is apparent that if , it only remains to show that to conclude that solving (SDP) is equivalent to solving (P). This reduces the task to that of carefully bounding this scalar, random variable:
where are the columns of the random noise matrix . If and were statistically independent, this would be bounded with high probability by , as desired. Indeed, since the vector contains only phases and since the Gaussian distribution is isotropic (the distribution is invariant under rotation in the complex plane), would be distributed identically to a sum of independent standard complex Gaussians. The modulus of such a variable concentrates close to . Taking the maximum over incurs an additional factor.
Unfortunately, the intricate dependence between and has not been satisfactorily resolved in previous work, where only suboptimal bounds have been produced for , eventually leading to suboptimal bounds on the acceptable noise levels [2, eq. (4.11)][8, Lemma 12][23, Proof of Thm. 2].
As a key step to overcome this difficulty, we (theoretically) introduce auxiliary problems to transform the question of controlling into one about the sensitivity of the optimum to perturbations of the data . This is outlined next.
Introducing auxiliary problems to reduce dependence
Since the main concern in controlling is the statistical dependence between and , we introduce new optimization problems of the form (P), where, for each value of in , the cost matrix is replaced by
where is the indicator function. In other terms, is with the th row and column set to 0, so that is statistically independent from . As a result, a global optimum of (P) with set to is also independent from . This usefully informs the following observation, where the global phases of and are chosen so that :
Crucially, independence of and implies the first term is with high probability, by the argument laid out after eq. (7). In the second term, a standard concentration argument shows with high probability—see Section 3. Hence, to control , it is sufficient to show that, with high probability for all , the solutions and are within distance of each other, in the sense.
This claim about the proximity of and turns out to be a delicate statement about the sensitivity of the global optimum of (P) to perturbations of only measurements which involve the th phase, . To establish it, we need precise control of the properties of the optima of (P). To this end, we develop a strategy to track the properties of sequences which converge to as well as to for each .
To ease further discussion about distances up to phase, consider the following distance-like function:
Restricted to complex vectors of given -norm or to complex vectors with unit-modulus entries, is a true metric on the quotient space induced by the equivalence relation :
Thus, is appropriate as a distance between estimators for (P) and as a distance between candidate eigenvectors, being invariant under global phase shifts. Moreover, the quotient space is a complete metric space with . More details will follow.
Coming back to our problem, the core argument is an analysis of recursive error bounds of GPM, and this analysis leads to the proof that all iterates stay in , where
and are some constants (determined in Section 4). On one hand, we show that, with high probability, the nonlinear mapping iterated by GPM is Lipschitz continuous over with constant . On the other hand, we show that, with high probability, all iterates of GPM are in . Together, these two properties imply that, with high probability, is a contraction mapping over the set of iterates of GPM. By a completeness argument, this implies that the sequence of iterates of GPM converges in . The roadmap of our proof is the following:
(this requires developing new bounds for eigenvector perturbation—see Theorem 8),
is -Lipschitz with on with respect to (by completeness, this implies —see Lemma 14), and
As hinted above, we follow this reasoning not only for the sequence which is expected to converge to , but also for auxiliary sequences expected to converge to . It is only through exploitation of the strong links between these sequences and reduction in statistical dependence they offer that we are able to go through with the proof program above.
Note that might not be a contraction mapping on all of since we do not show that . Nevertheless, is a contraction on the iterates, which is sufficient for our purpose; henceforth, we say the mapping has the local contraction property.
We remark that, in the study of high-dimensional -estimation , the idea of introducing auxiliary problems (and associated optimizers) is also used to tackle dependence, and it yields powerful analysis. While sharing similarity with that approach, our analysis relies on studying auxiliary sequences of iterates, as will be discussed soon—also see Figure 1.
As a necessary and useful warm-up, we first focus on the task of showing that (a leading eigenvector of ) is in , via analysis of the related (leading eigenvectors of ). This requires sharp bounds for . The outcome of this analysis is an eigenvector perturbation bound in the -norm, which is another motivation for the introduction of auxiliary problems.
First analysis: an perturbation bound for eigenvectors
As the initializer of Algorithm 1, the leading eigenvector of has several good properties necessary for analysis, and we will discuss them in depth in Section 3. Theorems and lemmas in this direction are stated separately and proved first, because their proof is illustrative of the techniques deployed to prove results about (P). Most notably, we prove a sharp perturbation bound for leading eigenvectors.
If , then, with high probability for large , a leading eigenvector of the data matrix (2) scaled such that satisfies
where the global phase of is suitably chosen (e.g., such that .)
The crux of the proof lies in a sharp bound on . This is obtained by using (a suitable version of) the Davis–Kahan theorem (Lemma 11): when ,
To reach the first conclusion, we view as the perturbed version of due to perturbation . Note that has nonzero entries only in the th row and th column, and they are independent of . Compared to the full perturbation which perturbs the eigenvector to , the matrix results in a much smaller distance between and . Notice that, as will be detailed later, these results combined with the reasoning of (9) imply that is in , as desired.
Comparing Theorem 4 to Theorem 3 readily shows that the eigenvector is an excellent estimator for (up to the fact that its entries are not necessarily unit-modulus, which can be easily corrected—see Theorem 8). Further efforts in this paper are dedicated to characterizing the performance and tractability of the MLE .
Analysis of iterations: tracking auxiliary sequences
While analyzing the eigenvector is relatively straightforward, the optima of (P) are more difficult to tame due to the unit-modulus constraints. As hinted above, the novel idea we develop in this paper is to track the sequences produced by Algorithm 1 with inputs (8) instead of , for each . These auxiliary sequences—which only serve for the analysis and are not (and could not be) computed in practice—enjoy the crucial proximity property desired in the previous subsection—see Figure 1.
Indeed, we will show by induction that there exist absolute constants such that, for all and for ,
, proximity property
The proximity and local contraction properties888Formally, we only prove ; but proximity implies is also in a (slightly larger) contraction region, with different constants . are both crucial and complementary for the analysis: the proximity property allows to control quantities in the presence of the random matrix despite statistical dependence (as shown in (9)), and the local contraction property is used to establish (14) below, making sure remains small.
For the high-level idea, consider the nonlinear operators and implicitly defined by Algorithm 1 so that and . If we can show that is -Lipschitz with constant with respect to , then a recursive error bound follows:
This ensures does not accumulate with , provided the discrepancy error—which is caused by the difference between and —is small enough. This is assured with high probability, because is independent of , causing the discrepancy error to be —considerably smaller than . In spirit, this is the same argument as in the analysis of the eigenvector estimator.
The above recursive error bound hinges on the other important property, that is, staying in the contraction region. Crucially, to establish , we need a tight bound on . Fortunately, we have seen how to control this quantity in (9): for any ,
This, in turn, requires a proximity result for . This insight naturally motivates an analysis of each iteration by induction.
There are two technical issues we briefly address before ending this introduction with a remark.
The first issue concerns the probabilistic argument in the proof. In (14) and (15), we invoke concentration inequalities to obtain tight bounds. However, since there is a (small) probability that such inequalities fail, we cannot use union bounds for , which is infinite. To overcome this obstacle, we use concentration inequalities only for the first iterations, and resort to a deterministic analysis for iterations . The critical observation is that, decays exponentially for due to contraction, so the amount of update is tiny after iterations. Using another inductive argument, we can secure exponential decay for as well. The rationale is that we already established good properties about , and is tiny for , so we can easily relate to and show also has good properties. Essentially, remains in a contraction region with slightly larger constants.
We close with a remark about the initializer (and ). Algorithm 1 uses the leading eigenvector of for initialization, and our analysis relies on perturbation bounds to verify the base case of the induction. However, we point out that, even in the absence of such perturbation results, we could set in theory, deduce that and thus prove Theorem 1 about the SDP relaxation. In other words, even if we leave out the discussion about the initializer altogether, there is still enough material to secure tightness of the SDP relaxation. The proof augmented with the analysis of the eigenvector perturbation has the advantage of also providing a statement about Algorithm 1 which is actually runnable in practice.
2 Main results
In this section we will state our main theorems formally. The assumption on random noise will also be relaxed to a broader class of random matrices. To begin with, let us first clarify the “up to phase” statements in Section 1.
The quotient space
For any , whether the true phases are or does not affect the measurements (2). As a result, the available data are insufficient to distinguish from . Clearly the program (P) is invariant to global phase shifts as well. It thus makes sense to ignore the global phase in defining distances between estimators. A reasonable notion of error then becomes
where the optimal phase is the phase (Arg) of . Similarly, a notion of error can be defined:
Formally, one can partition all points in into equivalence classes via the equivalence relation (10). The resulting quotient space contains the equivalence classes for all . Specifically, the feasible set of (P),
reduces to under this equivalence relation. It is easily verified that defines a distance on . In particular, it satisfies the triangular inequality (where is understood to mean ):
Moreover, is a complete metric space under (see Theorem 17). Similarly, is also a distance. As will be shown, the sequence described in Section 1 satisfies the local contraction property (see (14)) on the metric space , hence converges to a fixed point which is exactly (understood as ).
The noise matrix
In Section 1 we assume that has independent standard complex Gaussian variables above its diagonal. However, this restricted assumption is only for expository convenience, and can be relaxed to the class of Hermitian Wigner matrices with sub-gaussian entries. Statements about Algorithm 1 and about tightness of the SDP relaxation continue to hold, although of course the solution of (P) now no longer necessarily corresponds to the MLE.
The class of sub-gaussian variables subsumes Gaussian variables, but has one defining feature similar to Gaussian variables, that is, the tail probability decaying no slower than Gaussian variables. In our model, each entry of satisfies the tail bound
for both real and imaginary parts, where is an absolute constant. Formally, we assume the Hermitian matrix satisfies the following: are jointly independent, have zero mean, and satisfy the sub-gaussian tail bound (19); the diagonal elements are zero, and for any . Note there are equivalent definitions of sub-gaussian variables (up to constants) .
This random model is a much richer class of noise matrices, containing the Gaussian model introduced in Section 1 as a special case. Each random variable in can be, for example, a symmetric Bernoulli variable, any other centered and bounded variable, or simply zero.
The SDP approach tries to solve (P) via its convex semidefinite relaxation (SDP). It is a relaxation of (P) in the following sense. For any feasible , the corresponding matrix is feasible for (SDP). Likewise, any feasible matrix of rank 1 can be factored as such that is feasible for (P). Thus, the relaxation consists in allowing solutions of rank more than 1 in (SDP). Consequently, if (SDP) admits a solution of rank 1, , then the corresponding is a global optimum for (P). Furthermore, if the rank-1 solution of (SDP) is unique, it can be recovered in polynomial time. For this reason, the regime of interest is one where (SDP) admits a unique solution of rank 1.
Note that the exponent 2 in the failure probability can be replaced by any positive numerical constant, only affecting other absolute constants in the theorem (and all other theorems).
The generalized power method (Algorithm 1) is an iterative algorithm similar to the classical power method, but instead of projecting vectors onto a sphere after matrix-vector multiplication , it extracts the phases from , which is an entry-wise projection. It is much faster than SDP, and converges linearly to a limit, which is the optimum up to phase (optimality is stated in Theorem 7). The next theorem is a precise version of Theorem 2.
There exists an absolute constant such that, if then, with probability , the sequence produced by Algorithm 1 has a linear convergence rate to some (up to phase):
The proof is based on induction: in each iteration, we will establish the proximity property and the contraction property for . The proof is simply a rigorous justification of the heuristics we discussed in Section 1. It also leads to and error bounds for . The following theorem is a precise version of Theorem 3.
We denote, henceforth, the leading eigenvector of by , and similarly the leading eigenvector of by . Note that and (similarly and ) are identical.999We use notation (similarly ) to emphasize the leading eigenvector as an estimator, as opposed to merely an initializer. We highlight the significance of the eigenvector estimator in the following theorem, which is a precise version of Theorem 4.
Let be a leading eigenvector of with scale and global phase chosen such that and . There exists an absolute constant such that, if , then, with probability ,
where is some absolute constant. Moreover, the projected leading eigenvector, namely, , satisfies the same bounds with replaced by .
The eigenvector estimator has been studied extensively in recent years, prominently in the statistics literature [27, 20], under the spiked covariance model. While the perturbation is usually studied under norms, the norm received much less attention. A recent perturbation result appeared in , but it is a deterministic bound and would produce a suboptimal result here.
3 Proof organization for eigenvector perturbations
The first concentration result is standard and is a direct consequence of, for example, Proposition 2.4 in .
With probability , the following holds for any :
where is an absolute constant.
Let be the set of unit vectors in . Suppose for each we have a finite (random) set whose elements are independent of , and the cardinality of is not random. Concentration inequalities enable us to bound uniformly over all with high probability. We state this formally in the next lemma.
Suppose for all . Let be the th entry of a vector and denote . Then, with probability ,
where is an absolute constant. In particular, we can choose such that, if (defined in (18)), then, with probability ,
Introducing auxiliary eigenvector problems
As is well known, the leading eigenvectors of are the solutions to the following optimization problem (note that this problem is a relaxation of (P)):
We aim to show that a solution of () is close to in the sense of . As before, the major difficulty of the analysis is obtaining a sharp bound on . This is apparent when we write for the leading eigenvalue of and use to obtain (choosing the global phase of such that ):
While it is easy to analyze and , bounding requires more work.
For , let be the solution to an auxiliary problem () in which is replaced by —thus, is equivalent to . Following the same strategy as in (9), we can now split into two terms and try to bound separately:
where is the dominant term, and can be easily bounded—see the paragraph below Lemma 10; and is the higher-order discrepancy error, which is the price we pay for replacing with .
The crucial point is that , which is much smaller than . This is because the difference between and results from a sparse perturbation , whose effect on the leading eigenvector is small. This point is formalized in the next lemma, which follows from .101010By Davis-Kahan Theorem and Weyl’s inequality, . Thus, the lemma follows from .
Lemma 11 (Davis–Kahan Theorem).
Suppose that are Hermitian matrices, and . Let be the gap between the top two eigenvalues of , and be leading eigenvectors of and respectively, normalized such that . If , then
The benefit of this perturbation result is pronounced when is a sparse random matrix: if we set , then the numerator in (24) becomes with high probability (by Lemma 10 and a bound on ). If, however, we set , then the numerator is with high probability. This is why is so small and (23) yields a tight bound.
We remark that in many later uses of perturbation results (e.g., [37, 29]), especially in statistics and theoretical computer science, it is common to invoke a variant of the Davis–Kahan theorem in which is replaced by in (24), which would lead to a suboptimal result here. This is because with high probability when is a random sparse matrix and is not large. Our analysis here is an example that shows the merit of using the more precise version of the Davis–Kahan theorem.
4 Proof organization for phase synchronization
We begin with some useful lemmas about the local contraction property. These will prove useful to establish the desired properties of the iterates of Algorithm 1, by induction. These properties extend to the limit by continuity. Finally, we will use a known optimality certificate for (SDP) to validate .
Local contraction lemmas
First let us denote the rescaled matrix by , which can also be viewed as a linear operator in :