Gradient Descent with Random Initialization:
Fast Global Convergence for Nonconvex Phase Retrieval00footnotetext: Author names are sorted alphabetically.
This paper considers the problem of solving systems of quadratic equations, namely, recovering an object of interest from quadratic equations / samples , . This problem, also dubbed as phase retrieval, spans multiple domains including physical sciences and machine learning.
We investigate the efficiency of gradient descent (or Wirtinger flow) designed for the nonconvex least squares problem. We prove that under Gaussian designs, gradient descent — when randomly initialized — yields an -accurate solution in iterations given nearly minimal samples, thus achieving near-optimal computational and sample complexities at once. This provides the first global convergence guarantee concerning vanilla gradient descent for phase retrieval, without the need of (i) carefully-designed initialization, (ii) sample splitting, or (iii) sophisticated saddle-point escaping schemes. All of these are achieved by exploiting the statistical models in analyzing optimization algorithms, via a leave-one-out approach that enables the decoupling of certain statistical dependency between the gradient descent iterates and the data.
Suppose we are interested in learning an unknown object , but only have access to a few quadratic equations of the form
where is the sample we collect and is the design vector known a priori. Is it feasible to reconstruct in an accurate and efficient manner?
The problem of solving systems of quadratic equations (1) is of fundamental importance and finds applications in numerous contexts. Perhaps one of the best-known applications is the so-called phase retrieval problem arising in physical sciences [CESV13, SEC15]. In X-ray crystallography, due to the ultra-high frequency of the X-rays, the optical sensors and detectors are incapable of recording the phases of the diffractive waves; rather, only intensity measurements are collected. The phase retrieval problem comes down to reconstructing the specimen of interest given intensity-only measurements. If one thinks of as the specimen under study and uses to represent the intensity measurements, then phase retrieval is precisely about inverting the quadratic system (1).
Moving beyond physical sciences, the above problem also spans various machine learning applications. One example is mixed linear regression, where one wishes to estimate two unknown vectors and from unlabeled linear measurements [CYC14, BWY14]. The acquired data take the form of either or , without knowing which of the two vectors generates the data. In a simple symmetric case with (so that ), the squared measurements become the sufficient statistics, and hence mixed linear regression can be converted to learning from . Furthermore, the quadratic measurement model in (1) allows to represent a single neuron associated with a quadratic activation function, where are the data and encodes the parameters to be learned. As described in [SJL17, LMZ17], learning neural nets with quadratic activations involves solving systems of quadratic equations.
1.1 Nonconvex optimization via gradient descent
A natural strategy for inverting the system of quadratic equations (1) is to solve the following nonconvex least squares estimation problem
Under Gaussian designs where , the solution to (2) is known to be exact — up to some global sign — with high probability, as soon as the number of equations (samples) exceeds the order of the number of unknowns [BCMN14]. However, the loss function in (2) is highly nonconvex, thus resulting in severe computational challenges. With this issue in mind, can we still hope to find the global minimizer of (2) via low-complexity algorithms which, ideally, run in time proportional to that taken to read the data?
Fortunately, in spite of nonconvexity, a variety of optimization-based methods are shown to be effective in the presence of proper statistical models. Arguably, one of the simplest algorithms for solving (2) is vanilla gradient descent (GD), which attempts recovery via the update rule
with being the stepsize/learning rate. The above iterative procedure is also dubbed Wirtinger flow for phase retrieval, which can accommodate the complex-valued case as well [CLS15]. This simple algorithm is remarkably efficient under Gaussian designs: in conjunction with carefully-designed initialization and stepsize rules, GD provably converges to the truth at a linear rate111An iterative algorithm is said to enjoy linear convergence if the iterates converge geometrically fast to the minimizer . , provided that the ratio of the number of equations to the number of unknowns exceeds some logarithmic factor [CLS15, Sol14, MWCC17].
One crucial element in prior convergence analysis is initialization. In order to guarantee linear convergence, prior works typically recommend spectral initialization or its variants [CLS15, CC17, WGE17, ZZLC17, MWCC17, LL17, MM17]. Specifically, the spectral method forms an initial estimate using the (properly scaled) leading eigenvector of a certain data matrix. Two important features are worth emphasizing:
falls within a local -ball surrounding with a reasonably small radius, where enjoys strong convexity;
is incoherent with all the design vectors — in the sense that is reasonably small for all — and hence falls within a region where enjoys desired smoothness conditions.
These two properties taken collectively allow gradient descent to converge rapidly from the very beginning.
1.2 Random initialization?
The enormous success of spectral initialization gives rise to a curious question: is carefully-designed initialization necessary for achieving fast convergence? Obviously, vanilla GD cannot start from arbitrary points, since it may get trapped in undesirable stationary points (e.g. saddle points). However, is there any simpler initialization approach that allows to avoid such stationary points and that works equally well as spectral initialization?
A strategy that practitioners often like to employ is to initialize GD randomly. The advantage is clear: compared with spectral methods, random initialization is model-agnostic and is usually more robust vis-a-vis model mismatch. Despite its wide use in practice, however, GD with random initialization is poorly understood in theory. One way to study this method is through a geometric lens [SQW16]: under Gaussian designs, the loss function (cf. (2)) does not have any spurious local minima as long as the sample size is on the order of . Moreover, all saddle points are strict [GHJY15], meaning that the associated Hessian matrices have at least one negative eigenvalue if they are not local minima. Armed with these two conditions, the theory of Lee et al. [LSJR16, LPP17] implies that vanilla GD converges almost surely to the truth. However, the convergence rate remains unsettled. In fact, we are not aware of any theory that guarantees polynomial-time convergence of vanilla GD for phase retrieval in the absence of carefully-designed initialization.
Motivated by this, we aim to pursue a formal understanding about the convergence properties of GD with random initialization. Before embarking on theoretical analyses, we first assess its practical efficiency through numerical experiments. Generate the true object and the initial guess randomly as
We vary the number of unknowns (i.e. ), set , and take a constant stepsize . Here the measurement vectors are generated from Gaussian distributions, i.e. for . The relative errors of the GD iterates in a random trial are plotted in Figure 1, where
represents the distance between and modulo the unrecoverable global sign.
In all experiments carried out in Figure 1, we observe two stages for GD: (1) Stage 1: the relative error of stays nearly flat; (2) Stage 2: the relative error of experiences geometric decay. Interestingly, Stage 1 lasts only for a few tens of iterations. These numerical findings taken together reveal appealing numerical efficiency of GD in the presence of random initialization — it attains 5-digit accuracy within about 200 iterations!
To further illustrate this point, we take a closer inspection of the signal component and the perpendicular component , where we normalize for simplicity. Denote by the norm of the perpendicular component. We highlight two important and somewhat surprising observations that allude to why random initialization works.
Exponential growth of the magnitude ratio of the signal to the perpendicular components. The ratio, , grows exponentially fast throughout the execution of the algorithm, as demonstrated in Figure 2(a). This metric in some sense captures the signal-to-noise ratio of the running iterates.
Exponential growth of the signal strength in Stage 1. While the estimation error of may not drop significantly during Stage 1, the size of the signal component increases exponentially fast and becomes the dominant component within several tens of iterations, as demonstrated in Figure 2(b). This helps explain why Stage 1 lasts only for a short duration.
The central question then amounts to whether one can develop a mathematical theory to interpret such intriguing numerical performance. In particular, how many iterations does Stage 1 encompass, and how fast can the algorithm converge in Stage 2?
1.3 Main findings
The objective of the current paper is to demystify the computational efficiency of GD with random initialization, thus bridging the gap between theory and practice. Assuming a tractable random design model in which ’s follow Gaussian distributions, our main findings are summarized in the following theorem. Here and throughout, the notation or (resp. , ) means that there exist constants such that (resp. , ).
Fix with . Suppose that for , , and for some sufficiently small constant . Then with probability approaching one, there exist some sufficiently small constant and such that the GD iterates (3) obey
for some absolute constant , provided that the sample size .
The readers are referred to Theorem 2 for a more general statement.
Here, the stepsize is taken to be a fixed constant throughout all iterations, and we reuse all data across all iterations (i.e. no sample splitting is needed to establish this theorem). The GD trajectory is divided into 2 stages: (1) Stage 1 consists of the first iterations, corresponding to the first tens of iterations discussed in Section 1.2; (2) Stage 2 consists of all remaining iterations, where the estimation error contracts linearly. Several important implications/remarks follow immediately.
Stage 1 takes iterations. When seeded with a random initial guess, GD is capable of entering a local region surrounding within iterations, namely,
for some sufficiently small constant . Even though Stage 1 may not enjoy linear convergence in terms of the estimation error, it is of fairly short duration.
Stage 2 takes iterations. After entering the local region, GD converges linearly to the ground truth with a contraction rate . This tells us that GD reaches -accuracy (in a relative sense) within iterations.
Near linear-time computational complexity. Taken collectively, these imply that the iteration complexity of GD with random initialization is
Given that the cost of each iteration mainly lies in calculating the gradient , the whole algorithm takes nearly linear time, namely, it enjoys a computational complexity proportional to the time taken to read the data (modulo some logarithmic factor).
Near-minimal sample complexity. The preceding computational guarantees occur as soon as the sample size exceeds . Given that one needs at least samples to recover unknowns, the sample complexity of randomly initialized GD is optimal up to some logarithmic factor.
Saddle points? The GD iterates never hit the saddle points (see Figure 3 for an illustration). In fact, after a constant number of iterations at the very beginning, GD will follow a path that increasingly distances itself from the set of saddle points as the algorithm progresses. There is no need to adopt sophisticated saddle-point escaping schemes developed in generic optimization theory (e.g. perturbed GD [JGN17]).
Weak dependency w.r.t. the design vectors. As we will elaborate in Section 4, the statistical dependency between the GD iterates and certain components of the design vectors stays at an exceedingly weak level. Consequently, the GD iterates proceed as if fresh samples were employed in each iteration. This statistical observation plays a crucial role in characterizing the dynamics of the algorithm without the need of sample splitting.
It is worth emphasizing that the entire trajectory of GD is automatically confined within a certain region enjoying favorable geometry. For example, the GD iterates are always incoherent with the design vectors, stay sufficiently away from any saddle point, and exhibit desired smoothness conditions, which we will formalize in Section 4. Such delicate geometric properties underlying the GD trajectory are not explained by prior papers. In light of this, convergence analysis based on global geometry [SQW16] — which provides valuable insights into algorithm designs with arbitrary initialization — results in suboptimal (or even pessimistic) computational guarantees when analyzing a specific algorithm like GD. In contrast, the current paper establishes near-optimal performance guarantees by paying particular attention to finer dynamics of the algorithm. As will be seen later, this is accomplished by heavily exploiting the statistical properties in each iterative update.
2 Why random initialization works?
Before diving into the proof of the main theorem, we pause to develop intuitions regarding why gradient descent with random initialization is expected to work. We will build our understanding step by step: (i) we first investigate the dynamics of the population gradient sequence (the case where we have infinite samples); (ii) we then turn to the finite-sample case and present a heuristic argument assuming independence between the iterates and the design vectors; (iii) finally, we argue that the true trajectory is remarkably close to the one heuristically analyzed in the previous step, which arises from a key property concerning the “near-independence” between and the design vectors .
Without loss of generality, we assume throughout this section, where denotes the first standard basis vector. For notational simplicity, we denote by
the first entry and the 2nd through the th entries of , respectively. Since , it is easily seen that
represent respectively the components of along and perpendicular to the signal direction. In what follows, we focus our attention on the following two quantities that reflect the sizes of the preceding two components222Here, we do not take the absolute value of . As we shall see later, the ’s are of the same sign throughout the execution of the algorithm.
Without loss of generality, assume that .
2.1 Population dynamics
To start with, we consider the unrealistic case where the iterates are constructed using the population gradient (or equivalently, the gradient when the sample size approaches infinity), i.e.
Here, represents the population gradient given by
which can be computed by assuming that and the ’s are independent. Simple algebraic manipulation reveals the dynamics for both the signal and the perpendicular components:
Assuming that is sufficiently small and recognizing that , we arrive at the following population-level state evolution for both and (cf. (7)):
This recursive system has three fixed points:
which correspond to the global minimizer, the local maximizer, and the saddle points, respectively, of the population objective function.
We make note of the following key observations in the presence of a randomly initialized , which will be formalized later in Lemma 1:
the ratio of the size of the signal component to that of the perpendicular component increases exponentially fast;
the size of the signal component keeps growing until it plateaus around ;
the size of the perpendicular component eventually drops towards zero.
In other words, when randomly initialized, converges to rapidly, thus indicating rapid convergence of to the truth , without getting stuck at any undesirable saddle points. We also illustrate these phenomena numerically. Set , and . Figure 4 displays the dynamics of , , and , which are precisely as discussed above.
2.2 Finite-sample analysis: a heuristic treatment
We now move on to the finite-sample regime, and examine how many samples are needed in order for the population dynamics to be reasonably accurate. Notably, the arguments in this subsection are heuristic in nature, but they are useful in developing insights into the true dynamics of the GD iterates.
Rewrite the gradient update rule (3) as
where . Assuming (unreasonably) that the iterate is independent of , the central limit theorem (CLT) allows us to control the size of the fluctuation term . Take the signal component as an example: simple calculations give
with the first entry of . Owing to the preceding independence assumption, is the sum of i.i.d. zero-mean random variables. Assuming that never blows up so that , one can apply the CLT to demonstrate that
with high probability, which is often negligible compared to the other terms. For instance, for the random initial guess one has with probability approaching one, telling us that
as long as . Similar observations hold true for the perpendicular component .
In summary, by assuming independence between and , we arrive at an approximate state evolution for the finite-sample regime:
with the proviso that .
2.3 Key analysis ingredients: near-independence and leave-one-out tricks
The preceding heuristic argument justifies the approximate validity of the population dynamics, under an independence assumption that never holds unless we use fresh samples in each iteration. On closer inspection, what we essentially need is the fluctuation term (cf. (10)) being well-controlled. For instance, when focusing on the signal component, one needs for all . In particular, in the beginning iterations, is as small as . Without the independence assumption, the CLT types of results fail to hold due to the complicated dependency between and . In fact, one can easily find many points that result in much larger remainder terms (as large as ) and that violate the approximate state evolution (13). See Figure 5 for a caricature of the region where the fluctuation term is well-controlled. As can be seen, it only occupies a tiny fraction of the neighborhood of .
Fortunately, despite the complicated dependency across iterations, one can provably guarantee that always stays within the preceding desirable region in which is well-controlled. The key idea is to exploit a certain “near-independence” property between and . Towards this, we make use of a leave-one-out trick for analyzing nonconvex iterative methods. In particular, we construct auxiliary sequences that are
independent of certain components of the design vectors ; and
extremely close to the original gradient sequence .
As it turns out, we need to construct several auxiliary sequences , and , where is independent of the th sampling vector , is independent of the sign information of the first entries of all ’s, and is independent of both. In addition, these auxiliary sequences are constructed by slightly perturbing the original data (see Figure 6 for an illustration), and hence one can expect all of them to stay close to the original sequence throughout the execution of the algorithm. Taking these two properties together, one can propagate the above statistical independence underlying each auxiliary sequence to the true iterates , which in turn allows us to obtain near-optimal control of the fluctuation term . The details are postponed to Section 4.
3 Related work
Solving systems of quadratic equations, or phase retrieval, has been studied extensively in the recent literature; see [SEC15] for an overview. One popular method is convex relaxation (e.g. PhaseLift [CSV13]), which is guaranteed to work as long as exceeds some large enough constant [CL14, DH14, CCG15, CZ15, KRT17]. However, the resulting semidefinite program is computationally prohibitive for solving large-scale problems. To address this issue, [CLS15] proposed the Wirtinger flow algorithm with spectral initialization, which provides the first convergence guarantee for nonconvex methods without sample splitting. Both the sample and computation complexities were further improved by [CC17] with an adaptive truncation strategy. Other nonconvex phase retrieval methods include [NJS13, CLM16, Sol17, WGE17, ZZLC17, WGSC17, CL16, DR17, GX16, CFL15, Wei15, BEB17, TV17, CLW17, ZWGC17, QZEW17, ZCL16, YYF17, CWZG17, Zha17, MXM18]. Almost all of these nonconvex methods require carefully-designed initialization to guarantee a sufficiently accurate initial point. One exception is the approximate message passing algorithm proposed in [MXM18], which works as long as the correlation between the truth and the initial signal is bounded away from zero. This, however, does not accommodate the case when the initial signal strength is vanishingly small (like random initialization). Other works [Zha17, LGL15] explored the global convergence of alternating minimization / projection with random initialization which, however, require fresh samples at least in each of the first iterations in order to enter the local basin. In addition, [LMZ17] explored low-rank recovery from quadratic measurements with near-zero initialization. Using a truncated least squares objective, [LMZ17] established approximate (but non-exact) recovery of over-parametrized GD. Notably, if we do not over-parametrize the phase retrieval problem, then GD with near-zero initialization is (nearly) equivalent to running the power method for spectral initialization333More specifically, the GD update when , which is equivalent to a power iteration (without normalization) w.r.t. the data matrix . , which can be understood using prior theory.
Another related line of research is the design of generic saddle-point escaping algorithms, where the goal is to locate a second-order stationary point (i.e. the point with a vanishing gradient and a positive-semidefinite Hessian). As mentioned earlier, it has been shown by [SQW16] that as soon as , all local minima are global and all the saddle points are strict. With these two geometric properties in mind, saddle-point escaping algorithms are guaranteed to converge globally for phase retrieval. Existing saddle-point escaping algorithms include but are not limited to Hessian-based methods [SQW16] (see also [AAZB16, AZ17, JGN17] for some reviews), noisy stochastic gradient descent [GHJY15], perturbed gradient descent [JGN17], and normalized gradient descent [MSK17]. On the one hand, the results developed in these works are fairly general: they establish polynomial-time convergence guarantees under a few generic geometric conditions. On the other hand, the iteration complexity derived therein may be pessimistic when specialized to a particular problem.
Take phase retrieval and the perturbed gradient descent algorithm [JGN17] as an example. It has been shown in [JGN17, Theorem 5] that for an objective function that is -gradient Lipschitz, -Hessian Lipschitz, -strict saddle, and also locally -strongly convex and -smooth (see definitions in [JGN17]), it takes444When applied to phase retrieval with , one has , , (see [SQW16, Theorem 2.2]), , and (ignoring logarithmic factors).
iterations (ignoring logarithmic factors) for perturbed gradient descent to converge to -accuracy. In fact, even with Nesterov’s accelerated scheme [JNJ17], the iteration complexity for entering the local region is at least
Both of them are much larger than the complexity established herein. This is primarily due to the following facts: (i) the Lipschitz constants of both the gradients and the Hessians are quite large, i.e. and (ignoring log factors), which are, however, treated as dimension-independent constants in the aforementioned papers; (ii) the local condition number is also large, i.e. . In comparison, as suggested by our theory, the GD iterates with random initialization are always confined within a restricted region enjoying much more benign geometry than the worst-case/global characterization.
Furthermore, the above saddle-escaping first-order methods are often more complicated than vanilla GD. Despite its algorithmic simplicity and wide use in practice, the convergence rate of GD with random initialization remains largely unknown. In fact, Du et al. [DJL17] demonstrated that there exist non-pathological functions such that GD can take exponential time to escape the saddle points when initialized randomly. In contrast, as we have demonstrated, saddle points are not an issue for phase retrieval; the GD iterates with random initialization never get trapped in the saddle points.
Finally, the leave-one-out arguments have been invoked to analyze other high-dimensional statistical estimation problems including robust M-estimators [EKBB13, EK15], statistical inference for Lasso [JM15], likelihood ratio test for logistic regression [SCC17], etc. In addition, [ZB17, CFMW17, AFWZ17] made use of the leave-one-out trick to derive entrywise perturbation bounds for eigenvectors resulting from certain spectral methods. The techniques have also been applied by [MWCC17, LMCC18] to establish local linear convergence of vanilla GD for nonconvex statistical estimation problems in the presence of proper spectral initialization.
In this section, we first provide a more general version of Theorem 1 as follows. It spells out exactly the conditions on in order for the gradient method with random initialization to succeed.
Fix . Suppose and for some sufficiently large constant . Assume that the initialization is independent of and obeys
and that the stepsize satisfies for some sufficiently small constant . Then there exist a sufficiently small absolute constant and such that with probability at least ,
the GD iterates (3) converge linearly to after , namely,
the magnitude ratio of the signal component to the perpendicular component obeys
for some constant .
Several remarks regarding Theorem 2 are in order.
Our current sample complexity reads , which is optimal up to logarithmic factors. It is possible to further reduce the logarithmic factors using more refined probabilistic tools, which we leave for future work.
The remainder of this section is then devoted to proving Theorem 2. Without loss of generality555This is because of the rotational invariance of Gaussian distributions., we will assume throughout that
Given this, one can decompose
where and as introduced in Section 2. For notational simplicity, we define
Intuitively, represents the size of the signal component, whereas measures the size of the component perpendicular to the signal direction. In view of (16), we have .
4.1 Outline of the proof
To begin with, it is easily seen that if and (cf. (18)) obey and , then
Therefore, our first step — which is concerned with proving — comes down to the following two steps.
Show that if and satisfy the approximate state evolution (see (13)), then there exists some such that
which would immediately imply that
Along the way, we will also show that the ratio grows exponentially fast.
Justify that and satisfy the approximate state evolution with high probability, using (some variants of) leave-one-out arguments.
After , we can invoke prior theory concerning local convergence to show that with high probability,
for some constant independent of and .
4.2 Dynamics of approximate state evolution
This subsection formalizes our intuition in Section 2: as long as the approximate state evolution holds, then one can find obeying condition (19). In particular, the approximate state evolution is given by
where and represent the perturbation terms. Our result is this:
Let be some sufficiently small constant, and consider the approximate state evolution (20). Suppose the initial point obeys
and the perturbation terms satisfy
for some sufficiently small constant .
Then for any sufficiently large and and any sufficiently small constant , one has
and there exist some constants independent of and such that
(b) If we define
for some arbitrarily small constants , then
; ; ;
For , one has .
See Appendix B.∎
Recall that is sufficiently small and represents the global minimizer. Since , one has , which denotes the first time when the iterates enter the local region surrounding the global minimizer. In addition, the fact that gives and , both of which indicate the first time when the signal strength is sufficiently large.
Lemma 1 makes precise that under the approximate state evolution, the first stage enjoys a fairly short duration . Moreover, the size of the signal component grows faster than that of the perpendicular component for any iteration , thus confirming the exponential growth of .
In addition, Lemma 1 identifies two midpoints and when the sizes of the signal component become sufficiently large. These are helpful in our subsequent analysis. In what follows, we will divide Stage 1 (which consists of all iterations up to ) into two phases:
Phase I: consider the duration ;
Phase II: consider all iterations with .
We will justify the approximate state evolution (20) for these two phases separately.
4.3 Motivation of the leave-one-out approach
As we have alluded in Section 2.3, the main difficulty in establishing the approximate state evolution (20) lies in controlling the perturbation terms to the desired orders (i.e. in Lemma 1). To achieve this, we advocate the use of (some variants of) leave-one-out sequences to help establish certain “near-independence” between and certain components of .
We begin by taking a closer look at the perturbation terms. Regarding the signal component, it is easily seen from (11) that
where the perturbation term obeys
Here and throughout the paper, for any vector , denotes the 2nd through the th entries of . Due to the dependency between and , it is challenging to obtain sharp control of some of these terms.
In what follows, we use the term to explain and motivate our leave-one-out approach. As discussed in Section 2.3, needs to be controlled to the level . This precludes us from seeking a uniform bound on the function over all (or even all within the set incoherent with ), since the uniform bound can be times larger than the desired order.
In order to control to the desirable order, one strategy is to approximate it by a sum of independent variables and then invoke the CLT. Specifically, we first rewrite as
with . Here denotes the usual sign function. To exploit the statistical independence between and , we would like to identify some vector independent of that well approximates . If this can be done, then one may treat as a weighted independent sum of . Viewed in this light, our plan is the following:
Construct a sequence independent of obeying , so that
One can then apply standard concentration results (e.g. the Bernstein inequality) to control , as long as none of the weight is exceedingly large.
Demonstrate that the weight is well-controlled, or equivalently, () is not much larger than its typical size. This can be accomplished by identifying another sequence independent of