The estimation error of general first order methods

# The estimation error of general first order methods

## Abstract

Modern large-scale statistical models require to estimate thousands to millions of parameters. This is often accomplished by iterative algorithms such as gradient descent, projected gradient descent or their accelerated versions. What are the fundamental limits to these approaches? This question is well understood from an optimization viewpoint when the underlying objective is convex. Work in this area characterizes the gap to global optimality as a function of the number of iterations. However, these results have only indirect implications in terms of the gap to statistical optimality.

Here we consider two families of high-dimensional estimation problems: high-dimensional regression and low-rank matrix estimation, and introduce a class of ‘general first order methods’ that aim at efficiently estimating the underlying parameters. This class of algorithms is broad enough to include classical first order optimization (for convex and non-convex objectives), but also other types of algorithms. Under a random design assumption, we derive lower bounds on the estimation error that hold in the high-dimensional asymptotics in which both the number of observations and the number of parameters diverge. These lower bounds are optimal in the sense that there exist algorithms whose estimation error matches the lower bounds up to asymptotically negligible terms. We illustrate our general results through applications to sparse phase retrieval and sparse principal component analysis.

\AtAppendix\counterwithin

lemmasection

## 1 Introduction

High-dimensional statistical estimation problems are often addressed by constructing a suitable data-dependent cost function , which encodes the statistician’s knowledge of the problem. This cost is then minimized using an algorithm which scales well to large dimension. The most popular algorithms for high-dimensional statistical applications are first order methods, i.e., algorithms that query the cost by computing its gradient (or a subgradient) at a sequence of points ,…. Examples include (projected) gradient descent, mirror descent, and accelerated gradient descent.

This raises a fundamental question: What is the minimal statistical error achieved by first order methods? In particular, we would like to understand in which cases these methods are significantly sub-optimal (in terms of estimation) with respect to statistically optimal but potentially intractable estimators, and what is the optimal tradeoff between number of iterations and estimation error.

These questions are relatively well understood only from the point of view of convex optimization, namely if estimation is performed by minimizing a convex cost function , see e.g. [CT07, BRT09]. The seminal work of Nemirovsy and Yudin [NY83] characterizes the minimum gap to global optimality , where is the algorithm’s output after iterations (i.e., after gradient evaluations). For instance, if is a smooth convex function, there exists a first order algorithm which achieves . At the same time, no algorithm can be guaranteed to achieve a better convergence rate over all functions in this class.

In contrast, if the cost is nonconvex, there cannot be general guarantees of global optimality. Substantial effort has been devoted to showing that –under suitable assumptions about the data distribution– certain nonconvex costs can be minimized efficiently, e.g. by gradient descent [KMO10, LW11, CC15]. This line of work resulted in upper bounds on the estimation error of first order methods. Unlike in the convex case, worst case lower bounds are typically overly pessimistic since non-convex optimization is NP-hard. Our work aims at developing precise average-case lower bounds for a restricted class of algorithms, which are applicable both to convex and nonconvex problems.

We are particularly interested in problems that exhibit an information-computation gap: we know that the optimal statistical estimator has high accuracy, but existing upper bounds on first order methods are substantially sub-optimal (see examples below). Is this a limitation of our analysis, of the specific algorithm under consideration, or of first order algorithms in general? The main result of this paper is a tight asymptotic characterization of the minimum estimation error achieved by first order algorithms for two families of problems. This characterization can be used –in particular– to delineate information-computation gaps.

Our results are novel even in the case of a convex cost function , for two reasons. First, classical theory [Nes18] lower bounds the objective value after iterations. This has only indirect implications on estimation error, e.g., (here is the true value of the parameters, not the minimizer of the cost ). Second, the classical lower bounds on the objective value are worst case with respect to the function and do not take into account the data distribution.

Concretely, we consider two families of estimation problems:

High-dimensional regression.

Data are i.i.d. pairs , where is a label and is a feature vector. We assume and for a vector . Our objective is to estimate the coefficients from data (the matrix whose -th row is vector ) and (the vector whose -th entry is label ).

Low-rank matrix estimation.

Data consist of a matrix where with and . We denote by and the matrices whose rows are and respectively. Our objective is to to estimate from data .

In order to discuss these two examples in a unified fashion, we will introduce a dummy vector (e.g., the all-zeros vector) as part of the data in the low-rank matrix estimation problem. Let us point out that our normalizations are somewhat different from, but completely equivalent to, the traditional ones in statistics.

The first question to address is how to properly define ‘first order methods.’ A moment of thought reveals that the above discussion in terms of a cost function needs to be revised. Indeed, given either of the above statistical models, there is no simple way to construct a ‘statistically optimal’ cost function.3 Further, it is not clear that using a faster optimization algorithm for that cost will result in faster decrease of the estimation error.

We follow instead a different strategy and introduce the class of general first order methods (GFOM). In words, these include all algorithms that keep as state sequences of matrices , and , which are updated by two types of operations: row-wise application of a function, or multiplication by or . We will then show that standard first order methods, for common choices of the cost , are in fact special examples of GFOMs.

Formally, a GFOM is defined by sequences of functions , , with the ’s indexed by and the ’s indexed by . In the high-dimensional regression problem, we set . The algorithm produces two sequences of matrices (vectors for ) , , and , ,

 vt+1 =X⊤F(1)t(u1,…,ut;y,u)+F(2)t(v1,…,vt;v) (1a) ut =XG(1)t(v1,…,vt;v)+G(2)t(u1,…,ut−1;y,u), (1b)

where it is understood that each function is applied row-wise. For instance

 F(1)t(u1,…,ut;u)=(F(1)t(u1i,…,uti;ui))i≤n∈Rn×r,

where is the row of . Here are either deterministic or random and independent of everything else. In particular, the iteration is initialized with . The unknown matrices (or vectors) and are estimated after iterations by and , where the latter only applies in the low-rank matrix estimation problem. Let us point out that the update also depend on additional information encoded in the two vectors , . This enables us to model side information provided to the statistician (e.g., an ‘initialization’ correlated with the true signal) or auxiliary randomness.

We study the regime in which with and is fixed. We assume the number of iterations is fixed, or potentially after . In other words, we are interested in linear-time or nearly linear-time algorithms (complexity being measured relative to the input size ). As mentioned above, our main result is a general lower bound on the minimum estimation error that is achieved by any GFOM in this regime.

The paper is organized as follows: Section 2 illustrates the setting introduced above in two examples; Section 3 contains the statement of our general lower bounds; Section 4 applies these lower bounds to the two examples; Section 5 presents an outline of the proof, deferring technical details to appendices.

## 2 Two examples

### Example #1: M-estimation in high-dimensional regression and phase retrieval

Consider the high-dimensional regression problem. Regularized M-estimators minimize a cost

 Ln(ϑ):=n∑i=1ℓ(yi;⟨xi,ϑ⟩)+Ωn(ϑ)=^ℓn(y,Xϑ)+Ωn(ϑ), (2)

Here is a loss function, is its empirical average, and is a regularizer. It is often the case that is smooth and is separable, i.e., . We will assume this to be the case in our discussion.

The prototypical first order method is proximal gradient [PB13]:

 θt+1=ProxγtΩ1(θt−γt∇ϑ^ℓn(y,Xθt)), ProxγΩ1(y):=argminθ∈R{12(y−θ)2+γΩ1(θ)}.

Here is a sequence of step sizes and acts on a vector coordinate-wise. Notice that

 ∇ϑ^ℓn(y,Xθt)=XTs(y,Xθt),s(h,^y)i≡∂ℓ∂^yi(y,^yi). (3)

Therefore proximal gradient –for the cost function (2)– is an example of a GFOM. Similarly, mirror descent with a separable Bregman divergence and accelerated proximal gradient methods are easily shown to fit in the same framework.

Among the countless applications of regularized M-estimation, we will focus on the sparse phase retrieval problem. We want to reconstruct a sparse signal but only have noisy measurements of the modulus ; that is, we lose the ‘phase’ of these projections. (We will consider for simplicity the case of a real-valued signal, but the generalization of our results to the complex case should be immediate.)

As a concrete model, we will assume that number of non-zero entries of is . From an information-theoretic viewpoint, it is known that can be reconstructed accurately as soon as the number of measurements satisfies , with a sufficiently large constant [LV13]. Several groups have investigated practical reconstruction algorithms by exploiting either semidefinite programming relaxations [LV13] or first order methods [SR14, CLS15, CLM16]. A standard approach would be to apply a proximal gradient algorithm to the cost function (2) with . However, all existing global convergence guarantees for these methods require . Is the dependence on due to a fundamental computational barrier or an artifact of the theoretical analysis? Recently [Sol19] presented partial evidence towards the possibility of ‘breaking’ this barrier, by proving that a first order method can accurately reconstruct the signal for , if it is initialized close enough to the true signal .

### Example #2: Sparse PCA

In a simple model for sparse principal component analysis (PCA), we observe a matrix , where has entries , is a sparse vector with non-zero entries, and is a noise matrix with entries . Given data , we would like to reconstruct the signal . From an information-theoretic viewpoint, it is known that accurate reconstruction of is possible if , with a sufficiently large constant [AW08].

A number of polynomial time algorithms have been studied, ranging from simple thresholding algorithms [JL09, DM16] to sophisticated convex relaxations [AW08, MW15]. Among other approaches, one natural idea is to modify the power iteration algorithm of standard PCA by computing

 θt+1=ctXTXη(θt;γt). (4)

Here is a deterministic normalization, and is a thresholding function at level , e.g., soft thresholding . It is immediate to see that this algorithm is a GFOM. More elaborate versions of non-linear power iteration were developed, for example, by [JNRS10, Ma13], and are typically equivalent to suitable GFOMs.

Despite these efforts, no algorithm is known to succeed unless . Is this a fundamental barrier or a limitation of present algorithms or analysis? Evidence towards intractability was provided by [BR13, BBH18] via reduction from the planted clique problem. Our analysis provides new evidence towards the same conclusion.

## 3 Main results

In this section we state formally our general results about high-dimensional regression and low-rank matrix estimation. The next section will apply these general results to concrete instances. Throughout we make the following assumptions:

• The functions , , are Lipschitz continuous, with the ’s indexed by and the ’s indexed by .

• The covariates matrix (for high-dimensional regression) or the noise matrix (for low-rank estimation) have entries , .

Also, we denote by the set of probability distributions with finite -th moment on and those with compact support. We say a function is pseudo-Lipschitz of order 2 if there exists constant such that for all . We call a function a quadratically-bounded loss if it is non-negative and pseudo-Lipschitz of order 2 and there exists such that for all we have .

### 3.1 High-dimensional regression

We make the following additional assumptions for the regression problem:

• We sample , for .

• There exists a measurable function such that . Moreover, there exists constant such that for all .

Notice that the description in terms of a probability kernel is equivalent to the one in terms of a ‘noisy’ function in most cases of interest.

Our lower bound is defined in terms of a one-dimensional recursion. Let . Let be the minimum mean square error for estimation of given observations and where independent of . Set and , and define recursively

 ~τ2s=1δ% mmseΘ,V(τ2s),σ2s=1δ(τ2Θ−mmseΘ,V(τ2s)),1τ2s+1=1~τ2sE[E[G1|Y,G0,U]2], (5)

where and the expectation is with respect to and independent.

###### Theorem 1.

Under assumptions A1, A2, R1, R2 in the high-dimensional regression model and under the asymptotics , , let be output of any GFOM after iterations ( matrix-vector multiplications). Then

 limn→∞1p∥^θt−θ∥22≥mmseΘ,V(τ2t).

More generally, for any quadratically-bounded loss ,

 limn→∞1pp∑j=1ℓ(θj,^θtj)≥inf^θ(⋅)E{ℓ(Θ,^θ(Θ+τtG,V))}, (6)

where independent of , and the infimum on the right-hand side is over measurable functions . The limits are in probability and to a constant, and they are guaranteed to exist. For all , there exist GFOMs which satisfy these bounds to within tolerance .

### 3.2 Low-rank matrix estimation

We make the following additional assumption:

• We sample and for .

Again, our lower bound is defined in terms of recursion, which this time is defined over positive semidefinite matrices , . Set , and define recursively

 Qt+1=VΛ,U(^Qt),^Qt=1δVΘ,V(Qt), (7)

where we define the second moment of the conditional expectation by

 VΘ,V(Q):=E{E[Θ|Q1/2Θ+G=Y;V]E[Θ|Q1/2Θ+G=Y;V]T},

and analogously for . Here the expectation is with respect to and an independent Gaussian vector . Notice in particular that is the vector minimum mean square error when is observed in Gaussian noise with covariance . For , Eq. (7) is a simple scalar recursion.

###### Theorem 2.

Under assumptions A1, A2, M1 in the low-rank matrix estimation model and under the under the asymptotics , let be output of any GFOM after iterations ( matrix-vector multiplications). Then

 limn→∞1p∥^θt−θ∥2F≥E{∥Θ∥2}−TrVΘ,V(Qt).

More generally, for any quadratically-bounded loss ,

 limn→∞1pp∑j=1ℓ(θj,^θtj)≥inf^θ(⋅)E{ℓ(Θ,^θ(Q1/2tΘ+G,V))}, (8)

where the infimum on the right-hand side is over functions . The limits are in probability and to a constant, and they are guaranteed to exist. As above, for all there exist GFOMs which satisfy these bounds to within tolerance .

### 3.3 Discussion

Our motivations are similar to the ones for statistical query (SQ) lower bounds [FGR17, FGV17]: we want to provide estimation lower bounds under a restricted computational model, that are sensitive to the data distribution. However the scope of our approach is significantly different from SQ algorithms: the latter can query data distributions and compute approximate expectations with respect to that distribution. In contrast, our algorithms work with a fixed sample (the data matrix and responses ), which is queried multiple times. These queries can be thought as weighted averages of both rows and columns of and, as such, cannot be simulated by the SQ oracle. For instance, the proximal gradient method or the nonlinear power iteration of Section 2 cannot be framed as a SQ algorithms.

The lower bounds of Theorems 1 and 2 are satisfied with equality by a specific first order method that is an approximate message passing (AMP) algorithm, with Bayes updates. This can be regarded as a version of belief propagation (BP) for densely connected graphs [KF09], or an iterative implementation of the TAP equations from spin glass theory [MPV87].

Our proof builds on the asymptotically exact analysis of AMP algorithms developed in [Bol14, BM11, JM18, BMN19]. However we need to overcome three technical obstacles:  Show that any GFOM can be reduced (in a suitable sense) to a certain AMP algorithms, whose behavior can be exactly tracked.  Show that Bayes-AMP is optimal among all AMP algorithms. We achieve this goal by considering an estimation problem on trees and showing that, in a suitable large degree limit, it has the same asymptotic behavior as AMP on the complete graph. On trees it is immediate to see that BP is the optimal local algorithm.  We need to prove that the asymptotic behavior of BP for trees of large degree is equivalent to the one of Bayes-AMP on the original problem. This amounts to proving a Gaussian approximation theorem for BP. While similar results were obtained in the past for discrete models [Sly09, MX16], the current setting is technically more challenging because the underlying variables are continuous and unbounded.

While the line of argument above is –in hindsight– very natural, the conclusion is broadly useful. For instance, [AFUZ19] study a class of of message passing algorithms inspired to replica symmetry breaking and survey propagation [MPZ02], and observe that they do not perform better than Bayes AMP. These algorithms are within the scope of our Theorem 2, which implies that indeed they cannot outperform Bayes AMP, for any constant number of iterations.

Finally, a sequence of recent papers characterize the asymptotics of the Bayes-optimal estimation error in the two models described above [LM19, BKM19]. It was conjectured that, in this context, no polynomial-time algorithm can outperform Bayes AMP, provided these algorithms have access to an arbitrarily small amount of side information.4 Theorems 1 and 2 establish this result within the restricted class of GFOMs.

## 4 Applying the general lower bounds

In our two examples, we will refer to the sets of -sparse vectors and of vectors with -norm bounded by .

### Example #1: Sparse phase retrieval

For the reader’s convenience, we follow the standard normalization in phase retrieval, whereby the ‘sensing vectors’ (i.e. the rows of the design matrix) have norm concentrated around one. In other words, we observe , where .

In order to model the phase retrieval problem, we assume that the conditional density satisfies the symmetry condition . In words: we only observe a noisy version of the absolute value . An important role is played by the following critical value of the number of observations per dimension

 δ\tiny{sp}:=(∫REG[p(y|G)(G2−1)]EG[p(y|G)]dy)−1. (9)

Here expectation is with respect to . It was proved in [MM19] that, if and , for some bounded away from zero, then there exists a simple spectral estimator that achieves weak recovery, i.e., a positive correlation with the true signal. Namely, is bounded away from zero as .

In the case of a dense signal and observation model , the oversampling ratio is known to be information-theoretically optimal: for no estimator can achieve a correlation that is bounded away from [MM19]. On the other hand, if has at most nonzero entries, it is information-theoretically possible to reconstruct it from phaseless measurements per dimension [LV13].

Our next result implies that no GFOM can achieve reconstruction from measurements per dimension, unless it is initialized close enough to the true signal. In order to model the additional information provided by the initialization we assume to be given

 ¯¯¯v=√αθ/∥θ∥2+√1−α~g,(~gi)i≤p\lx@stackreliid∼N(0,1/p),. (10)

Notice that with this normalization concentrates tightly around , and can be interpreted as the cosine of the angle between and .

###### Corollary 1.

Consider the phase retrieval model, for a sequence of deterministic signals , and let . Assume the noise kernel to satisfy the conditions of Theorem 1 and to be be twice differentiable with respect to .

Then, for any , there exists and such that, if , then

 supt≥0limn,p→∞infθ∈\mathscrsfsT(ε,√p)E⟨θ,^θt⟩∥θ∥2∥^θt∥2≤C∗√α. (11)

The same conclusion holds if is drawn randomly with i.i.d. entries , .

### Example #2: Sparse PCA

For ease of interpretation, we assume the observation model , where and . Equivalently, conditional on , the rows of are i.i.d. samples , . We also assume to have access to an initialization correlated with , as per Eq. (10). In order to apply Theorem 2, we choose a specific distribution for the spike. Defining , we assume that the entries of follow a three-points sparse distribution . The next lemma specializes Theorem 2.

###### Lemma 1.

Assume the sparse PCA model with the distribution of given above. Define by

 qt+1 =V±(qt+~α)1+V±(qt+~α),q0=0, (12) V±(q) :=e−δqμ2μ2ε2E{sinh(μ√δqG)21−ε+εe−δqμ2/2cosh(μ√δqG)}, (13)

where . Then, for any GFOM

 limn,p→∞⟨¯¯¯θ,^θt⟩∥¯¯¯θ∥2∥^θt∥2≤√V±(qt+~α)μ2ε. (14)

The bound in the last lemma holds for random vectors with i.i.d. entries from the three-points distribution. As a consequence, it implies a minimax bound for non-random vectors with given -norm and sparsity. We state this bound in the corollary below. In order to develop explicit expressions, we analyze the recursion of Eqs. (12), (13).

###### Corollary 2.

Assume the sparse PCA model, for a deterministic vector and , random, and consider the parameter space .

1. If , then there exists such that, for , and any GFOM

 supt≥0limn,p→∞inf¯¯¯θ∈\mathscrsfsT(ε,R)E⟨¯¯¯θ,^θt⟩∥¯¯¯θ∥2∥^θt∥2≤C∗√α. (15)
2. If , then the above statement holds with , .

In words, the last corollary implies that for , no estimator achieves a non-vanishing correlation with the true signal , unless sufficient side information about is available. Notice that for is the threshold above which the principal eigenvector of the empirical covariance becomes correlated with . Hence, our result implies that, simple PCA fails, then every GFOM will fail.

Viceversa, if simple PCA succeed, then it can be implemented via a GFOM, provided arbitrarily weak side information if available. Indeed, assume side information , with , and an arbitrarily small constant. Then the power method initialized at converges to an estimate that has correlation with bounded away from zero in iterations.

## 5 Proof of main results

In this section, we prove Theorems 1 and 2 under stronger assumptions than in their statements. In the high-dimensional regression model, these assumptions are as follows.

• Given and for some , we sample , .

• There exists Lipschitz function such that . Measure has regular conditional probability distribution such that, for all fixed , the distribution of when has positive and bounded density with respect Lebesgue measure. Further, for exists and is bounded.

In the low-rank matrix estimation model, this assumption is as follows.

• Given , we sample , .

In Appendix E, we show that Theorem 1 (resp. Theorem 2) under assumptions R3 and R4 (resp. M2) implies the theorem under the weaker assumptions R1 and R2 (resp. M1).

### 5.1 Reduction of GFOMs to approximate message passing algorithms

Approximate message passing (AMP) algorithms are a special class of GFOMs that admit an asymptotic characterization called state evolution [BM11]. We show that, in both models we consider, any GFOM is equivalent to an AMP algorithm after a change of variables.

An AMP algorithm is defined by sequences of Lipschitz functions , . It generates sequences , of matrices in and , respectively, according to

 at+1=XTft(b1,…,bt;y,u)−t∑s=1gs(a1,…,as;v)ξTt,s,bt=Xgt(a1,…,at;v)−t−1∑s=0fs(b1,…,bs;y,u)ζTt,s, (16)

with initialization . Here , are deterministic matrices. The we refer to the recursion (16) as to an AMP algorithm if only if the matrices , are determined by the functions , in a specific way, which depends on the model under consideration, and we describe in Appendix B. For this special choice of the matrices , , the iterates are asymptotically Gaussian, with a covariance that can be determined via the state evolution recursion.

The next lemma, proved in Appendix B, makes this precise and describes the state evolution of the resulting AMP algorithm.

###### Lemma 2.

Under assumptions A1, A2, R3, R4 (for high-dimensional regression) or assumptions A1, A2, M2 (for low-rank matrix estimation), there exist Lipschitz functions as above and , such that the following holds. Let be matrices determined by the general AMP prescription (see Appendix B), and define via the AMP algorithm (16). Then we have

 vt=φt(a1,…,at;v),t≥1, ut=ϕt(b1,…,bt;y,u),t≥1.

Further, state evolution determines two collections of of matrices such that for all pseudo-Lipschitz functions of order 2,

 1pp∑j=1ψ(a1j,…,atj,vj,θj)\lx@stackrelp→E[ψ(α1Θ+Z1,…,αtΘ+Zt,V,Θ)], (17)

where independent of . Here is a positive semi-definite block matrix with block given by .5

Lemma 2 implies that the estimator in Theorem 1 and 2 can alternatively be viewed as a Lipschitz function of the AMP iterates and side information , applied row-wise. Thus, can be viewed as a pseudo-Lipschitz function of order 2 applied to ; namely, . Then, Lemma 2 implies that the limits in Theorems 1 and 2 exist and have lower bound

 infRℓ(g∗,(αs),(Ts,s′)):=infE[ℓ(Θ,g∗(α1Θ+Z1,…,αtΘ+Zt,V))], (18)

where the infimum is taken over Lipschitz functions and matrices generated by the state evolution of some AMP algorithm. This lower bound is characterized in the following sections.

### 5.2 Models and message passing on the computation tree

We introduce two statistical models on trees and a collection of algorithms which correspond, in a sense we make precise, to the high-dimensional regression and low-rank matrix estimation models, and AMP algorithms. We derive lower bounds on the estimation error in these models using information-theoretic, rather than algorithmic, techniques. We then transfer these to lower bounds on (18). The models are defined using an infinite connected tree consisting of infinite collections of variable nodes , factor nodes , and edges . Factor nodes have degree and have only variables nodes as neighbors, and variable nodes have degree and have only factor nodes as neighbors. These properties define the tree uniquely up to isomorphism. We denote the set of neighbors of a variable by , and similarly define . We call the computation tree.

The statistical models are joint distributions over random variables associated to the nodes and edges of the computation tree.

High-dimensional regression on the computation tree.

The random variables , , and are generated independently. We assume