Algorithmic Analysis and Statistical Estimation of SLOPE via Approximate Message Passing
SLOPE is a relatively new convex optimization procedure for high-dimensional linear regression via the sorted penalty: the larger the rank of the fitted coefficient, the larger the penalty. This non-separable penalty renders many existing techniques invalid or inconclusive in analyzing the SLOPE solution. In this paper, we develop an asymptotically exact characterization of the SLOPE solution under Gaussian random designs through solving the SLOPE problem using approximate message passing (AMP). This algorithmic approach allows us to approximate the SLOPE solution via the much more amenable AMP iterates. Explicitly, we characterize the asymptotic dynamics of the AMP iterates relying on a recently developed state evolution analysis for non-separable penalties, thereby overcoming the difficulty caused by the sorted penalty. Moreover, we prove that the AMP iterates converge to the SLOPE solution in an asymptotic sense, and numerical simulations show that the convergence is surprisingly fast. Our proof rests on a novel technique that specifically leverages the SLOPE problem. In contrast to prior literature, our work not only yields an asymptotically sharp analysis but also offers an algorithmic, flexible, and constructive approach to understanding the SLOPE problem.
Consider observing linear measurements that are modeled by the equation
where is a known measurement matrix, is an unknown signal, and is the measurement noise. Among numerous methods that seek to recover the signal from the observed data, especially in the setting where is sparse and is larger than , SLOPE has recently emerged as a useful procedure that allows for estimation and model selection . This method reconstructs the signal by solving the minimization problem
where denotes the norm, (with at least one strict inequality) is a sequence of thresholds, and are the order statistics of the fitted coefficients in absolute value. The regularizer is a sorted -norm (denoted as henceforth), which is non-separable due to the sorting operation involved in its calculation. Notably, SLOPE has two attractive features that are not simultaneously present in other methods for linear regression including the LASSO  and knockoffs . Explicitly, on the estimation side, SLOPE achieves minimax estimation properties under certain random designs without requiring any knowledge of the sparsity degree of [37, 7]. On the testing side, SLOPE controls the false discovery rate in the case of independent predictors [9, 11]. For completeness, we remark that [10, 39, 19] proposed similar non-separable regularizers to encourage grouping of correlated predictors.
This work is concerned with the algorithmic aspects of SLOPE through the lens of approximate message passing (AMP) [4, 16, 23, 31]. AMP is a class of computationally efficient and easy-to-implement algorithms for a broad range of statistical estimation problems, including compressed sensing and the LASSO . When applied to SLOPE, AMP takes the following form: at initial iteration , assign , and for
The non-increasing sequence is proportional to and will be given explicitly in Section 2. Here, is the proximal operator of the sorted norm, that is,
and denotes the divergence of the proximal operator (see an equivalent, but more explicit form, of this algorithm in Section 2 and further discussion of SLOPE and the prox operator in Section 5.1). Compared to the proximal gradient descent (ISTA) [13, 14, 29], AMP has an extra correction term in its residual step that adjusts the iteration in a non-trivial way and seeks to provide improved convergence performance .
The empirical performance of AMP in solving SLOPE under i.i.d. Gaussian matrix is illustrated in Figure 1 and Table 1, which suggest the superiority of AMP over ISTA and FISTA —perhaps the two most popular proximal gradient descent methods—in terms of speed of convergence in this setting. However, the vast AMP literature thus far remains silent on whether AMP provably solves SLOPE and, if so, whether one can leverage AMP to get insights into the statistical properties of SLOPE. This vacuum in the literature is due to the non-separability of the SLOPE regularizer, making it a major challenge to apply AMP to SLOPE directly. In stark contrast, AMP theory has been rigorously applied to the LASSO , showing both good empirical performance and nice theoretical properties of solving the LASSO using AMP. Moreover, AMP in this setting allows for asymptotically exact statistical characterization of its output, which converges to the LASSO solution, thereby providing a powerful tool in fine-grained analyses of the LASSO [3, 36, 28, 35].
Main contributions. In this work, we prove that the AMP algorithm (1.3) solves the SLOPE problem in an asymptotically exact sense under independent Gaussian random designs. Our proof uses the recently extended AMP theory for non-separable denoisers  and applies this tool to derive the state evolution that describes the asymptotically exact behaviors of the AMP iterates in (1.3). The next step, which is the core of our proof, is to relate the AMP estimates to the SLOPE solution. This presents several challenges that cannot be resolved only within the AMP framework. In particular, unlike the LASSO, the number of nonzeros in the SLOPE solution can exceed the number of observations. This fact imposes substantially more difficulties on showing that the distance between the SLOPE solution and the AMP iterates goes to zero than in the LASSO case due to the possible non-strong convexity of the SLOPE problem, even restricted to the solution support. To overcome these challenges, we develop novel techniques that are tailored to the characteristics of the SLOPE solution. For example, our proof relies on the crucial property of SLOPE that the unique nonzero components of its solution never outnumber the observation units.
As a byproduct, our analysis gives rise to an exact asymptotic characterization of the SLOPE solution under independent Gaussian random designs through leveraging the statistical aspect of the AMP theory. In more detail, the probability distribution of the SLOPE solution is completely specified by a few parameters that are the solution to a certain fixed-point equation in an asymptotic sense. This provides a powerful tool for fine-grained statistical analysis of SLOPE as it was for the LASSO problem. We note that a recent paper —which takes an entirely different path—gives an asymptotic characterization of the SLOPE solution that matches our asymptotic analysis deduced from our AMP theory for SLOPE. However, our AMP-based approach is more algorithmic in nature and offers a more concrete connection between the finite-sample behaviors of the SLOPE problem and its asymptotic distribution via the computationally efficient AMP algorithm.
Paper outline. In Section 2 we develop an AMP algorithm for finding the SLOPE estimator in (1.2). Specifically, it is through the threshold values in the AMP algorithm in (1.3) that one can ensure the AMP estimates converge to the SLOPE estimator with parameter , so in Section 2 we provide details for how one should calibrate the thresholds of the AMP iterations in (1.3) in order for the algorithm to solve SLOPE cost in (1.2). Then in Section 3, we state theoretical guarantees showing that the AMP algorithm solves the SLOPE optimization asymptotically and we leverage theoretical guarantees for the AMP algorithm to exactly characterize the mean square error (more generally, any pseudo-Lipschitz error) of the SLOPE estimator in the large system limit. This is done by applying recent theoretical results for AMP algorithms that use a non-separable non-linearity , like the one in (1.3). Finally, Sections 4-7 prove rigorously the theoretical results stated in Section 3 and we end with a discussion in Section 8.
2 Algorithmic Development
To begin with, we state assumptions under which our theoretical results will hold and give some preliminary ideas about SLOPE that will be useful in the development of the AMP algorithm.
The measurement matrix has independent and identically-distributed (i.i.d.) Gaussian entries that have mean and variance .
The signal has elements that are i.i.d. , with .
The noise is elementwise i.i.d. , with .
The vector is elementwise i.i.d. , with and .
The ratio approaches a constant in the large system limit, as .
Remark: (A4) can be relaxed as having an empirical distribution that converges weakly to probability measure on with and and . A similar relaxation can be made for the distributional assumptions (A2) and (A3).
SLOPE preliminaries. For a vector , the divergence of the proximal operator, , is given by the following:
where [37, proof of Fact 3.4],
Hence the divergence takes the simplified form
where counts the unique non-zero magnitudes in a vector, e.g. . This explicit form of divergence not only waives the need to use approximation in calculation but also speed up the recursion, since it only depends on the proximal operator as a whole instead of on . Therefore, we have
In AMP, (1.3b) is equivalent to
2.1 AMP Background
An attractive feature of AMP is that its statistical properties can be exactly characterized at each iteration , at least asymptotically, via a one-dimensional recursion known as state evolution [4, 8, 35, 21]. Specifically, it can be shown that the pseudo-data, meaning the input for the estimate of the unknown signal in (1.3a), is asymptotically equal in distribution to the true signal plus independent, Gaussian noise, i.e. , where the noise variance is defined by the state evolution. For this reason, the function used to update the estimate in (1.3a), in our case, the proximal operator, , is usually referred to as a ‘denoiser’ in the AMP literature.
This statistical characterization of the pseudo-data was first rigorously shown to be true in the case of ‘separable’ denoisers by Bayati and Montanari , and an analysis of the rate of this convergence was given in . A ‘separable’ denoiser is one that applies the same (possibly non-linear) function to each element of its input. Recent work  proves that the pseudo-data has distribution asymptotically, even when the ‘denoisers’ used in the AMP algorithm are non-separable, like the SLOPE prox operator in (1.3a).
As mentioned previously, the dynamics of the AMP iterations are tracked by a recursive sequence referred to as the state evolution, defined as follows. For elementwise i.i.d. independent of , let and for ,
We note that throughout, we let denote the Gaussian density with mean and variance and we use to indicate a identity matrix.
2.2 Analysis of the AMP State Evolution
As the state evolution (2.4) predicts the performance of the AMP algorithm (1.3) (the pseudo-data, , is asymptotically equal in distribution ), it is of interest to study the large asymptotics of (2.4). Moreover, recall that through the sequence of thresholds , one can relate the AMP algorithm to the SLOPE estimator in (1.2) for a specific , and the explicit form of this calibration, given in Section 2.3, is motivated by such asymptotic analysis of the state evolution.
It turns out that a finite-size approximation, which we denote , will be easier to analyze than (2.4). The definition of is stated explicitly in (2.5) below. Throughout the work, we will define thresholds for every iteration where the vector is fixed via a calibration made explicit in Section 2.3. We can interpret this to mean that within the AMP algorithm, plays the role of the regularizer parameter, . Now we define , for large , as a finite-sample approximation to (2.4), namely
where the difference between (2.5) and the state evolution (2.4) is via the large system limit in . When we refer to the recursion in (2.5), we will always specify the dependence explicitly as An analysis of the limiting properties (in ) of (2.5) is given in Theorem 1 below, after which it is then argued that because interchanging limits and differentiation is justified, the large analysis of (2.5) holds for (2.4) as well. Before presenting Theorem 1, however, we give the following result which motivates why the AMP iteration should relate at all to the SLOPE estimator.
Proof of Lemma 2.2.
Now we present Theorem 1, which provides results about the asymptotics of the recursion in (2.5) and its proof is given in Appendix A. First, some notation must be introduced: let be the set of solutions to
Here represents elementwise multiplication of vectors and for vector , is defined elementwise as if and otherwise. Let . The expectation in (2.7) is taken with respect to a -length vector of i.i.d. standard Gaussians. Finally, for the notation and we say is strictly larger than if for all elements . For the simple case of , we illustrate an example of the set in Figure 2.
For any strictly larger than at least one element in the set , the recursion in (2.5) has a unique fixed point that we denote as . Then monotonically for any initial condition. Define a function as
where is elementwise i.i.d. independent of , so that . Then at . Moreover, for defined in (2.7), we show that .
Beyond providing the large asymptotics of the state evolution sequence, notice that Theorem 1 gives necessary conditions on the calibration vector under which the recursion in (2.5), and equivalently, the calibration detailed in Section 2.3 below are well-defined.
Recall that it is actually the state evolution in (2.4) (and not that in (2.5)) that predicts the performance of the AMP algorithm, and therefore we would really like a version of Theorem 1 studying the large system limit in . We argue that because interchanging differentiation and the limit, the proof of Theorem 1 analyzing (2.5), can easily be used to give an analogous result for (2.4). In particular analyzing (2.4) via the strategy given in the proof of Theorem 1 requires that we study the partial derivative of with respect to . Indeed, to directly make use our proof for the finite- case given in Theorem 1, it is enough that
Suppose is a sequence of functions that converge pointwise to on a compact domain and whose derivatives converge uniformly to a function on . Then on .
Therefore, taking , it suffices to show that if
then the sequence converges uniformly as . The main tool for proving such a result is given in the following lemma.
Suppose is a sequence of -Lipschitz functions (where is independent of ) that converge pointwise to a function on a compact domain . Then, the convergence is also uniform on .
Using this lemma, the essential idea is to show that there exists a constant , independent of , such that for all and all , in a bounded set ,
This follows by the mean value theorem and (A.14), with .
2.3 Threshold Calibration
Motivated by Lemma 2.2 and the result of Theorem 1, we define a calibration from the regularization parameter , to the corresponding threshold used to define the AMP algorithm. In practice, we will be given finite-length and then we want to design the AMP iteration to solve the corresponding SLOPE cost. We do this by choosing as the vector that solves where
where is elementwise i.i.d. independent of and is the limiting value defined in Theorem 1. We note the fact that the calibration in (2.10) sets as a vector in the same direction as , but that is scaled by a constant value (for each ), where the scaling constant value is
and in Algorithm 1 we show that determining the calibration is straightforward in practice.
The function defined in (2.10) is continuous on for defined in (2.7) with and (where the limit is taken elementwise). Therefore the function satisfying (2.10) exists. As , the function becomes invertible (given , satisfying (2.10) exists uniquely). Furthermore, the inverse function is continuous non-decreasing for any .
In [5, Proposition 1.4 (first introduced in ) and Corollary 1.7] this is proven rigorously for the analogous LASSO calibration and in Appendix A we show how to adapt this proof to SLOPE case. This proposition motivates Algorithm 1 which uses a bisection method to find the unique for each . It suffices to find two guesses of parallel to that, when mapped via (2.10), sandwich the true .
The calibration in (2.10) is exact when , so we study the mapping between and in this limit. Recall from (A4), that the sequence of vectors are drawn i.i.d. from distribution . It follows that the sequence defined for each by the finite-sample calibration (2.10) are i.i.d. from a distribution , where satisfies , and is defined via
We note, moreover, that the calibrations presented in this section are well-defined:
This fact is proven in Appendix C. One idea used in the proof of Fact 2.7 is that the prox operator is asymptotically separable, a result shown by [20, Proposition 1]. Specifically, for sequences of input, , and thresholds, , having empirical distributions that weakly converge to distributions and , respectively, then there exists a limiting scalar function (determined by and ) of the proximal operator . Further details are given in Lemma 3.3 in Section 3. Using , this argument implies that (2.4) can be represented as
and if we denote as the Lebesgue measure, then the limit in (2.11) can be represented as
In other words, the limit in (2.11) is the Lebesgue measure of the domain of the quantile function of for which the quantile of assumes unique values (i.e., is not flat).
3 Asymptotic Characterization of SLOPE
3.1 AMP Recovers the SLOPE Estimate
Here we show that the AMP algorithm converges in to the SLOPE estimator, implying that the AMP iterates can be used as a surrogate for the global optimum of the SLOPE cost function. The schema of the proof is similar to [5, Lemma 3.1], however, major differences lie in the fact that the proximal operator used in the AMP updates (1.3a)-(1.3b) is non-separable. We sketch the proof here, and a forthcoming article will be devoted to giving a complete and detailed argument.
The proof of Theorem 2 can be found in Section 4. At a high level, the proof requires dealing carefully with the fact that the SLOPE cost function, given in (1.2) is not necessarily strongly convex, meaning that we could encounter the undesirable situation where is close to but is not close to , meaning the statistical recovery of would be poor.
In the LASSO case, one works around this challenge by showing that the (LASSO) cost function does have nice properties when considering just the elements of the non-zero support of at any (large) iteration . In the LASSO case, the non-zero support of has size no larger than .
In the SLOPE problem, however, it is possible that the support set has size exceeding , and therefore the LASSO analysis is not immediately applicable. Our proof develops novel techniques that are tailored to the characteristics of the SLOPE solution. Specifically, when considering the SLOPE problem, one can show nice properties (similar to those in the LASSO case) by considering a support-like set, that being the unique non-zeros in the estimate at any (large) iteration . In other words, if we define an equivalence relation when , then entries of AMP estimate at any iteration are partitioned into equivalence classes. Then we observe from (2.10), and the non-negativity of , that the number of equivalence classes is no larger than . We see an analogy between SLOPE’s equivalence class (or ‘maximal atom’ as described in Appendix 5.1) and LASSO’s support set. This approach allows us to deal with the lack of a strongly convex cost.
3.2 Exact Asymptotic Characterization of the SLOPE Estimate
A consequence of Theorem 4.1, is that the SLOPE estimator inherits performance guarantees provided by the AMP state evolution, in the sense of Theorem 3 below. Theorem 3 provides as asymptotic characterization of pseudo-Lipschitz loss between and the truth .
Uniformly pseudo-Lipschitz functions : For , a function is pseudo-Lipschitz of order if there exists a constant , such that for ,
A sequence (in ) of pseudo-Lipschitz functions is uniformly pseudo-Lipschitz of order if, denoting by the pseudo-Lipschitz constant of , for each and .
Under assumptions (A1) - (A5), for any uniformly pseudo-Lipschitz sequence of functions and for ,
where is defined in (2.4) and the expectation is taken with respect to
Theorem 3 tells us that under uniformly pseudo-Lipschitz loss, in the large system limit, distributionally the SLOPE optimizer acts as a ‘denoised’ version of the truth corrupted by additive Gaussian noise where the denoising function is given by the proximal operator, i.e. within uniformly pseudo-Lipschitz loss can be replaced with for large .
The proof of Theorem 3 can be found in Section 4. We show that Theorem 3 follows from Theorem 2 and recent AMP theory dealing with the state evolution analysis in the case of non-separable denoisers , which can be used to demonstrate that the state evolution given in (2.4) characterizes the performance of the SLOPE AMP (1.3) via pseudo-Lipschitz loss functions.
For any function that is asymptotically separable, in the sense that there exists some function , such that
where is Lebesgue integrable then where i.i.d. .
Now to show the result [20, Theorem 1], consider a special case of Theorem 3 where for function that is pseudo-Lipschitz of order . It is easy to show that is uniformly pseudo-Lipschitz of order . The result of Theorem 3 then says that
Then [20, Theorem 1] follows by [20, Proposition 1], restated below in Lemma 3.3, the Law of Large Numbers, and Theorem 1. Now we restate in Lemma 3.3, the result given in [20, Proposition 1], which says that becomes asymptotically separable as , for convenience.
Lemma 3.3 (Proposition 1, ).
For an input sequence , and a sequence of thresholds , both having empirical distributions that weakly converge to distributions and , respectively, then there exists a limiting scalar function (determined by and ) such that as
where applies coordinate-wise to (hence it is separable) and is Lipschitz(1).
Then [20, Theorem 1] follows from Theorem 3 by using the asymptotic separability of the prox operator. Namely, the result of Lemma 3.3 (using that has an empirical distribution that converges weakly to for defined by (2.11)), along with Cauchy-Schwarz and the fact that is pseudo-Lipschitz, allow us to apply a dominated convergence argument (see Lemma B.2), from which it follows for some limiting scalar function as specified by Lemma 3.3,
Then the above allows us to apply Lemma 3.2 and the Law of Large Numbers to show
Finally we note that the result of [20, Theorem 1] follows since
We highlight that our Theorem 3 allows the consideration of a non-asymptotic case in . While Theorem 1 motivates an algorithmic way to find a value which approximates well, Theorem 3 guarantees the accuracy of such approximation for use in practice. One particular use of Theorem 3 is to design the optimal sequence that achieves the minimum and equivalently minimum error , though a concrete algorithm for doing so is still under investigation.
Under assumptions ,
Applying Theorem 3 to the pseudo-Lipschitz loss function , defined as , we find The desired result follows since . To see this, note that and
for elementwise i.i.d. independent of . A rigorous argument for the above requires showing that the assumptions of Lemma 3.2 are satisfied and follows similarly to that used to prove property (P2) stated in Section 4 and proved in Appendix B. ∎
4 Proof for Asymptotic Characterization of the SLOPE Estimate
In this section we prove Theorem 3. To do this, we use a result guaranteeing that the state evolution given in (2.4) characterizes the performance of the SLOPE AMP algorithm (1.3b), given in Lemma 4.1 below. Specifically, Lemma 4.1 relates the state evolution (2.4) to the output of the AMP iteration (1.3b) for pseudo-Lipschitz loss functions. This result follows from [8, Theorem 14], which is a general result relating state evolutions to AMP algorithm with non-separable denoisers. In order to apply [8, Theorem 14], we need to demonstrate that our denoiser, i.e. the proximal operator defined in (1.4), satisfies two additional properties labeled (P1) and (P2) below.
Define a sequence of denoisers where to be those that apply the proximal operator defined in (1.4), i.e. for a vector , define
For any with a pair of length- vectors, where for , the pair i.i.d. with any covariance matrix, the following limits exist and are finite.
We will show that properties (P1) and (P2) are satisfied for our problem in Appendix B.
Proof of Theorem 3.
First, for any fixed and , the following bound uses that is uniformly pseudo-Lipschitz of order and the Triangle Inequality,
Now we take limits on either side of the above, first with respect to and then with respect to . We note that the term vanishes by Theorem 2. Then as long as