Asymptotic Analysis of LASSO’s Solution Path with Implications for Approximate Message Passing
This paper concerns the performance of the LASSO (also knows as basis pursuit denoising) for recovering sparse signals from undersampled, randomized, noisy measurements. We consider the recovery of the signal from random and noisy linear observations , where is the measurement matrix and is the noise. The LASSO estimate is given by the solution to the optimization problem with . Despite major progress in the theoretical analysis of the LASSO solution, little is known about its behavior as a function of the regularization parameter . In this paper we study two questions in the asymptotic setting (i.e., where , while the ratio converges to a fixed number in ): (i) How does the size of the active set behave as a function of , and (ii) How does the mean square error behave as a function of ? We then employ these results in a new, reliable algorithm for solving LASSO based on approximate message passing (AMP).
Consider the problem of recovering a vector from a set of undersampled random linear measurements , where is the measurement matrix, and denotes the noise. One of the most successful recovery algorithms, called basis pursuit denoising or LASSO ([Tiblasso96, ChDoSa98]), that employs the following optimization problem to obtain an estimate of :
A rich literature has provided a detailed analysis of this algorithm [DET, Tropp, ZhYu06, MeYu09, BiRiTs08, MeBu06, GeBu09, BuTsWe07, KnFu2000, ZoHaTib2007, DoTa05, Do05, DoTa08, DoTa09, DoTa09b, MalekiThesis, BaMo10, BaMo11, amelunxen2013living, oymak2012relation]. Most of the work published in this area falls into two categories: (i) non-asymptotic and (ii) asymptotic results. The non-asymptotic results consider and to be large but finite numbers and characterize the reconstruction error as a function of and . These analyses provide qualitative guidelines on how to design compressed sensing (CS) system. However, they suffer from loose constants and are incapable of providing quantitative guidelines. Therefore, inspired by the seminal work of Donoho and Tanner [DoTa05], researchers have started the asymptotic analysis of LASSO. Such analyses provide sharp quantitative guidelines for designing CS systems.
Despite the major progress in our understanding of LASSO, one major aspect of the method that is of major algorithmic importance has remained unexplored. In most of the theoretical work, it is assumed that an oracle has given the optimal value of to the statistician/engineer and the analysis is performed for the optimal value of . However, in practice the optimal value of is not known a priori. One important analysis that may help in both searching for the optimal value of and/or designing efficient algorithms for solving LASSO, is the behavior of the solution as a function of . In this paper, we conduct such an analysis and demonstrate how such results can be employed for designing efficient approximate message passing algorithms.
1.2 Analysis of LASSO’s solution path
In this paper we aim to analyze the properties of the solution of the LASSO as changes. The two main problems that we address are:
Q1: How does change as varies?
Q2: How does change as varies?
The first question is about the number of active elements in the solution of the LASSO, and the second one is about the mean squared error. Intuitively speaking, one would expect the size of the active set to shrink as increases and the mean square error to be a bowl-shaped function of . Unfortunately the peculiar behavior of LASSO breaks this intuition. See Figure 1 for a counter-example; we will clarify the details of this experiment in Section LABEL:sec:simulationdetails. This figure exhibits the number of active elements in the solution as we increase the value of . It is clear that the size of the active set is not monotonically decreasing.
Such pathological examples have discouraged further investigation of these problems in the literature. The main objective of this paper is to show that such pathological examples are quite rare, and if we consider the asymptotic setting (that will be described in Section 2.2), then we can provide quite intuitive answers to the two questions raised above. Let us summarize our results here in a non-rigorous way. We will formalize these statements and clarify the conditions under which they hold in Section LABEL:sec:lassopath.
A1: In the asymptotic setting, is a decreasing function of .
A2: In the asymptotic setting, is a quasi-convex function of .
1.3 Implications for approximate message passing algorithms
Traditional techniques of solving LASSO, such as the interior point method, have fail in addressing high-dimensional CS-type problems. Therefore, researchers have started exploring iterative algorithms with inexpensive per-iteration computations. One such algorithm is called approximate message passing (AMP) [DoMaMo09]; it is given by the following iteration:
AMP is an iterative algorithm, and is the index of iteration. is the estimate of at iteration . is the soft thresholding function applied component-wise to the elements of the vector. For , . . Finally is called the threshold parameter. One of the most interesting features of AMP is that, in the asymptotic setting (which will be clarified later), the distribution of is Gaussian at every iteration, and it can be considered to be independent of . Figure 2 shows the empirical distribution of at a three different iterations.
As is clear from (1.3), the only parameter that exists in the AMP algorithm is the threshold parameter at different iterations. It turns out that different choices of this parameter can lead to very different performance. One choice that has interesting theoretical properties was first introduced in [DoMaMo09, DoMaMoNSPT] and is based on the Gaussianity of . Suppose that an oracle gives us the standard deviation of at time , called . Then one way for determining the threshold is to set , where is a fixed number. This is called the fixed false alarm thresholding policy. It turns out that if we set properly in terms of (the regularization parameter of LASSO), then will eventually converge to . The nice theoretical properties of the fixed false alarm thresholding policy come at a price, however, and that is the requirement for estimating at every iteration, which is not straightforward as we observe and not . However, the fact that the size of the active set of LASSO is a monotonic function of provides a practical and easy way for setting . We call this approach fixed detection thresholding.
Definition 1.1 (Fixed detection thresholding policy).
Let . Set the threshold value, to the absolute value of the largest element (in absolute value) of .
Note that a similar thresholding policy has been employed for iterative hard thresholding [BlDaCS09, BlDa09], iterative soft thresholding [Maleki09], and AMP [MaMoCISS10] in a slightly different way. In these works, it is assumed that the signal is sparse and its sparsity level is known, and is set according to the sparsity level. However, here is assumed to be a free parameter. In the asymptotic setting, AMP with this thresholding policy is also equivalent to the LASSO in the following sense: for every there exists a unique for which AMP converges to the solution of LASSO as . This result is a conclusion of the monotonicity of the size of the active set of LASSO in terms of . We will formally state our results regarding the AMP algorithm with fixed detection thresholding policy in Section LABEL:sec:ampimplic.
1.4 Organization of the paper
The organization of the paper is as follows: Section 2 sets up the framework and formally states the main contributions of the paper. Section LABEL:sec:Thms proves the main results of this paper. Section LABEL:sec:simulations summarizes our simulation results and finally, Section LABEL:sec:con includes the conclusion of the paper.
2 Main contributions
Capital letters denote both matrices and random variables. As we may consider a sequence of vectors with different sizes, sometimes we denote with to emphasize its dependency on the ambient dimension. For a matrix , , , and denote the transpose of , the minimum, and maximum singular values of respectively. Calligraphic letters such as denote sets. For set , , and are the size of the set and its complement respectively. For a vector , , , and represent the component, , and norms respectively. We use and to denote the probability and expected value with respect to the measure that will be clear from the context. The notation denotes the expected value with respect to the randomness in random variable . The two functions and denote the probability density function and cumulative distribution function of standard normal distribution. Finally, and denote the indicator and sign functions, respectively.
2.2 Asymptotic CS framework
In this paper we consider the problem of recovering an approximately sparse vector from noisy linear observations . Our main goal is to analyze the properties of the solution of LASSO, defined in (1.1), on CS problems with the following two main feature. (i) the measurement matrix has iid gaussian elements,111With the recent results in CS [BaMoLe12] our results can be easily extended to subgaussian matrices. However, for notational simplicity we consider the Gaussian setting here. and (ii) the ambient dimension and the number of measurements are large. We adopt the asymptotic framework to incorporate these two features. Here is the formal definition of this framework [DoMaMoNSPT, BaMo10]. Let while is fixed. We write the vectors and matrices as , and to emphasize on the ambient dimension of the problem. Clearly, the number of row of the matrix is equal to , but we assume that is fixed and therefore we do not include in our notation for . The same argument is applied to and .
A sequence of instances is called a converging sequence if the following conditions hold:
The empirical distribution of converges weakly to a probability measure with bounded second moment.
The empirical distribution of () converges weakly to a probability measure with bounded second moment.
If denotes the standard basis for , then
Note that we have not imposed any constraint on the limiting distributions or . In fact for the purpose of this section, is not necessarily a sparsity promoting prior. Furthermore, unlike most of the other works that assumes is Gaussian, we do not even impose this constraint on the noise. Also, the last condition is equivalent to saying that all the columns have asymptotically unit norm. For each problem instance and we solve LASSO and obtain as the estimate. We would now like to evaluate certain measures of performance for this estimate such as the mean squared error. The next generalization formalizes the types of measure we are interested in.
Let be the sequence of solutions of the LASSO problem for the converging sequence of instances . Consider a function . An observable is defined as
A popular choice of the function is . For this function the observable has the form:
Another example of function that we consider in this paper is , which leads us to
Some of the popular observables are summarized in Table LABEL:table:observables with their corresponding functions. Note that so far we do not have any major assumption on the sequences of matrices. Following the other works in CS, we would now consider random measurement matrices. While all our discussion can be extended to more general classes of random matrices [BaMoLe12], for the notational simplicity we consider . Clearly, these matrices satisfy the unit norm column condition of converging sequences with high probability. Since is random, there are two questions that need to be addressed about . (i) Does it exist and in what sense (e.g., in probability or almost surely)? (ii) Does it converge to a random variable or to a deterministic quantity? The following theorem, conjectured in [DoMaMoNSPT] and proved in [BaMo11], shows that under some restrictions on the function, not only the almost sure limit exists in this scenario, but also it converges to a non-random number.
|Mean Square Error||MSE|
|False Alarm Rate||FA|