Asymptotic Analysis of LASSO’s Solution Path with Implications for Approximate Message Passing

Asymptotic Analysis of LASSO’s Solution Path with Implications for Approximate Message Passing

Abstract

This paper concerns the performance of the LASSO (also knows as basis pursuit denoising) for recovering sparse signals from undersampled, randomized, noisy measurements. We consider the recovery of the signal from random and noisy linear observations , where is the measurement matrix and is the noise. The LASSO estimate is given by the solution to the optimization problem with . Despite major progress in the theoretical analysis of the LASSO solution, little is known about its behavior as a function of the regularization parameter . In this paper we study two questions in the asymptotic setting (i.e., where , while the ratio converges to a fixed number in ): (i) How does the size of the active set behave as a function of , and (ii) How does the mean square error behave as a function of ? We then employ these results in a new, reliable algorithm for solving LASSO based on approximate message passing (AMP).

1Introduction

1.1Motivation

Consider the problem of recovering a vector from a set of undersampled random linear measurements , where is the measurement matrix, and denotes the noise. One of the most successful recovery algorithms, called basis pursuit denoising or LASSO ([1]), that employs the following optimization problem to obtain an estimate of :

A rich literature has provided a detailed analysis of this algorithm [3]. Most of the work published in this area falls into two categories: (i) non-asymptotic and (ii) asymptotic results. The non-asymptotic results consider and to be large but finite numbers and characterize the reconstruction error as a function of and . These analyses provide qualitative guidelines on how to design compressed sensing (CS) system. However, they suffer from loose constants and are incapable of providing quantitative guidelines. Therefore, inspired by the seminal work of Donoho and Tanner [13], researchers have started the asymptotic analysis of LASSO. Such analyses provide sharp quantitative guidelines for designing CS systems.

Despite the major progress in our understanding of LASSO, one major aspect of the method that is of major algorithmic importance has remained unexplored. In most of the theoretical work, it is assumed that an oracle has given the optimal value of to the statistician/engineer and the analysis is performed for the optimal value of . However, in practice the optimal value of is not known a priori. One important analysis that may help in both searching for the optimal value of and/or designing efficient algorithms for solving LASSO, is the behavior of the solution as a function of . In this paper, we conduct such an analysis and demonstrate how such results can be employed for designing efficient approximate message passing algorithms.

1.2Analysis of LASSO’s solution path

In this paper we aim to analyze the properties of the solution of the LASSO as changes. The two main problems that we address are:

  • Q1: How does change as varies?

  • Q2: How does change as varies?

The first question is about the number of active elements in the solution of the LASSO, and the second one is about the mean squared error. Intuitively speaking, one would expect the size of the active set to shrink as increases and the mean square error to be a bowl-shaped function of . Unfortunately the peculiar behavior of LASSO breaks this intuition. See Figure 1 for a counter-example; we will clarify the details of this experiment in Section 4.2. This figure exhibits the number of active elements in the solution as we increase the value of . It is clear that the size of the active set is not monotonically decreasing.

Figure 1: The number of active elements in the solution of LASSO as a function of \lambda. It is clear that this function does not match the intuition. The size of the active set at one location grows as we increase \lambda. See the details of this experiment in Section .
Figure 1: The number of active elements in the solution of LASSO as a function of . It is clear that this function does not match the intuition. The size of the active set at one location grows as we increase . See the details of this experiment in Section .

Such pathological examples have discouraged further investigation of these problems in the literature. The main objective of this paper is to show that such pathological examples are quite rare, and if we consider the asymptotic setting (that will be described in Section 2.2), then we can provide quite intuitive answers to the two questions raised above. Let us summarize our results here in a non-rigorous way. We will formalize these statements and clarify the conditions under which they hold in Section 2.3.

  • A1: In the asymptotic setting, is a decreasing function of .

  • A2: In the asymptotic setting, is a quasi-convex function of .

1.3Implications for approximate message passing algorithms

Traditional techniques of solving LASSO, such as the interior point method, have fail in addressing high-dimensional CS-type problems. Therefore, researchers have started exploring iterative algorithms with inexpensive per-iteration computations. One such algorithm is called approximate message passing (AMP) [23]; it is given by the following iteration:

AMP is an iterative algorithm, and is the index of iteration. is the estimate of at iteration . is the soft thresholding function applied component-wise to the elements of the vector. For , . . Finally is called the threshold parameter. One of the most interesting features of AMP is that, in the asymptotic setting (which will be clarified later), the distribution of is Gaussian at every iteration, and it can be considered to be independent of . Figure 2 shows the empirical distribution of at a three different iterations.

Figure 2: Histogram of v^t for three different iterations. The red curve displays the best Gaussian fit.
Figure 2: Histogram of for three different iterations. The red curve displays the best Gaussian fit.

As is clear from , the only parameter that exists in the AMP algorithm is the threshold parameter at different iterations. It turns out that different choices of this parameter can lead to very different performance. One choice that has interesting theoretical properties was first introduced in [23] and is based on the Gaussianity of . Suppose that an oracle gives us the standard deviation of at time , called . Then one way for determining the threshold is to set , where is a fixed number. This is called the fixed false alarm thresholding policy. It turns out that if we set properly in terms of (the regularization parameter of LASSO), then will eventually converge to . The nice theoretical properties of the fixed false alarm thresholding policy come at a price, however, and that is the requirement for estimating at every iteration, which is not straightforward as we observe and not . However, the fact that the size of the active set of LASSO is a monotonic function of provides a practical and easy way for setting . We call this approach fixed detection thresholding.

Note that a similar thresholding policy has been employed for iterative hard thresholding [25], iterative soft thresholding [27], and AMP [28] in a slightly different way. In these works, it is assumed that the signal is sparse and its sparsity level is known, and is set according to the sparsity level. However, here is assumed to be a free parameter. In the asymptotic setting, AMP with this thresholding policy is also equivalent to the LASSO in the following sense: for every there exists a unique for which AMP converges to the solution of LASSO as . This result is a conclusion of the monotonicity of the size of the active set of LASSO in terms of . We will formally state our results regarding the AMP algorithm with fixed detection thresholding policy in Section 2.4.

1.4Organization of the paper

The organization of the paper is as follows: Section 2 sets up the framework and formally states the main contributions of the paper. Section 3 proves the main results of this paper. Section 4 summarizes our simulation results and finally, Section 5 includes the conclusion of the paper.

2Main contributions

2.1Notation

Capital letters denote both matrices and random variables. As we may consider a sequence of vectors with different sizes, sometimes we denote with to emphasize its dependency on the ambient dimension. For a matrix , , , and denote the transpose of , the minimum, and maximum singular values of respectively. Calligraphic letters such as denote sets. For set , , and are the size of the set and its complement respectively. For a vector , , , and represent the component, , and norms respectively. We use and to denote the probability and expected value with respect to the measure that will be clear from the context. The notation denotes the expected value with respect to the randomness in random variable . The two functions and denote the probability density function and cumulative distribution function of standard normal distribution. Finally, and denote the indicator and sign functions, respectively.

2.2Asymptotic CS framework

In this paper we consider the problem of recovering an approximately sparse vector from noisy linear observations . Our main goal is to analyze the properties of the solution of LASSO, defined in , on CS problems with the following two main feature. (i) the measurement matrix has iid gaussian elements,1 and (ii) the ambient dimension and the number of measurements are large. We adopt the asymptotic framework to incorporate these two features. Here is the formal definition of this framework [24]. Let while is fixed. We write the vectors and matrices as , and to emphasize on the ambient dimension of the problem. Clearly, the number of row of the matrix is equal to , but we assume that is fixed and therefore we do not include in our notation for . The same argument is applied to and .

Note that we have not imposed any constraint on the limiting distributions or . In fact for the purpose of this section, is not necessarily a sparsity promoting prior. Furthermore, unlike most of the other works that assumes is Gaussian, we do not even impose this constraint on the noise. Also, the last condition is equivalent to saying that all the columns have asymptotically unit norm. For each problem instance and we solve LASSO and obtain as the estimate. We would now like to evaluate certain measures of performance for this estimate such as the mean squared error. The next generalization formalizes the types of measure we are interested in.

A popular choice of the function is . For this function the observable has the form:

Another example of function that we consider in this paper is , which leads us to

Some of the popular observables are summarized in Table ? with their corresponding functions. Note that so far we do not have any major assumption on the sequences of matrices. Following the other works in CS, we would now consider random measurement matrices. While all our discussion can be extended to more general classes of random matrices [29], for the notational simplicity we consider . Clearly, these matrices satisfy the unit norm column condition of converging sequences with high probability. Since is random, there are two questions that need to be addressed about . (i) Does it exist and in what sense (e.g., in probability or almost surely)? (ii) Does it converge to a random variable or to a deterministic quantity? The following theorem, conjectured in [24] and proved in [20], shows that under some restrictions on the function, not only the almost sure limit exists in this scenario, but also it converges to a non-random number.

Some observables and their abbreviations. The function for each observable is also specified.
Name Abbreviation
Mean Square Error MSE
False Alarm Rate FA
Detection Rate DR
Missed Detection MD

This theorem will provide the first step in our analysis of the LASSO’s solution path. Before we proceed to the implications of this theorem, let us explain some of its interesting features. Suppose that has iid elements, and each element is in law equal to , where and . Also, . If these two assumptions were true, then we could use strong law of large numbers (SLLN) and argue that were true under some mild conditions (required for SLLN). While this heuristic is not quite correct, and the elements of are not necessarily independent, at the level of calculating observables defined in Definition ? (and pseudo Lipschitz) this theorem confirms the heuristic. Note that the key element that has led to this heuristic is the randomness in the measurement matrix and the large size of the problem.
As we see in , there are two constants, , that are calculated according to and . [24] have shown that for a fixed , these two equations have a unique solution for . Note that here , i.e., the variance of the noise that we observe after the reconstruction, , is larger than the input noise (according to ). The extra noise that we observe after the reconstruction is due to subsampling. In fact, if we keep fixed and decrease , then we see that increases. This phenomena is sometimes called noise-folding in the CS literature [30].

One of the main applications of Theorem ? is in characterizing the normalized mean squared error of the LASSO problem as is summarized by the next corollary.

As we mentioned before, we are also interested in another observable and that is . As described in , this observable can be constructed by using . However, it is not difficult to see that for this observable, the function is not pseudo-Lipschitz, and hence Theorem ? does not apply. However, as conjectured in [24] and proved in [20] we can still characterize the almost sure limit of this observable.

2.3LASSO’s solution path

In Section 2.2 we characterized two simple expressions for the asymptotic behavior of normalized mean square error and normalized number of detections. These two expressions enable us to formalize the two questions that we raised in the Introduction.

As mentioned in the Introduction, if we consider a generic CS problem, there are some pathological examples for which the behavior of LASSO is quite unpredictable and inconsistent with our intuition. See Figure 1 for an example and Section ? for a detailed description about it. Here, we consider the asymptotic regime introduced in the last section. It turns out that in this setting the solution of LASSO behaves as expected.

We summarize the proof of this theorem in Section 3.2.
Intuitively speaking, Theorem ? claims that, as we increase the regularization parameter , the number of elements in the active set is decreasing. Also, according to the condition the largest it can get is . Since the number of active elements is a decreasing function of , appears only in the limit . Figure 3 shows the number of active elements as a function of for a setting described in Section ?. In the next section, we will exploit this property to design and tune AMP for solving the LASSO.

Figure 3: The number of active elements in the solution of LASSO as a function of \lambda. The size of the active set decreases monotonically as we increase \lambda.
Figure 3: The number of active elements in the solution of LASSO as a function of . The size of the active set decreases monotonically as we increase .

Our next result is regarding the behavior of the normalized MSE in terms of the regularization parameter . In asymptotic setting, we prove that the normalized MSE is a quasi-convex function of . See Section 3.4 of [32] for a short introduction on quasi-convex functions. Figure 4 exhibits the behavior of MSE as a function of . The detailed description of this problem instance can be found in Section 4.2. Before we proceed further, we define bowl-shaped functions.

Here is the formal statement of this result.

Figure 4: Behavior of MSE as a function of \lambda for two different noise variances.
Figure 4: Behavior of MSE as a function of for two different noise variances.

See the proof in Section 3.3.
In many applications such as imaging, it is important to find the optimal value of that leads to the minimum MSE. We believe that the combination of the quasi-convexity of MSE in terms of and certain risk estimation techniques, such as SURE, may lead to efficient algorithms for estimating . We leave this as an avenue for the future research.

2.4Implications for AMP

AMP in asymptotic settings

In this section we show how the result of Theorem ? can lead to an efficient method for setting the threshold in the AMP algorithm. We first review some background on the asymptotic analysis of AMP. This section is mainly based on the results in [23], and the interested reader is referred to these papers for further details. As we mentioned in Section 1.3, AMP is an iterative thresholding algorithm. Therefore, we would like to know the discrepancy of its estimate at every iteration from the original vector . The following definition formalizes different discrepancy measures for the AMP estimates.

As before, we can consider that leads to the normalized MSE of AMP at iteration . The following result that was conjectured in [23] and was finally proved in [19] provides a simple description of the almost sure limits of the observables.

Similarly, our discussion of the solution of the LASSO, this theorem claims that, as long as the calculation of the pseudo-Lipschitz observables is concerned, we can assume that estimate of the AMP are modeled as iid elements with each element modeled in law as , where and . As before, we are also interested in the normalized number of detections. The following theorem establishes this result.

In other words, the result of Theorem ? can be extended to , even though this function is not pseudo-Lipschitz.

Connection between AMP and LASSO

The AMP algorithm in its general form can be considered as a sparse signal recovery algorithm.2 The choice of the threshold parameter has major impact on the performance of AMP. It turns out that if we set “appropriately,” then the fixed point of AMP corresponds to the solution of LASSO in the asymptotic regime. One such choice of parameters is the fixed false alarm threshold given by , where satisfies . The following result conjectured in [24] and later proved in [20] formalizes this statement.

This promising result indicates that AMP can be potentially used as a fast iterative algorithm for solving the LASSO problem. However, it is not readily useful for practical scenarios in which is not known (since neither nor its distribution are known). Therefore, in the first implementations of AMP, has been estimated at every iteration from the observations . From Section 1.3 we know that can be modeled as Gaussian . Therefore, if we had access to we could easily estimate . However, we only observe , and we have to estimate from this observation. The estimates that have been proposed so far are exploiting the fact that is sparse and provide a biased estimate of . While such biased estimates still work well in practice, our discussion of LASSO provide an easier way to set the threshold. In the next section, based on our analysis of LASSO we discuss the performance of fixed detection thresholding policy, introduced in Section 1.3, and show that not only this thresholding policy can be implemented in its exact form, but also it has the nice properties of the fixed false alarm threshold.

Fixed detection thresholding

AMP looks for the sparsest solution of through the following iterations:

As was discussed in Section 1, a good choice for the threshold parameter is vital to the good performance of AMP. We proved in Section 2.3 that the number of active elements in the solution of LASSO is a monotonic function of the parameter . This motivates us to set the threshold of AMP in a way that at every iteration, a certain number of coefficients remains in the active set. To understand this claim better, compare for the fixed point of LASSO and for the iterations of AMP. Let us replace in . In addition, assume that is such that is equal to for some . Under these two assumptions, and can be converted to

Let us now consider the fixed point of AMP. By letting in we obtain

where . Comparing and we conclude that if we set in way that as then AMP has a fixed point that corresponds to the solution of LASSO. One such approach is the fixed detection thresholding policy that was introduced in Section 1.3. According to this thresholding policy, we keep the size of the active set of AMP fixed at every iteration. Then clearly, if the algorithm converges, then final solution will have the desired number of active elements. In other words, the final solution of the AMP will also satisfy the two equations:

The first question that we shall address here is wether the above two equations have a unique fixed point. Otherwise, depending on the initialization, AMP may converge to different fixed points.

See Section 3.4 for the proof of this lemma.

The heuristic discussion we have had so far shows that the fixed point of the AMP algorithm with fixed detection threshold converges to the solution of LASSO. The following theorem formalizes this result.

As we will show in Section 3.5, the proof of this theorem is essentially the same as the proof of Theorem 3.1 in [20]. There is a slight change in the proof due to the different thresholding policy that we consider here.

3Proofs of the main results

3.1Background

Quasiconvex functions

Here, we briefly mention several properties of quasi-convex functions. For more detailed introduction, refer to section 3.4 of [32]. The following basic theorem regarding the quasi-convex functions is a key element in our proofs.

The following simple lemma shows that shifting and scaling preserve quasi-convexity.

First, assume that is a quasi-convex function. According to the definition of the quasi-convexity we can write

Hence, is quasi-convex as well. On the other hand, suppose that is a quasi-convex function. Then, according to the definition, we can write

Therefore, is quasi-convex as well.

Risk of the soft thresholding function

In this section we will review some of the basic results that have been proved elsewhere and will be used in this paper. We will also extend some of the results. As we will see these extensions will be used later in the paper. Let

where and are two independent random variables. Note that is a function of , but here we assume that all these parameters are fixed, and is only a function of . The following lemma is adopted from [23].

One major implication of this theorem is related to the fixed points of summarized in the next lemma. This Lemma is adopted from [24].

Another interesting result that will play crucial role below is the quasi-convexity of the risk of soft thresholding in terms of the threshold [33]. Let be a random variable with distribution independent of , and define

We claim that is a quasi-convex function of . It turns out that the proof of this claim is similar to the proof of quasi-convexity of soft thresholding with fixed [33]. However, since the version of [33] that is publicly available did not have this theorem at the time of the publication of this manuscript, and since we need to make some modifications to the proof, we provide it here. Define as a point mass at zero. For instance is a distribution of a random variable that takes values and with probability half.

In order to prove this result we use Theorem ?. It is straightforward to prove that is a differentiable function of . This derivative is shown in Figure 5. According to Theorem ?, we have to prove that the derivative either has no sign change or has one sign change from negative to positive. Note that if we had this would immediately show that the function is convex and hence it is also quasiconvex. However, this is not true here. Figure 5 shows as a function of . As is clear from the figure, the derivative is not strictly increasing, and hence we should not expect the second derivative to be always positive.

Figure 5: The derivative of \frac{\partial}{\partial \tau} r(\mu, \tau) as a function of \tau for \mu=1 and 0< \tau < \infty. Note that the derivative of the risk has only one sign change. Below that point the derivative is negative and above of that point is positive (even though it converges to zero as \tau \rightarrow \infty). Hence we expect the risk to be quasi-convex.
Figure 5: The derivative of as a function of for and . Note that the derivative of the risk has only one sign change. Below that point the derivative is negative and above of that point is positive (even though it converges to zero as ). Hence we expect the risk to be quasi-convex.

Instead we will show that the ratio is strictly increasing in . Once we prove this, we conclude that will have at most one sign change, and since is always positive we can conclude that will have at most one sign change. Clearly, according to Theorem ? this completes the proof of quasi-convexity. Therefore, our main goal in the rest of the proof is to show that is strictly increasing. We have

Therefore

Changing integral variables to in and in results in

Consequently, we can write

Let . Taking the derivative of with respect to gives

can be simplified to

Therefore, since , we have that proves the fact that is an increasing function. This in addition with the fact that completes the proof of quasi-convexity.

So far, we have been able to prove that , which proves (when combined by the fact that ) the quasi-convexity of the risk. However, this does not mean that the function is bowl-shaped yet (See Definition ?). Infact, the function is bowl-shaped if sign-change happens at certain values of . We would now like to prove that the zero crossing in fact happens if .

We have

If we prove that for large enough we have , then we can conclude that has at least one zero-crossing. On the other hand, according to Lemma ?, has at most one zero crossing. Therefore, we would actually prove that has exactly one zero-crossing and, consequently, the risk of soft-thresholding function is a bowl-shaped function. We require the following two lemmas in the proof of our main claim that is summarized in Proposition ?.

This lemma is a straightforward consequence of the assumption .

We again prove this lemma using contradiction. Suppose that such does not exist. Therefore, for every we have . This would give . However, this is in contradiction with the property of in Lemma ? and hence the proof is complete.

Without loss of generality and for the simplicity of notation, we assume that . Using the same change of variable technique as in (Equation 9), we have

Note that , and . Our goal is to show that for large values of , . To achieve this goal, we find an upper bound for and a lower bound for . Simplifying gives

Therefore, it is sufficient to prove that . To achieve this goal, we first use the assumptions that and to prove the following two lemmas.

Using and introduced in Lemmas ? and ?, we can obtain a lower bound for as the following:

where

Combining (Equation 10) and (Equation 11) would give

Therefore, would be positive in a case that is large enough, i.e., if . Therefore, the proof is complete.

Combining Corollary ?, Proposition ?, and the fact that has at most one zero-crossing proves that has exactly one zero-crossing from negative to positive, and hence is a quasi-convex and bowl-shaped function.

We can write

Therefore, taking derivative with respect to gives

3.2Proof of Theorem

First note that according to Theorem ? we have

where satisfy the following equations:

Therefore, we have to prove that . The main difficulty of this problem is clear from : are complicated functions of each other whose explicit formulations are not known. Using the chain rule we have

This simple expression enables us to break the proof into the following two simpler parts:

  1. We prove that .

  2. We prove that .

Combining these two results with completes the proof.

Since if and only if , it is sufficient to prove . We have

The rest of the proof has four main steps:

  1. Calculation of .

  2. Calculation of .

  3. Calculation of .

  4. Plugging the results of the above three steps in and proving that .

Here are these four steps realized:
Step 1: Calculation of

Simple algebra leads us to

Step 2: Calculation of


Similar to Step 1, we can write

Step 3: Calculation of