A problem dependent analysis of SOCP algorithms in noisy compressed sensing

A problem dependent analysis of SOCP algorithms in noisy compressed sensing

Mihailo Stojnic

School of Industrial Engineering
Purdue University, West Lafayette, IN 47907
e-mail: mstojnic@purdue.edu

Abstract

Under-determined systems of linear equations with sparse solutions have been the subject of an extensive research in last several years above all due to results of [9, 10, 21]. In this paper we will consider noisy under-determined linear systems. In a breakthrough [10] it was established that in noisy systems for any linear level of under-determinedness there is a linear sparsity that can be approximately recovered through an SOCP (second order cone programming) optimization algorithm so that the approximate solution vector is (in an -norm sense) guaranteed to be no further from the sparse unknown vector than a constant times the noise. In our recent work [53] we established an alternative framework that can be used for statistical performance analysis of the SOCP algorithms. To demonstrate how the framework works we then showed in [53] how one can use it to precisely characterize the generic (worst-case) performance of the SOCP. In this paper we present a different set of results that can be obtained through the framework of [53]. The results will relate to problem dependent performance analysis of SOCP’s. We will consider specific types of unknown sparse vectors and characterize the SOCP performance when used for recovery of such vectors. We will also show that our theoretical predictions are in a solid agreement with the results one can get through numerical simulations.

Index Terms: Noisy systems of linear equations; SOCP; -optimization; compressed sensing.

1 Introduction

In recent years there has been an enormous interest in studying under-determined systems of linear equations with sparse solutions. With potential applications ranging from high-dimensional geometry, image reconstruction, single-pixel camera design, decoding of linear codes, channel estimation in wireless communications, to machine learning, data-streaming algorithms, DNA micro-arrays, magneto-encephalography etc. (see, e.g. [8, 26, 11, 49, 3, 18, 65, 46, 67, 42, 47, 38, 48, 56] and references therein) studying these systems seems to be of substantial theoretical/practical importance in a variety of different area. In this paper we study mathematical aspects of under-determined systems and put an emphasis on theoretical analysis of particular algorithms used for solving them.

In its simplest form solving an under-determined system of linear equations amounts to finding a, say, -sparse such that

(1)

where is an () matrix and is an vector (see Figure 1; here and in the rest of the paper, under -sparse vector we assume a vector that has at most nonzero components). Of course, the assumption will be that such an exists. To make writing in the rest of the paper easier, we will assume the so-called linear regime, i.e. we will assume that and that the number of equations is where and are constants independent of (more on the non-linear regime, i.e. on the regime when is larger than linearly proportional to can be found in e.g. [17, 30, 31]).

Figure 1: Model of a linear system; vector is -sparse

Clearly, if the solution is unique and can be found through an exhaustive search. However, in the linear regime that we have just assumed above (and will consider throughout the paper) the exhaustive search is clearly of exponential complexity. Instead one can of course design algorithms of much lower complexity while sacrificing on the recovery abilities, i.e. on the recoverable size of the nonzero portion of vector . Various algorithms have been introduced and analyzed in recent years throughout the literature from those that relate to the parallel design of matrix and the recovery algorithms (see, e.g. [45, 1, 39, 68, 35, 34]) to those that assume only the design of the recovery algorithms (see, e.g. [62, 63, 44, 25, 43, 19, 10, 21]). If one restricts to the algorithms of polynomial complexity and allows for the design of then the results from [45, 1, 39] that guarantee recovery of any -sparse in (1) for any and any are essentially optimal. On the other hand, if one restricts to the algorithms of polynomial complexity and does not allow for the parallel design of then the results of [10, 21] established that for any there still exists a such that any -sparse in (1) can be recovered via a polynomial time Basis pursuit (BP) algorithm.

Since the BP algorithm is fairly closely related to what we will be presenting later in the paper we will now briefly introduce it (we will often refer to it as the -optimization concept; a slight modification/adaptation of it will actually be the main topic of this paper). Variations of the standard -optimization from e.g. [12, 16, 51] as well as those from [50, 28, 32] related to -optimization, , are possible as well; moreover they can all be incorporated in what we will present below. The -optimization concept suggests that one can maybe find the -sparse in (1) by solving the following -norm minimization problem

min
subject to (2)

As mentioned above the results from [10, 21] were instrumental in theoretical characterization of (2) and its popularization in sparse recovery and even more so in generating an unprecedented interest in sparse recovery algorithms. The main reason is of course the quality of the results achieved in [10, 21]. Namely, [10] established that for any there is a such that the solution of (2) is the -sparse in (1). In a statistical and large dimensional context in [21] and later in [57, 55] for any given value of the exact value of the maximum possible was determined.

The above sparse recovery scenario is in a sense idealistic. Namely, it assumes that in (2) was obtained through (1). On other hand in many applications only a noisy version of may be available for (this is especially so in measuring type of applications) see, e.g. [10, 33, 66] (another somewhat related version of “imperfect” linear systems are the under-determined systems with the so-called approximately sparse solutions; more in this direction can be found in e.g. [10, 59]). When that happens one has the following equivalent to (1) (see, Figure 2)

(3)

where is an so-called noise vector (the so-called ideal case presented above is of course a special case of the noisy one given in (3)).

Figure 2: Model of a linear system; vector is -sparse

Finding the -sparse in (3) is now incredibly hard. In fact it is pretty much impossible. Basically, one is looking for a -sparse such that (3) holds and on top of that is unknown. Although the problem is hard there are various heuristics throughout the literature that one can use to solve it approximately. Majority of these heuristics are based on appropriate generalizations of the corresponding algorithms one would use in the noiseless case. Thinking along the same lines as in the noiseless case one can distinguish two scenarios depending on the availability of the freedom to choose/design . If one has the freedom to design then one can adapt the corresponding noiseless algorithms to the noisy scenario as well (more on this can be found in e.g. [5]). However, in this paper we mostly focus on the scenario where one has no control over . In such a scenario one can again make a parallel to the noiseless case and look at e.g. CoSAMP algorithm from [43] or Subspace pursuit from [20]. Essentially, in a statistical context, these algorithms can provably recover a linear sparsity while maintaining the approximation error proportional to the norm-2 of the noise vector which is pretty much a benchmark of what is currently known. These algorithms are in a way perfected noisy generalizations of the so-called Orthogonal matching pursuit (OMP) algorithms.

On the other hand in this paper we will focus on generalizations of BP that can handle the noisy case. To introduce a bit or tractability in finding the -sparse in (3) one usually assumes certain amount of knowledge about either or . As far as the tractability assumptions on are concerned one typically (and possibly fairly reasonably in applications of interest) assumes that is bounded (or highly likely to be bounded) from above by a certain known quantity. The following second-order cone programming (SOCP) analogue to (or say noisy generalization of) (2) is one of the approaches that utilizes such an assumption (more on this approach and its variations can be found in e.g. [10])

subject to (4)

where is a quantity such that (or is a quantity such that is say highly likely). For example, in [10] a statistical context is assumed and based on the statistics of , was chosen such that happens with overwhelming probability (as usual, under overwhelming probability we in this paper assume a probability that is no more than a number exponentially decaying in away from ). Given that (4) is now among few almost standard choices when it comes to finding an approximation to the -sparse in (3), the literature on its properties is vast (see, e.g. [10, 24, 61, 33] and references therein). We here briefly mention only what we consider to be the most influential work on this topic in recent years. Namely, in [10] the authors analyzed performance of (4) and showed a result similar in flavor to the one that holds in the ideal - noiseless - case. In a nutshell the following was shown in [10]: let be a -sparse vector such that (3) holds and let be the solution of (4). Then

(5)

where is a constant independent of and is a constant independent of and of course dependent on and . This result in a sense establishes a noisy equivalent to the fact that a linear sparsity can be recovered from an under-determined system of linear equations. In an informal language, it states that a linear sparsity can be approximately recovered in polynomial time from a noisy under-determined system with the norm of the recovery error guaranteed to be within a constant multiple of the noise norm (as mentioned above, the same was also established later in [43] for CoSAMP and in [20] for Subspace pursuit). Establishing such a result is, of course, a feat in its own class, not only because of its technical contribution but even more so because of the amount of interest that it generated in the field.

In our recent work [53] we designed a framework for performance analysis of the SOCP algorithm from (4). We then went further in [53] and showed how the framework practically works through a precise characterization of the generic (worst-case) performance of the SOCP from (4). In this paper we will again focus on the general framework developed in [53]. This time though we will focus on the problem dependent performance analysis of the SOCP. In other words, we will consider specific types of unknown sparse vectors in (3) and provide a performance analysis of the SOCP when applied for recovery of such vectors.

Before going into the details of the SOCP approach we should also mention that the SOCP algorithms are of course not the only possible generalizations (adaptations) of optimization to the noisy case. For example, LASSO algorithms (more on these algorithms can be found in e.g. [60, 14, 15, 7, 64, 41] as well as in recent developments [22, 4, 52]) are a very successful alternative. In our recent work [52] we established a nice connection between some of the algorithms from the LASSO group and certain SOCP algorithms and showed that with respect to certain performance measure they could be equivalent. Besides the LASSO algorithms the so-called Dantzig selector introduced in [13] is another alternative to the SOCP algorithms that is often encountered in the literature (more on the Dantzig selector as well as on its relation to the LASSO or SOCP algorithms can be found in e.g. [40, 6, 29, 27, 2, 36, 37]). Depending on the scenarios they are to be applied in each of these algorithms can have certain advantages/disadvantages over the other ones. A simple (general) characterization of these advantages/disadvanatges does not seem easy to us. In a rather informal language one could say that LASSO and SOCP are expected to perform better (i.e. to provide a solution vector that is under various metrics closer to the unknown one) in a larger set of different scenarios but as quadratic programs could be slower than the Dantzig selector which happens to be a linear program. Of course, whether LASSO or SOCP algorithms are indeed going to be slower or not or how much larger would be a set of different scenarios where they are expected to perform better are interesting/important questions. While clearly of interest answering these questions certainly goes way beyond the scope of the present paper and we will not pursue it here any further.

Before we proceed with the exposition we briefly summarize the organization of the rest of the paper. In Section 2, we recall on a set of powerful results presented in [53] and discuss further how they can be utilized to analyze the problem dependent performance of the SOCP from (4). The results that we will present in Section 2 will relate to the so-called general sparse signals . In Section 3 we will show how these results from Section 2 that relate to the general sparse vectors can be specialized further so they cover the so-called signed vectors . Finally, in Section 4 we will discuss obtained results.

2 SOCP’s problem dependent performance – general

In this section we first recall on the basic properties of the statistical SOCP’s performance analysis framework developed in [53]. We will then show how the framework can be used to characterize the SOCP from (4) in certain situations when the SOCP’s performance substantially depends on .

2.1 Basic properties of the SOCP’s framework

Before proceeding further we will now first state major assumptions that will be in place throughout the paper (clearly, since we will be utilizing the framework of [53] a majority of these assumptions was already present in [53]). Namely, as mentioned above, we will consider noisy under-determined systems of linear equations. The systems will be defined by a random matrix where the elements of are i.i.d. standard normal random variables. Also, we will assume that the elements of are i.i.d. Gaussian random variables with zero mean and variance . will be assumed to be the original in (3) that we are trying to recover (or a bit more precisely approximately recover). We will assume that is any -sparse vector with a given fixed location of its nonzero elements and a given fixed combination of their signs. Due to assumed statistics the analysis (and the performance of (4)) will clearly be irrelevant with respect to what particular location and what particular combination of signs of nonzero elements are chosen. We will therefore for the simplicity of the exposition and without loss of generality assume that the components of are equal to zero and the components of are greater than or equal to zero. Moreover, throughout the paper we will call such an -sparse and positive. In a more formal way we will set

(6)

We also now take the opportunity to point out a rather obvious detail. Namely, the fact that is positive is assumed for the purpose of the analysis. However, this fact is not known a priori and is not available to the solving algorithm (this will of course change in Section 3).

Before proceeding further we will introduce a few definitions that will be useful in formalizing/presenting our results as well as in conducting the entire analysis. Following what was done in [53] let us define the optimal value of a slightly changed objective of (4) in the following way

subject to (7)

To make writing easier we will instead of write just . A similar convention will be applied to few other functions throughout the paper. On many occasions, though, (especially where we deem it as substantial to the presentation) we will also keep all (or a majority of) arguments of the corresponding functions.

Also let be the solution of (4) (or the solution of (7)) and let be the so-called error vector defined in the following way

(8)

Our main goal in this paper will then boil down to various characterizations of and . Throughout the paper we will heavily rely on the following theorem from [53] that provides a general characterization of and .

Theorem 1.

([53] — SOCP’s performance characterization) Let be an vector of i.i.d. zero-mean variance Gaussian random variables and let be an matrix of i.i.d. standard normal random variables. Further, let and be and vectors of i.i.d. standard normals, respectively and let be vector of all ones. Consider a -sparse defined in (6) and a defined in (3) for . Let the solution of (4) be and let the so-called error vector of the SOCP from (4) be . Let in (4) be a positive scalar. Let be large and let constants and be below the following so-called fundamental characterization of optimization

(9)

Consider the following optimization problem:

subject to (10)

Let and be the solution of (10). Set

(11)

Then:

(12)

and

(13)

where is an arbitrarily small constant and is a constant dependent on and but independent of .

Proof.

Presented in [53]. ∎

Remark: A pair lies below the fundamental characterization (9) if and and are such that (9) holds.

2.2 Problem dependent properties of the framework

To facilitate the exposition that will follow we similarly to what was done in [57, 54, 53] set

(14)

where are the magnitudes of sorted in increasing order and are the elements of sorted in decreasing order (possible ties in the sorting processes are of course broken arbitrarily). One can then rewrite the optimization problem from (10) in the following way

subject to (15)

In what follows we will restrict our attention to a specific class of unknown vectors . Namely, we will consider vectors that have amplitude of the nonzero components equal to . In the noiseless case these problem instances are typically the hardest to solve (at least as long as one uses the optimization from (2)). We will again emphasize that the fact that magnitudes of the nonzero elements of are is not known a priori and can not be used in the solving algorithm (i.e. one can not add constraints that would exploit this knowledge in optimization problem (4)). It is just that we will consider how the SOCP from (4) behaves when used to solve problem instances generated by such an . Also, for such an (15) can be rewritten in the following way

subject to (16)

Now, let and be the solution of (16). Then analogously to (11) we can set

(17)

In what follows we will determine and or more precisely their concentrating points and . All other parameters such as , can (and some of them will) be computed through the framework as well. We do however mention right here that what we present below assumes a fair share of familiarity with the techniques introduced in our earlier papers [57, 52, 53]. To shorten the exposition we will skip many details presented in those papers and present only the key differences.

We proceed by following the line of thought presented in [57, 53]. Since is the solution of (16) there will be parameters , , and such that

and obviously , , and . At this point let us assume that these parameters are known and fixed. Then following [57, 53] the optimization problem from (16) can be rewritten in the following way

subject to (18)

To make writing of what will follow somewhat easier we set

(19)

We then proceed by solving the optimization in (18) over and . To do so we first look at the derivatives with respect to , of the objective in (18). Computing the derivatives and equalling them to zero gives

From the second to last line in the above equation one then has

(21)

and after an easy algebraic transformation

(22)

Using (22) we further have

(23)

Plugging the value for from (18) in (19) gives

(24)

Combining (23) and (24) we finally obtain

(25)

Equalling the derivative of with respect to to zero further gives

Let

(27)

Then combining (LABEL:eq:mainlasso11) and (27) one obtains

(28)

After solving (28) over we have

(29)

Following what was done in [57, 53], we have that a combination of (LABEL:eq:mainlasso6) and (29) gives the following three equations that can be used to determine , , and (the equations are rather inequalities; since we will assume a large dimensional scenario we will instead of any of the inequalities below write an equality; this will make writing much easier).

The last term that appears on the right hand side of the first two of the above equations can be further simplified based on (23) in the following way

(31)

where we of course recognized that . Combining (27) and (31) one can then simplify the equations from (LABEL:eq:mainlasso15) in the following way