(1+\epsilon)-approximate Sparse Recovery

(1+eps)-approximate Sparse Recovery

Abstract

The problem central to sparse recovery and compressive sensing is that of stable sparse recovery: we want a distribution of matrices such that, for any and with probability over , there is an algorithm to recover from with

(1)

for some constant and norm .

The measurement complexity of this problem is well understood for constant . However, in a variety of applications it is important to obtain for a small , and this complexity is not well understood. We resolve the dependence on in the number of measurements required of a -sparse recovery algorithm, up to polylogarithmic factors for the central cases of and . Namely, we give new algorithms and lower bounds that show the number of measurements required is . For , our bound of is tight up to constant factors. We also give matching bounds when the output is required to be -sparse, in which case we achieve . This shows the distinction between the complexity of sparse and non-sparse outputs is fundamental.

1 Introduction

Over the last several years, substantial interest has been generated in the problem of solving underdetermined linear systems subject to a sparsity constraint. The field, known as compressed sensing or sparse recovery, has applications to a wide variety of fields that includes data stream algorithms [Mut05], medical or geological imaging [CRT06, Don06], and genetics testing [SAZ10]. The approach uses the power of a sparsity constraint: a vector is -sparse if at most coefficients are non-zero. A standard formulation for the problem is that of stable sparse recovery: we want a distribution of matrices such that, for any and with probability over , there is an algorithm to recover from with

(2)

for some constant and norm 1. We call this a -approximate recovery scheme with failure probability . We refer to the elements of as measurements.

It is known [CRT06, GLPS10] that such recovery schemes exist for with and . Furthermore, it is known [DIPW10, FPRU10] that any such recovery scheme requires measurements. This means the measurement complexity is well understood for , but not for .

A number of applications would like to have for small . For example, a radio wave signal can be modeled as where is -sparse (corresponding to a signal over a narrow band) and the noise is i.i.d. Gaussian with  [TDB09]. Then sparse recovery with allows the recovery of a fraction of the true signal . Since is concentrated in a small band while is located over a large region, it is often the case that .

The difficulty of -approximate recovery has seemed to depend on whether the output is required to be -sparse or can have more than elements in its support. Having -sparse output is important for some applications (e.g. the aforementioned radio waves) but not for others (e.g. imaging). Algorithms that output a -sparse have used measurements [CCF02, CM04, CM06, Wai09]. In contrast, [GLPS10] uses only measurements for and outputs a non--sparse .

Lower bound Upper bound
-sparse output [CM04]
[CCF02, CM06, Wai09]
Non--sparse output
[GLPS10]
Figure 1: Our results, along with existing upper bounds. Fairly minor restrictions on the relative magnitude of parameters apply; see the theorem statements for details.

Our results

We show that the apparent distinction between complexity of sparse and non-sparse outputs is fundamental, for both and . We show that for sparse output, measurements are necessary, matching the upper bounds up to a factor. For general output and , we show measurements are necessary, matching the upper bound up to a constant factor. In the remaining case of general output and , we show measurements are necessary. We then give a novel algorithm that uses measurements, beating the dependence given by all previous algorithms. As a result, all our bounds are tight up to factors logarithmic in . The full results are shown in Figure 1.

In addition, for and general output, we show that thresholding the top elements of a Count-Sketch [CCF02] estimate gives -approximate recovery with measurements. This is interesting because it highlights the distinction between sparse output and non-sparse output: [CM06] showed that thresholding the top elements of a Count-Sketch estimate requires . While [GLPS10] achieves for the same regime, it only succeeds with constant probability while ours succeeds with probability ; hence ours is the most efficient known algorithm when and .

Related work

Much of the work on sparse recovery has relied on the Restricted Isometry Property [CRT06]. None of this work has been able to get better than -approximate recovery, so there are relatively few papers achieving -approximate recovery. The existing ones with measurements are surveyed above (except for [IR08], which has worse dependence on than [CM04] for the same regime).

A couple of previous works have studied the problem, where every coordinate must be estimated with small error. This problem is harder than sparse recovery with sparse output. For [Wai09] showed that schemes using Gaussian matrices require . For [CM05] showed that any sketch requires bits (rather than measurements).

Independently of this work and of each other, multiple authors [CD11, IT10, ASZ10] have matched our bound for in related settings. The details vary, but all proofs are broadly similar in structure to ours: they consider observing a large set of “well-separated” vectors under Gaussian noise. Fano’s inequality gives a lower bound on the mutual information between the observation and the signal; then, an upper bound on the mutual information is given by either the Shannon-Hartley theorem or a KL-divergence argument. This technique does not seem useful for the other problems we consider in this paper, such as lower bounds for or the sparse output setting.

Our techniques

For the upper bounds for non-sparse output, we observe that the hard case for sparse output is when the noise is fairly concentrated, in which the estimation of the top elements can have error. Our goal is to recover enough mass from outside the top elements to cancel this error. The upper bound for is a fairly straightforward analysis of the top elements of a Count-Sketch data structure.

The upper bound for proceeds by subsampling the vector at rate and performing a Count-Sketch with size proportional to , for . The intuition is that if the noise is well spread over many (more than ) coordinates, then the bound from the first Count-Sketch gives a very good bound, so the approximation is -approximate. However, if the noise is concentrated over a small number of coordinates, then the error from the first Count-Sketch is proportional to . But in this case, one of the subsamples will only have of the coordinates with large noise. We can then recover those coordinates with the Count-Sketch for that subsample. Those coordinates contain an fraction of the total noise, so recovering them decreases the approximation error by exactly the error induced from the first Count-Sketch.

The lower bounds use substantially different techniques for sparse output and for non-sparse output. For sparse output, we use reductions from communication complexity to show a lower bound in terms of bits. Then, as in [DIPW10], we embed copies of this communication problem into a single vector. This multiplies the bit complexity by ; we also show we can round to bits per measurement without affecting recovery, giving a lower bound in terms of measurements.

We illustrate the lower bound on bit complexity for sparse output using . Consider a vector containing ones and zeros elsewhere, such that for all . For any , set and elsewhere. Then successful -approximate sparse recovery from returns with . Hence we can recover each bit of with probability , requiring bits2. We can generalize this to -sparse output for bits, and to failure probability with . However, the two generalizations do not seem to combine.

For non-sparse output, we split between and . In , we consider where is sparse and has uniform Gaussian noise with . Then each coordinate of is a Gaussian channel with signal to noise ratio . This channel has channel capacity , showing . Correct sparse recovery must either get most of or an fraction of ; the latter requires and the former requires . This gives a tight result. Unfortunately, this does not easily extend to , because it relies on the Gaussian distribution being both stable and maximum entropy under ; the corresponding distributions in are not the same.

Therefore for non-sparse output, we have yet another argument. The hard instances for must have one large value (or else is a valid output) but small other values (or else the -sparse approximation is significantly better than the -sparse approximation). Suppose has one value of size and values of size spread through a vector of size . Then a -approximate recovery scheme must either locate the large element or guess the locations of the values with more correct than incorrect. The former requires bits by the difficulty of a novel version of the Gap- problem. The latter requires bits because it allows recovering an error correcting code. Setting balances the terms at bits. Because some of these reductions are very intricate, this extended abstract does not manage to embed copies of the problem into a single vector. As a result, we lose a factor in a universe of size when converting to measurement complexity from bit complexity.

2 Preliminaries

Notation

We use to denote the set . For any set , we use to denote the complement of , i.e., the set . For any , denotes the th coordinate of , and denotes the vector given by if , and otherwise. We use to denote the support of .

3 Upper bounds

The algorithms in this section are indifferent to permutation of the coordinates. Therefore, for simplicity of notation in the analysis, we assume the coefficients of are sorted such that .

Count-Sketch

Both our upper bounds use the Count-Sketch [CCF02] data structure. The structure consists of hash tables of size , for total space; it can be represented as for a matrix with rows. Given , one can construct with

(3)

with failure probability .

3.1 Non-sparse

It was shown in [CM06] that, if is the result of a Count-Sketch with hash table size , then outputting the top elements of gives a -approximate recovery scheme. Here we show that a seemingly minor change—selecting elements rather than elements—turns this into a -approximate recovery scheme.

Theorem 3.1.

Let be the top estimates from a Count-Sketch structure with hash table size . Then with failure probability ,

Therefore, there is a -approximate recovery scheme with rows.

Proof.

Let the hash table size be for constant , and let be the vector of estimates for each coordinate. Define to be the indices of the largest values in , and .

By (3), the standard analysis of Count-Sketch:

so

(4)

Let and , and let . The algorithm passes over an element of value to choose one of value , so

Then

and combining this with (4) gives

or

which proves the theorem for . ∎

3.2 Non-sparse

Theorem 3.2.

There exists a -approximate recovery scheme with measurements and failure probability .

Set , so our goal is to get -approximate recovery with measurements.

For intuition, consider 1-sparse recovery of the following vector : let and set and . Then we have

and by (3), a Count-Sketch with -sized hash tables returns with

The reconstruction algorithm therefore cannot reliably find any of the for , and its error on is at least . Hence the algorithm will not do better than a -approximation.

However, consider what happens if we subsample an fraction of the vector. The result probably has about non-zero values, so a -width Count-Sketch can reconstruct it exactly. Putting this in our output improves the overall error by about . Since , this more than cancels the error the initial Count-Sketch makes on , giving an approximation factor better than .

This tells us that subsampling can help. We don’t need to subsample at a scale below (where we can reconstruct well already) or above (where the bound is small enough already), but in the intermediate range we need to subsample. Our algorithm subsamples at all rates in between these two endpoints, and combines the heavy hitters from each.

First we analyze how subsampled Count-Sketch works.

Lemma 3.3.

Suppose we subsample with probability and then apply Count-Sketch with rows and -sized hash tables. Let be the subsample of . Then with failure probability we recover a with

Proof.

Recall the following form of the Chernoff bound: if are independent with , and , then

Let be the set of coordinates in the sample. Then , so

Suppose this event does not happen, so . We also have

Let if and if . Then

For we have

giving by Chernoff that

But if this event does not happen, then

By (3), using -size hash tables gives a with

with failure probability , as desired. ∎

Let . Our algorithm is as follows: for , we find and estimate the largest elements not found in previous in a subsampled Count-Sketch with probability and hash size for some parameter . We output , the union of all these estimates. Our goal is to show

For each level , let be the largest coordinates in our estimate not found in . Let . By Lemma 3.3, for each we have (with failure probability ) that

and so

(5)

By standard arguments, the bound for gives

(6)

Combining Equations (5) and (6) gives

(7)

We would like to convert the first term to depend on the norm. For any and we have, by splitting into chunks of size , that

Along with the triangle inequality, this gives us that

so

(8)

Define . The first term grows as so it is fine, but can grow as . We need to show that they are canceled by the corresponding . In particular, we will show that with high probability—at least wherever .

Let be the set of with , so that . We have

(9)

For , we have

so, along with , we turn Equation (9) into

When choosing , let be the set of indices chosen in the sample. Applying Lemma 3.3 the estimate of has

for .

Let . We have so and with failure probability . Conditioned on , since has at least possible choices of value at least , must have at least elements at least . Therefore, for ,

and therefore

(10)

Using (8) and (3.2) we get

for some . Hence we use a total of measurements for -approximate recovery.

For each we had failure probability (from Lemma 3.3 and ). By the union bound, our overall failure probability is at most

proving Theorem 3.2.

4 Lower bounds for non-sparse output and

In this case, the lower bound follows fairly straightforwardly from the Shannon-Hartley information capacity of a Gaussian channel.

We will set up a communication game. Let be a family of -sparse supports such that:

  • for ,

  • for all , and

  • .

This is possible; for example, a random linear code on with relative distance has these properties [Gur10].3

Let . Let be i.i.d. normal with variance in each coordinate. Consider the following process:

Procedure

First, Alice chooses uniformly at random, then uniformly at random subject to , then . She sets and sends to Bob. Bob performs sparse recovery on to recover , rounds to by , and sets . This gives a Markov chain .

If sparse recovery works for any with probability as a distribution over , then there is some specific and random seed such that sparse recovery works with probability over ; let us choose this and the random seed, so that Alice and Bob run deterministic algorithms on their inputs.

Lemma 4.1.

.

Proof.

Let the columns of be . We may assume that the are orthonormal, because this can be accomplished via a unitary transformation on . Then we have that , where and

Hence is a Gaussian channel with power constraint and noise variance