(1+eps)-approximate Sparse Recovery
The problem central to sparse recovery and compressive sensing is that of stable sparse recovery: we want a distribution of matrices such that, for any and with probability over , there is an algorithm to recover from with
for some constant and norm .
The measurement complexity of this problem is well understood for constant . However, in a variety of applications it is important to obtain for a small , and this complexity is not well understood. We resolve the dependence on in the number of measurements required of a -sparse recovery algorithm, up to polylogarithmic factors for the central cases of and . Namely, we give new algorithms and lower bounds that show the number of measurements required is . For , our bound of is tight up to constant factors. We also give matching bounds when the output is required to be -sparse, in which case we achieve . This shows the distinction between the complexity of sparse and non-sparse outputs is fundamental.
Over the last several years, substantial interest has been generated in the problem of solving underdetermined linear systems subject to a sparsity constraint. The field, known as compressed sensing or sparse recovery, has applications to a wide variety of fields that includes data stream algorithms [Mut05], medical or geological imaging [CRT06, Don06], and genetics testing [SAZ10]. The approach uses the power of a sparsity constraint: a vector is -sparse if at most coefficients are non-zero. A standard formulation for the problem is that of stable sparse recovery: we want a distribution of matrices such that, for any and with probability over , there is an algorithm to recover from with
for some constant and norm
It is known [CRT06, GLPS10] that such recovery schemes exist for with and . Furthermore, it is known [DIPW10, FPRU10] that any such recovery scheme requires measurements. This means the measurement complexity is well understood for , but not for .
A number of applications would like to have for small . For example, a radio wave signal can be modeled as where is -sparse (corresponding to a signal over a narrow band) and the noise is i.i.d. Gaussian with [TDB09]. Then sparse recovery with allows the recovery of a fraction of the true signal . Since is concentrated in a small band while is located over a large region, it is often the case that .
The difficulty of -approximate recovery has seemed to depend on whether the output is required to be -sparse or can have more than elements in its support. Having -sparse output is important for some applications (e.g. the aforementioned radio waves) but not for others (e.g. imaging). Algorithms that output a -sparse have used measurements [CCF02, CM04, CM06, Wai09]. In contrast, [GLPS10] uses only measurements for and outputs a non--sparse .
|Lower bound||Upper bound|
|[CCF02, CM06, Wai09]|
We show that the apparent distinction between complexity of sparse and non-sparse outputs is fundamental, for both and . We show that for sparse output, measurements are necessary, matching the upper bounds up to a factor. For general output and , we show measurements are necessary, matching the upper bound up to a constant factor. In the remaining case of general output and , we show measurements are necessary. We then give a novel algorithm that uses measurements, beating the dependence given by all previous algorithms. As a result, all our bounds are tight up to factors logarithmic in . The full results are shown in Figure 1.
In addition, for and general output, we show that thresholding the top elements of a Count-Sketch [CCF02] estimate gives -approximate recovery with measurements. This is interesting because it highlights the distinction between sparse output and non-sparse output: [CM06] showed that thresholding the top elements of a Count-Sketch estimate requires . While [GLPS10] achieves for the same regime, it only succeeds with constant probability while ours succeeds with probability ; hence ours is the most efficient known algorithm when and .
Much of the work on sparse recovery has relied on the Restricted Isometry Property [CRT06]. None of this work has been able to get better than -approximate recovery, so there are relatively few papers achieving -approximate recovery. The existing ones with measurements are surveyed above (except for [IR08], which has worse dependence on than [CM04] for the same regime).
A couple of previous works have studied the problem, where every coordinate must be estimated with small error. This problem is harder than sparse recovery with sparse output. For , [Wai09] showed that schemes using Gaussian matrices require . For , [CM05] showed that any sketch requires bits (rather than measurements).
Independently of this work and of each other, multiple authors [CD11, IT10, ASZ10] have matched our bound for in related settings. The details vary, but all proofs are broadly similar in structure to ours: they consider observing a large set of “well-separated” vectors under Gaussian noise. Fano’s inequality gives a lower bound on the mutual information between the observation and the signal; then, an upper bound on the mutual information is given by either the Shannon-Hartley theorem or a KL-divergence argument. This technique does not seem useful for the other problems we consider in this paper, such as lower bounds for or the sparse output setting.
For the upper bounds for non-sparse output, we observe that the hard case for sparse output is when the noise is fairly concentrated, in which the estimation of the top elements can have error. Our goal is to recover enough mass from outside the top elements to cancel this error. The upper bound for is a fairly straightforward analysis of the top elements of a Count-Sketch data structure.
The upper bound for proceeds by subsampling the vector at rate and performing a Count-Sketch with size proportional to , for . The intuition is that if the noise is well spread over many (more than ) coordinates, then the bound from the first Count-Sketch gives a very good bound, so the approximation is -approximate. However, if the noise is concentrated over a small number of coordinates, then the error from the first Count-Sketch is proportional to . But in this case, one of the subsamples will only have of the coordinates with large noise. We can then recover those coordinates with the Count-Sketch for that subsample. Those coordinates contain an fraction of the total noise, so recovering them decreases the approximation error by exactly the error induced from the first Count-Sketch.
The lower bounds use substantially different techniques for sparse output and for non-sparse output. For sparse output, we use reductions from communication complexity to show a lower bound in terms of bits. Then, as in [DIPW10], we embed copies of this communication problem into a single vector. This multiplies the bit complexity by ; we also show we can round to bits per measurement without affecting recovery, giving a lower bound in terms of measurements.
We illustrate the lower bound on bit complexity for sparse output
using . Consider a vector containing ones and
zeros elsewhere, such that for all . For
any , set and elsewhere. Then
successful -approximate sparse recovery from
returns with .
Hence we can recover each bit of with probability ,
For non-sparse output, we split between and . In , we consider where is sparse and has uniform Gaussian noise with . Then each coordinate of is a Gaussian channel with signal to noise ratio . This channel has channel capacity , showing . Correct sparse recovery must either get most of or an fraction of ; the latter requires and the former requires . This gives a tight result. Unfortunately, this does not easily extend to , because it relies on the Gaussian distribution being both stable and maximum entropy under ; the corresponding distributions in are not the same.
Therefore for non-sparse output, we have yet another argument. The hard instances for must have one large value (or else is a valid output) but small other values (or else the -sparse approximation is significantly better than the -sparse approximation). Suppose has one value of size and values of size spread through a vector of size . Then a -approximate recovery scheme must either locate the large element or guess the locations of the values with more correct than incorrect. The former requires bits by the difficulty of a novel version of the Gap- problem. The latter requires bits because it allows recovering an error correcting code. Setting balances the terms at bits. Because some of these reductions are very intricate, this extended abstract does not manage to embed copies of the problem into a single vector. As a result, we lose a factor in a universe of size when converting to measurement complexity from bit complexity.
We use to denote the set . For any set , we use to denote the complement of , i.e., the set . For any , denotes the th coordinate of , and denotes the vector given by if , and otherwise. We use to denote the support of .
3 Upper bounds
The algorithms in this section are indifferent to permutation of the coordinates. Therefore, for simplicity of notation in the analysis, we assume the coefficients of are sorted such that .
Both our upper bounds use the Count-Sketch [CCF02] data structure. The structure consists of hash tables of size , for total space; it can be represented as for a matrix with rows. Given , one can construct with
with failure probability .
It was shown in [CM06] that, if is the result of a Count-Sketch with hash table size , then outputting the top elements of gives a -approximate recovery scheme. Here we show that a seemingly minor change—selecting elements rather than elements—turns this into a -approximate recovery scheme.
Let be the top estimates from a Count-Sketch structure with hash table size . Then with failure probability ,
Therefore, there is a -approximate recovery scheme with rows.
Let the hash table size be for constant , and let be the vector of estimates for each coordinate. Define to be the indices of the largest values in , and .
By (3), the standard analysis of Count-Sketch:
Let and , and let . The algorithm passes over an element of value to choose one of value , so
and combining this with (4) gives
which proves the theorem for . ∎
There exists a -approximate recovery scheme with measurements and failure probability .
Set , so our goal is to get -approximate recovery with measurements.
For intuition, consider 1-sparse recovery of the following vector : let and set and . Then we have
and by (3), a Count-Sketch with -sized hash tables returns with
The reconstruction algorithm therefore cannot reliably find any of the for , and its error on is at least . Hence the algorithm will not do better than a -approximation.
However, consider what happens if we subsample an fraction of the vector. The result probably has about non-zero values, so a -width Count-Sketch can reconstruct it exactly. Putting this in our output improves the overall error by about . Since , this more than cancels the error the initial Count-Sketch makes on , giving an approximation factor better than .
This tells us that subsampling can help. We don’t need to subsample at a scale below (where we can reconstruct well already) or above (where the bound is small enough already), but in the intermediate range we need to subsample. Our algorithm subsamples at all rates in between these two endpoints, and combines the heavy hitters from each.
First we analyze how subsampled Count-Sketch works.
Suppose we subsample with probability and then apply Count-Sketch with rows and -sized hash tables. Let be the subsample of . Then with failure probability we recover a with
Recall the following form of the Chernoff bound: if are independent with , and , then
Let be the set of coordinates in the sample. Then , so
Suppose this event does not happen, so . We also have
Let if and if . Then
For we have
giving by Chernoff that
But if this event does not happen, then
By (3), using -size hash tables gives a with
with failure probability , as desired. ∎
Let . Our algorithm is as follows: for , we find and estimate the largest elements not found in previous in a subsampled Count-Sketch with probability and hash size for some parameter . We output , the union of all these estimates. Our goal is to show
For each level , let be the largest coordinates in our estimate not found in . Let . By Lemma 3.3, for each we have (with failure probability ) that
By standard arguments, the bound for gives
We would like to convert the first term to depend on the norm. For any and we have, by splitting into chunks of size , that
Along with the triangle inequality, this gives us that
Define . The first term grows as so it is fine, but can grow as . We need to show that they are canceled by the corresponding . In particular, we will show that with high probability—at least wherever .
Let be the set of with , so that . We have
For , we have
so, along with , we turn Equation (9) into
When choosing , let be the set of indices chosen in the sample. Applying Lemma 3.3 the estimate of has
Let . We have so and with failure probability . Conditioned on , since has at least possible choices of value at least , must have at least elements at least . Therefore, for ,
for some . Hence we use a total of measurements for -approximate recovery.
4 Lower bounds for non-sparse output and
In this case, the lower bound follows fairly straightforwardly from the Shannon-Hartley information capacity of a Gaussian channel.
We will set up a communication game. Let be a family of -sparse supports such that:
for all , and
This is possible; for example, a random linear code on with
relative distance has these
Let . Let be i.i.d. normal with variance in each coordinate. Consider the following process:
First, Alice chooses uniformly at random, then uniformly at random subject to , then . She sets and sends to Bob. Bob performs sparse recovery on to recover , rounds to by , and sets . This gives a Markov chain .
If sparse recovery works for any with probability as a distribution over , then there is some specific and random seed such that sparse recovery works with probability over ; let us choose this and the random seed, so that Alice and Bob run deterministic algorithms on their inputs.
Let the columns of be . We may assume that the are orthonormal, because this can be accomplished via a unitary transformation on . Then we have that , where and
Hence is a Gaussian channel with power constraint and noise variance