One-bit compressed sensing by linear programming

One-bit compressed sensing by linear programming

Yaniv Plan  and  Roman Vershynin Department of Mathematics, University of Michigan, 530 Church St., Ann Arbor, MI 48109, U.S.A. {yplan,romanv}@umich.edu
September 19, 2011
Abstract.

We give the first computationally tractable and almost optimal solution to the problem of one-bit compressed sensing, showing how to accurately recover an -sparse vector from the signs of random linear measurements of . The recovery is achieved by a simple linear program. This result extends to approximately sparse vectors . Our result is universal in the sense that with high probability, one measurement scheme will successfully recover all sparse vectors simultaneously. The argument is based on solving an equivalent geometric problem on random hyperplane tessellations.

2000 Mathematics Subject Classification:
94A12; 60D05; 90C25
Y.P. is supported by an NSF Postdoctoral Research Fellowship under award No. 1103909. R.V. is supported by NSF grants DMS 0918623 and 1001829.

1. Introduction

Compressed sensing is a modern paradigm of data acquisition, which is having an impact on several disciplines, see [21]. The scientist has access to a measurement vector obtained as

(1.1)

where is a given measurement matrix and is an unknown signal that one needs to recover from . One would like to take , rendering non-invertible; the key ingredient to successful recovery of is take into account its assumed structure – sparsity. Thus one assumes that has at most nonzero entries, although the support pattern is unknown. The strongest known results are for random measurement matrices . In particular, if has Gaussian i.i.d. entries, then we may take and still recover exactly with high probability [10, 7]; see [26] for an overview. Furthermore, this recovery may be achieved in polynomial time by solving the convex minimization program

(1.2)

Stability results are also available when noise is added to the problem [9, 8, 3, 27].

However, while the focus of compressed sensing is signal recovery with minimal information, the classical set-up (1.1), (1.2) assumes infinite bit precision of the measurements. This disaccord raises an important question: how many bits per measurement (i.e. per coordinate of ) are sufficient for tractable and accurate sparse recovery? This paper shows that one bit per measurement is enough.

There are many applications where such severe quantization may be inherent or preferred — analog-to-digital conversion [20, 18], binomial regression in statistical modeling and threshold group testing [12], to name a few.

1.1. Main results

This paper demonstrates that a simple modification of the convex program (1.2) is able to accurately estimate from extremely quantized measurement vector

Here is the vector of signs of the coordinates of .111To be precise, for a scalar we define , and . We allow the function to act on a vector by acting individually on each element.

Note that contains no information about the magnitude of , and thus we can only hope to recover the normalized vector . This problem was introduced and first studied by Boufounos and Baraniuk [6] under the name of one-bit compressed sensing; some related work is summarized in Section 1.2.

We shall show that the signal can be accurately recovered by solving the following convex minimization program

(1.3)

The first constraint, , keeps the solution consistent with the measurements. It is defined by the relation for , where is the -th row of . The second constraint, , serves to prevent the program from returning a zero solution. Moreover, this constraint is linear as it can be represented as one linear equation where denote the coordinates of . Therefore (1.3) is indeed a convex minimization program; furthermore one can easily represent it as a linear program, see (5.3) below. Note also that the number in (1.3) is chosen for convenience of the analysis; it can be replaced by any other fixed positive number.

Theorem 1.1 (Recovery from one-bit measurements).

Let , and let be an random matrix with independent standard normal entries. Set

(1.4)

Then, with probability at least , the following holds uniformly for all signals satisfying . Let . Then the solution of the convex minimization program (1.3) satisfies

Here and thereafter and denote positive absolute constants; other standard notation is explained in Section 1.3.

Remark 1 (Effective sparsity).

The Cauchy-Schwarz inequality implies that where is the number of nonzero elements of . Therefore one can view the parameter as a measure of effective sparsity of the signal . The effective sparsity is thus a real valued and robust extension of the sparsity parameter , which allows one to handle approximately sparse vectors.

Let us then state the partial case of Theorem 1.1 for sparse signals:

Corollary 1.2 (Sparse recovery from one-bit measurements).

Let , and set as in (1.4). Then, with probability at least , the following holds uniformly for all signals satisfying . Let . Then the solution of the convex minimization program (1.3) satisfies

Remark 2 (Number of measurements).

The conclusion of Corollary 1.2 can be stated in the following useful way. With high probability, an arbitrarily accurate estimation of every -sparse vector can be achieved from

one-bit random measurements. The implicit factor in the notation depends only on the desired accuracy level ; more precisely up to an absolute constant factor. The same holds if is only effectively -sparse as in Theorem 1.1. The central point here is that the number of measurements is almost linear in the sparsity , which can be much smaller than the ambient dimension .

Remark 3 (Non-gaussian measurements).

Most results in compressed sensing, and in random matrix theory in general, are valid not only for Gaussian random matrices but also for general random matrix ensembles. In one-bit compressed sensing, since the measurements do not depend on the scaling of the rows of , it is clear that our results will not change if the rows of are sampled independently from any rotationally invariant distribution in (for example, the uniform distribution on the unit Euclidean sphere ).

However, in contrast to the widespread universality phenomenon, one-bit compressed sensing cannot be generalized to some of the simplest discrete distributions, such as Bernoulli. Indeed, suppose the entries of are independent valued symmetric random variables. Then for the vectors and one can easily check that for any number of measurements . So one-bit measurements can not distinguish between two fixed distinct signals and no matter how many measurements are taken.

Remark 4 (Optimality).

For a fixed level of accuracy, our estimate on the number of measurements matches the best known number of measurements in the classical (not quantized) compressed sensing problem up to the exponent of the logarithm, and up to an absolute constant factor. However, we believe that the exponent can be reduced to . We also believe that the error in Theorem 1.1 may decrease more quickly as . In particular, Jacques et al. [18] demonstrate that is exactly sparse and is estimated using an -minimization-based approach, the error is upper bounded as ; they also demonstrate a lower error bound regardless of what algorithm is used. In fact, such a result is not possible when is only known to be effectively sparse (i.e., ). Instead, the best possible bound is of the form (this can be checked via entropy arguments). We believe this is achievable (and is optimal) for the convex program (1.3).

1.2. Prior work

While there have been several numerical results for quantized compressed sensing [6, 4, 5, 20, 28], as well as guarantees on the convergence of many of the algorithms used for these numerical results, theoretical accuracy guarantees have been much less developed. One may endeavor to circumvent this problem by considering quantization errors as a source of noise, thereby reducing the quantized compressed sensing problem to the noisy classical compressed sensing problem. Further, in some cases the theory and algorithms of noisy compressed sensing may be adapted to this problem as in [28, 11, 17, 25]; the method of quantization may be specialized in order to minimize the recovery error. As noted in [19] if the range of the signal is unspecified, then such a noise source is unbounded, and so the classical theory does not apply. However, in the setup of our paper we may assume without loss of generality that , and thus it is possible that the methods of Candes and Tao [8] can be adapted to derive a version of Corollary 1.2 for a fixed sparse signal . Nevertheless, we do not see any way to deduce by these methods a uniform result over all sparse signals .

In a complementary line of research Ardestanizadeh et al. [2] consider compressed sensing with a finite number of bits per measurement. However, the number of bits per measurement there is not one (or constant); this number depends on the sparsity level and the dynamic range of the signal . Similarly, in the work of Gunturk et al. [14, 15] on sigma-delta quantization, the number of bits per measurement depends on the dynamic range of . On the other hand, by considering sigma-delta quantization and multiple bits, the Gunturk et al. are able to provide excellent guarantees on the speed of decay of the error as decreases.

The framework of one-bit compressed sensing was introduced by Boufounos and Baraniuk in [6]. Jacques et al. [18] show that one-bit measurements are sufficient to recover an -sparse vector with arbitrary precision; their results are also robust to bit flips. In particular, their results require the estimate to be as sparse as , have unit norm, and be consistent with the data. The difficulty is that the first two of these constraints are non-convex, and thus the only known program which is known to return such an estimate is minimization with the unit norm constraint—this is generally considered to be intractable. Gupta et al. [16] demonstrate that one may tractably recover the support of from measurements. They give two measurement schemes. One is non-adaptive, but the number of measurements has a quadratic dependence on the dynamic range of the signal. The other has no such dependence but is adaptive. Our results settle several of these issues: (a) we make no assumption about the dynamic range of the signal, (b) the one-bit measurements are non-adaptive, and (c) the signal is recovered by a tractable algorithm (linear programming).

1.3. Notation and organization of the paper

Throughout the paper, , , , etc. denote absolute constants whose values may change from line to line. For integer , we denote . Vectors are written in bold italics, e.g., , and their coordinates are written in plain text so that the -th component of is . For a subset , is the vector restricted to the elements indexed by . The and norms of a vector are defined as and respectively. The number of non-zero coordinates of is denoted by . The unit balls with respect to and norms are denoted by and respectively. The unit Euclidean sphere is denoted .

The rest of the paper is devoted to proving Theorem 1.1. In Section 2 we reduce this task to the following two ingredients: (a) Theorem 2.3 which states states that a solution to (1.3) is effectively sparse, and (b) Theorem 2.2 which analyzes a simpler but non-convex version of (1.3) where the constraint is replaced by . The latter result can be interpreted in a geometric way in terms of random hyperplane tessellations of a subset of the Euclidean sphere, specifically for the set of effectively sparse signals . In Section 3 we estimate the metric entropy of , and we use this in Section 4 to prove our main geometric result of independent interest: random hyperplanes are enough to cut into small pieces, yielding that all cells of the resulting tessellation have arbitrarily small diameter. This will complete part (b) above. For part (a), we prove Theorem 2.3 on the effective sparsity of solutions in Section 5. The proof is based on counting all possible solutions of (1.3), which are the vertices of the feasible polytope. This will allow us to use standard concentration inequalities from the Appendix and to conclude the argument by a union bound.

Acknowledgement

The authors are grateful to Sinan Güntürk for pointing out an inaccuracy in the statement of Lemma 3.4 in an earlier version of this paper.

2. Strategy of the proof

Our proof of Theorem 1.1 has two main ingredients which we explain in this section. Throughout the paper, will denote the rows of , which are i.i.d. standard normal vectors in .

Let us revisit the second constraint in the convex minimization program (1.3). Consider a fixed signal for the moment. Taking the expectation with respect to the random matrix , we see that

where . Here we used that the first absolute moment of the standard normal random variable equals . So in expectation, the constraint is equivalent to up to constant factor .

This observation suggests that we may first try to analyze the simpler minimization program

(2.1)

This optimization program was first proposed in [6]. Unfortunately, it is non-convex due to the constraint , and therefore seems to be computationally intractable. On the other hand, we find that the non-convex program (2.1) is more amenable to theoretical analysis than the convex program (1.3).

The first ingredient of our theory will be to demonstrate that the non-convex optimization program (2.1) leads to accurate recovery of an effectively sparse signal . One can reformulate this as a geometric problem about random hyperplane tessellations. We will discuss tessellations in Section 4; the main result of that section is Theorem 4.2 which immediately implies the following result:

Theorem 2.1.

Let , and set

(2.2)

Then, with probability at least , the following holds uniformly for all that satisfy , , :

Theorem 2.1 yields a version of our main Theorem 1.1 for the non-convex program (2.1):

Theorem 2.2 (Non-convex recovery).

Let , and set as in (2.1). Then, with probability at least , the following holds uniformly for all signals satisfying . Let . Then the solution of the non-convex minimization program (2.1) satisfies

Proof.

We can assume without loss of generality that and thus . Since is feasible for the program (2.1), we also have , and thus . Therefore Theorem 2.1 applies to , and it yields that as required. ∎

Remark 5 (Prior work).

A version of Theorem 2.1 was recently proved in [18] for exactly sparse signals , i.e. such that , , . This latter result holds with . However, from the proof of Theorem 2.2 given above one sees that the result of [18] would not be sufficient to deduce our main results, even Corollary 1.2 for exactly sparse vectors. The reason is that our goal is to solve a tractable program that involves the norm, and thus we cannot directly assume that our estimate will be in the low-dimensional set of exactly sparse vectors. Our proof of Theorem 2.1 has to overcome some additional difficulties compared to [18] caused by the absence of any control of the supports of the signals . In particular, the metric entropy of the set of unit-normed, sparse vectors only grows logarithmically with the inverse of the covering accuracy. This allows the consideration of a very fine cover in the proofs in [18]. In contrast, the metric entropy of the set of vectors satisfying and is much larger at fine scales, thus necessitating a different strategy of proof.

Theorem 1.1 would follow if we could demonstrate that the convex program (1.3) and the non-convex program (2.1) were equivalent. Rather than doing this explicitly, we shall prove that the solution of the convex program (1.3) essentially preserves the effective sparsity of a signal , and we finish off by applying Theorem 2.1.

Theorem 2.3 (Preserving effective sparsity).

Let and suppose that . Then, with probability at least , the following holds uniformly for all signals satisfying . Let . Then the solution of the convex minimization program (1.3) satisfies

This result is the second main ingredient of our argument, and it will be proved in Section 5. Now we are ready to deduce Theorem 1.1.

Proof of Theorem 1.1..

Consider a signal as in Theorem 1.1, so . In view of the application of Theorem 2.3, we may assume without loss of generality that . Indeed, otherwise we have and the conclusion of Theorem 1.1 is trivial. So Theorem 2.3 applies and gives

Also, as we noted above, . So Theorem 2.1 applies for the normalized vectors , and for . Note that because is a feasible vector for the program (1.3). Therefore Theorem 2.1 yields

where

This completes the proof. ∎

For the rest of the paper, our task will be to prove the two ingredients above – Theorem 2.1, which we shall relate to a more general hyperplane tessellation problem, and Theorem 2.3 on the effective sparsity of the solution.

3. Geometry of signal sets

Our arguments are based on the geometry of the set of effectively -sparse signals

and the set of -sparse signals

While the set is not convex, is, and moreover it is a convexification of in the following sense. Below, for a set , we define to be its convex hull.

Lemma 3.1 (Convexification).

One has .

Proof.

The first containment follows by Cauchy-Schwartz inequality, which implies for each that . The second containment is proved using a common technique in the compressed sensing literature. Let . Partition the support of into disjoint subsets so that indexes the largest elements of (in magnitude), indexes the next largest elements, and so on. Since all , in order to complete the proof it suffices to show that

To prove this, first note that . Second, note that for , each element of is bounded in magnitude by , and thus . Combining these two facts we obtain

(3.1)

where in the last inequality we used that for . The proof is complete. ∎

Our arguments will rely on entropy bounds for the set . Consider a more general situation, where is a bounded subset of and is a fixed number. A subset is called an -net of if for every one can find so that . The minimal cardinality of an -net of is called the covering number and denoted . The number is called the metric entropy of . The covering numbers are (almost) increasing by inclusion:

(3.2)

Specializing to our sets of signals and , we come across a useful example of an -net:

Lemma 3.2 (Sparse net).

Let . Then is an -net of .

Proof.

Let , and let denote the set of the indices of the largest coefficients of (in magnitude). Using the decomposition and noting that , we see that it suffices to check that . This will follow from the same steps as in (3.1). In particular, we have

as required. ∎

Next we pass to quantitative entropy estimates. The entropy of the Euclidean ball can be estimated using a standard volume comparison argument, as follows (see [24, Lemma 4.16]):

(3.3)

From this we deduce a known bound on the entropy of :

Lemma 3.3 (Entropy of ).

For and , we have

Proof.

We represent as the union of the unit Euclidean balls in all -dimensional coordinate subspaces, . Each ball has an -net for of cardinality at most , according to (3.3). The union of these nets forms an -net of , and since the number of possible is , the resulting net has cardinality at most . Taking the logarithm completes the proof. ∎

As a consequence, we obtain an entropy bound for :

Lemma 3.4 (Entropy of ).

For , we have

Proof.

First note that . Then the monotonicity property (3.2) followed by the volumetric estimate (3.3) yield the first desired bound for all .

Next, suppose that . Then set . Lemma 3.2 states that is an -net of . Furthermore, to find an -net of , we use Lemma 3.3 for and for . Taking into account the monotonicity property (3.2), we see that there exists an -net of and such that

It follows that is an -net of , and its cardinality is as required. ∎

4. Random hyperplane tessellations

In this section we prove a generalization of Theorem 2.1. We consider a set and a collection of random hyperplanes in , chosen independently and uniformly from the Haar measure. The resulting partition of by this collection of hyperplanes is called a random tessellation of . The cells of the tessellation are formed by intersection of and the random half-spaces with particular orientations. The main interest in the theory of random tessellations is the typical shape of the cells.

Figure 1. Hyperplane tessellation of a subset of a sphere

We shall study the situation where is a subset of the sphere , see Figure 1. The particular example of is a natural model of random hyperplane tessellation in the sperical space . The more classical and well studied model of random hyperplane tessellation is in Euclidean space , where the hyperlanes are allowed to be affine, see [23] for the history of this field. The random hyperplane tessellations of the sphere is studied in particular in [22].

Here we focus on the following question. How many random hyperplanes ensure that all the cells of the tessellation of have small diameter (such as )? For the purposes of this paper, we shall address this problem for a specific set, namely for

We shall prove that hyperplanes suffice with high probability. Our argument can be extended to more general sets , but we defer generalizations to a later paper.

Theorem 4.1 (Random hyperplane tessellations).

Let and be positive integers. Consider the tessellation of the set by random hyperplanes in chosen independently and uniformly from the Haar measure. Let , and assume that

Then, with probability at least , all cells of the tessellation of have diameter at most .

It is convenient to represent the random hyperplanes in Theorem 4.1 as , , where are i.i.d. standard normal vectors in . The claim that all cells of the tessellation of have diameter at most can be restated in the following way. Every pair of points satisfying is separated by at least one of the hyperplanes, so there exists such that

Theorem 4.1 is then a direct consequence of the following slightly stronger result.

Theorem 4.2 (Separation by a set of hyperplanes).

Let and be positive integers. Consider the set and independent random vectors in . Let , and assume that

Then, with probability at least , the following holds. For every pair of points satisfying , there is a set of at least of the indices that satisfy

We will prove Theorem 4.2 by the following covering argument, which will allow us to uniformly handle all pairs satisfying . We choose an -net of as in Lemma 3.4. We decompose the vector where is a “center” and is a “tail”, and we do similarly for . An elementary probabilistic argument and a union bound will allow us to nicely separate each pair of centers satisfying by hyperplanes. (Specifically, it will follow that , for at least of the indices .)

Furthermore, the tails can be uniformly controlled using Lemma 5.4, which implies that all tails are in a good position with respect to hyperplanes. (Specifically, for small one can deduce that , for at least of the indices .) Putting the centers and the tails together, we shall conclude that and are separated at least hyperplanes, as required.

Now we present the full proof of Theorem 4.2.

4.1. Step 1: Decomposition into centers and tails

Let be a number to be determined later. Let be an -net of . Since , Lemma 3.4 along with monotonicity property of entropy (3.2) guarantee that can be chosen so that

(4.1)
Lemma 4.3 (Decomposition into centers and tails).

Let . Then every vector can be represented as

(4.2)

where , .

Proof.

Since is an -net of , representation (4.2) holds for some . Since , it remains to check that . Note that and . By the triangle inequality this implies that . Thus , as claimed. ∎

4.2. Step 2: Separation of the centers

Our next task is to separate the centers , of each pair of points that are far apart by hyperplanes. For a fixed pair of points and for one hyperplane, it is easy to estimate the probability of a nice separation.

Lemma 4.4 (Separation by one hyperplane).

Let and assume that for some . Let . Then for we have

Proof.

Note that

The inequality above follows by the union bound. Now, since we have

Also, denoting the geodesic distance in by it is not hard to show that

(see [13, Lemma 3.2]). Thus

as claimed. ∎

Now we will pay attention to the number of hyperplanes that nicely separate a given pair of points.

Definition 4.5 (Separating set).

Let . The separating index set of a pair of points is defined as

The cardinality is a binomial random variable, which is the sum of indicator functions of the independent events . The probability of each such event can be estimated using Lemma 4.4. Indeed, suppose for some , and let . Then the probability of each of the events above is at least . Then with . A standard deviation inequality (e.g. [1, Theorem A.1.13]) yields

(4.3)

Now we take a union bound over pairs of centers in the net that was chosen in the beginning of Section 4.1.

Lemma 4.6 (Separation of the centers).

Let , and let be an -net of whose cardinality satisfies (4.1). Assume that

(4.4)

Then, with probability at least , the following event holds:

(4.5)
Proof.

For a fixed pair as above, we can rewrite (4.3) as

A union bound over all pairs implies that the event in (4.5) fails with probability at most

By (4.1) and (4.4), this quantity is further bounded by

provided the absolute constant is chosen sufficiently large. The proof is complete. ∎

4.3. Step 3: Control of the tails

Now we provide a uniform control of the tails that arise from the decomposition given in Lemma 4.3. The next result is a direct consequence of Lemma 5.4.

Lemma 4.7 (Control of the tails).

Let and let be independent random vectors in . Assume that

(4.6)

Then, with probability at least , the following event holds:

4.4. Step 4: Putting the centers and tails together

Let for a sufficiently small absolute constant . To control the tails, we choose an -net of as in Lemma 4.6, and we shall apply this lemma with instead of . Note that requirement (4.4) becomes

and it is satisfied by the assumption of Theorem 4.2, for a sufficiently large absolute constant . So Lemma 4.6 yields that with probability at least , the following separation of centers holds:

(4.7)

To control the tails, we choose as in Decomposition Lemma 4.3, and we shall apply Lemma 4.7. Note that requirement (4.6) becomes

and it is satisfied by the assumption of Theorem 4.2, for a sufficiently large absolute constant . So Lemma 4.7 yields that with probability at least , the following control of tails holds:

(4.8)

Now we combine the centers and the tails. With probability at least , both events (4.7) and (4.8) hold. Suppose both these events indeed hold, and consider a pair of vectors as in the assumption, so . We decompose these vectors according to Lemma 4.3:

(4.9)

where and . By the triangle inequality and the choice of , the centers are far apart:

Then event (4.7) implies that the separating set

(4.10)

Furthermore, using (4.8) for the tails and we see that

By Markov’s inequality, the set

We claim that

is a set of indices that satisfies the conclusion of Theorem 4.2. Indeed, the number of indices in is as required since

Further, let us fix . Using decomposition (4.9) we can write

Since , we have , while from we obtain . Thus

where the last estimate follows by the choice of for a sufficiently small absolute constant . In a similar way one can show that

This completes the proof of Theorem 4.2. ∎

5. Effective sparsity of solutions

In this section we prove Theorem 2.3 about the effective sparsity of the solution of the convex optimization problem (1.3). Our proof consists of two steps – a lower bound for proved in Lemma 5.1 below, and an upper bound on which we can deduce from Lemma 5.4 in the Appendix.

Lemma 5.1 (Euclidean norm of solutions).

Let . Then, with probability at least , the following holds uniformly for all signals . Let . Then the solution of the convex minimization program (1.3) satisfies

Remark 6.

Note that the sparsity of the signal plays no role in Lemma 5.1; the result holds uniformly for all signals .

Let us assume that Lemma 5.1 is true for a moment, and show how together with Lemma 5.4 it implies Theorem 2.3.

Proof of Theorem 2.3.

With probability at least