One-bit compressed sensing by linear programming

One-bit compressed sensing by linear programming

Abstract

We give the first computationally tractable and almost optimal solution to the problem of one-bit compressed sensing, showing how to accurately recover an -sparse vector from the signs of random linear measurements of . The recovery is achieved by a simple linear program. This result extends to approximately sparse vectors . Our result is universal in the sense that with high probability, one measurement scheme will successfully recover all sparse vectors simultaneously. The argument is based on solving an equivalent geometric problem on random hyperplane tessellations.

1Introduction

Compressed sensing is a modern paradigm of data acquisition, which is having an impact on several disciplines, see [21]. The scientist has access to a measurement vector obtained as

where is a given measurement matrix and is an unknown signal that one needs to recover from . One would like to take , rendering non-invertible; the key ingredient to successful recovery of is take into account its assumed structure – sparsity. Thus one assumes that has at most nonzero entries, although the support pattern is unknown. The strongest known results are for random measurement matrices . In particular, if has Gaussian i.i.d. entries, then we may take and still recover exactly with high probability [10]; see [26] for an overview. Furthermore, this recovery may be achieved in polynomial time by solving the convex minimization program

Stability results are also available when noise is added to the problem [9].

However, while the focus of compressed sensing is signal recovery with minimal information, the classical set-up , assumes infinite bit precision of the measurements. This disaccord raises an important question: how many bits per measurement (i.e. per coordinate of ) are sufficient for tractable and accurate sparse recovery? This paper shows that one bit per measurement is enough.

There are many applications where such severe quantization may be inherent or preferred — analog-to-digital conversion [20], binomial regression in statistical modeling and threshold group testing [12], to name a few.

1.1Main results

This paper demonstrates that a simple modification of the convex program is able to accurately estimate from extremely quantized measurement vector

Here is the vector of signs of the coordinates of .1

Note that contains no information about the magnitude of , and thus we can only hope to recover the normalized vector . This problem was introduced and first studied by Boufounos and Baraniuk [6] under the name of one-bit compressed sensing; some related work is summarized in Section 1.2.

We shall show that the signal can be accurately recovered by solving the following convex minimization program

The first constraint, , keeps the solution consistent with the measurements. It is defined by the relation for , where is the -th row of . The second constraint, , serves to prevent the program from returning a zero solution. Moreover, this constraint is linear as it can be represented as one linear equation where denote the coordinates of . Therefore is indeed a convex minimization program; furthermore one can easily represent it as a linear program, see below. Note also that the number in is chosen for convenience of the analysis; it can be replaced by any other fixed positive number.

Here and thereafter and denote positive absolute constants; other standard notation is explained in Section 1.3.

Let us then state the partial case of Theorem ? for sparse signals:

1.2Prior work

While there have been several numerical results for quantized compressed sensing [6], as well as guarantees on the convergence of many of the algorithms used for these numerical results, theoretical accuracy guarantees have been much less developed. One may endeavor to circumvent this problem by considering quantization errors as a source of noise, thereby reducing the quantized compressed sensing problem to the noisy classical compressed sensing problem. Further, in some cases the theory and algorithms of noisy compressed sensing may be adapted to this problem as in [28]; the method of quantization may be specialized in order to minimize the recovery error. As noted in [19] if the range of the signal is unspecified, then such a noise source is unbounded, and so the classical theory does not apply. However, in the setup of our paper we may assume without loss of generality that , and thus it is possible that the methods of Candes and Tao [8] can be adapted to derive a version of Corollary ? for a fixed sparse signal . Nevertheless, we do not see any way to deduce by these methods a uniform result over all sparse signals .

In a complementary line of research Ardestanizadeh et al. [2] consider compressed sensing with a finite number of bits per measurement. However, the number of bits per measurement there is not one (or constant); this number depends on the sparsity level and the dynamic range of the signal . Similarly, in the work of Gunturk et al. [14] on sigma-delta quantization, the number of bits per measurement depends on the dynamic range of . On the other hand, by considering sigma-delta quantization and multiple bits, the Gunturk et al. are able to provide excellent guarantees on the speed of decay of the error as decreases.

The framework of one-bit compressed sensing was introduced by Boufounos and Baraniuk in [6]. Jacques et al. [18] show that one-bit measurements are sufficient to recover an -sparse vector with arbitrary precision; their results are also robust to bit flips. In particular, their results require the estimate to be as sparse as , have unit norm, and be consistent with the data. The difficulty is that the first two of these constraints are non-convex, and thus the only known program which is known to return such an estimate is minimization with the unit norm constraint—this is generally considered to be intractable. Gupta et al. [16] demonstrate that one may tractably recover the support of from measurements. They give two measurement schemes. One is non-adaptive, but the number of measurements has a quadratic dependence on the dynamic range of the signal. The other has no such dependence but is adaptive. Our results settle several of these issues: (a) we make no assumption about the dynamic range of the signal, (b) the one-bit measurements are non-adaptive, and (c) the signal is recovered by a tractable algorithm (linear programming).

1.3Notation and organization of the paper

Throughout the paper, , , , etc. denote absolute constants whose values may change from line to line. For integer , we denote . Vectors are written in bold italics, e.g., , and their coordinates are written in plain text so that the -th component of is . For a subset , is the vector restricted to the elements indexed by . The and norms of a vector are defined as and respectively. The number of non-zero coordinates of is denoted by . The unit balls with respect to and norms are denoted by and respectively. The unit Euclidean sphere is denoted .

The rest of the paper is devoted to proving Theorem ?. In Section 2 we reduce this task to the following two ingredients: (a) Theorem ? which states states that a solution to is effectively sparse, and (b) Theorem ? which analyzes a simpler but non-convex version of where the constraint is replaced by . The latter result can be interpreted in a geometric way in terms of random hyperplane tessellations of a subset of the Euclidean sphere, specifically for the set of effectively sparse signals . In Section 3 we estimate the metric entropy of , and we use this in Section 4 to prove our main geometric result of independent interest: random hyperplanes are enough to cut into small pieces, yielding that all cells of the resulting tessellation have arbitrarily small diameter. This will complete part (b) above. For part (a), we prove Theorem ? on the effective sparsity of solutions in Section 5. The proof is based on counting all possible solutions of , which are the vertices of the feasible polytope. This will allow us to use standard concentration inequalities from the Appendix and to conclude the argument by a union bound.

Acknowledgement

The authors are grateful to Sinan Güntürk for pointing out an inaccuracy in the statement of Lemma in an earlier version of this paper.

2Strategy of the proof

Our proof of Theorem ? has two main ingredients which we explain in this section. Throughout the paper, will denote the rows of , which are i.i.d. standard normal vectors in .

Let us revisit the second constraint in the convex minimization program . Consider a fixed signal for the moment. Taking the expectation with respect to the random matrix , we see that

where . Here we used that the first absolute moment of the standard normal random variable equals . So in expectation, the constraint is equivalent to up to constant factor .

This observation suggests that we may first try to analyze the simpler minimization program

This optimization program was first proposed in [6]. Unfortunately, it is non-convex due to the constraint , and therefore seems to be computationally intractable. On the other hand, we find that the non-convex program is more amenable to theoretical analysis than the convex program .

The first ingredient of our theory will be to demonstrate that the non-convex optimization program leads to accurate recovery of an effectively sparse signal . One can reformulate this as a geometric problem about random hyperplane tessellations. We will discuss tessellations in Section 4; the main result of that section is Theorem ? which immediately implies the following result:

Theorem ? yields a version of our main Theorem ? for the non-convex program :

We can assume without loss of generality that and thus . Since is feasible for the program , we also have , and thus . Therefore Theorem ? applies to , and it yields that as required.

Theorem ? would follow if we could demonstrate that the convex program and the non-convex program were equivalent. Rather than doing this explicitly, we shall prove that the solution of the convex program essentially preserves the effective sparsity of a signal , and we finish off by applying Theorem ?.

This result is the second main ingredient of our argument, and it will be proved in Section 5. Now we are ready to deduce Theorem ?.

Consider a signal as in Theorem ?, so . In view of the application of Theorem ?, we may assume without loss of generality that . Indeed, otherwise we have and the conclusion of Theorem ? is trivial. So Theorem ? applies and gives

Also, as we noted above, . So Theorem ? applies for the normalized vectors , and for . Note that because is a feasible vector for the program . Therefore Theorem ? yields

where

This completes the proof.

For the rest of the paper, our task will be to prove the two ingredients above – Theorem ?, which we shall relate to a more general hyperplane tessellation problem, and Theorem ? on the effective sparsity of the solution.

3Geometry of signal sets

Our arguments are based on the geometry of the set of effectively -sparse signals

and the set of -sparse signals

While the set is not convex, is, and moreover it is a convexification of in the following sense. Below, for a set , we define to be its convex hull.

The first containment follows by Cauchy-Schwartz inequality, which implies for each that . The second containment is proved using a common technique in the compressed sensing literature. Let . Partition the support of into disjoint subsets so that indexes the largest elements of (in magnitude), indexes the next largest elements, and so on. Since all , in order to complete the proof it suffices to show that

To prove this, first note that . Second, note that for , each element of is bounded in magnitude by , and thus . Combining these two facts we obtain

where in the last inequality we used that for . The proof is complete.

Our arguments will rely on entropy bounds for the set . Consider a more general situation, where is a bounded subset of and is a fixed number. A subset is called an -net of if for every one can find so that . The minimal cardinality of an -net of is called the covering number and denoted . The number is called the metric entropy of . The covering numbers are (almost) increasing by inclusion:

Specializing to our sets of signals and , we come across a useful example of an -net:

Let , and let denote the set of the indices of the largest coefficients of (in magnitude). Using the decomposition and noting that , we see that it suffices to check that . This will follow from the same steps as in . In particular, we have

as required.

Next we pass to quantitative entropy estimates. The entropy of the Euclidean ball can be estimated using a standard volume comparison argument, as follows (see [24]):

From this we deduce a known bound on the entropy of :

We represent as the union of the unit Euclidean balls in all -dimensional coordinate subspaces, . Each ball has an -net for of cardinality at most , according to . The union of these nets forms an -net of , and since the number of possible is , the resulting net has cardinality at most . Taking the logarithm completes the proof.

As a consequence, we obtain an entropy bound for :

First note that . Then the monotonicity property followed by the volumetric estimate yield the first desired bound for all .

Next, suppose that . Then set . Lemma states that is an -net of . Furthermore, to find an -net of , we use Lemma for and for . Taking into account the monotonicity property , we see that there exists an -net of and such that

It follows that is an -net of , and its cardinality is as required.

4Random hyperplane tessellations

In this section we prove a generalization of Theorem ?. We consider a set and a collection of random hyperplanes in , chosen independently and uniformly from the Haar measure. The resulting partition of by this collection of hyperplanes is called a random tessellation of . The cells of the tessellation are formed by intersection of and the random half-spaces with particular orientations. The main interest in the theory of random tessellations is the typical shape of the cells.

Figure 1: Hyperplane tessellation of a subset K of a sphere
Figure 1: Hyperplane tessellation of a subset of a sphere

We shall study the situation where is a subset of the sphere , see Figure 1. The particular example of is a natural model of random hyperplane tessellation in the sperical space . The more classical and well studied model of random hyperplane tessellation is in Euclidean space , where the hyperlanes are allowed to be affine, see [23] for the history of this field. The random hyperplane tessellations of the sphere is studied in particular in [22].

Here we focus on the following question. How many random hyperplanes ensure that all the cells of the tessellation of have small diameter (such as )? For the purposes of this paper, we shall address this problem for a specific set, namely for

We shall prove that hyperplanes suffice with high probability. Our argument can be extended to more general sets , but we defer generalizations to a later paper.

It is convenient to represent the random hyperplanes in Theorem ? as , , where are i.i.d. standard normal vectors in . The claim that all cells of the tessellation of have diameter at most can be restated in the following way. Every pair of points satisfying is separated by at least one of the hyperplanes, so there exists such that

Theorem ? is then a direct consequence of the following slightly stronger result.

We will prove Theorem ? by the following covering argument, which will allow us to uniformly handle all pairs satisfying . We choose an -net of as in Lemma ?. We decompose the vector where is a “center” and is a “tail”, and we do similarly for . An elementary probabilistic argument and a union bound will allow us to nicely separate each pair of centers satisfying by hyperplanes. (Specifically, it will follow that , for at least of the indices .)

Furthermore, the tails can be uniformly controlled using Lemma ?, which implies that all tails are in a good position with respect to hyperplanes. (Specifically, for small one can deduce that , for at least of the indices .) Putting the centers and the tails together, we shall conclude that and are separated at least hyperplanes, as required.

Now we present the full proof of Theorem ?.

4.1Step 1: Decomposition into centers and tails

Let be a number to be determined later. Let be an -net of . Since , Lemma ? along with monotonicity property of entropy guarantee that can be chosen so that

Since is an -net of , representation holds for some . Since , it remains to check that . Note that and . By the triangle inequality this implies that . Thus , as claimed.

4.2Step 2: Separation of the centers

Our next task is to separate the centers , of each pair of points that are far apart by hyperplanes. For a fixed pair of points and for one hyperplane, it is easy to estimate the probability of a nice separation.

Note that

The inequality above follows by the union bound. Now, since we have

Also, denoting the geodesic distance in by it is not hard to show that

(see [13]). Thus

as claimed.

Now we will pay attention to the number of hyperplanes that nicely separate a given pair of points.

The cardinality is a binomial random variable, which is the sum of indicator functions of the independent events . The probability of each such event can be estimated using Lemma ?. Indeed, suppose for some , and let . Then the probability of each of the events above is at least . Then with . A standard deviation inequality (e.g. [1]) yields

Now we take a union bound over pairs of centers in the net that was chosen in the beginning of Section 4.1.

For a fixed pair as above, we can rewrite as

A union bound over all pairs implies that the event in fails with probability at most

By and , this quantity is further bounded by

provided the absolute constant is chosen sufficiently large. The proof is complete.

4.3Step 3: Control of the tails

Now we provide a uniform control of the tails that arise from the decomposition given in Lemma ?. The next result is a direct consequence of Lemma ?.

4.4Step 4: Putting the centers and tails together

Let for a sufficiently small absolute constant . To control the tails, we choose an -net of as in Lemma ?, and we shall apply this lemma with instead of . Note that requirement becomes

and it is satisfied by the assumption of Theorem ?, for a sufficiently large absolute constant . So Lemma ? yields that with probability at least , the following separation of centers holds:

To control the tails, we choose as in Decomposition Lemma ?, and we shall apply Lemma ?. Note that requirement becomes

and it is satisfied by the assumption of Theorem ?, for a sufficiently large absolute constant . So Lemma ? yields that with probability at least , the following control of tails holds:

Now we combine the centers and the tails. With probability at least , both events and hold. Suppose both these events indeed hold, and consider a pair of vectors as in the assumption, so . We decompose these vectors according to Lemma ?:

where and . By the triangle inequality and the choice of , the centers are far apart:

Then event implies that the separating set

Furthermore, using for the tails and we see that

By Markov’s inequality, the set

We claim that

is a set of indices that satisfies the conclusion of Theorem ?. Indeed, the number of indices in is as required since

Further, let us fix . Using decomposition we can write

Since , we have , while from we obtain . Thus

where the last estimate follows by the choice of for a sufficiently small absolute constant . In a similar way one can show that

This completes the proof of Theorem ?.

5Effective sparsity of solutions

In this section we prove Theorem ? about the effective sparsity of the solution of the convex optimization problem . Our proof consists of two steps – a lower bound for proved in Lemma ? below, and an upper bound on which we can deduce from Lemma ? in the Appendix.

Let us assume that Lemma ? is true for a moment, and show how together with Lemma ? it implies Theorem ?.

With probability at least , the conclusions of both Lemma ? and Lemma ? with hold. Assume this event occurs. Consider a signal as in Theorem ? and the corresponding solution of . By Lemma ?, the latter satisfies

Next, consider

Since by the assumption on we have , Lemma ? with implies that

By definition of , the vector is feasible for the program , so the solution of this program satisfies

Putting this together with and , we conclude that

This completes the proof of Theorem ?.

In the rest of this section we prove Lemma ?. The argument based on the observation that the set of possible solutions of the convex program for all and corresponding is finite, and its cardinality can be bounded by . For each fixed solution , a lower bound on will be deduced from Gaussian concentration inequalities, and the argument will be finished by taking a union bound over .

It may be convenient to recast the convex minimization program as a linear program by introducing the dummy variables :

The feasible set of the linear program is a polytope in , and the linear program attains a solution on a vertex of this polytope. Further, since the random Gaussian vectors are in general position, one can check that the solution of the linear program is unique with probability . Thus, by characterizing these vertices and pointing out the relationship between and , we may reduce the space of possible solutions . This is the content of our next lemma. Given subsets , , we define to be the submatrix of with columns indexed by and rows indexed by .

Part (1) follows since we are minimizing . Part (5) follows since

combined with the fact that we are implicitly minimizing . Parts (2)–(4) will follow from the fact that achieves its minimum at a vertex. The vertices are precisely the feasible points for which some of the inequality constraints achieve equality, provided is the unique solution to those equalities. Since at least of the constraints must be equalities. We now count equalities based on and .

We first consider the constraints . If we have two equalities, and , otherwise, we have one. This gives equalities. Part (5) gives one more equality. This leaves us with at least equalities that must be satisfied out of the equations . Thus, we may take .

We may disregard the dummy variables and consider that the solution must satisfy conditions (2)–(5) above for some and . We will show that with high probability, any such vector is lower bounded in the Euclidean norm.

Let us first fix sets and , and consider a vector satisfying (2)–(5). We represent it as

Our goal is to lower bound . By condition (4) above, we have which, with probability 1, completely determines the vector up to multiplication by (since and ). Moreover, since , we have , so for . Using this with together with condition (5), we obtain

and thus

We proceed to upper bound .

Since the random vector depends entirely on , it is independent of for . Thus, by the rotational invariance of the Gaussian distribution, for any fixed vector with unit norm, we have the following distributional estimates:2

The last term is a sum of independent sub-Gaussian random variables, and it can be bounded using standard concentration inequalities. Specifically, applying Lemma ? from the Appendix, we obtain

Using , this is equivalent to

It is left to upper bound the number of vectors satisfying conditions (2)–(5) and to use the union bound. Since , the total number of possible choices for and (and hence of ) is

Thus, by picking with a sufficiently large absolute constant , we find that all uniformly satisfy the required estimate with probability at least . Lemma ? is proved.

Appendix. Uniform concentration inequality.

In this section we prove concentration inequalities for

In the situation where the vector is fixed, we have a sum of independent random variables, which can be controlled by standard concentration inequalities:

Without loss of generality we can assume that . Then are independent standard normal random variables, so . Therefore are independent and identically distributed centered random variables. Moreover, are sub-gaussian random variable with , see [26]. An application of Hoeffding-type inequality (see [26]) yields

This completes the proof.

We will now prove a stronger version of Lemma ? that is uniform over all effectively sparse signals .

This is a standard covering argument, although the approximation step requires a little extra care. Let be a -net of . Since , we can arrange by Lemma ? that

By definition, for any one can find such that . So the triangle inequality yields

Note that . Together with this means that

Consequently,

We bound the terms and separately. For simplicity of notation, we assume that is an integer, as the non-integer case will have no significant effect on the result.

A bound on follows from the concentration estimate in Lemma ? and a union bound: