Robust one-bit compressed sensing with non-Gaussian measurements

# Robust one-bit compressed sensing with non-Gaussian measurements

Sjoerd Dirksen  and  Shahar Mendelson
###### Abstract.

We study memoryless one-bit compressed sensing with non-Gaussian measurement matrices. We show that by quantizing at uniformly distributed thresholds, it is possible to accurately reconstruct low-complexity signals from a small number of one-bit quantized measurements, even if the measurement vectors are drawn from a heavy-tailed distribution. Our reconstruction results are uniform in nature and robust in the presence of pre-quantization noise on the analog measurements as well as adversarial bit corruptions in the quantization process. If the measurement matrix is subgaussian, then accurate recovery can be achieved via a convex program. Our reconstruction theorems rely on a new random hyperplane tessellation result, which is of independent interest.

Lehrstuhl C für Mathematik (Analysis), RWTH Aachen University, dirksen@mathc.rwth-aachen.de
Mathematical Sciences Institute, The Australian National University and Department of Mathematics,
Technion, I.I.T, shahar.mendelson@gmail.com

## 1. Introduction

In any signal processing application, an important procedure is the quantization of analog signals to a finite number of bits. This essential step allows one to digitally transmit, process, and reconstruct signals. The area of quantized compressed sensing investigates how to design a measurement procedure, quantizer, and reconstruction algorithm that together recover low-complexity signals—such as signals that have a sparse representation in a given basis. An efficient system has to be able to reconstruct signals based on a minimal number of measurements, each of which is quantized to the smallest number or bits, and to do so via a computationally efficient reconstruction algorithm. In addition, the system should be reliable: it should be robust to pre-quantization noise (noise in the analog measurements process), and to post-quantization noise (bit corruptions that occur during the quantization process).

In this article we study the reconstruction of signals from measurements that are quantized to a single bit using an efficient quantizer. We focus on the one-bit compressed sensing model, in which one observes quantized measurements of the form

 (1.1) q=sign(Ax+νnoise+τthres),

where , , sign is the sign function applied element-wise, is a vector modelling the noise in the analog measurement process and is a (possibly random) vector consisting of quantization thresholds. We restrict ourselves to memoryless quantization, meaning that the thresholds are set in a non-adaptive manner. In this case, the one bit quantizer can be implemented using an energy-efficient comparator to a fixed voltage level (if for all ) combined with dithering (if is random). And, because of its efficiency, this quantizer has been very popular in engineering literature, in particular in applications where analog-to-digital converters represent a significant factor in the energy consumption of the measurement system (e.g. in distributed sensing and massive MIMO).

In spite of its popularity, there are few rigorous results that show that one-bit compressed sensing is viable: the vast majority of the mathematical literature has focused on the special case where is a standard Gaussian matrix, and the practical relevance of such results is limited—Gaussian matrices cannot be realized in a real-world measurement setup. As an additional difficulty, one-bit compressed sensing may perform poorly outside the Gaussian setup. In fact, it can very easily fail, even if the measurement matrix is known to perform optimally in ‘unquantized’ compressed sensing. For example, if the threshold vector , there are -sparse vectors that cannot be distinguished based on their one-bit Bernoulli measurements (see [1] and Section 2 for more details).

The purpose of this work is to show that one-bit compressed sensing can perform well in scenarios that are far more general than the Gaussian setting. What makes all the difference is the rather striking effect that dithering has on the one-bit quantizer. We show that thanks to dithering, accurate recovery from one-bit measurements is possible even if the measurement vectors are drawn from a heavy-tailed distribution. Moreover, the recovery results that we establish are robust to both adversarial and potentially heavy-tailed stochastic noise on the analog measurements, as well as to adversarial bit corruptions that may occur during quantization.

Let us formulate our main recovery results. Consider an arbitrary signal set contained in , the Euclidean ball of radius . Assume that at most a fraction of the bits are arbitrarily corrupted during quantization, that is, instead of one observes a binary vector whose Hamming distance to is at most . The measurement vectors, i.e., the rows of the matrix , are independent and identically distributed as and the entries of are independent and identically distributed as . Dithering is generated by the vector , whose coordinates are independent and distributed uniformly in . We assume that , , and are independent.

Although the method we introduce can be used in other situations, our focus is on two scenarios. The first is an -subgaussian scenario, in which is an isotropic111Recall than a random vector is isotropic if its covariance matrix is the identity; thus, for every , ., symmetric random vector that is -subgaussian, that is, for every and , . Moreover, we assume that is -subgaussian as well: for every , .

In the second scenario we explore heavy-tailed random variables: again is isotropic and symmetric, but in addition we only assume satisfies an - equivalence: for every ,

 ∥⟨X,x⟩∥L2≤L∥⟨X,x⟩∥L1.

In this scenario, we also assume that has finite variance and satisfies an - equivalence.

There are two different complexity parameters corresponding to these two scenarios, which dictate the number of measurements required for reconstruction. In the general, heavy-tailed case, let

 E(K):=Esupx∈K∣∣⟨1√mm∑i=1εiXi,x⟩∣∣,

where is a sequence of independent, symmetric -valued random variables that is independent of . The sets we consider are usually localizations of , that is, the sets ; alternatively, at times we consider localizations of , that is the sets .

In the subgaussian scenario, the empirical processes parameter is replaced by its natural upper bound, the Gaussian mean width222The fact that is dominated by the Gaussian mean width of is one of the features of subgaussian processes and is an outcome of Talagrand’s Majorizing Measures Theorem. Finding upper bounds on when is not subgaussian is a challenging question that has been studied extensively over the last 30 years or so and which will not be pursued here. of , which is defined by

 ℓ∗(K):=Esupx∈K|⟨G,x⟩|,

where is a standard Gaussian vector in .

Finally, we let denote the covering number of with respect to the Euclidean norm.

In what follows we study two reconstruction programs. The first is simply ‘empirical risk minimization’ performed in :

 (1.2) mindH(qcorr,sign(Az+τthres))s.t.z∈T.

Thus, one selects whose noiseless one-bit measurements minimize the Hamming distance to the corrupted vector of quantized noisy measurements.

The second recovery program is based on regularization and is performed in : for ,

 (1.3) max1m⟨qcorr,Az⟩−12λ∥z∥22s.t.z∈conv(T).

As we discuss in detail in Section 2, (1.3) is essentially the convexification of (1.2). It can be solved in low-order polynomial time for several signal sets of interest, e.g., for sparse vectors and low-rank matrices.

To present our recovery results, fix a target reconstruction error , recall that the quantization thresholds are uniformly distributed on and that at most of the bits are corrupted during quantization. The adversarial component of the pre-quantization noise is , is its variance and is its norm.

Our first recovery result for (1.2) is in the -subgaussian scenario; as we explain in what follows, it extends and improves significantly on the current state of the art in various ways; see Section 1.1 for details.

###### Theorem 1.1.

There exist constants depending only on such that the following holds. Let , set and put . Assume that

 m≥c1λ(ℓ2∗(Tr)ρ3+logN(T,r)ρ),

and that , and .

Then with probability at least , for every , any solution of (1.2) satisfies .

To put Theorem 1.1 in some context, consider an arbitrary and assume , so that is a constant that depends only on . By Sudakov’s inequality,

 (1.4) logN(T,r)≤cℓ2∗(T)r2≤c1(L)log(e/ρ)ρ2ℓ2∗(T),

and trivially , which means that a sample size of

 m=c′(L)log(e/ρ)ρ3ℓ2∗(T)

suffices for recovery. In the special case of , the subset of consisting of -sparse vectors, a much better estimate is possible. Indeed, it is standard to verify that there is an absolute constant such that for any ,

 ℓ∗(Σs,n)≃√slog(en/s)   and   logN(Σs,n,r)≤cslog(ensr).

Moreover, since it follows that

 ℓ∗(Tr)≤c1r√slog(en/s)=c2(L)ρ√log(e/ρ)⋅√slog(en/s),

implying that a sample size of

 m=c2(L)ρ−1slog(ensρ)

guarantees that with high probability one can recover with accuracy any -sparse vector via (1.2).

When it comes to the heavy-tailed scenario, the connection between the sample size and the accuracy is less explicit, because depends on . And although the uniform central limit theorem shows that converges to as tends to infinity, here one is interested in quantitative estimates, which are, in general, nontrivial. Because that is not the main focus of this article we shall not pursue the question of estimating any further.

###### Theorem 1.2.

There exist constants depending only on such that the following holds. Assume that and are as above and satisfy an - norm equivalence and that . Let , set , and suppose that satisfies

 m≥c2⎛⎝(λE(Tr)ρ2)2+λlogN(T,r)ρ⎞⎠.

Assume further that , and .

Then with probability at least , for every , any solution of (1.2) satisfies .

Next, let us turn to the convex recovery procedure (1.3) which we only explore in the -subgaussian scenario. Just like Theorem 1.1 and Theorem 1.2, Theorem 1.3 is a significant improvement on the current state of the art (see Section 1.1 for more details).

We assume for the sake of simplicity that and denote and .

###### Theorem 1.3.

There exist constants that depend only on for which the following holds. Let , fix , set

 λ≥c1(σ+R)√log(c1λ/ρ)

and let . If

 m≥c3⎛⎝(λℓ∗(Uρ)ρ2)2+λ2logN(T,r)ρ2⎞⎠,

then with probability at least

 1−4exp(−c4min{mρ2λ2,βmlog(e/β)}),

for any , a solution of (1.3) satisfies

 ∥x#−x∥2≤max{ρ,c5λβ√log(e/β)}.

As an example, let , the set of approximately -sparse vectors in the Euclidean unit ball. Observe that and that one may set . Also, for , , and it is standard to verify that . Taking the estimate (1.4) for into account, it is evident that if

 m=c(L)slog(en/s)log2(eλ/ρ)ρ4

then with high probability one may recover any using the convex recovery procedure (1.3), even in the presence of pre- and post quantization noise.

At the heart of our analysis is a generalization of a beautiful result due to Plan and Vershynin [13]. They showed that if is a subset of the Euclidean unit sphere , , and is an Gaussian matrix, then with high probability for all ,

 (1.5) dSn−1(x,y)−ρ≤1mdH(sign(Γx),sign(Γy))≤dSn−1(x,y)+ρ;

in other words, if and are ‘far enough’, the fraction of the random Gaussian hyperplanes that separate and approximates their geodesic distance in a very sharp way. It was later shown in [10] that (1.5) remain true if . Moreover, for certain ‘simple’ sets (e.g., if is the set of unit norm sparse vectors) measurements suffice (see [10, 13]).

The main technical result of this article is an ‘isomorphic’ version of (1.5): the fact that distances are exhibited by the number of separating hyperplanes happens to be a general phenomenon rather than being a Gaussian one; it holds for hyperplanes generated by a subgaussian or a heavy-tailed random vector , aided by the crucial impact of dithering.

More accurately, we show that as long as are ‘far enough’, the number of hyperplanes given by that separate and is , and is of the order of . As an example, our results lead to the following:

###### Theorem 1.4.

Let be an isotropic, symmetric, -subgaussian random vector, let and set

 d(x,y)=1m|{i:sign(⟨Xi,x⟩+τi)≠sign(⟨Xi,y⟩+τi)}|.

If , and

 m≥c1Rlog3/2(eR/ρ)ρ3ℓ2∗(T),

then with probability at least , for any such that , one has

 (1.6) c2∥x−y∥2R≤d(x,y)≤c3√log(eR/ρ)⋅∥x−y∥2R.

The connection of our recovery problem with this separation property, and more generally with random tessellations is explained in detail in Section 2; the study of generalizations of (1.5) can be found in Section 3.

###### Remark 1.5.

At the expense of substantial additional technicalities, the proof strategies developed in this work lead to recovery results for sparse vectors when is a random partial circulant matrix generated by a subgaussian random vector. The latter model occurs in several practical measurement setups, including SAR radar imaging, Fourier optical imaging and channel estimation (see e.g. [14] and the references therein). To keep this work accessible to a general audience and clearly expose the main ideas, we choose to defer the additional technical developments needed for the circulant case to a companion work [6].

### 1.1. Related work

To put our main results in some perspective, let us describe some closely related results in detail. We refer to [5] and the references therein for further facts on one-bit compressed sensing.

Gaussian measurements. As was mentioned previously, almost all the signal reconstruction results in (memoryless) one-bit compressed sensing concern standard Gaussian measurement matrices. Let us first consider known results when there is no dithering (). In that case it is only possible to recover signals located on the unit sphere. It was shown in [8, Theorem 2] that if is standard Gaussian and then, with high probability, any -sparse for which satisfy . In particular, one can approximate with accuracy by solving the non-convex program

 min∥z∥0s.t.sign(Ax)=sign(Az), ∥z∥2=1.

In comparison, Theorem 1.1 shows that this estimate holds in the subgaussian scenario—attaining the same quantitative estimate on the sample size and making it robust to pre- and post-quantization noise. Clearly, such a generalization is possible thanks to the effect of dithering. Theorem 1.2 shows that this result can be extended further to heavy-tailed measurements, at the expense of worse parameter dependencies.

Plan and Vershynin were the first to propose a tractable method for stable reconstruction of sparse vectors [11]. They showed that by using Gaussian measurements, with high probability one can recover every with and up to reconstruction error via a linear program. However, this reconstruction result is not robust to quantization errors. To mend that, Plan and Vershynin introduced in [12] a different, convex program (see (2.4) below) and proved recovery results for signal sets of two different flavours. In a non-uniform recovery setting333In the uniform recovery setting one attains a high probability event on which recovery is possible for all , whereas in non-uniform recovery the event depends on the signal . they showed that measurements suffice to reconstruct a fixed signal, even if pre-quantization noise is present and quantization bits are randomly flipped with a probability that is allowed to be arbitrarily close to . In the uniform recovery setting, they showed that if , one can achieve a reconstruction error even if a fraction of the received bits are corrupted in an adversarial manner while quantizing. Theorem 1.3 extends the latter result to subgaussian measurements with a better condition on and , and at the same time incorporates pre-quantization noise and allows the reconstruction of signals that need not be located on the unit sphere.

Non-Gaussian measurements. When the measurements are not standard Gaussian, there are very few reconstruction results available. The work [1] generalized the non-uniform recovery results from [12] to subgaussian measurements under additional restrictions. For a fixed with they showed that suffice to reconstruct up to error via (2.4) provided that either (meaning that the signal must be sufficiently spread) or the total variation distance between the subgaussian measurements and the standard Gaussian distribution is at most . Theorem 1.3 significantly improves on these results.

### 1.2. Notation

We use to denote the -norm of and denotes the -unit ball in . For a subgaussian random variable we let

 ∥ξ∥ψ2:=supp≥1∥ξ∥Lp√p∥ξ∥L2.

We use to denote the uniform distribution. For set and for a set let denote its cardinality. is the (unnormalized) Hamming distance on the discrete cube and is the set of -sparse vectors in the Euclidean unit ball. Finally, and denote absolute constants; their value many change from line to line. or denotes a constant that depends only on the parameter . We write if , and means that and .

## 2. The connection between signal recovery and hyperplane tessellations

The problem of finding a ‘good’ tessellation of an arbitrary subset of is of obvious interest—independent of recovery problems—and we address it in Section 3. Still, it is rather surprising that tessellations have such strong connections with signal reconstruction. We shall now explain those connections and show how they naturally lead to the two recovery programs we explore in this article.

Let us first assume that the signal is an -sparse vector; that no bit corruptions occur in the quantization process; and that there is no pre-quantization noise. Also, for the time being, consider zero thresholds. Thus, one observes

 qcorr=q=sign(Ax).

Assume further that . Clearly, ‘encodes’ the location of on the sphere: each measurement vector determines a hyperplane , and the corresponding quantized measurement indicates on which side of the hyperplane is located (see Figure 1). Therefore, the measurements ‘split’ into (at most) ‘cells’, and the sequence of bits encodes the location of within this tessellation (see Figure 2).

Thus, one may try to recover the signal via the program

 (2.1) minz∈Rn∥z∥0s.t.sign(Ax)=sign(Az); ∥z∥2=1.

Any solution of the program is a vector that is located in the same ‘cell’ as ; it is at least as sparse as (since is feasible for the program); and it is located on the Euclidean unit sphere. Hence, if the goal is to show that with high probability , it suffices that for any -sparse lying in the same cell as . Of course, there is no information on the identity of the cell in which is located, and therefore one has to ensure that any two -sparse points in the same cell are ‘close’.

In other words, the success of the recovery program (2.1) forces the measurements vectors to endow with a -uniform tessellation. Phrased differently, if are -sparse vectors on whose distance is at least , that fact must be exhibited by the hyperplanes : at least one of the hyperplanes must separate and .

One should note that there is nothing special about and a similar statement would be true for any set : a tessellation of consisting of cells whose diameter in is at most allows one to uniformly recover signals from using only as data. Moreover, the reverse direction is clearly true: the degree of accuracy in uniform recovery results in is determined by the largest diameter (in ) of a cell of the tessellation endowed by the hyperplanes .

As it happens, this type of separation is a very ‘Gaussian’ property. The most striking example is when is a Bernoulli matrix: it is straightforward to verify that and for , which are -sparse, cannot be separated based on their one-bit measurements, regardless of how many measurements are taken. In fact, even using all possible quantized measurements produced by vectors does not help (see Figure 3).

Another problem that exposes some of the difficulties in one-bit compressed sensing with zero thresholds—even in the Gaussian case—is the recovery of signals that are not located on the unit sphere. Clearly, two signals lying on a straight line produce the same quantized measurements, even if they are far apart (see Figure 3).

Both these issues can be addressed using dithering in the quantization process: intentionally adding ‘noise’ to the measurements before quantizing. As a result of dithering, every measurement vector determines the hyperplane which is a (random) parallel shift of the original hyperplane . Following dithering, indicates the side of the shifted hyperplane on which is located. The ability to shift hyperplanes gives an additional degree of freedom, and the hope is that the random tessellation endowed by will indeed have uniformly small cells.

With that optimistic viewpoint in mind, one can attempt recovery of the sparse signal using the program

 (2.2) minz∈Rn∥z∥0s.t.sign(Ax+τthres)=sign(Az+τthres), ∥z∥2≤1;

however, even if induces a good tessellation of , there is still the question of pre- and post-quantization noise one has to contend with. To understand the effect of post-quantization noise (i.e., bit corruptions that occur during quantization), assume that one observes a corrupted sequence of bits , where the -th bit being corrupted means that instead of receiving from the quantizer, one observes ; thus, one is led to believe that is on the ‘wrong side’ of the -th hyperplane . As a consequence, in the best case scenario (2.2) will search for a vector in the wrong cell of the tessellation, and in the worse case the corrupted bit may cause a conflict and there will be no sparse vector satisfying (see Figure 4 for an illustration). The conclusion is clear: at the end of the day, all the above mentioned programs can very easily fail in the presence of post-quantization noise.

The effect of pre-quantization noise (i.e., noise in the analog measurement process) is equally problematic: noise simply causes a parallel shift of the hyperplane , and one has no control over the size of this ‘noise-induced’ shift. Again, the recovery programs (2.1) and (2.2) can easily fail if pre-quantization noise is present (see Figure 5).

One possible way of overcoming this ‘infeasibility problem’ due to noise, is by designing a program that is stable: its output does not change by much even if some of given bits are misleading. For example, one may try search for a vector whose uncorrupted quantized measurements are closest to the observed corrupted vector . However, since one does not have access to , one can only try to match its proxy to , as is done by (1.2). In the context of sparse recovery, the latter program is

 (2.3) mindH(qcorr,sign(Az+τthres))s.t.∥z∥0≤s,∥z∥2≤1.

To ensure that (1.2) yields an accurate reconstruction, the uniform tessellation has to be finer than in the corruption-free case: even if some signs are ‘flipped’, the distance between points in the resulting cell and points in the true one should still be small. The generalized version of (1.5) ensures this: for any that are at least -separated there are many hyperplanes that separate the two points—of the order of . Thus, even after corrupting bits one may still detect that and are ‘far away’ from one another.

Although (1.2) can guarantee robust signal recovery, there are no guarantees that it can be solved efficiently. In addition, since (1.2) matches , rather than , to , it is still quite sensitive to pre-quantization noise. Both problems can be mended by convexification. Observe that

 dH(qcorr,sign(Az+νnoise+τthres))=12m∑i=1(1−(qcorr)isign(⟨Xi,z⟩+νi+τi)).

One may relax this objective function by replacing by and relax the constraint to leading to the convex program

 min12m∑i=1(1−(qcorr)i(⟨Xi,z⟩+νi+τi))s.t.z∈conv(T).

An equivalent formulation of this program, which only requires the known data and , is

 (2.4) max1m⟨qcorr,Az⟩s.t.z∈conv(T).

This program was proposed in [12]. As was mentioned in the introduction, we explore the regularized version (1.3) of (2.4). In the context of sparse recovery, this corresponds to the tractable program

 max1m⟨qcorr,Az⟩−12λ∥z∥22s.t.∥z∥1≤√s,∥z∥2≤1.

## 3. Random tessellations

The main result of this section is an ‘isomorphic’ version of (1.5), which says that distances between any two points in that are ‘far-enough’ are reflected by the number of hyperplanes that separate the points. Our main interest is in the lower estimate, which is the essential component in the proofs of Theorem 1.1 and Theorem 1.2.

As a starting point, consider a random vector that is isotropic, symmetric and satisfies an - norm equivalence: i.e, that for every ,

 (3.1) ∥t∥2=∥⟨X,t⟩∥L2≤L∥⟨X,t⟩∥L1.
###### Theorem 3.1.

There exist constants that depend only on for which the following holds. Let and set . Let that satisfy and assume that

 logN(T,r)≤c2mρλ,    and    E(Tr)≤c2ρ2√m.

Then with probability at least , for every that satisfy ,

 |{i:sign(⟨Xi,x⟩+τi)≠sign(⟨Xi,y⟩+τi)}|≥c4m∥x−y∥2λ.

When is -subgaussian one may establish a sharper, two-sided estimate that holds with an improved probability estimate. As it happens, the upper bound requires that satisfies a mild structural assumption:

###### Definition 3.2.

Let be a metric space. is -metrically convex in if for every there are such that

 γr≤d(zi,zi+1)≤r  and  ℓ∑i=1d(zi,zi+1)≤γ−1d(x,y),

where we set , . If we say that is -metrically convex.

The idea behind this notion is straightforward: it implies that controlling ‘local oscillations’ of a function ensures that it satisfies a Lipschitz condition for long distances. Indeed, assume that . For any that satisfy let be as in Definition 3.2. Then

 (3.2) |f(x)−f(y)|≤∣∣ ∣∣ℓ∑i=0(f(zi)−f(zi+1))∣∣ ∣∣≤κ(ℓ+1)≤κγrℓ∑i=0d(zi,zi+1)≤κγ2rd(x,y).

Therefore, satisfies a Lipschitz condition for long distances with constant .

Observe that if is a convex subset of a normed space then it is -metrically convex for any ; also, every subset of a normed space is -metrically convex in its convex hull. Finally, is -metrically convex in for an absolute constant . We omit the straightforward proofs of these claims.

###### Theorem 3.3.

There exist constants that depend only on for which the following holds. Let , set and consider an isotropic, symmetric, -subgaussian random vector . Let and satisfy that

 ρ≥c1r√log(eλ/r),

and

 m≥c2max{λrlogN(T,r), λ(ℓ2∗(Tr))ρ3}.

Then with probability at least , for every such that , one has

 |{i:sign(⟨Xi,x⟩+τi)≠sign(⟨Xi,y⟩+τi)}|≥c4m∥x−y∥2λ.

Moreover, if is -metrically convex then on the same event, if ,

 |{i:sign(⟨Xi,x⟩+τi)≠sign(⟨Xi,y⟩+τi)}|≤c5√log(eλ/ρ)γ2⋅m∥x−y∥2λ.

Proof of Theorem 1.4. Theorem 1.4 is an immediate outcome of Theorem 3.3 for . Indeed, is metrically convex for any , , and by Sudakov’s inequality, .

In the context of tessellations, Theorem 3.1 and the first part of Theorem 3.3 improve the estimate from (1.5) in several ways: firstly, Theorem 3.1 holds for a very general collection of random vectors - the vector has to satisfy a small-ball condition rather than being Gaussian. Secondly, both are valid for any subset of and not just for subsets of the sphere; and, finally, if happens to be -subgaussian, it yields the best known estimate on the diameter of each ‘cell’ in the random tessellation.

###### Remark 3.4.

Let us mention that because upper estimates are of lesser interest as far as the applications we have in mind are concerned, we formulated an upper bound only in the subgaussian scenario. For an upper estimate in the heavy-tailed scenario see Theorem 3.9 below.

### 3.1. The heavy-tailed scenario

A fundamental question that is at the heart of our arguments has to do with stability: given two points and , how ‘stable’ is the set

 {i:sign(⟨Xi,x⟩+τi)≠sign(⟨Xi,y⟩+τi)}=(∗)

to perturbations? If one believes that the cardinality of reflects distances , it stands to reason if is significantly smaller than and , , then should not be very different from .

Unfortunately, stability is not true in general. If either or are ‘too close’ to many of the separating hyperplanes, then even a small shift in either one of them can have a dramatic effect on the signs of and destroy the separation. Thus, to ensure stability one requires a stronger property than mere separation: points need to be separated by a large margin.

###### Definition 3.5.

The hyperplane -well-separates and if

• ,

• , and

• .

Denote by the set of indices for which -well-separates and .

The condition that is precisely what ensures that perturbations of or of the order of do not spoil the fact that the hyperplane separates the two points.

We begin by showing that even in the heavy-tailed scenario and with high probability, for any two (fixed) points and . Let us mention that the high probability estimate is crucial: it will lead to a uniform control on a net of a large cardinality.

###### Theorem 3.6.

There are constants that depend only on for which the following holds. Let and set . With probability at least

 1−4exp(−c2mmin{∥x−y∥2λ,1}),
 |Ix,y(c3)|≥c4m∥x−y∥2λ.

The proof of Theorem 3.6 requires two preliminary results. Consider a random variable that satisfies the small ball estimate

 (3.3) supu∈RP(|τ−u|≤ε)≤Cτεfor all ε≥0,

and let be independent of . Then clearly

 (3.4) P(|Z+τ|≤ε)≤Cτε,for % all ε≥0.

If then (3.3) holds for . Therefore, by the Chernoff bound, if and are independent copies of and respectively, then with probability at least ,

 (3.5) |{i:|Zi+τi|≥ε}|≥(1−2ελ)m.

The second observation is somewhat more involved. Consider a random variable that satisfies

 (3.6) P(α<τ≤β)≥cτ(β−α)

for all . Let and be square integrable whose difference satisfies a small-ball condition: there are constants and such that

 P(|Z−W|≥κ∥Z−W∥L1)≥δ.
###### Lemma 3.7.

There are absolute constants and and constants such that the following holds. Assume that and are independent of and that

 λ≥(c0/√δ)max{∥Z∥L2,∥W∥L2}.

If , and are independent copies of , and respectively, then with probability at least

 1−2exp(−c1mδ)−2exp(−c2m∥Z−W∥L1),
 |{i:sign(Zi+τi)≠sign(Wi+τi)}|≥c3m∥Z−W∥L1.
###### Proof.

Set to be named later and observe that . Hence, with probability at least ,

 |{i:|Zi|≥∥Z∥L2/√θ}|≤2θm,

where is an absolute constant; a similar estimate holds for .

At the same time, recall that , implying that with probability at least ,

 |{i:|Zi−Wi|≥κ∥Z−W∥L1}|≥δm2.

Set and let . The above shows that there is an event of -probability at least