Linear signal recovery from b-bit-quantized linear measurements: precise analysis of the trade-off between bit depth and number of measurements

# Linear signal recovery from b-bit-quantized linear measurements: precise analysis of the trade-off between bit depth and number of measurements

Martin Slawski
Department of Statistics and Biostatistics
Department of Computer Science
Rutgers University
Piscataway, NJ 08854, USA
martin.slawski@rutgers.edu
Ping Li
Department of Statistics and Biostatistics
Department of Computer Science
Rutgers University
Piscataway, NJ 08854, USA
pingli@stat.rutgers.edu
###### Abstract

We consider the problem of recovering a high-dimensional structured signal from independent Gaussian linear measurements each of which is quantized to bits. Our interest is in linear approaches to signal recovery, where “linear” means that non-linearity resulting from quantization is ignored and the observations are treated as if they arose from a linear measurement model. Specifically, the focus is on a generalization of a method for one-bit observations due to Plan and Vershynin [IEEE Trans. Inform. Theory, 59 (2013), 482–494]. At the heart of the present paper is a precise characterization of the optimal trade-off between the number of measurements and the bit depth per measurement given a total budget of bits when the goal is to minimize the -error in estimating the signal. It turns out that the choice is optimal for estimating the unit vector (direction) corresponding to the signal for any level of additive Gaussian noise before quantization as well as for a specific model of adversarial noise, while the choice is optimal for estimating the direction and the norm (scale) of the signal. Moreover, Lloyd-Max quantization is shown to be an optimal quantization scheme w.r.t. -estimation error. Our analysis is corroborated by numerical experiments showing nearly perfect agreement with our theoretical predictions. The paper is complemented by an empirical comparison to alternative methods of signal recovery taking the non-linearity resulting from quantization into account. The results of that comparison point to a regime change depending on the noise level: in a low-noise setting, linear signal recovery falls short of more sophisticated competitors while being competitive in moderate- and high-noise settings.

## 1 Introduction

One of the celebrated results in compressed sensing (CS) states that it is possible to recover a high-dimensional signal from a small number of Gaussian linear measurements if 1) exhibits “low-dimensional structure” and 2) signal recovery is tailored to the underlying low-dimensional structure. Moreover, 2) can typically be accomplished in a computationally tractable manner, e.g., by solving a linear program. There is an enormous amount of literature on the subject spanning different areas, in particular mathematics, computer science, and engineering; we refer to [14] for an overview.

The concept of signal recovery from incomplete data in the sense of having available less measurements than would ordinarily be required has subsequently been developed further by considering settings in which the linear measurement process is subject to quantization, with the extreme case of single-bit quantization (e.g., [7, 17, 18, 22, 28, 32, 33, 44]. In general, one can think of -bit quantization, . Assuming that one is free in choosing and a corresponding scalar quantizer given a fixed budget of bits yields a trade-off between the number of measurements and the bit depth per measurement. An optimal balance of these two quantities minimizes a criterion of interest like the -error in recovering the signal. Such trade-off arises naturally in the presence of communication constraints. For example, signal acquisition and signal recovery may have to be carried out at different locations, and transmitting the acquired data is subject to a limited rate. The optimal trade-off depends on the signal, the noise mechanism and the noise level, the suitability of the parameters of the scalar quantizer relative to the signal, and the specific approach used for signal recovery. The dependency on the latter may be sidestepped by considering information-theoretical lower and upper bounds. The corresponding analysis is valuable as it would yield fundamental limits (see the survey [8] for such limits in specific settings), but it does not necessarily have immediate practical consequences unless there exists computationally tractable recovery algorithms achieving those limits. In this paper, we follow a different route by focusing on “linear” signal recovery as proposed in [33] with follow-up work in [34, 35]. Here, “linear” means that non-linearity resulting from quantization is ignored, and that the observations are treated as if they arose from a linear measurement model. Linear signal recovery may appear overly simple. In fact, it is known to be suboptimal in a noiseless setting (i.e., the only source of distortion is quantization), with an -error decaying with compared to [8]. In spite of that, there is quite some justification for having a closer look at the linear approach. It turns out that ignoring non-linearity does not have a dramatic effect in a regime where the norm of the signal and the level of additive noise are comparable, in which case the -error of linear signal recovery can only be improved in terms of a multiplicative constant. Empirically, as is demonstrated herein, the improvements achieved by more sophisticated methods tend to be rather small. Moreover, linear signal recovery typically comes with minimum requirements in terms of computation and storage. Apart from that, linear signal recovery constitutes a natural baseline. It is thus helpful to understand the aforementioned trade-off between the number of measurements and bit depth in this simple case.

#### Outline and summary of contributions.

In 2, we provide an overview on linear signal recovery as pioneered in [33, 34, 35] and adopt the specific formulation in [33] for estimating the “direction” . The corresponding analysis of the trade-off between and when estimating is laid out in 3. The analysis builds on ideas in [33, 35] to a considerable extent, and it is complemented by modifications/extensions to deal with the specific measurement model of -bit quantization. Out of that analysis, we deduce explicit, easy-to-compute expressions for the relative performance of -bit vs. -bit measurements () under three different noise models (additive Gaussian noise before quantization, adversarial bin flips, and random bin flips). We are then in position to decide on the optimal choice of given a fixed budget of bits. It turns out that the choice is optimal in the noiseless case, under additive Gaussian noise as well as under adversarial bin flips. In 4, we discuss an issue that has largely been neglected in the literature, namely the estimation of the “scale” . We show that as long as , this can be done at a fast rate by maximum likelihood estimation, separately from estimating . Combining the results for estimating the direction and those for estimating the scale, we conclude that for the specific recovery algorithm under consideration, it does not pay off to take . Along the way we prove that classic Lloyd-Max quantization [29, 31] constitutes an optimal -bit quantization scheme in the sense that it leads to a minimization of an upper bound on the -estimation error. Our theoretical results are corroborated by numerical experiments in 5. Regarding the relative performance of one-bit vs. two-bit measurements, the experimental results sharply agree with our theoretical predictions. One set of experiments sheds some light on the performance of linear signal recovery under study relative to alternative approaches. While the performance of the former is noticeably inferior to more sophisticated methods in a low-noise setting, it becomes competitive as the noise level increases. Altogether, the findings of this paper point to the conclusion that noisy settings are the domain of a two-fold simple approach, consisting a basic recovery algorithm and measurements of a low bit depth (i.e., one- or two-bit measurements). More conclusions can be found in 6. Proofs and complementary derivations have been moved to the appendix.

The present paper considerably extends a previous conference publication of the authors [37]. In particular, the analysis therein is limited to sparse signals, whereas the present paper covers general low-complexity signals as quantified by the Gaussian width. In addition, we provide an account on the asymptotic sharpness of our analysis, discuss an extension to anisotropic measurements, draw additional connections to existing literature and present a more comprehensive set of numerical results.

#### Related work.

There is a plethora of papers discussing various aspects of compressed sensing with quantization. We refer to [8] for an excellent overview on the problem including basic performance limits, different approaches to quantization and signal recovery, and the associated references. On the other hand, comparatively little seems to be known about the trade-off between the number of measurements and bit depth as it is in the focus of the present paper. An important reference in this regard is [25] where this very trade-off is studied for sparse signals. The analysis in [25] concerns “oracle-assisted” least squares (i.e., least squares with knowledge of the set of non-zero entries of the signal) which is of theoretical interest, but not a practical approach to signal recovery. The authors point out the role of the signal-to-noise ratio (SNR) for the optimal trade-off that leads to the distinction of two basic regimes: the so-called measurement compression regime with high SNR, small and large as opposed to the quantization compression regime with low SNR, large and small . Some of the results in our paper can be related to this finding.

We here focus on linear signal recovery. An alternative that is more principled as it uses knowledge about the quantizer is “consistent reconstruction”. This approach has been studied in a series of recent papers by Jacques and collaborators [21, 1, 20]. At the moment, this line of research only addresses the case without noise. The trade-off between and appears in [21], where a specific version of Iterative Hard Thresholding [5] tailored to quantized measurements is considered. It is shown via experimental results (which are partially reproduced in 5) that unlike the main result of the present paper, increasing beyond one or two yields improvements, which underlines that the trade-off can be rather different depending on the recovery algorithm used.

After submitting the conference paper [37], we became aware of the work [38] in which Lloyd-Max quantization is found to be optimal for linear signal recovery as in the present paper. The derivation in [38] is not fully rigorous though as it is only shown that Lloyd-Max quantization yields a stationary point, whereas herein, global optimality is established.

#### Notation.

For the convenience of the reader, we here gather notation used throughout the paper. For a positive integer , we use the shortcut . denotes the indicator function of expression with if is true and otherwise. For a matrix , denotes the -th row and the -th column, , . We use for the spectral norm of . The identity matrix of dimension is denoted by . For a vector , we write for the sub-vector corresponding to an index set . For , . A class of signals is denoted by and we let , where for , denotes the unit -ball in . We further write , where . The unit sphere of is denoted by . For a set , we write for its cardinality, for its convex hull, and for , we let . The letter refers to a Gaussian random variable or a canonical Gaussian random vector, i.e.,  or . The probability density function (pdf) and the cumulative density function (cdf) of the standard Gaussian distribution are denoted by and , respectively. In addition to the usual Landau notation, we occasionally make use of the stochastic order symbol : a sequence of random variables satisfies if for all there is a finite such that for all .

## 2 Linear signal recovery based on quantized linear measurements

In this section, we first fix the problem setup and then introduce the approach that will be studied in depth in subsequent sections.

Measurement model. Let be the signal to be recovered. We think of as a set describing a class of signals having a certain low-dimensional structure, e.g., , the set of -sparse signals. More examples are given in 3.3. The set is assumed to be known. Let be a random matrix with i.i.d.  entries whose rows and columns are denoted by and , respectively. The observations arise from the model

 yi=Q(⟨ai,x∗⟩+σεi),i=1,…,m, (1)

where has i.i.d.  entries and is referred to as “noise level”; when , we speak of a noiseless setting. The map is called quantization map or quantizer, which is piecewise constant, monotonically increasing, and odd. It partitions the real axis into bins where is the bit depth per measurement. Because of symmetry, it suffices to define a partitioning of into bins resulting from distinct thresholds (in increasing order) and , such that . Each bin is assigned a distinct representative from the codebook with . Accordingly, is defined as

 z↦Q(z)=sign(z)K∑k=1μkI(|z|∈Rk). (2)

For convenience, Figure 1 visualizes this definition. Since noise (if any) is added before is applied, we speak of “additive noise before quantization”. Other noise mechanism acting after quantization are possible, too; see 3.5 for specific examples.

Linear Signal Recovery. Linear signal recovery is carried out by means of an estimator for some linear map depending on and only depending on but not directly on . It is perhaps surprising that even when restricting oneself to this class, it is possible to construct consistent estimators of the “direction” . This was first established by Brillinger [9] in the traditional setting of asymptotic statistics with fixed parameter set and tending to infinity. It turns out that Gaussianity of plays a crucial role here. Brillinger’s result has been generalized recently to modern high-dimensional settings in various ways [33, 34, 35, 38]. We follow this line of research in the present paper. Linear signal recovery is limited to estimating as estimating the “scale” entails using knowledge of . Indeed, expanding model (1), we obtain

 yi =sign(⟨ai,x∗⟩+σεi)∑Kk=1μkI(|(⟨ai,x∗⟩+σεi)|∈Rk) =sign(⟨ai,x∗u⟩+τεi)∑Kk=1μkI(|(⟨ai,x∗u⟩+τεi)|∈Rk/ψ∗),i∈[m], (3)

where . We conclude that model (1) can always be transformed into one with by re-scaling the thresholds accordingly. Hence, in order to be able to estimate , the thresholds have to be taken into account, while linear signal recovery discards this information. Moreover, a second consequence is that when studying linear signal recovery with regard to the estimation of in the sequel, we may assume w.l.o.g. that . Linear estimation of the direction can trivially be combined with separate non-linear estimation of the scale, an approach whose discussion is postponed to 4.

Marginal Regression. The choice gives rise to what we will refer to as the “canonical linear estimator”:

 ˆx∈argminx∈C−⟨η,x⟩,whereη:=A⊤ym,and C:=K∩Bn2. (4)

This estimator is proposed in [33] for and . In the conference paper preliminary to the present work [37], and general are considered. The heading “Marginal Regression” is rooted in the fact that is essentially proportional to the vector of univariate (or marginal) regression coefficients corresponding to separate linear regressions of onto , , which are given by

 ˆβj=A⊤jy∥Aj∥22=A⊤jym+OP(√m)

which implies that , .

A formulation rather similar to (4) is studied in [35].

 ˆxλ∈λ−1ΠλK(η)=λ−1argminx∈λK12∥η−x∥22=λ−1argminx∈λK12∥x∥22−⟨η,x⟩ (5)

where is defined by the relation (cf. (7) below) and denotes the Euclidean projection on . As derived in Appendix H, if is a cone (i.e.,  for all ), then is identical to in (4) up to a constant of proportionality. This applies to all except for one of the examples considered for herein. In general, and are different though. A notable disadvantage of (5) compared to (4) is that it requires knowledge of which depends on the (typically unknown) noise level . For this reason, we concentrate on (4) in the following.

Least Squares. The choice would yield the least squares estimator. Since the inverse does not exist once , one needs to take advantage of low-dimensional structure. This yields the constrained least squares estimator

 ˆxLS∈λ−1argminx∈λK12x⊤A⊤Amx−⟨η,x⟩. (6)

In [34] and [38], this approach is studied under the names “Generalized Lasso” respectively “-Lasso” in allusion to the popular choice for the set as an -ball [39]. Observe that (5) differs from (6) only in that the matrix is replaced by its expectation, the identity matrix. While this may appear as minor, an important consequence is that (6) can achieve exact recovery of as and unlike (5). Outside the high signal-to-noise regime, however, both formulations indeed perform similarly as can be concluded from the analysis of (5) in [35] and the analysis of (6) in [34]. The latter yields a bound that is essentially of the form resulting from a lower bound of the form , , for all in the so-called tangent cone of at . There are two more aspects that are relevant to a comparison.

• From the viewpoint of computation, the optimization problems (4) and (5) tend to be easier than the one in (6) as the objectives in (4),(5) are linear respectively coordinate-wise separable.

• More generally, one can consider the case of i.i.d. anisotropic measurements, i.e., , . If is known, then (4) and (5) remain applicable with , cf. 3.4 below. The approach (6) has its merits in the situation that is not known. In general, estimating resp. its inverse is statistically more difficult and computationally more demanding than estimating , hence the use of (4) or (5) combined with plug-in estimation of is not a suitable option.

## 3 Analysis

The following section is dedicated to the analysis of the -error of the canonical linear estimator (4) under the -bit quantization model as defined by (1) and (2). Our main result is an asymptotic bound for that allows for a precise (asymptotically sharp) characterization of the dependence on the bit depth , the thresholds and the representatives parameterizing the quantization map. Given this result, we are in position to address the trade-off between and . Along the way, we show that Lloyd-Max quantization [29, 31] constitutes an optimal quantization scheme in the sense that it leads to a minimization of the error bound w.r.t.  and . Finally, an extension to two other natural noise models is discussed.

At a technical level, the main ingredients of our analysis appear in related literature, in particular in [33]. Certain adjustments are necessary though to deal effectively with the specific measurement model herein, and to end up with a result suitable for the purpose of studying the trade-off between and .

### 3.1 Preparations

Our main result depends on three quantities and one condition which are given below. Throughout this section, we assume w.l.o.g. that so that as this can always be achieved by re-scaling and , cf. (2).

(Q1) The first quantity has initially been introduced in [33].

 λ=λb=λb,σ=λb,σ(t,μ):=E[θ(g)g],g∼N(0,1), (7)

where the map is defined by the relation

 E[y1|a1]=θ(⟨a1,x∗u⟩). (8)

At a high level, quantifies the distortion from linearity caused by quantization. It can be shown that (Appendix A) that , hence also equals the constant of proportionality up to which can be recovered by linear estimation. The quantity is positive, increases with and approaches one as . A precise expression for is the content of Lemma 1 below.

(Q2) The second quantity is given by

 Ψ=Ψb,σ=Ψb,σ(t,μ):=√E[y21], (9)

which is simply the (marginal) standard deviation of the . The error bound of Theorem 1 below is proportional to . We refer to Lemma 1 for a more specific expression for .

(Q3) The third quantity is a measure of complexity of the class of signals under consideration, the so-called Gaussian width of the tangent cone of at , a notion which can be considered as standard in the context of high-dimensional linear inverse problems [10, 13, 33, 34, 35]. We set

 Δ(C;x∗u):={α(x−x∗u),α≥0,x∈C},¯¯¯¯¯Δ(C;x∗u):=Δ(C;x∗u)∩Sn−1. (10)

We suppress dependence on , i.e., we use and for the tangent cone of at and its spherical part, respectively. The latter enters Theorem 1 below via its Gaussian width. For compact and , the Gaussian width of is defined by

 w(K):=E[supv∈K|⟨g,v⟩|] (11)

(C) Finally, we require the following condition. At a technical level, rather than being fixed, we think of as a sequence whose elements are contained in spheres of growing dimension, satisfying as . This condition is easily met for typical signal classes of interest under natural sampling models for as is elaborated in the discussion after Theorem 1 below.

### 3.2 Main result

We now state our main result along with a brief general discussion. Further implications are subsequently discussed in separate subsections.

###### Theorem 1.

Consider the measurement model given by (1),(2) and consider the canonical linear estimator given by (4). Under condition (C), as , we have

 ∥ˆx−x∗u∥2≤24Ψb,σ(t,μ)λb,σ(t,μ)w(¯¯¯¯¯Δ(C))√m. (12)

with probability at least as , for some constant .

It turns out that the bound (12) does not leave much room for further improvement in general. The term is identified as the rate of estimation. It follows from existing literature [10, 13] that in the presence of additive noise, one has a corresponding lower bound in terms of the so-called Sudakov minoration of which typically yields matching upper and lower bounds modulo constant factors. For the examples of resp.  given in 3.3 below, the upper bounds on yield a match to known minimax lower bounds, e.g., [11, 12, 30].

We point out, however, that the rate in (12) can be suboptimal in the noiseless case. For , it is shown in [22] that there exists a (computationally intractable) recovery algorithm that achieves an error decay of the order compared to in (12).

In the sequel, a lot of attention will be paid to the leading constant . The dependency on this quantity is asymptotically sharp as it is already encountered in the traditional asymptotic setup in which with being of a smaller order of magnitude than . In fact, one can show (cf. Appendix D) that under a double asymptotic framework in which , and (C) holds, for any

 √m(ˆxj−x∗u,j)D→N(0,Ψ2b,σ(t,μ)λ2b,σ(t,μ)), (13)

where denotes convergence in distribution. In other words, the estimation error for any single coordinate is proportional to the leading constant of our bound.

The numerical constant “” in the bound (12) does not appear to be optimal. As it comes to the key point of the paper, namely the ratio of estimation errors for different choices of the bit depth , this is not an issue as all terms not depending on cancel out.

### 3.3 Classes of signals

Theorem 1 can be specialized to popular signal classes by bounding the associated Gaussian widths. We here provide several examples including a short discussion of computational aspects. We also discuss condition (C) in light of those examples. Derivations have been relegated to Appendix J.

1) Sparsity.
Following the argument in the proof of Lemma 2.3 in [33], one can show that

 w(¯¯¯¯¯Δ(C))≤w(B0(2s;n)∩Sn−1)≤√2s+√4slog(en2s)+√π/2≤3.5√2slog(en2s),

i.e., we recover the usual rate in Theorem 1.

2) Fused Sparsity. Let denote the first-order difference operator and set ; this is the set of all signals in that are piecewise constant with breakpoints. One can show that satisfies the same upper bound as in 1).

3) Group Sparsity. Let be a partition of into groups. Define by and for , let . Consider , in which case we say that is -group sparse (w.r.t. the partition ). Then, .

4) Low-rank matrices. Our framework can accommodate a set of matrices by identifying with . Consider and accordingly , with as the Frobenius norm. Then, .

5) -ball constraint. , . In addition, suppose that . Then , and we may resort to the bound in 1).

In the same way as 5) arises as the convex counterpart to 1), one can consider a total variation constraint in place of 2), an ball (“group lasso” [43]) constraint in place of 3), and a Schatten-one norm ball constraint in place 4). Under a sparsity assumption for , the Gaussian widths of the convex formulations equal – up to numerical constants – those of the corresponding non-convex formulations; for the sake of brevity and since this is known in the literature, we omit explicit statements/derivations here.

Computation. The constraint sets in examples 1) to 4) are non-convex. Nevertheless, due to the simplicity of the objective in Eq. (4) and the specific structure of the constraint sets, all of the resulting optimization problems are computationally tractable; 1) and 3) even have closed form solutions, 4) can be reduced to a singular value decomposition, and 2) can be solved in flops by dynamic programming [3]. A formal derivation for 1) is contained in Appendix I representative for 1) – 3). For 2), the convex counterpart has a much better computational complexity which scales only linearly in [24], while achieving comparable statistical performance.

Discussion of (C). For examples 1) – 4) above, Condition (C) can be shown to be satisfied with probability tending to one as when sampling uniformly at random according to natural generating mechanisms.

1) For a given sparsity level , pick the support of at random and sample the non-zero entries i.i.d. from a distribution with finite fourth moment, and normalize to unit 2-norm. It follows from Markov’s inequality that (C) is satisfied with probability tending to one as .

2) Pick a random partition , of such that , as , where denotes the number of elements in the -th element of the partition, . For each of those, sample the corresponding entries at random from a distribution with finite first moment, scale them by the square root of the respective block size and then normalize to unit -norm.

3) Suppose that as , the sizes of the non-zero blocks are of the same order. Sampling the non-zero entries as in example 1), condition (C) is fulfilled with probability tending to one as the block sizes of the non-zero blocks or the number of non-zero blocks go to infinity.

4) Draw a random matrix with i.i.d. -entries, , and compute its SVD. Keep the top left and right singular vectors and replace the corresponding singular values by an arbitrary element of .

Clearly, there may exist different or more general sampling schemes for (C) to be satisfied.

### 3.4 Extension to anisotropic measurements

We extend Theorem 1 to the case of anisotropic Gaussian measurements. More precisely, we now suppose that the are i.i.d. from a -distribution where is invertible and assumed to be known. In the anisotropic case, the linear estimator (4) is replaced by

 ˆxΣ∈argminx∈C−⟨x,Σ−1η⟩, (14)

with as in (4). We then have the following counterpart to Theorem 1.

###### Theorem 2.

Consider the anisotropic measurement model as above, let denote the condition number of , and let be as in (14). Under condition (C), as , it holds that

 ∥ˆxΣ−x∗u∥2≤24κ(Σ1/2)Ψb,σ(t,μ)λb,σ(t,μ)w(¯¯¯¯¯Δ(C))√m, (15)

with probability at least as , for some constant .

The bound for the anisotropic case thus only involves the additional factor . Setting , we recover Theorem 1. Regarding the trade-off between and to be studied in the next sections, the extra factor does not have any influence as it does not depend on .

### 3.5 Implications for the optimal trade-off between m and b

We now study in detail the implications of Theorem 1 for the central question of this paper. Suppose we have a fixed budget of bits available and are free to choose the number of measurements and the number of bits per measurement subject to such that the -error of of is as small as possible. What is the optimal choice of ? At this point, we still confine ourselves to the direction . In 4 below, we provide an answer for in place of by linking the findings of the present section to the results on scale estimation.

In virtue of Theorem 1 and the comments that follow, the asymptotic -error as depends on only via . In a first step, we provide more specific expressions for and that will prove useful in subsequent analysis. Below, denotes the entry-wise (Hadamard) multiplication of vectors.

###### Lemma 1.

Let , , be as in (2). We have and , where

 α(t)=(α1(t),…,αK(t))⊤, αk(t)=P{|˜g|∈Rk(t)}, ˜g∼N(0,1+σ2),k∈[K], E(t)=(E1(t),…,EK(t))⊤, Ek(t)=E[˜g|˜g∈Rk(t)], ˜g∼N(0,1+σ2),k∈[K].

Optimal choice of and . In order to eliminate the dependence on and , we minimize w.r.t. these two quantities. In this manner, we also obtain an optimal parameterization of the quantization map yielding minimum -estimation error. It turns out that the solution coincides with that of the classical Lloyd-Max quantization problem [29, 31] stated below. Let be a random variable with finite variance and consider the optimization problem

 mint,μE[{h−Q(h;t,μ)}2]=mint,μE[{h−sign(h)∑Kk=1μkI(|h|∈Rk(t))}2]. (16)

Problem (16) can be solved by an iterative scheme known as the Lloyd-Max algorithm (3.2.3 in [16]) that alternates between optimization of for fixed and vice versa. For from a log-concave distribution (e.g., Gaussian) that scheme can be shown to deliver the global optimum [23].

###### Theorem 3.

Consider the minimization problem with . Its minimizer equals that of the Lloyd-Max problem (16) for . Moreover,

 Ωb,σ(t∗,μ∗)=√(σ2+1)/λb,0(t∗0,μ∗0), (17)

where denotes the optimal choice of in the sense of (16) when (i.e., ).

Regarding the choice of the result of Theorem 3 may not come as a surprise as the entries of are i.i.d. . It is less immediate though that this specific choice can also be motivated as the one leading to the minimization of the error bound (12).

The second part of Theorem 3 implies that the ratio does not depend on . Combining Theorem 1, Lemma 1 and Theorem 3, we are eventually in position to determine the optimal trade-off between and . Theorem 1 yields that the -error decays with . Therefore, for to improve over with at the level of bits, it is required that . In fact, when using bits per measurement we may multiply the number of measurements by a factor of so that the bit budgets are balanced, i.e., , where and denote the number of -bit and -bit measurements, respectively. Using the Lloyd-Max algorithm to determine and invoking (17) as well as Lemma 1, it is straightforward to evaluate the ratios numerically. In Table 1, we provide the results for selected pairs of and .

From these figures, we see that increasing reduces the error as the number of measurements are fixed. However, the reduction is not substantial enough to yield an improvement when thinking in terms of a budget of bits instead of measurements. Reducing from two to one increases the error by a factor of which is below , the factor required for to be inferior compared to . The reduction factor for increasing becomes even smaller for the transitions from two to three and three to four bits, and quickly approaches . The overall conclusion is that for estimating the optimal trade-off is achieved by one-bit quantization – instead of increasing the bit depth , one should rather increase the number of measurements. This conclusion is valid regardless of the noise level as a consequence of Theorem 3. Even more, the figures in Table 1 assume optimal quantization for , and in turn knowledge of , which may not be fulfilled in practice.

Beyond additive noise. Additive Gaussian noise is perhaps the most studied form of perturbation, but one can of course think of numerous other mechanisms whose effect can be analyzed along the path used for additive noise as long as it is feasible to obtain the corresponding expressions for and . We here do so for the following mechanisms acting after quantization (2).

(I) Random bin flip. For : with probability , remains unchanged. With probability , is changed to an element from uniformly at random.

(II) Adversarial bin flip. For : write for and . With probability , remains unchanged. With probability , is changed to .

Note that for , (I) and (II) coincide as both amount to a sign flip with probability . Depending on the magnitude of , the corresponding value may even be negative, which is unlike the case of additive noise. Recall that the error bound (12) requires . Borrowing terminology from robust statistics, we consider as the breakdown point, i.e., the (expected) proportion of contaminated observations that can still be tolerated so that (12) continues to hold. Mechanism (II) produces a natural counterpart to gross corruptions when the linear measurements are not subject to quantization. It is not hard to see that among all maps applied randomly to the observations with a fixed probability, (II) maximizes the ratio , hence the attribute “adversarial”. In Figure 2 we display for both (I) and (II) and . Table 2 provides the corresponding breakdown points. For simplicity, are not optimized but set to the optimal (in the sense of Lloyd-Max) choice in the noiseless case. The underlying derivations can be found in Appendix K.

Figure 2 and Table 2 provide one more argument in favour of one-bit measurements as they offer better robustness vis-à-vis adversarial corruptions. In fact, once the fraction of such corruptions reaches , performs best on the measurement scale. For the milder corruption scheme (I), turns out to the best choice for significant but moderate .

## 4 Scale estimation

In Section 2, we have decomposed into a product of a unit vector and a scale parameter . We have pointed out that can be estimated from a linear map of separately from since the latter can be absorbed into the definition of the bins . Accordingly, we may estimate as with as in (4) estimating and estimating . We here consider the maximum likelihood estimator (MLE) for .

Noiseless case. To begin with, let us consider the case , so that the are i.i.d. . The likelihood function is then given by

 L(ψ)=m∏i=1K∑k=1I(yi∈Rk)P(|yi|∈Rk)=K∏k=1{2(Φ(tk/ψ)−Φ(tk−1/ψ))}mk, (18)

where , and denotes the standard Gaussian cdf. Note that for , is constant (i.e., does not depend on ) which confirms that for , it is impossible to recover . For (i.e., ), the MLE has a simple a closed form expression given by . The following tail bound establishes fast convergence of to .

###### Proposition 1.

Let and , where denotes the standard Gaussian pdf and its derivative. With probability at least , we have .

The exponent is maximized for and becomes smaller as moves away from . While scale estimation from -bit measurements is possible, convergence can be slow if is too small or too large.

One may profit from taking as this introduces multiple thresholds , so that scale estimation is less dependent on the choice of each individual threshold. Moreover, if the thresholds are well chosen, convergence becomes faster as increases; see [27] for a detailed analysis regarding this aspect. On the other hand, for , the MLE is no longer available in closed from.

Additive noise. We now turn to the case . Since the are now i.i.d. , the MLE based on (18) systematically over-estimates . Therefore, a different approach is needed. Suppose that is known and denote by the interval the -th observation is contained in before quantization, . Then the joint likelihood for is given by

 L(ψ,˜σ)=m∏i=1{Φ(ui−ψ⟨ai,x∗u⟩˜σ)−Φ(li−ψ⟨ai,x∗u⟩˜σ)}. (19)

Existence and uniqueness of the MLE. It is easy to see that as or

and thus . Hence, the negative log-likelihood is coercive so that the MLE always exists. However, the MLE is not necessarily unique. In fact, if there exists so that (i.e., the bin assignment of the “linear predictions” perfectly matches that of the observations), the likelihood (19) attains the maximum possible value of one by setting to zero. There are potentially multiple values of satisfying the above condition; any of these is a MLE. However, this is a rather unlikely scenario as long as there is a noticeable noise level.

Computation of the MLE. A straightforward strategy to deal with the resulting two-dimensional optimization problem is coordinate descent, i.e.,  is optimized for fixed and vice versa. Both these sub-problems are smooth univariate optimization problems which can be solved using methods based on golden section search as implemented in common software packages. Alternating between the two univariate problems until the objective cannot be further decreased yields a root of the likelihood equation . It is known that the negative logarithm of is convex in [45], but it is not clear to us whether it is jointly convex in . Hence, the above likelihood equation may have multiple roots in general. Empirically, we have not encountered any issue with spurious solutions when using and as the MLE from the noiseless case as starting point. Moreover, since the optimization problem is only two-dimensional, it is feasible to perform a grid search to locate a smaller region within which the MLE resides, which reduces the chance of getting trapped in an undesired stationary point different from the MLE.

So far, we have assumed that is known. We may follow the plug-in principle and replace by its estimator . The resulting estimator for is no longer an MLE, but it can be a reasonable proxy depending on the distance of and