# Sigma Delta quantization with Harmonic frames and partial Fourier ensembles

###### Abstract

Sigma Delta () quantization, a quantization method which first surfaced in the 1960s, has now been used widely in various digital products such as cameras, cell phones, radars, etc. The method samples an input signal at a rate higher than the Nyquist rate, thus achieves great robustness to quantization noise.

Compressed Sensing (CS) is a frugal acquisition method that utilizes the possible sparsity of the signals to reduce the required number of samples for a lossless acquisition. One can deem the reduced number as an effective dimensionality of the set of sparse signals and accordingly, define an effective oversampling rate as the ratio between the actual sampling rate and the effective dimensionality. A natural conjecture is that the error of Sigma Delta quantization, previously shown to decay with the vanilla oversampling rate, should now decay with the effective oversampling rate when carried out in the regime of compressed sensing. Confirming this intuition is one of the main goals in this direction.

The study of quantization in CS has so far been limited to proving error convergence results for Gaussian and sub-Gaussian sensing matrices, as the number of bits and/or the number of samples grow to infinity. In this paper, we provide a first result for the more realistic Fourier sensing matrices. The major idea is to randomly permute the Fourier samples before feeding them into the quantizer. We show that the random permutation can effectively increase the low frequency power of the measurements, thus enhance the quality of quantization.

## 1 Introduction

Over the last decade, Sigma Delta quantization has been extensively studied in quantizing band-limited functions [10, 11, 14, 16] and redundant samples of discrete signals [3, 4, 5, 20, 17, 22, 27]. It has also been investigated under the setup of Compressed Sensing [15, 29, 21, 12] for (sub)-Gaussian matrices. The popularity of Sigma Delta primarily comes from its strong robustness to quantization errors by oversampling the objective signal and exploiting the correlations among the samples. If the sample correlations are not considered, we arrive at the traditional Memoryless Scalar Quantization (MSQ). MSQ quantizes each sample independently by directly recording their binary representations. This independent encoding has led to the scheme’s instability under measurement noise as well as quantization errors. As a result, MSQ are not extensively used in high accuracy analog-to-digital converters (ADC) although it offers the optimal bit efficiency (i.e., bit rate distortion) when no redundancy exists.

Besides the stability, Sigma Delta quantization was also used to overcome hardware limitations. One notable example is its use in certain digital camera designs to enhance image quality captured under low-light. Despite the ever increasing resolution one witnessed, there were technical difficulties in increasing the dynamic range of the digital camera. The dynamic range is the ratio between the maximum and minimum measurable light intensities. Compared to the film camera, the dynamic range of the digital cameras was much smaller due to the limitation of its sensing devices, the Complementary metal-oxide-semiconductor (CMOS) or the Charge-coupled Device (CDD). A direct manifestation is that, when there is not enough light, the digital camera tends to produce many noisy spots in the photo. Technically speaking, in this case, the voltage (or current) converted from light intensity via the photodiode is too low compared to quantization and other type of noise. The architecture of Sigma Delta quantization has a feedback loop in it to perform the oversampling. This feedback loop can be used in this case to accumulate the voltage over several samples hence magnify the signal and increase the signal-to-noise ratio. We refer the interested readers to [24] for an explanation of the exact design of this scheme as well as a demonstration of its efficacy in increasing the dynamic range.

Compressed Sensing is an effective acquisition method for sampling sparsely structured signals. It is followed by a signal inversion procedure including (approximately) solving an or minimization problem. Due to the high complexity of the available solvers, this step must be carried out in the computer. Therefore an analysis of how it works with various quantization methods is inevitable. When using Sigma Delta quantization, we naturally ask the question: can the method utilize oversamplings to compress quantization errors as it does for the band-limited functions and if so, what is the relation between the error compression rate and the oversampling rate? This question has been answered affirmatively in [15, 29, 21, 12] for sub-Gaussian matrices with descent error compression rate provided. The proof techniques in these papers apply for all sizes of sensing matrices and bit-depths (the number of bits allocated for each sample). This paper aims to extend this result to Fourier sensing matrices.

A parallel line of research investigates MSQ in compressed sensing with only 1 bit of storage per measurement [1, 18, 19, 25, 26]. Most results in this direction do not easily generalize to multi-bit scheme, even though having more bits normally reduces the quantization error. In addition, it has been shown that the reconstruction error of 1-bit MSQ is asymptotically larger than that of 1-bit Sigma Delta. This will be discussed in detail later.

To make the illustration more concrete, we now introduce the mathematical model. Denote the set of all sparse signal by . Suppose the objective signal lies in , and the measurements vector is obtained under linear map , i.e., .

We say a sensing matrix satisfies the -Restricted Isometry Property (RIP) if

and the smallest that satisfies the above inequality is called the Restricted Isometry Constant (RIC).

A well-known sufficient condition for the minimization problem

to successfully recover all -sparse vectors requires to satisfy the -RIP [8]. Tighter condition was also developed in [6]. The Fourier ensemble studied in this paper refers to the matrix being consisted of randomly subsampled rows of the discrete Fourier matrix in . This type of matrices are known [28] to satisfy the -RIP with probability provided for any , where and are both pure constants. Here the lower bound can be considered as the effective dimensionality of the -sparse signals associated with the Fourier ensemble.

Denote by the quantization operator that digitalizes

where is called the quantization alphabet, which is the computer’s dictionary. All values in must have been assigned digital labels so that they can be accurately recorded in the computer with finitely many bits.

Now we define the two types of quantizations. A scalar quantization is a simple rounding off operation.

The MSQ, , of the measurement vector , is defined as

where is the space of measurement. Each is acquired independently using only the th sample . The method ignores the possible correlation between samples.

The Sigma Delta quantization with order , , is defined as

where | |||

where is some pre-assigned constant. A usual choice of is 0. The method records the quantization error in the so-called state variable . The next quantization will leverage this error and the new incoming sample so that the common information contained in several adjacent samples are refined. For interested readers, the architecture of quantization is explained at length in [15, 29].

For simplicity, in this paper we use a uniform alphabet . Here is the imaginary unit, and is the set of integers. defines how dense the alphabet is so is called the quantization step size. denotes the ball in with center and radius . We assume where is the smallest integer greater than or equal to . This assumption guarantees the quantization error to be bounded by , thus fulfilling the stability requirement of quantization (see [15]).

Quantizations are essentially a type of encoders that map real or complex values to a finite codebook of the digital device thus decoders are needed. When has large bit-depth, i.e., each sample is recorded by many bits, then itself is already a good approximation of and we can direct use to recover . However, for small bit-depth, a decoder needs to be used to find an estimate of from then another decoder is to invert/reconstruct the original signal from . In many cases including ours, the two decoding steps are integrated into a single algorithm, for which the input is and output is , an estimate of . The performance of these algorithms are evaluated based on their accuracy, efficiency, robustness, etc, which we explain as follows. Efficiency refers to an algorithm’s tractability. Robustness refers to the algorithm’s stability to additive noise and quantization error. Accuracy is measured by the maximal reconstruction error, i.e., distortion, of all signals among a given class ,

Here denotes a quantization method and denotes a reconstruction algorithm. Previous works [15, 29] have demonstrated that for Gaussian matrices, the Sigma Delta encoder coupled with several decoders are both robust and efficient. The same is true for our setting by a straightforward generalization. So we only focus on evaluating the accuracy, i.e., finding a formula for the distortion as a function of . The accuracy for Gaussian matrices has been found in [20, 29, 18]. With the -th order Sigma Delta quantization, we have

(1.1) |

and by optimizing over it achieves

The constants in both expressions depend on and , the order of and the signal class’s sparsity level. Both are assumed to be fixed beforehand.

In contrast, the distortion for one bit compressed sensing with Gaussian measurement matrix is known [2, 9] to at best be

There exist more sophisticated quantization schemes that achieve better, exponential asymptotic rate [9], but they come at the expense of complicating the design of the analog hardware of the quantizer, such as involving analog multipliers (devices that perform analog multiplications). In contrast, Sigma Delta quantization only requires analog additions and negations.

In this paper, we prove similar results as (1.1) for Fourier sensing matrices, which require a complete different technique. As a recursive scheme, it is not very surprising that Sigma Delta displays a great sensitivity to permutations of the input sequence. In fact, we demonstrate that one needs to randomly permute the entires of the Fourier samples in order to obtain uniformly good reconstruction results for all the signals of a given sparsity level.

### 1.1 Notation

For a matrix , we use to denote the Schatten p-norm and when , we simply use to denote its spectral norm. In addition, is the trace of , stands for the entry-wise L-infinity norm, and stands for the operator norm from the -space to the -space. As usual, and are ’s smallest and largest singular values, respectively. If is an index set , then denotes the collection of columns of indexed in . For a vector , stands for the norm of . The set of all -sparse signals are . For simplicity, we also use the convention for any positive integer .

## 2 The proposed method

Sigma Delta quantization is conventionally designed to quantize low frequency signals. By definition, the quantizer has the so-called noise shaping effect because it pushes the quantization noise towards high frequencies which are outside the band of interest. The Gaussian measurements of sparse signals are not of low frequency. Rather, they have flat spectrums. The results in [29] are interesting as it shows that also perform reasonably well in quantizing this type of inputs.

###### Proposition 2.1 ([29]).

Suppose is a sub-Gaussian matrix with zero-mean and unit-variance. Then there exist a convex decoder , and constants , and , such that with probability over on the draw of , the reconstruction errors for all -sparse signals obey

(2.1) |

provided that .

Despite stated for sub-Gaussian matrix, a close examination of its proof in [29], suggests the above result is obtained by utilizing only the spectrum properties of sub-matrices of , i.e., the bounds on the quantities

(2.2) |

Here is an integer depending only on , is the collection of columns of indexed in , and is the last left singular vectors of the finite difference matrix. Clearly, since are singular vectors, then stands for a projection of onto the span of . The following proposition shows essentially contains low frequency sinusoids. Therefore, the above quantities measure the level of low frequency energy in .

###### Proposition 2.2.

([20]) Let be the discrete difference matrix with ones and minus ones on the main and sub diagonals, respectively, and zeros otherwise. Let be the singular value decomposition of . Then

and

The following propositions are extracted from [29]. They show that in the frame case (i.e., is a tall matrix with full column rank), the distortion is controlled by the amount of low frequency energy of the analysis frame operator; in the compressed sensing case, the error is determined by the Restricted Isometry Constant of .

###### Proposition 2.3.

Let be an matrix with normalized rows. Then there exists a decoder, such that for any (the unit ball in ), the reconstruction from the quantization using this decoder obeys

(2.3) |

for any with . Here is an absolute constant and denotes the smallest singular value of .

This proposition says that in the frame measurement case, small distortion is guaranteed by large values of .

###### Proposition 2.4.

Let be an matrix. Suppose that there exist some constant and integer such that has -RIP with RIC , i.e.,

Then there exists a decoder, such that for any -sparse signal , the reconstruction from the quantization using this decoder obeys

(2.4) |

Here is a constant that only depends only on .

This proposition proves polynomial convergence rate for the distortion under the condition that the quantity satisfies a RIP condition. The condition implies that all with have similar level of low frequency energy.

Consider the situation when is the DFT matrix, then there exist choices of that only contains high frequency sinusoids, and also cases where contains only low frequency sinusoids. Together they induce a large RIC bound on , in fact too large for Proposition 2.4 to be informative. Even when the support is known in which case the measurement reduces to , Proposition 2.3 would yield sub optimal result due to the small value of in the case when only contains high frequency components.

To overcome this intrinsic issue of Fourier ensembles, and inspired by the success of the sub-Gaussian ensembles, we propose to randomly permute the entries of the Fourier measurements so as to whiten its spectrum before letting it enter the quantizer. The random permutation we consider in this paper is with replacement. The result for random permutation without replacement is similar but with more complicated proof so we omit it to avoid distraction from the main topic.

In what follows, we first demonstrate the random permutation increases the value of so we have the desired distortion bound for the frame case. The result for the compressed sensing case then follows from it.

We start with a definition of the harmonic frame as sub-matrices of the DFT matrices.

###### Definition 2.5 (Harmonic Frames).

An harmonic frame has the form

(2.5) |

where

By definition, all measurements under are band-limited signals. Denote by the map that creates a new measurement sequence via randomly selecting elements from with replacement. stands for the index chosen in the th random draw, i.e., means that the first element in the new sequence is the third one in the original. We denote the new sequence by hence is the matrix formed by randomly selecting rows of according to . The next theorem shows that with large probability on the draw of , all measurements under have a somewhat uniform spectrum.

###### Theorem 2.6.

Fix integers , an absolute constant , and two index sets with , . Let be the random selection map defined by selecting indices from the set randomly times, i.e., for each , . Then there exist constants and , such that, with probability on the draws of , it holds,

and

provide that . Here denotes with normalized columns and denotes a harmonic frame in . is the matrix formed by stacking rows of according to the order defined in . is an indicator function that indicates whether the first column (the all-one column) belongs to .

If all the frequency bands carry the same level of energy, then the relative energy in arbitrary out of bands should be . Hence the upper and lower bound in Theorem 1.5 are tight up to a constant when the column of ones is not in .

The following theorem replaces in the above theorem by , providing a direct estimate of the key quantity .

###### Theorem 2.7.

Let be any harmonic frame defined in (2.5). Let be an matrix whose rows are randomly chosen from those of with replacement. Then there exist such that for any , any and satisfying , it holds

(2.6) |

where are the first right singular vectors of . If in addition, does not contain the column of all ones, then it also holds that

(2.7) |

Theorem 2.6 and 2.7 will be proved together as corollaries of the following more general result. It shows that the only property of needed to obtain results like Theorem 2.6 and 2.7 is the wide-spreadness of its entries.

###### Theorem 2.8.

Let be a tight frame with frame bound (i.e., ) and assume where is the identity matrix and . Let be some orthonormal matrix. Assume constants are such that and that . Let be an matrix whose rows are randomly chosen from those of with replacement. Then there exists a positive function such that for any , as long as satisfies , it holds with probability that

and

where with being an absolute constant.

## 3 Applications to frame and compressed sensing settings

Applying Theorem 2.7, we can show that with some appropriate decoder, the distortion under randomly permuted partial Fourier measurements obeys

(3.1) |

where depends on the signal’s sparsity level and the ambient dimension . This error bound has the same asymptotic order in as that for the sub-Gaussian matrices.

### 3.1 Decoders

The existing decoders in the literature are sufficient to show our results. When is underdetermined, we use the following decoder proposed in [29] to reconstruct from the th order Sigma Delta quantized measurements ,

() |

When is overdetermined with full column rank, we can either use a simplified version of (),

() |

or the th order Sobolev dual frame proposed in [4],

() |

### 3.2 High order Sigma Delta

To generalize (3.1) to higher order quantizations ), we need some unjustified properties of the singular vectors of to derive the necessary estimate on in Theorem 2.8. Explicitly, from numerical experiments, we have the following conjecture.

###### Conjecture 3.1.

There exists a constant such that for any , the singular vectors of satisfies

where is the element-wise norm of .

If this conjecture is true, then the result (3.1) can be generalized to as

### 3.3 Main results

We are now ready to state the main theorems of the paper.

###### Theorem 3.2.

Denote by the unit ball in . Let be an Harmonic frame, and be a matrix with randomly selected rows from with replacement. Suppose is the signal and is the first order quantization of using the uniform quantization alphabet . Then there exist absolute constants and such that for any , the reconstruction from or from

obeys

with probability provided that .

###### Theorem 3.3.

Let be an unnormalized DFT matrix of dimension , and let be a matrix with randomly selected rows from with replacement. Assume is a -sparse signal. Let be the first order Sigma Delta quantization of the compressed measurements with the quantization alphabet and suppose is the solution to

(3.2) |

Then there exist absolute constants and such that for any ,

(3.3) |

with probability over provided that . Here denotes the set of -sparse signals in .

## 4 Proofs of the theorems

### 4.1 Auxiliary Lemmas

In this section, we list a few large deviation results from probability theory as well as some preliminary lemmas that will be employed later .

###### Proposition 4.1 (Bernstein inequality).

Let be independent zero-mean random variables. Suppose that almost surely, for all . Then for all positive ,

###### Proposition 4.2 (Decoupling [31]).

Let be an matrix with zero diagonal. Let be a random vector with independent mean zero coefficients. Then, for every convex function , one has

where is an independent copy of .

###### Proposition 4.3 (Matrix Rosenthal inequality [23].).

Suppose that or . Consider a finite sequence of centered, independent, random Hermitian matrices, and assume that . Then

###### Proposition 4.4 (Matrix Bernstein [23]).

Consider an independent sequence of random matrices in that satisfy and for any almost surely. Then, for all

for

###### Proposition 4.5 ([13]).

Let , and . Suppose that has -RIP with , then for any , we have

with constants , only depending on .

###### Lemma 4.6.

Let be an diagonal positive definite matrix, and let be such that , where is the identity matrix. In addition, suppose with some constant . Then, for any such that , the random matrix whose rows are independently and uniformly chosen from the rows of satisfies

(4.1) |

### 4.2 Proof of Proposition 2.3 and 2.4

As mentioned above, the proof is essentially contained in the proof of Theorem 9 of [29]. We extract the key steps and present them here for completeness.

###### Proof of Proposition 2.3.

We can use either the Sobolev dual decoder or the minimization decoder to produce the stated reconstruction error. (a) If , then

where denotes the th largest singular values of and we have used the fact that

(4.3) |

derived in [15].

b) If is a feasible solution to (), then by the triangle inequality,

Rearranging the above equation yields the conclusion of the theorem. ∎

### 4.3 Proof of Theorem 2.6, 2.7 and 2.8

###### Proof of Theorem 2.8.

For simplicity, we assume is even. The odd case follows a similar line of argument.

Let be the diagonal matrix containing the main diagonal of the matrix . By definition is positive definite.

Let , and . By the normality of , we have , and

Here equals to the maximum row norm of . Applying Lemma 4.6 with , we are led to

(4.5) |

The following is devoted to finding an upper bound for the quantity . Let be a random vector which randomly selects indices from a total of indices. By a similar argument as in [30],

Using the convexity of the norm, we have

(4.6) |

where , and , means restricting to the columns indexed by , and means restricting to to the submatrix with indices .

Next we shall calculate for a fixed . Since , then and are now independent of each other.