Sequential Sensing with Model Mismatch

Sequential Sensing with Model Mismatch

Ruiyang Song,  Yao Xie,  and Sebastian Pokutta  Ruiyang Song (songry12@mails.tsinghua.edu.cn) is with the Dept. of Electronic Engineering, Tsinghua University, Beijing, China. Yao Xie (yao.xie@isye.gatech.edu) and Sebastian Pokutta (sebastian.pokutta@isye.gatech.edu) are H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA.This work is partially supported by NSF grant CMMI-1300144 and CCF-1442635.
Abstract

We characterize the performance of sequential information guided sensing, Info-Greedy Sensing [1], when there is a mismatch between the true signal model and the assumed model, which may be a sample estimate. In particular, we consider a setup where the signal is low-rank Gaussian and the measurements are taken in the directions of eigenvectors of the covariance matrix in a decreasing order of eigenvalues. We establish a set of performance bounds when a mismatched covariance matrix is used, in terms of the gap of signal posterior entropy, as well as the additional amount of power required to achieve the same signal recovery precision. Based on this, we further study how to choose an initialization for Info-Greedy Sensing using the sample covariance matrix, or using an efficient covariance sketching scheme.

compressed sensing, information theory, sequential methods, high-dimensional statistics, sketching algorithms
\defaultfontfeatures

Ligatures=TeX \setmainfontTeX Gyre Termes \setmonofontTeX Gyre Cursor \setmathfontTeX Gyre Termes Math \NewDocumentCommand\entropyomH[#2 \IfValueT#1 — #1] \NewDocumentCommand\bentropylm ~H#1[#2] \NewDocumentCommand\mutualInfoommI[#2;#3 \IfValueT#1 — #1]

I Introduction

Sequential compressed sensing is a promising new information acquisition and recovery technique to process big data that arise in various applications such as compressive imaging [2, 3, 4], power network monitoring [5], and large scale sensor networks [6]. The sequential nature of the problems arises either because the measurements are taken one after another, or due to the fact that the data is obtained in a streaming fashion so that it has to be processed in one pass.

To harvest the benefits of adaptivity in sequential compressed sensing, various algorithms have been developed (see [1] for a review.) We may classify these algorithms as (1) being agnostic about the signal distribution and, hence, using random measurements [7, 8, 9, 10, 11, 12, 13]; (2) exploiting additional structure of the signal (such as graphical structure [14] and tree-sparse structure [15, 16]) to design measurements; (3) exploiting the distributional information of the signal in choosing the measurements possibly through maximizing mutual information: the seminal Bayesian compressive sensing work [17], Gaussian mixture models (GMM) [18, 19] and our earlier work [1] which presents a general framework for information guided sensing referred to as Info-Greedy Sensing.

In this paper we consider the setup of Info-Greedy Sensing[1], as it provides certain optimality guarantees. Info-Greedy Sensing aims at designing subsequent measurements to maximize the mutual information conditioned on previous measurements. Conditional mutual information is a natural metric here, as it captures exclusively useful new information between the signal and the result of the measurement disregarding noise and what has already been learned from previous measurements. It was shown in [1] that Info-Greedy Sensing for a Gaussian signal is equivalent to choosing the sequential measurement vectors as the orthonormal eigenvectors of in a decreasing order of eigenvalues.

In practice, we do not know the signal covariance matrix and have to use a sample covariance matrix as an estimate. As a consequence, the measurement vectors are calculated from , which deviate from the optimal directions. Since we almost always have to use some estimate for the signal covariance, it is important to quantify the performance of sensing algorithms with model mismatch.

In this paper, we characterize the performance of Info-Greedy Sensing for Gaussian signals [1] when the true signal covariance matrix is replaced with a proxy, which may be an estimate from direct samples or using a covariance sketching scheme. We establish a set of theoretical results including (1) relating the error in the covariance matrix to the entropy of the signal posterior distribution after each sequential measurement, and thus characterizing the gap between this entropy and the entropy when the correct covariance matrix is used; (2) establishing an upper bound on the amount of additional power required to achieve the same precision of the recovered signal if using an estimated covariance matrix; (3) if initializing Info-Greedy Sensing via a sample covariance matrix, finding the minimum number of samples required so that using such an initialization can achieve good performance; (4) presenting a covariance sketching scheme to initialize Info-Greedy Sensing and find the conditions so that using such an initialization is sufficient. We also present a numerical example to demonstrate the good performance of Info-Greedy Sensing compared to a batch method (where measurements are not adaptive) when there is mismatch.

Our notations are standard. Denote ; is the spectral norm of a matrix , denotes the Frobenius norm of a matrix , and represents the nuclear norm of a matrix ; is the norm of a vector , and is the norm of a vector ; let be the quantile function of the chi-squared distribution with degrees of freedom; let and denote the mean and the variance of a random variable ; means that the matrix is positive semi-definite.

Ii Problem setup

A typical sequential compressed sensing setup is as follows. Let be an unknown -dimensional signal. We make measurements of sequentially

and the power of the measurement is . The goal is to recover using measurements . Consider a Gaussian signal with known zero mean and covariance matrix (here without loss of generality we have assumed the signal has zero mean). Assume the rank of is and the signal can be low-rank . Info-Greedy Sensing [1] chooses each measurement to maximizes the conditional mutual information

The goal is to use minimum number of measurements (or total power) so that the estimated signal is recovered with precision : with high probabilities.

In [1], we have devised a solution to the above problem, and established that Info-Greedy Sensing for low-rank Gaussian signal is to measure in the directions of the eigenvectors of in a decreasing order of eigenvalues with power allocation depending on the noise variance, signal recovery precision and confidence level , as given in Algorithm 1.

Ideally, if we know the true signal covariance we will use the corresponding eigenvector to form measurements. However, in practice, we have to use an estimate of the covariance matrix which usually has errors. To establish performance bound when there is a mismatch between the assumed and the true covariance matrix, we adopt a metric which is the posterior entropy of the signal conditioned on previous measurement outcomes. The entropy of a Gaussian signal is given by

Hence, the conditional mutual information is essentially the log of the determinant of the conditional covariance matrix, or equivalently the log of the volume of the ellipsoid defined by the covariance matrix. Here, to accommodate the scenario where the covariance matrix is low-rank, we consider a modified definition for conditional entropy, which is the log of the volume of the ellipsoid on the low-dimensional space. Let be the underlying true signal covariance conditioned on the previous measurements; denote by the observed covariance matrix, which is also the output of the sequential algorithm. Assume the rank of is . Then the metric we use to track the progress of our algorithm is

where is the volume of the ellipse defined by the covariance matrix , which is equal to the product of its non-zero eigenvalues.

Fig. 1: Parameter update in the algorithm and for the true distribution.
0:  assumed signal mean and covariance matrix , noise variance , recovery accuracy , confidence level
1:  repeat
2:     
3:      {largest eigenvalue}
4:      normalized eigenvector of for eigenvalue
5:     form measurement:
6:     measure:
7:     update mean:
8:     update covariance:
9:  until  {all eigenvalues become small}
10:  return  posterior mean as a signal estimate
Algorithm 1 Info-Greedy Sensing for Gaussian signals

Iii Performance bounds

We analyze the performance of Info-Greedy Sensing, when the assumed covariance matrix is used for measurement design, , which is different from the true signal covariance matrix , i.e. is used to initialize Algorithm 1. Let the eigenpairs of with the eigenvalues (which can be zero) ranked from the largest to the smallest to be , and let the eigenpairs of with the eigenvalues (which can be zero) ranked from the largest to the smallest to be . Let the updated covariance matrix in Algorithm 1 starting from after measurements using be , and the true conditional covariance matrix of the signal after these measurements be . The evolution of the covariance matrices in Algorithm 1 is illustrated in Fig. 1. Hence, by this notation, since each time we measure in direction of the dominating eigenvector of the updated covariance matrix, we have that is the largest eigenpair of , and that is the largest eigenpair of . Furthermore, denote the difference between the true and the assumed conditional covariance matrices after we obtain measurements

and let

Assume the eigenvalues of are . Then .

Iii-a Deterministic error

The following theorem shows that when the error, is sufficiently small, the performance of Info-Greedy Sensing will not degrade much. Note that, however, if the power allocations are calculated using the eigenvalues of the assumed covariance matrix , after iterations, we do not necessarily reach the desired precision with probability .

Theorem 1.

Assume the power allocations are calculated using eigenvalues of , the noise variance , recovery accuracy and confidence level in Algorithm 1. Given the rank of the covariance matrix , the number of total measurements is , for some constant , if the error satisfies

then

(1)

where

In the proof of Theorem 1, we use the trace of the underlying actual covariance matrix as potential function, which serves as a surrogate for the product of eigenvalues that determines the entropy, since the calculation of the trace of the observed covariance matrix is much easier. Note that for an assumed covariance matrix , after measuring in the direction of a unit norm eigenvector with eigenvalue using power , the updated matrix takes the form of

(2)

where is the component of in the orthogonal complement of . Thus, the only change in the eigen-decomposition of is the update of the eigenvalue of from to . Based on the update above in (2), after one measurement, the trace of the covariance matrix that the algorithm keeps track of becomes

Remark 1.

The upper bound of the posterior signal entropy in (1) shows that the amount of uncertainty reduction by the th measurement is roughly .

Remark 2.

Use the inequality that for , we have that in (1)

On the other hand, if the true covariance matrix is used, the posterior entropy of the signal is given by

(3)

where . Hence, we have

(4)

This upper bound has a nice interpretation: it characterizes the amount of uncertainty reduction with each measurement. For example, when the number of measurements required when using the assumed covariance matrix versus using the true covariance matrix are the same, we have and . Hence, the third term in (4) is upper bounded by , which means that the amount of reduction in entropy is roughly 1/2 nat per measurement.

Remark 3.

Consider the special case where the errors only occur in the eigenvalues of the matrix but not in the eigenspace , i.e.

and , the upper bound in (3) can be further simplified. Suppose only the first largest eigenvalues of are larger than the stopping criterion required by the precision, i.e., the algorithm takes steps in total. Then

This characterizes the gap between the signal posterior entropy using the correct versus the incorrect covariance matrices after all measurements have been used.

If we allow more total power and use a different power allocation scheme than what is prescribed in Algorithm 1, we are able to reach the desired precision . The following theorem establishes an upper bound on the amount of extra total power needed to reach the same precision (than the total power if using the correct covariance matrix).

Theorem 2.

Given the recovery precision , confidence level , rank of the true covariance matrix , assume eigenvalues of are larger than . If

then to reach a precision at confidence level , the total power required by Algorithm 1 when using is upper bounded by

Remark 4.

In a special case when eigenvalues of are larger than , then under the condition of Theorem 2, we have a simpler expression for the upper bound

Note that the additional power required is only linear in , which is quite small. All other parameters are independent of the input matrix.

Also, note that when there is a mismatch in the assumed covariance matrix, better performance can be achieved if we make many low power measurements than making one full power measurement because we update the assumed covariance matrix in between.

Iii-B Initialization with sample covariance matrix

In practice, we usually use a sample covariance matrix for . When the samples are Gaussian distributed, the sample covariance matrix follows a Wishart distribution. By finding the tail probability of the Wishart distribution, we are able to establish a lower bound on the number of samples to form the sample covariance matrix so that the conditions required by Theorem 1 are met with high probability and, hence, Algorithm 1 has good performance with the assumed matrix .

Corollary 1.

Suppose the sample covariance matrix is obtained from training samples that are drawn i.i.d. from :

Let . When

we have with probability exceeding .

Iii-C Initialization with covariance sketching

We may also use a covariance sketching scheme to form an estimate of the covariance matrix to initialize the algorithm, as illustrated in Fig. 2. Covariance sketching is based on sketches , , of the samples , drawn from the signal distribution. The sketches are formed by linearly projecting these samples via random sketching vectors , and then computing the average energy over repetitions. The sketching can be shown to be a linear operator applied on the original covariance matrix , as demonstrated in Appendix A. Then we may recover the original covariance matrix from these sketches by solving the following convex program

(5)

where is a user parameter that specifies the noise level.

Fig. 2: Diagram of covariance sketching in our setting. The circle aggregates quadratic sketches from branches and computes the average.

We further establish conditions on the covariance sketching so that such an initialization for Info-Greedy Sensing is sufficient.

Theorem 3.

Assume the setup of covariance sketching as above. Then with probability exceeding , the solution to (5) satisfies

for some , as long as for some constant the parameters , , , and are chosen such that

Here , , , and are absolute constants.

Iv Numerical example

When the assumed covariance matrix for the signal is equal to its true covariance matrix, Info-Greedy Sensing is identical to the batch method [19] (the batch method measures using the largest eigenvectors of the signal covariance matrix). However, when there is a mismatch between the two, Info-Greedy Sensing outperforms the batch method due to its adaptivity, as shown by the example demonstrated in Fig. 3. Info-Greedy Sensing also outperforms the sensing algorithm where are chosen to be random Gaussian vectors with the same power allocation, as it uses prior knowledge (albeit being imprecise) about the signal distribution.

Fig. 3: Sensing a low-rank Gaussian signal of dimension and about of the eigenvalues are non-zero, when there is mismatch between the assumed covariance matrix and true covariance matrix: , where , and using 20 measurements. The batch method measures using the largest eigenvectors of , and the Info-Greedy Sensing updates in the algorithm. Info-Greedy Sensing is more robust to mismatch than the batch method.

V Discussion

In high-dimensional problems, a commonly used low-dimensional signal model for is to assume the signal lies in a subspace plus Gaussian noise, which corresponds to the case we considered in this paper where the signal covariance is low-rank. A more general model is the Gaussian mixture model (GMM), which can be viewed as a model for the signal lying in a union of multiple subspaces plus Gaussian noise, and it has been widely used in image and video analysis among others. Our analysis for a low-rank Gaussian signal can be easily extended to an analysis of a low-rank Gaussian mixture model (GMM). Such results for GMM are quite general and can be used for an arbitrary signal distribution. In fact, parameterizing via low-rank GMMs is a popular way to approximate complex densities for high-dimensional data. Hence, we may be able to couple the results for Info-Greedy Sensing of GMM with the recently developed methods of scalable multi-scale density estimation based on empirical Bayes [20] to create powerful tools for information guided sensing for a general signal model. We may also be able to obtain performance guarantees using multiplicative weight update techniques together with the error bounds in [20].

Acknowledgement

This work is partially supported by an NSF CAREER Award CMMI-1452463 and an NSF grant CCF-1442635. Ruiyang Song was visiting the H. Milton Stewart School of Industrial and Systems Engineering at the Georgia Institute of Technology while working on this paper.

References

  • [1] G. Braun, S. Pokutta, and Y. Xie, “Info-greedy sequential adaptive compressed sensing,” to appear in IEEE J. Sel. Top. Sig. Proc., 2014.
  • [2] A. Ashok, P. Baheti, and M. A. Neifeld, “Compressive imaging system design using task-specific information,” Applied Optics, vol. 47, no. 25, pp. 4457–4471, 2008.
  • [3] J. Ke, A. Ashok, and M. Neifeld, “Object reconstruction from adaptive compressive measurements in feature-specific imaging,” Applied Optics, vol. 49, no. 34, pp. 27–39, 2010.
  • [4] A. Ashok and M. A. Neifeld, “Compressive imaging: hybrid measurement basis design,” J. Opt. Soc. Am. A, vol. 28, no. 6, pp. 1041– 1050, 2011.
  • [5] W. Boonsong and W. Ismail, “Wireless monitoring of household electrical power meter using embedded RFID with wireless sensor network platform,” Int. J. Distributed Sensor Networks, Article ID 876914, 10 pages, vol. 2014, 2014.
  • [6] B. Zhang, X. Cheng, N. Zhang, Y. Cui, Y. Li, and Q. Liang, “Sparse target counting and localization in sensor networks based on compressive sensing,” in IEEE Int. Conf. Computer Communications (INFOCOM), pp. 2255 – 2258, 2014.
  • [7] J. Haupt, R. Nowak, and R. Castro, “Adaptive sensing for sparse signal recovery,” in IEEE 13th Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop (DSP/SPE), pp. 702 – 707, 2009.
  • [8] M. A. Davenport and E. Arias-Castro, “Compressive binary search,” arXiv:1202.0937v2, 2012.
  • [9] A. Tajer and H. V. Poor, “Quick search for rare events,” arXiv:1210:2406v1, 2012.
  • [10] D. Malioutov, S. Sanghavi, and A. Willsky, “Sequential compressed sensing,” IEEE J. Sel. Topics Sig. Proc., vol. 4, pp. 435–444, April 2010.
  • [11] J. Haupt, R. Baraniuk, R. Castro, and R. Nowak, “Sequentially designed compressed sensing,” in Proc. IEEE/SP Workshop on Statistical Signal Processing, 2012.
  • [12] S. Jain, A. Soni, and J. Haupt, “Compressive measurement designs for estimating structured signals in structured clutter: A Bayesian experimental design approach,” arXiv:1311.5599v1, 2013.
  • [13] M. L. Malloy and R. Nowak, “Near-optimal adaptive compressed sensing,” arXiv:1306.6239v1, 2013.
  • [14] A. Krishnamurthy, J. Sharpnack, and A. Singh, “Recovering graph-structured activations using adaptive compressive measurements,” in Annual Asilomar Conference on Signals, Systems, and Computers, Sept. 2013.
  • [15] T. Ervin and R. Castro, “Adaptive sensing for estimation of structure sparse signals,” arXiv:1311.7118, 2013.
  • [16] S. Akshay and J. Haupt, “On the fundamental limits of recovering tree sparse vectors from noisy linear measurements,” IEEE Trans. Info. Theory, vol. 60, no. 1, pp. 133–149, 2014.
  • [17] S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEE Trans. Sig. Proc., vol. 56, no. 6, pp. 2346–2356, 2008.
  • [18] J. M. Duarte-Carvajalino, G. Yu, L. Carin, and G. Sapiro, “Task-driven adaptive statistical compressive sensing of Gaussian mixture models,” IEEE Trans. Sig. Proc., vol. 61, no. 3, pp. 585–600, 2013.
  • [19] W. Carson, M. Chen, R. Calderbank, and L. Carin, “Communication inspired projection design with application to compressive sensing,” SIAM J. Imaging Sciences, 2012.
  • [20] Y. Wang, A. Canale, and D. Dunson, “Scalable multiscale density estimation,” arXiv:1410.7692, 2014.
  • [21] G. W. Stewart and J.-G. Sun, Matrix perturbation theory. Academic Press, Inc., 1990.
  • [22] Y. Chen, Y. Chi, and A. J. Goldsmith, “Exact and stable covariance estimation from quadratic sampling via convex programming,” IEEE Trans. Info. Theory, vol. in revision, 2013.
  • [23] S. Zhu, “A short note on the tail bound of wishart distribution,” arXiv:1212.5860, 2012.

Appendix A Covariance sketching

We consider the following setup for covariance sketching. Suppose we are able to form measurement in the form of like we have in the Info-Greedy Sensing algorithm. Suppose there are copies of Gaussian signal we would like to sketch: that are i.i.d. sampled from , and we sketch using random vectors: . Then for each fixed sketching vector , and fixed copy of the signal , we acquire noisy realizations of the projection result via

We choose the random sampling vectors as i.i.d. Gaussian with zero mean and covariance matrix equal to an identity matrix. Then we average over all realizations to form the th sketch for a single copy :

The average is introduced to suppress measurement noise, which can be viewed as a generalization of sketching using just one sample. Denote , which is distributed as . Then we will use average energy of the sketches as our data , , for covariance recovery:

Note that can be further expanded as

(6)

where

is the maximum likelihood estimate of (and is also unbiased). We can write (6) in vector matrix notation as follows. Let . Define a linear operator such that . Thus, we can write (6) as a linear measurement of the true covariance matrix

where contains all the error terms and corresponds to the noise in our covariance sketching measurements, with the th entry given by

Note that we can further bound the norm of the error term as

where

We may recover the true covariance matrix from the sketches using the convex optimization problem (5).

Appendix B Backgrounds

Lemma 1.

[21] Let , be symmetric,with eigenvalues and respectively. has eigenvalues . Then for each ,

Lemma 2.

[22] Denote a linear operator and for , . Suppose the measurement is contaminated by noise , i.e. and assume . Then with probability exceeding the solution to the trace minimization (5) satisfies

for all , provided that . , , and are absolute constants and represents the best rank-r approximation of . When is exactly rank-

Lemma 3.

[23] If , then for ,

where

Appendix C Proofs

Lemma 4.

Suppose the power of measurement in the th step is . If , .

Proof.

Let , and ,

Now that , we have .

Lemma 5.

Consider positive semi-definite matrix , for , if

we have

Proof.

Apparently, , , i.e.

Apply a decomposition for the positive semi-definite matrix . For , let , . If , ; otherwise, when , we have

Thus,

Therefore , i.e. , This shows that , which leads to

Lemma 6.

If , the true conditional covariance matrix of the signal conditioned upon the measurements is related to the previous iteration as follows:

Proof.

Let .

Note that , thus , therefore it has at most one nonzero eigenvalue,

Note that is symmetric and is positive semi-definite, we have Hence,

Therefore,

Lemma 7.

Denote , , , if

and

then with probability exceeding we have

Proof.

From Chebyshev’s inequality, we have that

and

Let When

with with Lemma 3, we have

when

we have