Performance Bounds for Sparse Parametric Covariance Estimation in Gaussian Models

# Performance Bounds for Sparse Parametric Covariance Estimation in Gaussian Models

## Abstract

We consider estimation of a sparse parameter vector that determines the covariance matrix of a Gaussian random vector via a sparse expansion into known “basis matrices.” Using the theory of reproducing kernel Hilbert spaces, we derive lower bounds on the variance of estimators with a given mean function. This includes unbiased estimation as a special case. We also present a numerical comparison of our lower bounds with the variance of two standard estimators (hard-thresholding estimator and maximum likelihood estimator).

\ninept\name

Alexander Jung, Sebastian Schmutzhard, Franz Hlawatsch, and Alfred O. Hero III1 \addressInstitute of Telecommunications, Vienna University of Technology, Austria; {ajung,fhlawats}@nt.tuwien.ac.at
NuHAG, Faculty of Mathematics, University of Vienna, Austria; sebastian.schmutzhard@univie.ac.at
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA; hero@eecs.umich.edu

{keywords}

Sparsity, sparse covariance estimation, variance bound, reproducing kernel Hilbert space, RKHS.

## 1 Introduction

We consider a Gaussian signal vector , embedded in white Gaussian noise . The observed vector is

 y=s+n,\vspace−.5mm (1)

where and are independent and the signal mean and noise variance are known. In what follows, we assume since a nonzero can always be subtracted from . The signal covariance matrix is unknown; we will parameterize it according to

 (2)

with unknown nonrandom coefficients and known positive semidefinite “basis matrices” . Thus, estimation of the signal covariance matrix reduces to estimation of the coefficient vector .

Our central assumption is that is -sparse, i.e., at most coefficients are nonzero. We can formulate this as

 (3)

The sparsity degree is supposed known; however, the set of positions of the nonzero entries of (denoted by ; note that ) is unknown. Typically, . We will refer to (1)–(3) as the sparse covariance model (SCM). The SCM and estimation of are relevant, e.g., in time-frequency (TF) analysis [1, 2], where the basis matrices correspond to disjoint TF regions and represents the mean signal power in the th TF region. An application is cognitive radio scene analysis [3].

The problem we will study is estimation of from , where is a known function. This includes estimation of and, less trivially, of a linear combination of the . In the TF application mentioned above, the latter case corresponds to a linear combination of the mean signal powers in the various TF regions.

In this paper, building on [4, 5], we use the theory of reproducing kernel Hilbert spaces (RKHS) to derive lower bounds on the variance of estimators of . The estimators are required to have a prescribed differentiable mean function; this includes the case of unbiased estimation. They are allowed to exploit the known sparsity of . The RKHS framework has been previously proposed for a fundamentally different problem of sparsity-exploiting estimation in [6].

Sparsity-exploiting estimation of and of was considered recently in [7] and in [8], respectively. In both cases, the sparsity assumption was placed on , which corresponds to a sparse graphical model for . Our SCM approach (2), (3) is clearly different: while the coefficient vector is assumed sparse, the matrices or need not be sparse.

This paper is organized as follows. In Section 2, we review minimum-variance estimation and the RKHS framework. In Section 3, we use RKHS theory to derive lower variance bounds for the SCM. The special case of unbiased estimation is considered in Section 4. Finally, Section 5 presents a numerical comparison of our bounds with the variance of two established estimation schemes.

## 2 RKHS formulation of minimum-variance estimation

### 2.1 Minimum-Variance Estimation

The estimation error incurred by an estimator of can be quantified by the mean squared error (MSE) , where the notation indicates that the expectation is taken with respect to the pdf parameterized by . According to our assumptions in Section 1,

 f(y;x)=exp(−12yT~C−1(x)y)[(2π)Mdet{~C(x)}]1/2,with~C(x)≜C(x)+σ2I.

Let and denote the th entries of and , respectively. We have , where denotes the th component MSE. For our scope, minimization of with respect to is equivalent to separate minimization of each component MSE with respect to . We furthermore have

 ε(^zk(⋅),x)=b2(^zk(⋅),x)+v(^zk(⋅),x), (5)

with the component bias and the component variance . A common approach to defining a “locally optimal” estimator is to require for all , with a given bias function , and look for estimators that minimize the variance at a given parameter vector . It follows from (5) that once the bias is fixed, minimizing is equivalent to minimizing . Furthermore, fixing the bias is equivalent to fixing the mean, i.e., requiring that for all , where .

In what follows, we consider a fixed component and drop the subscript for better readability. Furthermore, we consider a given mean function (short for ) and a given nominal parameter vector . We are interested in the minimum variance at achievable by estimators (short for ) that have mean function for all . In order to derive a lower bound on this achievable variance, let us consider some subset . We denote by the set of all scalar estimators whose mean equals for all (however, not necessarily for all ) and whose variance at is finite, i.e.,

If is nonempty, we consider the minimum variance achievable at the given parameter vector by estimators :

 LDγ(x0)≜min^z(⋅)∈BDγ(x0)v(^z(⋅),x0).\vspace−.5mm (6)

The use of (rather than ) in (6) is justified by the fact that the existence of a finite minimum can always be guaranteed by a proper choice of ; a sufficient condition will be provided in Section 2.2.

Because , is a lower bound on the variance at of any estimator whose mean is for all (and not just for all ), i.e.,

 LDγ(x0)≤v(^z(⋅),x0),for any ^z(⋅) such that Ex{^z(y)}=γ(x)∀x∈XS,+. (7)

### 2.2 RKHS Formulation

An inner product of two real random variables , can be defined as , with induced norm . Note the dependence on . One can show that (6) can be rewritten formally as the following constrained norm-minimization problem:

 LDγ(x0) =min^z(⋅)∥^z∥2RV−γ2(x0) subject to⟨^z,ρx⟩RV=γ(x)∀x∈D. (8)

Furthermore, if is nonempty, the existence of a finite minimum in (6), (8) can be guaranteed by choosing such that [4, 5]

 (9)

where

 ρx(y)Ê≜f(y;x)f(y;x0).\vspace1mm (10)

According to [4], the solutions of (8) can be described using an RKHS with kernel given by

 (11)

Note that and depend on . Inserting (10) and (LABEL:equ_param_pdf) into (11) yields the expression

 R(x1,x2) ⋅(~C−1(x1)+~C−1(x2)−~C−1(x0))}]−1/2,

where as before . The RKHS is a Hilbert space of functions that is defined as the closure of the linear span of the set of functions This closure is taken with respect to the topology that is given by the inner product defined via the reproducing property [9]

The induced norm is .

It can be shown [9, 4] that if satisfies (9), then is necessary and sufficient for to be nonempty and the minimum value in (6), (8) to exist and be given by

 LDγ(x0)=∥γ∥2H(R)−γ2(x0).\vspace−.7mm (13)

## 3 Lower Bounds on the Estimator Variance

According to (13), any lower bound on entails a lower bound on . For mathematical tractability, we hereafter assume that the basis matrices in (2) are projection matrices on orthogonal subspaces of . Thus, they can be written as

 (14)

where is an orthonormal basis for and the sets are disjoint, so that they span orthogonal subspaces of . We note that (2) and (14) correspond to a latent variable model with , where the are independent zero-mean Gaussian with variance for all , i.e., . This is similar to the latent variable model used in probabilistic principal component analysis [10] except that our “factors” are fixed. With (14), the kernel expression in (LABEL:equ_kernel_SCM) simplifies to

where, e.g., denotes the th entry of . We will refer to the SCM with basis matrices of the form (14) as the sparse diagonalizable covariance model (SDCM).2 It can be shown that, within the SDCM, a sufficient condition for (9)—and, thus, for the existence of a minimum in (6), (8)—is for all . Therefore, we choose our domain as

 D={x∈XS,+∣∣xk<2x0,k+σ2∀k∈{1,…,N}}.

Note that depends on .

We will now derive a lower bound on for the SDCM. Let us assume for the moment that . Consider functions , , with and , which are orthogonal, i.e., if . Let denote the subspace of spanned by the , and the orthogonal projection operator on . Clearly, a lower bound on is given by

 ∥PVγ∥2H(R)≤∥γ∥2H(R). (15)

This lower bound can be expressed as

 ∥PVγ∥2H(R)=L∑l=1|⟨γ,vl⟩H(R)|2∥vl∥2H(R). (16)

A convenient construction of functions is via partial derivatives of with respect to [4]. Consider an index set containing exactly indices from , i.e., and . Furthermore let be different multi-indices satisfying . We then define

 vl(x)≜∂plR(x,x2)∂xpl2∣∣∣x2=xK0,l=1,…,L, (17)

where and is obtained from by zeroing all entries except those whose indices are in . It can be verified that the functions are orthogonal, i.e.,

 ⟨vl,vl′⟩H(R)=ql(x0)δl,l′Ê, (18)

where . Furthermore [4],

 (19)

Using (18) and (19) in (16), we obtain

 ∥PVγ∥2H(R)=L∑l=11ql(x0)∣∣∣∂plγ(x)∂xpl∣∣∣x=xK0∣∣∣2. (20)

Finally, combining (7), (13), (15), and (20), we arrive at the following bound. (Hereafter, we again explicitly indicate the index .)

###### Theorem 3.1.

For the SDCM, let be any estimator of whose mean equals for all and whose variance at a fixed is finite. Then, this variance satisfies

 v(^zk(⋅),x0)≥L∑l=11ql(x0)∣∣∣∂plγk(x)∂xpl∣∣∣x=xK0∣∣∣2−γ2k(x0), (21)

for any choice of different such that , where is an arbitrary set of different indices. The lower bound (21) is achieved by an estimator if and only if there are nonrandom coefficients such that

 ^zk(y)=L∑l=1al∂plρx(y)∂xpl∣∣∣x=xK0

with the random variables being defined in (10).

Note that the bound in (21) depends on only via a finite number of partial derivatives of at . Thus, it only depends on the local behavior of the prescribed mean or bias. We furthermore note that Theorem 3.1 does not mention the condition we used in its derivation. This is no problem because it can be shown [4] that if , there exists no estimator that has mean for all and finite variance at .

## 4 Special Case: Unbiased Estimation

In this section, we evaluate the bound (21) for the important special case of unbiased estimation of , i.e., for and or equivalently . To obtain a simple expression, we use and particular choices of and (). Specifically, using , where consists of the indices of the largest entries of the vector that is obtained from by zeroing the th entry, and and , where denotes the th column of the identity matrix, the following variance bound is obtained from Theorem 3.1.

###### Corollary 4.1.

For the SDCM, let be any estimator of that is unbiased (i.e., ) for all and whose variance at a fixed is finite. Then, this variance satisfies

 v(^xk(⋅),x0)

where , denote the value and index respectively of the -largest entry of .

The lower bound (LABEL:equ_corr_lower_bound) can be achieved at least in the following two cases: (i) if , and (ii) for any if (note that this condition implies ). In both cases, the estimator given by

 ^xk(y)=βk(y)−σ2,% withβk(y)≜1rkrk∑i=1(uTmk,iy)2,

is unbiased and its variance achieves the bound (LABEL:equ_corr_lower_bound). This estimator does not use the sparsity information and does not depend on .

Let us define a “signal-to-noise ratio” (SNR) quantity as . For , the lower bound (LABEL:equ_corr_lower_bound) is approximately for any , which does not depend on and moreover equals the variance of the unbiased estimator (LABEL:equ_tight-est). Since that estimator does not exploit any sparsity information, Corollary 4.1 suggests that, in the low-SNR regime, unbiased estimators cannot exploit the prior information that is -sparse. However, in the high-SNR regime (), (LABEL:equ_corr_lower_bound) becomes for and for , which can be shown to equal the variance of the oracle estimator that knows (this oracle estimator yields for all ). The transition of the lower bound (LABEL:equ_corr_lower_bound) from the low-SNR regime to the high-SNR regime has a polynomial characteristic; it is thus much slower than the exponential transition of an analogous lower bound recently derived in [6] for the sparse linear model. This slow transition suggests that the optimal estimator for low SNR—which ignores the sparsity information—will also be nearly optimal over a relatively wide SNR range. This further suggests that, for covariance estimation based on the SDCM, prior information of sparsity is not as helpful as for estimating the mean of a Gaussian random vector based on the sparse linear model [6].

In the special case where and , let and denote, respectively, the value and index of the single nonzero entry of . Consider the estimator given componentwise by

 ^x(x0)k(y)={βk(y)−σ2,k=j0α(y;x0)(βk(y)−σ2),k≠j0, (24)

where with and . One can show using RKHS theory that is unbiased and has the minimum variance achievable by unbiased estimators at any with . Note that this estimator depends explicitly on the assumed , at which it achieves minimum variance; its performance may be poor when the true parameter vector is different from .

## 5 Numerical Results

We compare the lower bound (21) for with the variance of two standard estimators. The first is an ad-hoc adaptation of the hard-thresholding (HT) estimator [11] to SDCM-based covariance estimation. It is defined componentwise as (cf. (LABEL:equ_tight-est))

 ^xk,HT(y)≜1rkrk∑i=1φ2τ(uTmk,iy)−σ2,

where denotes the hard-thresholding function with threshold , i.e., is for and else. The second standard method is the maximum likelihood (ML) estimator

 ^xML(y)≜argmaxx′∈XS,+f(y;x′).\vspace−1mm

For the SDCM, one can show that

where consists of the indices for which (with ) is largest, and consists of all indices for which .

For a numerical evaluation, we considered the SDCM with , , , and . We generated parameter vectors with and different . In Fig. 1, we show the variance at , (computed by means of numerical integration), for the HT estimator using various choices of and for the ML estimator. The variance is plotted versus . Along with each variance curve, we display a corresponding lower bound that was calculated by evaluating (21) for each , using for the mean function of the respective estimator (HT or ML), and summing over all . (The mean functions of the HT and ML estimators were computed by means of numerical integration.) In evaluating (21), we used partial derivatives of order at most in (17), and we chose for the evaluation of the lower bound , , , and . In Fig. 1, all variances and bounds are normalized by , which is the variance of the oracle estimator knowing .

It can be seen from Fig. 1 that in the high-SNR regime, for both estimators, the gap between the variance and the corresponding lower bound is quite small. This indicates that the performance of both estimators is nearly optimal. However, in the low-SNR regime, the variances of the estimators tend to be significantly higher than the bounds. This means that there may be estimators with the same bias and mean function as that of the HT or ML estimator but a lower variance. However, the actual existence of such estimators is not shown by our analysis.

## 6 Conclusion

We considered estimation of (a function of) a sparse vector that determines the covariance matrix of a Gaussian random vector via a parametric covariance model. Using RKHS theory, we derived lower bounds on the estimator variance for a prescribed bias and mean function. For the important special case of unbiased estimators of , we found that the transition of our bounds from low to high SNR is significantly slower than that of analogous bounds for the sparse linear model [6]. This suggests that the prior information of sparsity is not as helpful as for the sparse linear model. Numerical results showed that for low SNR, the variance of two standard estimators (hard-thresholding estimator and maximum likelihood estimator) is significantly higher than our bounds. Hence, there might exist estimators that have the same bias and mean function as these standard estimators but a smaller variance.

### Footnotes

1. thanks: This work was supported by the FWF under Grants S10602-N13 and S10603-N13 within the National Research Network SISE, and by the WWTF under Grant MA 07-004 (SPORTS).
2. Indeed, for the SDCM, the covariance matrix can be diagonalized by a signal transformation , with a unitary matrix that does not depend on the true parameter vector .

### References

1. P. Flandrin, Time-Frequency/Time-Scale Analysis.    San Diego (CA): Academic Press, 1999.
2. F. Hlawatsch, Time-Frequency Analysis and Synthesis of Linear Signal Spaces: Time-Frequency Filters, Signal Detection and Estimation, and Range-Doppler Estimation.    Boston (MA): Kluwer, 1998.
3. S. Haykin, “Cognitive radio: Brain-empowered wireless communication,” IEEE J. Sel. Areas Comm., vol. 23, pp. 201–220, Feb. 2005.
4. E. Parzen, “Statistical inference on time series by Hilbert space methods, I.” Appl. Math. Stat. Lab., Stanford University, Stanford, CA, Tech. Rep. 23, Jan. 1959.
5. D. D. Duttweiler and T. Kailath, “RKHS approach to detection and estimation problems – Part V: Parameter estimation,” IEEE Trans. Inf. Theory, vol. 19, no. 1, pp. 29–37, Jan. 1973.
6. S. Schmutzhard, A. Jung, F. Hlawatsch, Z. Ben-Haim, and Y. C. Eldar, “A lower bound on the estimator variance for the sparse linear model,” in Proc. 44th Asilomar Conf. Signals, Systems, Computers, Pacific Grove, CA, Nov. 2010.
7. P. Rütimann and P. Bühlmann, “High dimensional sparse covariance estimation via directed acyclic graphs,” Electron. J. Statist., vol. 3, pp. 1133–1160, 2009.
8. Y. Chen, A. Wiesel, Y. C. Eldar, and A. O. Hero III, “Covariance estimation in decomposable Gaussian graphical models,” IEEE Trans. Signal Processing, vol. 58, no. 3, pp. 1482–1492, March 2010.
9. N. Aronszajn, “Theory of reproducing kernels,” Trans. Am. Math. Soc., vol. 68, no. 3, pp. 337–404, May 1950.
10. M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,” J. Roy. Stat. Soc. Ser. B, vol. 61, pp. 611–622, 1999.
11. D. L. Donoho and I. M. Johnstone, “Ideal spatial adaptation by wavelet shrinkage,” Biometrika, vol. 81, pp. 425–455, 1994.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters