Sparse covariance matrix estimation in high-dimensional deconvolution

Sparse covariance matrix estimation in high-dimensional deconvolution

\fnmsDenis \snmBelomestnylabel=e1]denis.belomestny@uni-due.de\thanksreft1,t11,T1 [    \fnmsMathias \snmTrabslabel=e2]mathias.trabs@uni-hamburg.de\thanksreft2,T1 [    \fnmsAlexandre B. \snmTsybakovlabel=e3]alexandre.tsybakov@ensae.fr\thanksreft3,T1 [ Duisburg-Essen University, Faculty of Mathematics
Thea-Leymann-Str. 9 D-45127 Essen, Germany
National Research University Higher School of Economics
Shabolovka, 26, 119049 Moscow, Russia,
E-mail:  \printeade1
Universität Hamburg, Faculty of Mathematics
Bundesstraße 55, 20146 Hamburg, Germany
E-mail:  \printeade2
CREST, ENSAE, Université Paris-Saclay
5, avenue Henry Le Chatelier, 91120 Palaiseau, France
E-mail:  \printeade3
Abstract

We study the estimation of the covariance matrix of a -dimensional normal random vector based on independent observations corrupted by additive noise. Only a general nonparametric assumption is imposed on the distribution of the noise without any sparsity constraint on its covariance matrix. In this high-dimensional semiparametric deconvolution problem, we propose spectral thresholding estimators that are adaptive to the sparsity of . We establish an oracle inequality for these estimators under model miss-specification and derive non-asymptotic minimax convergence rates that are shown to be logarithmic in . We also discuss the estimation of low-rank matrices based on indirect observations as well as the generalization to elliptical distributions. The finite sample performance of the threshold estimators is illustrated in a numerical example.

[
\kwd
\runtitle

Covariance estimation in high-dimensional deconvolution

{aug}

, and

\thankstext

T1 D.B. acknowledges the financial support from the Russian Academic Excellence Project “5-100” and from Deutsche Forschungsgemeinschaft (DFG) through the SFB 823 “Statistical modelling of nonlinear dynamic processes”. M.T. gratefully acknowledges the financial support by the DFG research fellowship TR 1349/1-1. A.B.T.’s research is supported by GENES and by the French National Research Agency (ANR) under the grants IPANEMA (ANR-13-BSH1-0004-02) and Labex Ecodec (ANR-11-LABEX-0047). This work has been started while M.T. was affiliated to the Université Paris-Dauphine.

class=MSC2010] \kwd[Primary ]62H12 \kwd[; secondary ]62F12 \kwd62G05

Thresholding \kwdminimax convergence rates \kwdFourier methods \kwdseverely ill-posed inverse problem

1 Introduction

One of the fundamental problems of multivariate data analysis is to estimate the covariance matrix of a random vector based on independent and identically distributed (i.i.d.) realizations of . An important feature of data sets in modern applications is high dimensionality. Since it is well known that classical procedures fail if the dimension is large, various novel methods of high-dimensional matrix estimation have been developed in the last decade. However, an important question has not yet been settled: How can be estimated in a high-dimensional regime if the observations are corrupted by noise?

Let be i.i.d. random variables with multivariate normal distribution . The maximum likelihood estimator of is the sample covariance estimator

The estimation error of explodes for large . To overcome this problem, sparsity assumptions can be imposed on , reducing the effective number of parameters. The first rigorous studies of this idea go back to Bickel and Levina [3, 4] and El Karoui [20] who have assumed that most entries of are zero or very small. This allows for the construction of banding, tapering and thresholding estimators based on , for which the dimension can grow exponentially in . Subsequently, a rich theory has been developed in this direction including Lam and Fan [32] who proposed a penalized pseudo-likelihood approach, Cai et al. [11] who studied minimax optimal rates, Cai and Zhou [12] studying the loss as well as Rothman et al. [44] and Cai and Liu [8] for more general threshold procedures and adaptation, to mention only the papers most related to the present contribution. For current reviews on the theory of large covariance estimation, we refer to [9, 22]. Heading in a similar direction as noisy observations, covariance estimation in the presence of missing data has been recently investigated by Lounici [35] as well as Cai and Zhang [10].

Almost all estimators in the afore mentioned results build on the sample covariance estimator . In this paper, we assume that only the noisy observations

are available, where the errors are i.i.d. random vectors in independent of . Then the sample covariance estimator is biased:

where is the covariance matrix of the errors. Assuming known to correct the bias is not very realistic. Moreover, for heavy tailed the errors that do not have finite second moments, is not defined and the argument based on makes no sense. Several questions arising in this context will be addressed below:

  1. How much information on the distribution of do we need to consistently estimate ?

  2. Do we need finite second moments of and/or sparsity restrictions on to estimate ?

  3. What is the minimax optimal rate of estimating based on noisy observations?

If the covariance matrix of the errors exists and is known, the problem does not differ from the direct observation case, since can be simply subtracted from . If can be estimated, for instance from a separate sample of the error distribution or from repeated measurements, we can proceed similarly. However, in the latter case, we need to assume that is sparse, since otherwise we cannot find a good estimator for large dimensions. Reducing our knowledge about further, we may only assume that the distribution of belongs to a given nonparametric class. This leads to a high-dimensional deconvolution model. The difference from standard deconvolution problems is that the density of ’s is a parametric object known up to a high-dimensional matrix parameter . A related model in the context of stochastic processes has been recently studied by Belomestny and Trabs [2]. Obviously, we need some assumption on the distribution of errors since otherwise is not identifiable as, for example, in the case of normally distributed . It turns out that we do not need a sparse covariance structure for the error distribution and we can allow for heavy tailed errors without any moments.

From the deconvolution point of view, it might seem surprising that and thus the distribution of can be estimated consistently without knowing or estimating the distribution of errors , but as we will show it is possible. The price to pay for this lack of information is in the convergence rates that turn out to be very slow - logarithmic in the sample size. In the pioneering works in one-dimensional case, Matias [38], Butucea and Matias [5] have constructed a variance estimator in deconvolution model with logarithmic convergence rate and a corresponding lower bound. In this paper, we provide a general multidimensional analysis of the minimax rates on the class of sparse covariance matrices.

To replace the sample covariance matrix by a deconvolution counterpart, we use some ideas from the literature on density deconvolution. Starting with Carroll and Hall [13] and Fan [21], the deconvolution problem have been extensively studied. In particular, unknown (but inferable) error distributions have been analysed by Neumann [40], Delaigle et al. [17], Johannes [29] and Delaigle and Hall [16] among others. For adaptive estimation with unknown error distribution we refer to Comte and Lacour [14], Kappus and Mabon [30], Dattner et al. [15] and references therein. Almost all contributions to the deconvolution literature are restricted to a univariate model. Hence, our study contributes to the deconvolution theory by treating the multivariate case; in particular, our techniques for the lower bounds might be of interest. To our knowledge, only Masry [37], Eckle et al. [19], and Lepski and Willer [33, 34] have studied the setting of multivariate deconvolution. They deal with a different problem, namely that of nonparametric estimation of the density of or its geometric features when the distribution of is known.

Applying a spectral approach, we construct an estimator for the covariance matrix assuming that are normally distributed and that the characteristic function of the distribution of decays slower than the Gaussian characteristic function. A similar idea in a one-dimensional deconvolution problem has been developed by Butucea et al. [6]. The assumption as implies identifiability of and allows us to construct an estimator , which is consistent in the maximal entry norm. Based on , we then construct hard and soft thresholding estimators and , respectively, for sparse matrices. The sparsity is described by an upper bound on the -norm, , of entries of . We establish sparsity oracle inequalities for and when the estimation error is measured in the Frobenius norm. This choice of the norm is naturally related to the distance between two multivariate normal distributions. The oracle bounds reveal that the thresholding estimators adapt to the unknown sparsity . For the soft thresholding estimator we present an oracle inequality, which shows that the estimator adapts also to approximate sparsity.

Assuming that the characteristic function of satisfies for large and some , we prove the following upper bound on the estimation error in the Frobenius norm:

(1)

for some constant and with high probability. The dependence of this bound on the sparsity is the same as found by Bickel and Levina [3] for the case direct observations; furthermore the well-known quotient drives the rate. However, the severely ill-posed nature of the inverse problem causes the logarithmic dependence of the rate on . We also see that the estimation problem is getting harder if gets closer to 2 where it is more difficult to distinguish the signal from the noise. Furthermore, we establish a lower bound showing that the rate in (1) cannot be improved in a minimax sense for . Let us emphasise that our observations are by definition not normally distributed. Therefore, the proof of the lower bound differs considerably from the usual lower bounds in high-dimensional statistics, which rely on Gaussian models.

This paper is organized as follows. In Section 2 we construct and analyze the spectral covariance matrix estimator. In Section 3 the resulting thresholding procedures are defined and analyzed. In Section 4 we investigate upper and lower bounds on the estimation error. In Section 5 some extensions of our approach are discussed including the estimation of low-rank matrices based on indirect observations as well as the generalization to elliptical distributions. The numerical performance of the procedure is illustrated in Section 6. Longer and more technical proofs are postponed to Section 7 and to the appendix.

Notation: For any and , the -norm of is denoted by and we write for brevity . For the Euclidean scalar product is written as . We denote by the identity matrix, and by the indicator function. For two matrices the Frobenius scalar product is given by inducing the Frobeninus norm . The nuclear norm is denoted by and the spectral norm by , where stands for the maximal eigenvalue. For and we denote by the -norm of the entries of the matrix if and the number of non-zero entries for . We write or if the matrix is positive definite or semi-definite. We denote by the joint distribution of when the covariance matrix of is and the characteristic function of the noise is . We will write for brevity if there is no ambiguity.

2 Spectral covariance estimators

Let denote the characteristic function of error distribution:

Then the characteristic function of is given by

Here and throughout we assume that . This assumption is standard in the literature on deconvolution. Allowing for some zeros of has been studied in [39, 18]. Note that our estimation procedure defined below does not rely on all in , but uses only with a certain radius .

The canonical estimator for the characteristic function is the empirical characteristic function

Arguing similarly to Belomestny and Trabs [2], consider the identity

(2)

Both sides are normalized by being the order of the leading term . While the left-hand side of (2) is a statistic based on the observations , the first term on the right-hand side encodes the parameter of interest, namely the covariance matrix . The second term is a deterministic error due to the unknown distribution of . If , i.e., the error distribution is less smooth than the normal distribution, the deterministic error vanishes as . The third term in (2) is a stochastic error term. Using the first order approximation we get

(3)

The latter expression resembles the estimation error in classical deconvolution problems. However, there is a difference since here in the denominator we have rather than the characteristic function of the distribution of errors. A similar structure was detected in the statistical analysis of low-frequently observed Lévy processes by Belomestny and Reiß [1]. Following [1], one can call this type of problems auto-deconvolution problems. Since , and we assume that , the stochastic error grows exponentially in . Thus, the estimation problem is severely ill-posed even in one-dimensional case.

These remarks lead us to the conclusion that can be estimated consistently without any particular knowledge of the error distribution as soon as , and the spectral radius in (2) is chosen to achieve a trade-off between the stochastic and deterministic errors. To specify more precisely the condition , it is convenient to consider, for any and , the following nonparametric class of functions :

Note that since . Therefore, the condition that determines the class can be written as the lower bound . If the characteristic function of belongs to , the decay for some of the characteristic exponent allows for separating the normal distribution of from error distribution for large . The decay rate determines the ill-posedness of the estimation problem. Noteworthy, we require neither sparsity restrictions on the joint distribution of nor moment conditions of these random variables.

A typical representative in the class is a characteristic function of a vector of independent -stable random variables. In the case of identically distributed marginals, it has the form for some parameter . A related example with correlated coefficients is a -dimensional stable distribution with characteristic function (note that ). Recalling that stable distributions can be characterized as limit distributions of normalized sums of independent random variables and interpreting as accumulation of many small measurement errors, suggests that these examples are indeed quite natural.

If , the deterministic error term in (2) is small for large values of . We will choose in (2) in the form where is large, and are -dimensional unit vectors defined by

(4)

Using the symmetry of , we obtain

for any with . Motivated by (2) applied to for some spectral radius , we introduce the spectral covariance estimator:

(5)

Equivalently, we can write for any with . Since concentrates around , cf. Lemma 13, we have with high probability if .

The spectral covariance estimator can be viewed as a counterpart of the classical sample covariance matrix for the case of indirect observations. The entries of enjoy the following concentration property.

Theorem 1.

Assume that , and for some . Let and satisfy . Set

(6)

Then, for any ,

where .

Proof.

Set . Using (2) we obtain, for all ,

For the last summand in this display is bounded uniformly by on the class . This remark and Corollary 14 in Section 7.1 imply that

if the condition is satisfied for all . Note that for any , and any ,

Therefore, for and satisfying the conditions of the theorem,

A union bound concludes the proof. ∎

The first term in is an upper bound for the stochastic error. We recover the familiar factor which is due to a sub-Gaussian bound on the maximum of the entries . The term is an upper bound for appearing in the linearization (3). Note that for this bound can be written as for . This suggests the choice of spectral radius in the form for some sufficiently small constant . The second term in (6) bounds the deterministic error and determines the resulting rate , cf. Theorem 5.

3 Thresholding

Based on the spectral covariance estimator, we can now propose estimators of high-dimensional sparse covariance matrices. We consider the following sparsity classes of matrices:

(7)

where denotes the sparsity parameter and bounds the largest entry of . We also consider larger classes that differ from only in that the condition is dropped. Note that for the classes , since otherwise the condition does not hold. This restriction on does not apply to the classes , for which the unknown effective dimension of can be smaller than . However, for the classes , the overall model remains, in general, -dimensional since the distribution of the noise can be supported on the whole space .

The sparsity classes considered by Bickel and Levina [3] and in many subsequent papers are given by

for , and with the usual modification for . We have , so that our results can be used to obtain upper bounds on the risk for the classes .

Based on the spectral covariance estimator, we define the spectral hard thresholding estimator for as

(8)

for some threshold value . The following theorem gives an upper bound on the risk of this estimator in the Frobenius norm.

Theorem 2.

Let , , and . Let be defined in (6) with parameters and satisfying . Then

provided that for , and for . Here, .

Proof.

First, consider the case and . In view of Theorem 1, the event is of probability at least for all . On we have the inclusion , so that . Therefore, on the event ,

Note that, again on , we have . Combining this with the last display implies the assertion of the theorem for .

Consider now the case and . We use the following elementary fact: If for some and , then (cf.[47]). Taking , , and , and using Theorem 1 we obtain that, on the event of probability at least ,

Thus, for any , with probability at least ,

Since all bounds hold uniform in and , the theorem is proven. ∎

In the direct observation case where we have for all , so that the deterministic error term in (6) disappears. In this case, can be fixed and the threshold can be chosen as a multiple of , analogously to [3]. Together with the embedding , we recover Theorem 2 from Bickel and Levina [3]. In Section 4 we will discuss in detail the optimal choice of the spectral radius and the threshold in the presence of noise.

The spectral soft thresholding estimator is defined as

with some threshold . It is well known, cf., e.g. [47], that

(9)

Adapting the proof of Theorem 2 in Rigollet and Tsybakov [42], we obtain the following oracle inequality, which is sharp for and looses a factor otherwise.

Theorem 3.

Assume that , and for some . Let where is defined in (6) with parameters and such that . Then,

(10)

with probability at least where . For any we have, with probability at least ,

(11)

where is a constant depending only on .

Proof.

Starting from the characterization (9), we use Theorem 2 by Koltchinskii et al. [31]. To this end, we write where are random variables with exponential concentration around zero due to Theorem 1. Observing is thus a sequence space model in dimension and a special case of the trace regression model considered in [31]. Namely, is the diagonal matrix with diagonal entries and are diagonalisations of the canonical basis in . In particular, Assumption 1 in [31] is satisfied for , i.e., where we use the notation of [31]. Note also that the rank of a diagonal matrix is equal to the number of its non-zero elements. Consequently, Theorem 2 in [31] yields with that

on the event that . To estimate the probability of , we apply Theorem 1. Inequality (11) follows from (10) using the same argument as in Corollary 2 of [42]. ∎

This theorem shows that the soft thresholding estimator allows for estimating matrices that are a not exactly sparse but can be well approximated by a sparse matrix. Choosing in the oracle inequalities (10) and (11) we obtain the following corollary analogous to Theorem 2.

Corollary 4.

Let , , and . Let where is defined in (6) with parameters and such that . Then

where for , and for .

4 Minimax optimality

In this section, we study minimax optimal rates for the estimation of on the class . We first state an upper bound on the rate of convergence of the hard thresholding estimator in this high-dimensional semiparametric problem. It is an immediate consequence of Theorem 2. Due to Corollary 4, the result directly carries over to the soft thresholding estimator.

Theorem 5.

Let , , and . For , set

(12)

Let be large enough such that for some numerical constant . Then for any where is defined in (6) we have

(13)

for some numerical constants .

Proof.

It follows from the assumption on that . This and the definition of imply that . Therefore, we can apply Theorem 2, which yields the result since

It is interesting to compare Theorem 5 with the result of Butucea and Matias [5] corresponding to and establishing a logarithmic rate for estimation of the variance in deconvolution model under exponential decay of the Fourier transform of . Butucea and Matias [5] have shown that, if , their estimator achieves asymptotically a mean squared error of the order . This coincides with the case and of the non-asymptotic bound in (13). A similar rate for has been obtained by Matias [38] under the assumptions on the decay of the Laplace transform.

We now turn to the lower bound matching (13) for . Intuitively, the slow rate comes from the fact that the error distribution can mimic the Gaussian distribution up to some frequency in the Fourier domain. A rigorous application of this observation to the construction of lower bounds goes back to Jacod and Reiß [28], though in quite a different setting. For the multidimensional case that we consider here the issue becomes particularly challenging.

Theorem 6.

Let and assume that , , for some constants , and . Then, there are constants such that

where the infimum is taken over all estimators .

The proof of this theorem is postponed to Section 8. We use the method of reduction to testing of many hypotheses relying on a control of the -divergence between the corresponding distributions, cf. Theorem 2.6 in [46]. The present high-dimensional setting introduces some additional difficulties. When the dimension of the sample space is growing, an increasing number of derivatives of the characteristic functions has to be taken into account for the -bound. Achieving bounds of the correct order in causes difficulty when is arbitrarily large. We have circumvented this problem by introducing a block structure to define the hypotheses. The construction of the family of covariance matrices of used in the lower bounds relies on ideas from Rigollet and Tsybakov [42], while the error distributions are chosen as perturbed -stable distributions. To bound the -divergence, we need a lower bound on the probability density of . It is shown by Butucea and Tsybakov [7] that the tails of the density of a one-dimensional stable distribution are polynomially decreasing. We generalize this result to the multivariate case (cf. Lemma 15 below) using properties of infinitely divisible distributions.

We now give some comments on the lower bound of Theorem 6. Assuming of order means that we consider quite a sparse regime. We always have . Recall also that as the diagonal of the covariance matrix is included in the definition of for the class . An alternative strategy pursued in the literature is to estimate a correlation matrix, i.e., to assume that all diagonal entries are known and equal to one. However, this seems not very natural in the present noisy observation scheme. On the other hand, Theorem 6 shows that even in the sparse regime the estimation error tends to as for dimensions growing polynomially in . The logarithmic in rate reflects the fact that the present semiparametric problem is severely ill-posed.

Comparing the lower bound with the upper bound from Theorem 5, we see that they coincide if the dimension satisfies for some and some . Thus, we have established the minimax optimal rate under this condition. Note also that we only loose a factor of order for very large , for instance, if .

5 Discussion and extensions

5.1 The adaptivity issue

Since the threshold in Theorem 5 depends on unknown parameters , and , a natural question is whether it is possible to construct an adaptive procedure independent of these parameters that achieves the same rate. One possibility to explore consists in selecting in a data-driven way. Another option would be to construct estimators corresponding to values of , and on a grid, and then to aggregate them.

For direct observations an adaptive choice of the threshold, more precisely a cross-validation criterion, has been proposed by Bickel and Levina [3] and was further investigated by Cai and Liu [8]. For noisy observations that we consider here, the adaptation problem turns out to be more delicate since not only an optimal constant has to be selected but also the order of magnitude of depends on the unknown parameter .

Often an upper bound on the maximal entry of is known, so that one does not need considering adaptation to . Ignoring the issue of unknown , the choice of the spectral radius of the order is universal, which reflects the fact that the estimation problem is severely ill-posed with dominating bias. Indeed, in Theorem 5 corresponds to undersmoothing such that the deterministic estimation error dominates the stochastic error without deteriorating the convergence rates. To construct an adaptive counterpart of , we need either an estimator of the error of an optimal procedure for estimating under the -loss or an estimator of the “regularity” . Therefore, extrapolating the argument of Low [36] to our setting, it seems plausible that an adaptive choice of cannot, in general, lead to the optimal rate. This does not exclude that optimal adaptive estimators can be constructed by other type of procedure, such as aggregation of estimators on the grid as mentioned above.

5.2 Low-rank covariance matrix

Alternatively to the above setting where the covariance matrix is sparse, we can consider a low-rank matrix . This is of particular interest in the context of factor models where, as discussed by Fan et al. [23, 24], an additional observation error should be taken into account. While [23, 24] estimate the covariance matrix of the noisy observations assuming that the errors have a sparse covariance structure, a spectral approach analogous to the one developed above allows for estimating directly the low-rank covariance matrix of without sparsity restrictions on the error distribution.

Such an approach, which is at first sight quite natural, would be to use the spectral covariance estimator from (5) together with a nuclear norm penalization. The following oracle inequality is an easy consequence of Theorem 1 in Koltchinskii et al. [31].

Proposition 7.

Assume that is convex and let . On the event , the estimator satisfies

To use this proposition, we need to find a bound on the spectral norm that hold with high probability. The techniques from Cai et al. [11] designed for the case of direct observations allow us to obtain an upper bound on this quantity of order up to a logarithmic in factor. Thus, the convergence rate of this estimator is rather slow.

Let us show now that another estimator can be constructed based the approach from Belomestny and Trabs [2], which allows for a better dependence on . To this end, we write

For a weight function supported on the annulus and a spectral radius , we set . Motivated by (2), we define the weighted Lasso-type estimator