Rate-optimal estimation of p-dimensional linear functionals in a sparse Gaussian model

Rate-optimal estimation of -dimensional linear functionals in a sparse Gaussian model

Abstract

We consider two problems of estimation in high-dimensional Gaussian models. The first problem is that of estimating a linear functional of the means of independent -dimensional Gaussian vectors, under the assumption that most of these means are equal to zero. We show that, up to a logarithmic factor, the minimax rate of estimation in squared Euclidean norm is between and . The estimator that attains the upper bound being computationally demanding, we investigate suitable versions of group thresholding estimators. An interesting new phenomenon revealed by this investigation is that the group thresholding leads to a substantial improvement in the rate as compared to the element-wise thresholding. Thus, the rate of the group thresholding is , while the element-wise thresholding has an error of order . To the best of our knowledge, this is the first known setting in which leveraging the group structure leads to a polynomial improvement in the rate.

The second problem studied in this work is the estimation of the common -dimensional mean of the inliers among independent Gaussian vectors. We show that there is a strong analogy between this problem and the first one. Exploiting it, we propose new strategies of robust estimation that are computationally tractable and have better rates of convergence than the other computationally tractable robust (with respect to the presence of the outliers in the data) estimators studied in the literature. However, this tractability comes with a loss of the minimax-rate-optimality in some regimes.

[
\kwd
\startlocaldefs\endlocaldefs\runtitle

Estimation of multidimensional linear functionals in Gaussian models

{aug}

class=MSC] \kwd[Primary ]62J05 \kwd[; secondary ]62G05

Column-sparsity \kwdMinimax estimation \kwdGroup-sparsity \kwdLinear transformation \kwdHigh-dimensional inference \kwdRobust estimation

1 Introduction

Linear functionals are of central interest in statistics. The problems of estimating a function at given points, predicting the value of a future observation, testing the validity of a hypothesis, finding a dimension reduction subspace are all examples of statistical inference on linear functionals. The primary goal of this paper is to investigate the problem of estimation of a particular form of linear functional defined as the sum of the observed multidimensional signals. Although this problem is of independent interest on its own, one of our motivations for studying it is its tight relation with the problem of robust estimation.

Various aspects of the problem of estimation of a linear functional of an unknown high-dimensional or even infinite-dimensional parameter were studied in the literature, mostly focusing on the case of a functional taking real values (as opposed to the vector valued functional considered in the present work). Early results for smooth functionals were obtained by Koshevnik and Levit (1977). Minimax estimation of linear functionals over various classes and models were thoroughly analyzed by Donoho and Liu (1987); Klemela and Tsybakov (2001); Efromovich and Low (1994); Golubev and Levit (2004); Cai and Low (2004, 2005); Laurent et al. (2008); Butucea and Comte (2009); Juditsky and Nemirovski (2009). There is also a vast literature on studying the problem of estimating quadratic functionals (Donoho and Nussbaum, 1990; Laurent and Massart, 2000; Cai and Low, 2006; Bickel and Ritov, 1988). Since the estimators of (quadratic) functionals can be often used as test statistics, the problem of estimating functionals has close relations with the problem of testing that were successfully exploited in (Comminges and Dalalyan, 2012, 2013; Collier and Dalalyan, 2015; Lepski et al., 1999). The problem of estimation of nonsmooth functionals was also tackled in the literature, see (Cai and Low, 2011).

To complete this section, let us note that some statistical problems related to functionals of high-dimensional parameters under various types of sparsity constraints were recently addressed in several papers. The case of real valued linear and quadratic functionals was studied by Collier et al. (2017) and Collier et al. (2016), focusing on the Gaussian sequence model. Verzelen and Gassiat (2016) analyzed the problem of the signal-to-noise ration estimation in the linear regression model under various assumptions on the design. In a companion paper of the present submission, Collier and Dalalyan (2018) considered the problem of a vector valued linear functional estimation when the observations are drawn from a Poisson distribution. It turns out that the result established in the present work for the group (hard and soft) thresholding estimators are valid for the Poisson model as well, but it is not the case for the results on the greedy estimator studied in Section 2.1.

1.1 Organization

The rest of the paper is organized as follows. Section 2 is devoted to the problem of linear functional estimation. It contains the statements of the main results concerning the risk bounds of different relevant estimators and some lower bounds on the minimax risk. The problem of robust estimation is addressed in Section 3. We summarize our findings and describe some directions of future research in Section 4. The proofs of main theorems are postponed to Section 5, whereas the proofs of technical lemmas are gathered in Section 6. Some well-known results frequently used in the present work are recalled in Section 7.

1.2 Notation

We denote by the set of integers . The -dimensional vectors containing only ones and only zeros are denoted by and , respectively. As usual, stands for the Euclidean norm of a vector . The identity matrix is denoted by . For every matrix and every , we denote by the submatrix of obtained by removing the columns with indices lying outside . The Frobenius norm of , denoted by , is defined by . We will use the notation for the linear functional equal to the sum of the columns of .

2 Estimation of a linear functional

We assume that we are given a matrix generated by the following model:

(1)

This means that the deterministic matrix is observed in Gaussian white noise of variance . Equivalently, the columns of satisfy

(2)

Our goal is to estimate the vector , where is the linear transformation defined by

(3)

Let us first explain that this is a nontrivial statistical problem, at least when both and are large. In fact, the naive solution to the aforementioned problem consists in replacing in (3) the unknown matrix by the noisy observation . This leads to the estimator , the risk of which can be easily shown to be

(4)

When the matrix has at most nonzero columns with being much smaller than , it is possible to design estimators that perform much better than the naive estimator . Indeed, an oracle who knows the sparsity pattern may use the oracle-estimator which has a risk equal to . It is not difficult to show that there is no estimator having a smaller risk uniformly over all the matrices with a given sparsity pattern of cardinality . Thus, we have two benchmarks: the very slow rate attained by the naive estimator and the fast rate attained by the oracle-estimator that is unavailable in practice. The general question that we study in this work is the following: what is the best possible rate in the range that can be obtained by an estimator that does not rely on the knowledge of ?

In what follows, we denote by the set of all matrices with real entries having at most nonzero columns:

(5)

2.1 Greedy subset selection

Let us consider a greedy estimator that tries to successively recover various pieces of the sparsity pattern . We start by setting and . If is empty, then we set and terminate. Otherwise, i.e., when is not empty, we set and . In the next step, we define , and in the same way using as starting point instead of . We repeat this procedure until we get or . Then we set

(6)

The detailed pseudo-code for this algorithm is given in Algorithm 1 below.

\setstretch1.2 input : matrix and noise variance , threshold .
output : vector .
1 initialization and . repeat
2       Set . if  then
3            
4      else
5             Set .
6       end if
7      Update . Update .
until  is empty or is emptyreturn
1Algorithm 1 Greedy subset selection algorithm
Theorem 1.

Let be a prescribed tolerance level. The greedy subset selection estimator with satisfies

(7)

This result tells us that the worst-case rate of convergence of the GSS estimator over the class is . As a consequence, the minimax risk of estimating the functional over the aforementioned class is at most of order . As we will see below, this rate is optimal up to a logarithmic factor.

However, from a practical point of view, the GSS algorithm has limited applicability because of its high computational cost. It is therefore appealing to look for other estimators that can be computed efficiently even though their estimation error does not decay at the optimal rate for every possible configuration on . Let us note here that using standard tools it is possible to establish an upper bound similar to (7) that holds in expectation.

2.2 Group hard thresholding estimator

A natural approach to the problem of estimating consits in filtering out all the signals that have a large norm and by computing the sum of the remaining signals. This is equivalent to solving the following optimization problem

(8)

where is a tuning parameter. The estimator , hereafter referred to as group hard thresholding, minimizes the negative log-likelihood penalized by the number of non-zero columns in . One easily checks that the foregoing optimization problem can be solved explicitly and the resulting estimator is

(9)

Using the group hard thresholding estimator of and the method of substitution, we can estimate by

(10)

It is clear that this estimator is computationally far more attractive than the GSS estimator presented above. Indeed, the computation of the GHT estimator requires at most operations. However, as stated in the next theorem, this gain is achieved at the expense of a higher statistical error.

Theorem 2.

Let be the estimator defined in (10) with the tuning parameter

(11)

There exists a universal constant such that, for every , it holds

(12)

Using the fact that , we infer from this theorem that the rate of the group hard thresholding for fixed is of order , up to a logarithmic factor. Moreover, the rate obtained in this theorem can not be improved, up to logarithmic factors, as stated in the next theorem.

Theorem 3.

Denote by the estimator defined in 10 with a threshold . There are two universal constants and , such that for any and , the following lower bound holds

(13)

The proofs of these theorems being deferred to Section 5, let us comment on the stated results. At first sight the presence of the sparsity in the definition of the threshold in Theorem 2 might seem problematic, since this quantity is unknown in most practical situations. However, one can easily modify the claim of Theorem 2 replacing and respectively by and both in the definition of and the subsequent risk bound.

A second remark concerns the rate optimality. If we neglect the logarithmic factors in this discussion, the rate of the GHT estimator is shown to be at most of order . This coincides with the optimal rate (and the one of the GSS estimator) when and has an extra factor in the worst-case . When there is a limit on the computational budget, that is when the attention is restricted to the estimators computable in polynomial (in ) time, we do not know whether such a deterioration of the risk can be avoided.

An inspection of the proof of Theorem 2 shows that if all the nonzero signals are large enough, that is when for some constant , the extra factor disappears and the GHT achieves the optimal rate. Put differently, the signals at which the GHT estimator fails to achieve the optimal rate are those having an Euclidean norm of order . This is closely related to the minimax rate of separation in hypotheses testing. It is known that the separation rate for testing against , when one observes is of order .

Our last remark on Theorem 2 concerns the relation with element-wise hard thresholding. The idea is the following: any column-sparse matrix is also sparse in the most common sense of sparsity. That is, the number of nonzero entries of the matrix is only a small fraction of the total number of entries. Therefore, one can estimate the entries of by thresholding those of and then estimate by the method of substitution. The statistical complexity of this estimator is quantified in the next theorem, the proof of which is similar to the corresponding theorem in (Collier et al., 2017).

Theorem 4.

Let be the element-wise hard thresholding estimator defined by for . If the threshold is chosen so that , then

(14)

where is a universal constant.

A striking feature of the problem of linear functional estimation uncovered by Theorem 2 and Theorem 4, is that exploiting the group structure leads to an improvement of the risk which may attain a factor (for the squared Euclidean norm). To the best of our knowledge, this is the first framework in which the grouping is proved to have such a strong impact. This can be compared to the problem of estimating the matrix itself under the same sparsity assumptions. Provable guarantees in such a setting show only a logarithmic improvement due to the use of the sparsity structure (Lounici et al., 2011).

2.3 Soft-thresholding estimator

A natural question is whether the results obtained above for the group hard thresholding can be carried over a suitable version of the soft-thresholding estimator. Such an extension could have two potential benefits. First, the soft thresholding is defined as a solution to a convex optimization problem, whereas hard thresholding minimizes a nonconvex cost function. This difference makes the soft thresholding method more suitable to deal with various statistical problems. The simplest example is the problem of linear regression: the extension of the soft thresholding estimator to the case of non-orthogonal design is the lasso, that can be computed even when the dimension is very large. In the same problem, the extension of the hard thresholding is the BIC-type estimator, the computation of which is known to be prohibitively complex when the dimension is large.

A second reason motivating our interest in the soft thresholding is its smooth dependence on the data. This smoothness implies that the estimator is less sensitive to the changes in the data than the hard thresholding. Furthermore, it makes it possible to design a SURE-type algorithm for defining an unbiased estimator of the risk and, eventually, selecting the tuning parameter in a data-driven way.

In the model under consideration, the soft thresholding estimator can be defined as the minimizer of the group-lasso cost function, that is

(15)

This problem has an explicit solution given by

(16)

It is natural then to define the plug-in estimator as . The next theorem establishes the performance of this estimator.

Theorem 5.

The estimator defined in (16) with1

(17)

satisfies, for every ,

(18)

where is some universal constant.

The comments made after the statement of Theorem 2 can be repeated here. The dependence of on is not crucial; one can replace by 1 in the expression for , this will only slightly affect the risk bound. The bound in expectation can be complemented by a bound in deviation. The rate obtained for the soft thresholding is exactly of the same order as the obtained in Theorem 2 for the hard thresholding. A notable difference, however, is that in the case of soft thresholding the tuning parameter suggested by the theoretical developments is data dependent.

Remark 1.

Both Theorem 2 and Theorem 5 remain valid if we replace the condition by the following one: the vectors are independent (not necessarily identically distributed) Gaussian with .

2.4 Lower bounds and minimax rate optimality

We now address the question of the optimality of our estimators. In (Collier et al., 2017), the case was solved with lower and upper bounds matching up to a constant. In particular, Theorem 1 in (Collier et al., 2017) yields the following proposition.

Proposition 1.

Assume that , then there is a universal constant such that

(19)

Note that when , this rate is of the order of . It is straightforward that this rate generalizes to in the multidimensional case. Furthermore, if we knew in advance the sparsity pattern , then we could restrict the matrix of observations to the indices in , and we would get the oracle rate . These remarks are made formal in the following theorem.

Theorem 6.

Assume that , then there is a universal constant such that

(20)

Therefore, the greedy subset selector in Section 2.1 is provably rate-optimal in the case . A question that remains open is the rate optimality when . The lower bound of Theorem 6 is then of order , whereas the upper bound of Theorem 1 is of order . Taking into account the fact that the naive estimator has a risk of order , we get that the minimax risk is upper bounded by .Thus, there is a gap of order when .

\setstretch1.2 input : matrix , noise variance and confidence level .
output : vector .
1 Set . Set . Set (if the set is empty, set ) if  then
2      
3else
4      
5 end if
return
1Algorithm 2 Adaptive GSS

Note that none of the estimators discussed earlier in this work attain the upper bound ; indeed, the latter is obtained as the minimum of the risk of two estimators. Interestingly, one can design a single estimator that attains this rate. Previous sections contain all the necessary ingredients for this. We will illustrate the trick in the case of the GSS estimator, but similar technique can be applied to any estimator for which an “in deviation” risk bound is established.

The idea is to combine the GSS estimator and the naive estimator , with the aim of choosing the “best” one. The combination can be performed using the Lepski method (Lepskii, 1991), also known as intersection of confidence intervals (Goldenshluger and Nemirovski, 1997). The method is described in Algorithm 2. The construction is based on the following two facts:

  1. The true value lies with probability in the ball with .

  2. The true value lies with probability in the ball with (cf. Theorem 1).

These two facts imply that with probability at least the balls and have nonempty intersection. As a consequence, in this event, we have and, therefore, . Now, if , then and we have

(21)

along with

(22)
(23)

Thus, . In the second case, , we have , where the last equality follows from the fact that . Thus, we have established the following result.

Proposition 2.

Let be a prescribed confidence level. With probability at least , the adaptive greedy subset selection estimator defined in Algorithm 2 satisfies .

Let us summarize the content of this section. We have established a lower bound on the minimax risk, showing that the latter is at least of order , up to a logarithmic factor. We have also obtained upper bounds, which imply that the minimax risk is at most of order . Furthermore, this rate can be attained by a single estimator (adaptive greedy subset selection).

3 The problem of robust estimation

The problem of linear functional estimation considered in the previous section has multiple connections with the problem of robust estimation of a Gaussian mean. In the latter problem, the observations in are assumed to satisfy

(24)

where is the identity matrix of dimension . We are interested in estimating the vector , under the assumption that most vectors are equal to zero. All the observations such that are considered as inliers, while all the others are outliers. In this problem, the vectors are unknown, but their estimation is not our primary aim. They are rather considered as nuisance parameters. In some cases, it might be helpful to use the matrix notation of (24):

(25)

The obvious connection with the problem considered in the previous section is that if we know that in (24), then we recover model (1). This can be expressed in a more formal way as shown in the next proposition.

Proposition 3.

The problem of estimating the linear functional in model (25) is not easier, in the minimax sense, than that of estimating . More precisely, we have

(26)

where the sup in the left-hand side and in the right-hand side are taken, respectively, over all and over all .

Proof.

The first inequality is a consequence of the fact that when all the entries of are zero, the optimal estimator of in the minimax sense is the sample mean of ’s. To prove the second inequality, let be an estimator of . We can associate with the following estimator of : . These estimators satisfy

(27)
(28)
(29)

Since is drawn from the Gaussian distribution , we have and the claim of the proposition follows. ∎

Another important point that we would like to emphasize here is the relation between model (24) and the Huber contamination model (Huber, 1964) frequently studied in the statistical literature (we refer the reader to Chen et al. (2015, 2016) for recent overviews). As mentioned earlier in Remark 1, all the results established in the paper remain valid when the assumption on the noise is relaxed; the corresponding model is

(30)

where as before and are unknown covariance matrices satisfying . The next result shows that the model we investigate in this paper—i.e., the one defined by observations (30) with —is more general than Huber’s contamination model. Recall that in Huber’s contamination model, the observations are iid -dimensional vectors drawn from the mixture distribution . The particularity of this model is that it assumes all the outliers to be generated by the same distribution ; the latter, however, can be an arbitrary distribution on .

Proposition 4.

Let be an estimator of . Then, we have

(31)

The supremum of the left-hand side is over all probability distributions on , while the notation stands for the binomial distribution.

The proof of this proposition is a simple exercise and is left to the reader. Although some statistical problems of robust estimation in a framework of the same spirit as (30) have been already tackled in the literature (Dalalyan and Keriven, 2012; Dalalyan and Chen, 2012; Balmand and Dalalyan, 2015; Nguyen and Tran, 2013), the entire picture in terms of matching upper and lower bounds is not yet available. On the other side, it has been established in (Chen et al., 2015) that the minimax rate of estimating in Huber’s contamination model is

(32)

It is shown that this rate is achieved by the Tukey depth estimator. An important observation is that the evaluation Tukey’s depth is a hard computational problem: there exist no algorithm to date capable of approximating Tukey’s depth in a number of operations that scales polynomially in and the approximation precision. The best known computationally tractable robust estimator, the element-wise median, has a rate of order (Chen et al., 2015, Prop. 2.1)

(33)

We shall show in this section that a suitable adaptation of the group soft thresholding estimator presented in the previous section leads to a rate that can be arbitrarily close to

(34)

This shows that if we restrict our attention to the estimators that have a computational complexity that is at most polynomial, the minimax rate satisfies, for every ,

(35)

where means inequality up to logarithmic factors.

3.1 Maximum of profile likelihood with group lasso penalty

A computationally tractable estimator that allows to efficiently deal with structured sparsity and has provably good statistical complexity is the group lasso (Yuan and Lin, 2006; Lin and Zhang, 2006; Chesneau and Hebiri, 2008; Meier et al., 2009; Lounici et al., 2011). We define the group-lasso estimator by

(36)

where the are some positive numbers to be defined later. The estimator can be seen as the maximum of a profile penalized likelihood, where the penalty is proportional to the norm (also known as the group lasso penalty) of the nuisance parameter . The above optimization problem is convex and can be solved numerically even when the dimension and the sample size are large. It is also well known that from (36) is exactly the Huber M-estimator (Donoho and Montanari, 2016, Section 6). In addition, these estimators can also be written as

(37)
(38)

where denotes the orthogonal projection in onto the orthogonal complement of the constant vector . Unfortunately, we were unable to establish a risk bound for this estimator that improves on the element-wise median. The best result that we get is the following.

Theorem 7.

Consider the estimators of and defined in (36) with . Then, with probability at least and provided that , we have

(39)
(40)

This result, proved in Section 5.2, shows that the rate of the profiled penalized likelihood estimator of , with a group lasso penalty, converges at the rate , which coincides with the one obtained2 by (Chen et al., 2015). In the rest of this section, we will propose an estimator which improves on this rate. To this end, we start with obtaining a simplified expression for the group lasso estimator .

First, using the fact that , we get , so that

(41)

Recall that . The first-order necessary conditions imply that, for every such that ,

(42)

Furthermore, if and only if . We infer that

(43)

for every . Finally, denoting , we get

(44)

This formula shows the clear analogy between the group lasso estimator and the soft thresholding estimator studied in the previous section. This analogy suggests to choose the tuning parameters in a data driven way; namely, it is tempting to set

(45)

Unfortunately, such a choice is impossible to realize since this depends on the solution of the optimization problem, which in turn is defined through . To circumvent this problem, we suggest to use an iterative algorithm that starts from an initial estimator of , defines the vectors and then updates by the formula , where the columns of the matrix are defined by the second equality in (45). This algorithm, called iterative soft thresholding, is described in Algorithm 3.

\setstretch1.2 input : matrix , noise variance , number of outliers . parameters : number of iterations , confidence level . output : vectors and . 1 initialization    solution of (38) with    for  do 2       Set Set for  do 3             Set Set 4       end for 5       update   .    6 end for return and . 1Algorithm 3 Iterative Soft Thresholding

Prior to stating the theorem that describes the statistical complexity of this estimator, we present a result that explains why such an iterative algorithm succeeds in improving the convergence rate.

Theorem 8.

Let us set , where is a preliminary estimator of . Let be a tolerence level. Consider the estimator of defined by concatenating the vectors

(46)

where is a tuning parameter. Define the event

(47)

There is an event (the same for all estimators ) of probability at least , such that on , we have

(48)

It follows from this theorem that at each iteration of the algorithm we improve the precision of estimation of . Indeed, if is an upper bound on the error at the th iteration, then we get from the last theorem that

(49)

with .

Lemma 1.

If , and , then

(50)

Combining all these results, we arrive at the following risk bound for the iterative soft thresholding estimator.

Theorem 9.

Let , and let be the iterative soft thresholding estimator obtaind after iterations. Assume that and . There are some universal strictly positive constants such that if the condition

(51)

is satisfied then, with probability at least , the following inequalities hold true:

(52)