Mixing Least-Squares Estimators when the Variance is Unknown

# Mixing Least-Squares Estimators when the Variance is Unknown

Christophe Giraud Université de Nice Sophia-Antipolis, Laboratoire J-A Dieudonné, Parc Valrose, 06108 Nice cedex 02
First draft 03/02/2007, Revision 02/11/2007
###### Abstract.

We propose a procedure to handle the problem of Gaussian regression when the variance is unknown. We mix least-squares estimators from various models according to a procedure inspired by that of Leung and Barron [17]. We show that in some cases the resulting estimator is a simple shrinkage estimator. We then apply this procedure in various statistical settings such as linear regression or adaptive estimation in Besov spaces. Our results provide non-asymptotic risk bounds for the Euclidean risk of the estimator.

###### Key words and phrases:
Gibbs mixture - shrinkage estimator - oracle inequalities - adaptive estimation - linear regression
62G08

## 1. Introduction

We consider the regression framework, where we have noisy observations

 (1) Yi=μi+σεi,  i=1,…,n

of an unknown vector . We assume that the ’s are i.i.d standard Gaussian random variables and that the noise level is unknown. Our aim is to estimate .

In this direction, we introduce a finite collection of linear spaces of , which we call henceforth models. To each model , we associate the least-squares estimator of on , where denotes the orthogonal projector onto . The -risk of the estimator with respect to the Euclidean norm on is

 (2) E[∥μ−^μm∥2]=∥μ−ΠSmμ∥2+dim(Sm)σ2.

Two strategies have emerged to handle the problem of the choice of an estimator of in this setting. One strategy is to select a model with a data driven criterion and use to estimate . In the favorable cases, the risk of this estimator is of order the minimum over of the risks (2). Model selection procedures have received a lot of attention in the literature, starting from the pioneer work of Akaike [1] and Mallows [18]. It is beyond the scope of this paper to make an historical review of the topic. We simply mention in the Gaussian setting the papers of Birgé and Massart [7, 8] (influenced by Barron and Cover [5] and Barron, Birgé and Massart [4]) which give non-asymptotic risk bounds for a selection criterion generalizing Mallows’.

An alternative to model selection is mixing. One estimates by a convex (or linear) combination of the s

 (3) ^μ=∑m∈Mwm^μm,

with weights which are -measurable random variables. This strategy is not suitable when the goal is to select a single model , nevertheless it enjoys the nice property that may perform better than the best of the s. Various choices of weights have been proposed, from an information theoretic or Bayesian perspective. Risk bounds have been provided by Catoni [11], Yang [25, 28], Tsybakov [23] and Bunea et al. [9] for regression on a random design and by Barron [3], Catoni [10] and Yang [26] for density estimation. For the Gaussian regression framework we consider here, Leung and Barron [17] propose a mixing procedure for which they derive a precise non-asymptotic risk bound. When the collection of models is not too complex, this bound shows that the risk of their estimator is close to the minimum over of the risks (2). Another nice feature of their mixing procedure is that both the weights and the estimators are build on the same data set, which enable to handle cases where the law of the data set is not exchangeable. Unfortunately, their choice of weights depends on the variance , which is usually unknown.

In the present paper, we consider the more practical situation where the variance is unknown. Our mixing strategy is akin to that of Leung and Barron [17], but is not depending on the variance . In addition, we show that both our estimator and the estimator of Leung and Barron are simple shrinkage estimators in some cases. From a theoretical point of view, we relate our weights to a Gibbs measure on and derive a sharp risk bound for the estimator . Roughly, this bound says that the risk of is close to the minimum over of the risks (2) in the favorable cases. We then discuss the choice of the collection of models in various situations. Among others, we produce an estimation procedure which is adaptive over a large class of Besov balls.

Before presenting our mixing procedure, we briefly recall that of Leung and Barron [17]. Assuming that the variance is known, they use the weights

 (4) wm=πmZexp(−β[∥Y−^μm∥2/σ2+2dim(Sm)−n]),m∈M

where is a given prior distribution on and normalizes the sum of the s to one. These weights have a Bayesian flavor. Indeed, they appear with in Hartigan [15] which considers the Bayes procedure with the following (improper) prior distribution: pick an in according to and then sample ”uniformly” on . Nevertheless, in Leung and Barron [17] the role of the prior distribution is to favor models with low complexity. Therefore, the choice of is driven by the complexity of the model rather than from a prior knowledge on . In this sense their approach differs from the classical Bayesian point of view. Note that the term appearing in the weights (4) is an unbiased estimator of the risk (2) rescaled by . The size of the weight then depends on the difference between this estimator of the risk (2) and , which can be thought as a complexity-driven penalty (in the spirit of Barron and Cover [5] or Barron et al. [4]). The parameter tunes the balance between this two terms. For , Theorem 5 in [17] provides a sharp risk bound for the procedure.

The rest of the paper is organized as follows. We present our mixing strategy in the next section and express in some cases the resulting estimator as a shrinkage estimator. In Section 3, we state non-asymptotic risk bounds for the procedure and discuss the choice of the tuning parameters. Finally, we propose in Section 4 some weighting strategies for linear regression or for adaptive regression over Besov balls. Section 5 is devoted to a numerical illustration and Section 6 to the proofs. Additional results are given in the Appendix.

We end this section with some notations we shall use along this paper. We write for the cardinality of a finite set , and for the inner product of two vectors and in . To any real number , we denote by its positive part and by its integer part.

## 2. The estimation procedure

We assume henceforth that .

### 2.1. The estimator

We start with a finite collection of models and to each model we associate the least-squares estimator of on . We also introduce a probability distribution on , which is meant to take into account the complexity of the family and favor models with low dimension. For example, if the collection has (at most) models per dimension , we suggest to choose , see the example at the end of Section 3.1. As mentioned before, the quantity can be interpreted as a complexity-driven penalty associated to the model (in the sense of Barron et al. [4]). The performance of our estimation procedure depends strongly on the choice of the collection of models and the probability distribution . We detail in Section 4 some suitable choices of these families for linear regression and estimation of BV or Besov functions.

Hereafter, we assume that there exists a linear space of dimension , such that for all . We will take advantage of this situation and estimate the variance of the noise by

 (5) ^σ2=∥Y−ΠS∗Y∥2N∗

where . We emphasize that we do not assume that and the estimator is (positively) biased in general. It turns out that our estimation procedure does not need a precise estimation of the variance and the choice (5) gives good results. In practice, we may replace the residual estimator by a difference-based estimator (Rice [20], Hall et al. [14], Munk et al. [19], Tong and Wang [22], Wang et al. [29], etc) or by any non-parametric estimator (e.g. Lenth [16]), but we are not able to prove any bound similar to (12) or (13) when using one of these estimators.

Finally, we associate to the collection of models , a collection of non-negative weights. We recommend to set , but any (sharp) upper bound of this quantity may also be appropriate, see the discussion after Theorem 1. Then, for a given positive constant we define the estimator by

 (6) ^μ=∑m∈Mwm^μmwithwm=πmZexp(β∥^μm∥2^σ2−Lm),

where is a constant that normalizes the sum of the s to one. An alternative formula for is with . We can interpret the term appearing in the exponential as a (biased) estimate of the risk (2) rescaled by . As in (4), the balance in the weight between this estimate of the risk and the penalty is tuned by . We refer to the discussion after Theorem 1 for the choice of this parameter. We mention that the weights can be viewed as a Gibbs measure on and we will use this property to assess the performance of the procedure.

We emphasize in the next section, that is a simple shrinkage estimator in some cases.

### 2.2. A simple shrinkage estimator

In this section, we focus on the case where consists of all the subsets of , for some and with an orthonormal family of vectors in . We use the convention . An example of such a setting is given in Section 4.2, see also the numerical illustration Section 5. Note that corresponds here to and .

To favor models with small dimensions, we choose the probability distribution

 (7) πm=(1+1pα)−pp−α|m|,m∈M,

with . We also set for some .

###### Proposition 1.

Under the above assumptions, we have the following expression for

 (8) ^μ=p∑j=1(cjZj)vj,withZj=andcj=exp(βZ2j/^σ2)pαexp(b)+exp(βZ2j/^σ2).

The proof of this proposition is postponed to Section 6.1. The main interest of Formula (8) is to allow a fast computation of . Indeed, we only need to compute the coefficients instead of the weights of formula (6).

The coefficients are shrinkage coefficients taking values in . They are close to one when is large and close to zero when is small. The transition from 0 to 1 occurs when . The choice of the tuning parameters , and will be discussed in Section 3.2.

Remark 1. Other choices are possible for and they lead to different s. Let us mention the choice for which the s are given by

 cj=∫10q∏k≠j[q+(1−q)exp(−βZ2k/^σ2+b)]dq∫10∏pk=1[q+(1−q)exp(−βZ2k/^σ2+b)]dq, for j=1,…,p.

This formula can be derived from the Appendix of Leung and Barron [17].

Remark 2. When the variance is known, we can give a formula similar to (8) for the estimator of Leung and Barron [17]. Let us consider the same setting, with . Then, when the distribution is given by (7), the estimator (3) with weights given by (4) takes the form

 (9) ^μ=p∑j=1⎛⎝eβZ2j/σ2pαe2β+eβZ2j/σ2Zj⎞⎠vj.

## 3. The performance

### 3.1. A general risk bound

The next result gives an upper bound on the -risk of the estimation procedure. We remind the reader that and set

 (10) ϕ:]0,1[→]0,+∞[x↦(x−1−logx)/2 ,

which is decreasing.

###### Theorem 1.

Assume that and fulfill the condition

 (11) β<1/4andN∗≥2+lognϕ(4β),

with defined by (10). Assume also that , for all . Then, we have the following upper bounds on the -risk of the estimator

 (12) spanspanE(∥μ−^μ∥2) ≤ −(1+εn) ¯σ2β log[∑m∈Mπme−β[∥μ−ΠSmμ∥2−dim(Sm)σ2]/¯σ2−Lm]+σ22logn (13) ≤ (1+εn)infm∈M{∥μ−ΠSmμ∥2+¯σ2β(Lm−logπm)}+σ22logn,

where and .

The proof Theorem 1 is delayed to Section 6.3. Let us comment this result.

To start with, the Bound (12) may look somewhat cumberstone but it improves (13) when there are several good models to estimate . For example, we can derive from (12) the bound

 E(∥μ−^μ∥2)≤(1+εn)infm∈M{∥μ−ΠSmμ∥2+¯σ2β(Lm−logπm)}+infδ≥0{δ−¯σ2βlog|Mδ|}+σ22logn,

where is the set made of those in fulfilling

 ∥μ−ΠSm∗μ∥2+¯σ2β(Lm∗−logπm∗)≤δ+infm∈M{∥μ−ΠSmμ∥2+¯σ2β(Lm−logπm)}.

In the extreme case where all the quantities are equal, (12) then improves (13) by a factor .

We now discuss the choice of the parameter and the weights . The choice seems to be the more accurate since it satisfies the conditions of Theorem 1 and minimizes the right hand side of (12) and (13). We shall mostly use this one in the following, but there are some cases where it is easier to use some (sharp) upper bound of the dimension of instead of dim itself, see for example Section 4.3.

The largest parameter fulfilling Condition (11) is

 (14) β=14 ϕ−1(lognN∗−2)<14.

We suggest to use this value, since it minimizes the right hand side of (12) and (13). Nevertheless, as discussed in Section 3.2 for the situation of Section 2.2, it is sometimes possible to use larger values for .

Finally, we would like to compare the bounds of Theorem 1 with the minimum over of the risks given by (2). Roughly, the Bound (13) states that the estimator achieves the best trade-off between the bias and the complexity term . More precisely, we derive from (13) the (cruder) bound

 (15) E(∥μ−^μ∥2)≤(1+εn)infm∈M{∥μ−ΠSmμ∥2+1βCmσ2}+R∗nσ2,

with

 εn=12nlognandR∗n=12logn+∥μ−ΠS∗μ∥2βN∗σ2supm∈MCm.

In particular, if is of order , then (15) allows to compare the risk of with the infimum of the risks (2). We discuss this point in the following example.

Example. Assume that the family has an index of complexity , as defined in [2], which means that

 |{m∈M, dim(Sm)=d}|≤Mead,for all d≥1.

If we choose, example given,

 (16) πm=e−(a+1/2)dim(Sm)∑m′∈Me−(a+1/2)dim(Sm′)% andLm=dim(Sm)/2,

we have . Therefore, when is given by (14) and for some , we have

 E(∥μ−^μ∥2)≤(1+εn)infm∈M{∥μ−ΠSmμ∥2+a+1βdim(Sm)σ2}+R′nσ2,

with

 R′n=log(3M)β+12logn+∥μ−ΠS∗μ∥2σ2×(a+1)κ+n−1log(3M)β(1−κ).

In particular, for a given index of complexity and a given , the previous bound gives an oracle inequality.

### 3.2. On the choice of the parameter β.

The choice of the tuning parameter is important in practice. Theorem 1 or Theorem 5 in [17] justify the choice of a smaller than . Nevertheless, Bayesian arguments [15] suggest to take a larger value for , namely . In this section, we discuss this issue on the example of Section 2.2.

For the sake of simplicity, we will restrict to the case where the variance is known. We consider the weights (4) proposed by Leung and Barron, with the probability distribution given by (7) with , namely111Note that this choice of minimizes the rate of growth of when goes to infinity since when is given by (7) with .

 πm=(1+p−1)−pp−|m|.

According to (9), the estimator takes the form

 (17) ^μ=p∑j=1sβ(Zj/σ)Zjvjwith% Zj=andsβ(z)=eβz2pe2β+eβz2.

To start with, we note that a choice is not to be recommanded. Indeed, we can compare the shrinkage coefficient to a threshold at level since

For , the risk of is then larger than a quarter of the risk of the threshold estimator , namely

Now, when the threshold is of order with , the threshold estimator is known to behave poorly for , see [7] Section 7.2. Therefore, a choice would give poor results at least when .

On the other hand, next proposition justifies the use of any by a risk bound similar to (13). For and , we introduce the numerical constants and

 cβ(p)=supx∈[0,4γβ(p)]⎡⎣∫R(x−(x+z)sβ(x+z))2e−z2/2dz/√2πmin(x2,γβ(p)2)+γβ(p)2/p⎤⎦∨0.6.

This constant can be numerically computed. For example, for any , see Figure 1.

###### Proposition 2.

For and , the Euclidean risk of the estimator (17) is upper bounded by

 (18) E(∥μ−^μ∥2)≤∥μ−ΠS∗μ∥2+cβ(p)infm∈M[∥ΠS∗μ−ΠSmμ∥2+(2+β−1logp)(|m|+1)σ2].

The constant is (crudely) bounded by 16 when and .

We delayed the proof of Proposition 2 to Section 6.4. We also emphasize that the bound is crude.

In light of the above result, seems to be a good choice in this case and corresponds to the choice of Hartigan [15]. Note that the choice has no reason to be a good choice in other situations. Indeed, a different choice of in (7) would give different ”good” values for . For , one may check that the ”good” values for are .

Remark 3. Proposition 2 does not provide an oracle inequality. The Bound (18) differs from the best trade off between the bias and the variance term by a factor. This is unavoidable from a minimax point of view as noticed in Donoho and Johnstone [13].

Remark 4. A similar analysis can be done for the estimator (8) when the variance is unknown. When the parameters and in (8) equal 1, we can justify the use of values of fulfilling

 β≤12 ϕ−1(logpn−p),for% n>p≥3,

see the Appendix.

## 4. Choice of the models and the weights in different settings

In this section, we propose some choices of weights in three situations: the linear regression, the estimation of functions with bounded variation and regression in Besov spaces.

### 4.1. Linear regression

We consider the case where the signal depends linearly on some observed explanatory variables , namely

 μi=p∑j=1θjx(j)i,i=1,…,n.

The index usually corresponds to an index of the experiment or to a time of observation. The number of variables may be large, but we assume here that is bounded by .

#### 4.1.1. The case of ordered variables

In some favorable situations, the explanatory variables are naturally ordered. In this case, we will consider the models spanned by the first explanatory variables, with ranging from 0 to . In this direction, we set and for , where . This collection of models is indexed by and contains one model per dimension. Note that coincides here with .

We may use in this case the priors

 πm=eα−1eα−e−αp e−αm,m=0,…,p,

with , set and takes the value (14) for , with . Then, according to Theorem 1 the performance of our procedure is controlled by

 spanspanE(∥μ−^μ∥2) ≤ (1+εn)infm∈M{∥μ−ΠSmμ∥2+¯σ2β(α+1/2)m}+1.2¯σ2βlog(eαeα−1)+σ22logn.

with and . As mentioned at the end of Section 3.1, the previous bound can be formulated as an oracle inequality when imposing the condition , for some .

#### 4.1.2. The case of unordered variables

When there is no natural order on the explanatory variables, we have to consider a larger collection of models. Set some which represents the maximal number of explanatory variables we want to take into account. Then, we write for all the subsets of of size less than and for any nonempty . We also set and . Note that the cardinality of is of order , so when is large the value should remain small in practice.

A possible choice for is

 πm=[(p|m|)(|m|+1)Hq]−1withHq=q∑d=01d+1 ≤ 1+log(q+1).

Again, we choose the value (14) for with and . With this choice, combining the inequality with Theorem 1 gives the following bound on the risk of the procedure

 spanspanE(∥μ−^μ∥2) ≤ [1+εn]infm∈M{∥μ−ΠSmμ∥2+¯σ2β[|m|(3/2+logp|m|)+log(|m|+1)]} +σ22logn+¯σ2βloglog[(q+1)e],

with and .

Remark. When the family is orthogonal and , we fall into the setting of Section 2.2. An alternative in this case is to use given by (8), which is easy to compute numerically.

### 4.2. Estimation of BV functions

We consider here the functional setting

 (19) μi=f(xi),i=1,…,n

where is an unknown function and are deterministic points of . We assume for simplicity that and . We set and with and for . For we define by

 [vj,k]i=2(j−1)/2(1I+j,k(i)−1I−j,k(i)),  i=1,…,n

with and . The family corresponds to the image of the points by a Haar basis (see Section 6.5) and it is orthonormal for the scalar product

 n=1nn∑i=1xiyi.

We use the collection of models indexed by and fall into the setting of Section 2.2. We choose the distribution given by (7) with and . We also set and take some fulfilling

 β≤12ϕ−1(2log(n/2)n).

According to Proposition 1 the estimator (6) then takes the form

 (20) ^μ=J∗∑j=0∑k∈Λ(j)⎛⎜ ⎜⎝Zj,kexp(nβZ2j,k/^σ2)en/2+exp(nβZ2j,k/^σ2)⎞⎟ ⎟⎠vj,k ,

with and

 ^σ2=2⎛⎝2n−J∗∑j=0∑k∈Λ(j)Z2j,k⎞⎠.

Next corollary gives the rate of convergence of this estimator when has bounded variation, in terms of the norm induced by the scalar product .

###### Corollary 1.

In the setting described above, there exists a numerical constant such that for any function with bounded variation

 E(∥μ−^μ∥2n)≤Cmax⎧⎨⎩(V(f)σ2lognn)2/3 , V(f)2n , σ2lognn⎫⎬⎭.

The proof is delayed to Section 6.5. The minimax rate in this setting is . So, the rate of convergence of the estimator differs from the minimax rate by a factor. We can actually obtain a rate-minimax estimator by using a smaller collection of models similar to the one introduced in the next section, but we lose then Formula (20).

### 4.3. Regression on Besov space Bαp,∞[0,1]

We consider again the setting (19) with and introduce a -orthonormal family of compactly support wavelets with regularity . We will use models generated by finite subsets of wavelets. If we want that our estimator shares some good adaptive properties on Besov spaces, we shall introduce a family of models induced by the compression algorithm of Birgé and Massart [7]. This collection turns to be slightly more intricate than the family used in the previous section. We start with some and set . The largest approximation space we will consider is

 F∗=span{ϕj,k,j=0…J∗,k=1…2j},

whose dimension is bounded by . For , we define

 MJ={m=J∗⋃j=0{j}×Aj, with Aj∈Λj,J},

where when and

 Λj,J={A⊂{1,…,2j}: |A|=⌊2J/(j−J+1)3⌋},when J≤j≤J∗.

To in , we associate and define the model by

 Sm={(f(x1),…,f(xn)),f∈Fm}⊂S∗={(f(x1),…,f(xn)),f∈F∗}.

When , the dimension of is bounded from above by

 (21) dim(Sm)≤J−1∑j=02j+J∗∑j=J2J(j−J+1)3≤2J[1+J∗−J+1∑k=1k−3]≤2.2⋅2J

and . Note also that the cardinality of is

 |MJ|=J∗∏j=J(2j⌊2J/(j−J+1)3⌋).

To estimate , we use the estimator given by (6) with given by (14) and

 Lm=1.1⋅2Jandπm=[2J(1−2J∗)J∗∏j=J(2j⌊2J/(j−J+1)3⌋)]−1,for m∈MJ.

Next corollary gives the rate of convergence of the estimator when belongs to some Besov ball with (we refer to De Vore and Lorentz [12] for a precise definition of Besov spaces). As it is usual in this setting, we express the result in terms of the norm on .

###### Corollary 2.

For any and , there exists some constant not depending on and such that the estimator defined above fulfills

 E(∥μ−^μ∥2n)≤Cmax{(σ2n)2α/(2α+1) , 1n2(α−1/p) , σ2n}

for any given by (19) with .

The proof is delayed to Section 6.6. We remind that the rate is minimax in this framework, see Yang and Barron [24].

## 5. A numerical illustration

We illustrate the use of our procedure on a numerical simulation. We start from a signal

 f(x)=0.7cos(x)+cos(7x)+1.5sin(x)+0.8sin(5x)+0.9sin(8x),x∈[0,1]

which is in black in Figure 2. We have noisy observations of this signal

 Yi=f(xi)+σεi,i=1