PAC-Bayesian aggregation of affine estimators

# PAC-Bayesian aggregation of affine estimators

Lucie Montuelle and Erwan Le Pennec RTE, La Défense, France, lucie.montuelle@rte-france.comCMAP/XPOP, École Polytechnique, France, erwan.le-pennec@polytechnique.edu
January 2018
###### Abstract

Aggregating estimators using exponential weights depending on their risk appears optimal in expectation but not in probability. We use here a slight overpenalization to obtain oracle inequality in probability for such an explicit aggregation procedure. We focus on the fixed design regression framework and the aggregation of affine estimators and obtain results for a large family of affine estimators under a non necessarily independent sub-Gaussian noise assumptions.

## 1 Introduction

We consider here a classical fixed design regression model

 ∀i∈{1,…,n},Yi=f0(xi)+Wi

with an unknown function, the fixed design points and a centered sub-Gaussian noise. We assume that we have at hand a family of affine estimate , for instance a family of projection estimator, of linear ordered smoother in a basis or in a family of basis. The most classical way to use such a family is to select one of the estimate according to the observations, for instance using a penalized empirical risk principle. A better way is to combine linearly those estimates with weights depending of the observation. A simple strategy is the Exponential Weighting Average in which all those estimate are averaged with a weight proportional to where is a (penalized) estimate of the risk of . This strategy is not new nor optimal as explained below but is widely used in practice. In this article, we analyze the performance of this simple EWA estimator by providing oracle inequalities in probability under mild sub-Gaussian assumption on the noise.

Our aim is to obtain the best possible estimate of the function at the grid points. This setting is probably one of the most common in statistics and many regression estimators are available in the literature. For non parametric estimation, Nadaraya-Watson estimator [40, 53] and its fixed design counterpart [26] are widely used, just like projection estimators using trigonometric, wavelet [24] or spline [52] basis for example. In the parametric framework, least squares or maximum likelihood estimators are commonly employed, sometimes with minimization constraints, leading to lasso [48], ridge [34], elastic net [61], aic [1] or bic [46] estimates.

Facing this variety, the statistician may wonder which procedure provides the best estimation. Unfortunately, the answer depends on the data. For instance, a rectangular function is well approximated by wavelets but not by trigonometric functions. Since the best estimator is not known in advance, our aim is to mimic its performances in term of risk. This is theoretically guaranteed by an oracle inequality:

 R(f0,~f)≤Cninft∈TR(f0,^ft)+ϵn

comparing the risk of the constructed estimator to the risk of the best available procedure in the collection . Our strategy is based on convex combination of these preliminary estimators and relies on PAC-Bayesian aggregation to obtain a single adaptive estimator. We focus on a wide family, commonly used in practice : affine estimators with a common recentring.

Aggregation procedures have been introduced by Vovk [51], Littlestone and Warmuth [38], Cesa-Bianchi et al. [14], Cesa-Bianchi and Lugosi [13]. They are a central ingredient of bagging [9], boosting [25, 45] or random forest (Amit and Geman [3] or Breiman [10]; or more recently Biau et al. [8], Biau and Devroye [7], Biau [6], Genuer [27]).

The general aggregation framework is detailed in Nemirovski [41] and studied in Catoni [11, 12] through a PAC-Bayesian framework as well as in Yang [54, 55, 56, 57, 58, 59, 60]. See for instance Tsybakov [50] for a survey. Optimal rates of aggregation in regression and density estimation are studied by Tsybakov [49], Lounici [39], Rigollet and Tsybakov [43], Rigollet [42] and Lecué [36].

A way to translate the confidence of each preliminary estimate is to aggregate according to a measure exponentially decreasing when the estimate’s risk rises. This widely used strategy is called exponentially weighted aggregation. More precisely, as explained before, the weight of each element in the collection is proportional to where is a (penalized) estimate of the risk of , is a positive parameter, called the temperature, that has to be calibrated and is a prior measure over . The main interest of exponential weights resides in Lemma 1 [12] since they explicitly minimize the aggregated risk penalized by the Kullback-Leibler divergence to the prior measure . Our aim is to give sufficient conditions on the risk estimate and the temperature to obtain an oracle inequality for the risk of the aggregate. Note that when the family is countable, the exponentially weighted aggregate is a weighted sum of the preliminary estimates.

This procedure has shown its efficiency, offering lower risk than model selection because we bet on several estimators. Aggregation of projections has already been addressed by Leung and Barron [37]. They have proved by the mean of an oracle inequality, that in expectation, the aggregate performs almost as well as the best projection in the collection. Those results have been extended to several settings and noise conditions [20, 21, 22, 29, 23, 5, 18, 30, 47, 44] under a frozen estimator assumption: they should not depend on the observed sample. This restriction, not present in the work by Leung and Barron [37], has been removed by Dalalyan and Salmon [19] within the context of affine estimator and exponentially weighted aggregation. Nevertheless, they make additional assumptions on the matrices and the Gaussian noise to obtain an optimal oracle inequality in expectation for affine estimates. Very sharp results have been obtained in Golubev [31], Chernousova et al. [15] and Golubev and Ostobski [32]. Those papers, except the last one, study a risk in expectation.

Indeed, the Exponential Weighting Aggregation is not optimal anymore in probability. Dai et al. [16] have indeed proved the sub-optimality in deviation of exponential weighting, not allowing to obtain a sharp oracle inequality in probability. Under strong assumptions and independent noise, Bellec [4] provides a sharp oracle inequality with optimal rate for another aggregation procedure called Q-aggregation. It is similar to exponential weights but the criterion to minimize is modified and the weights no longer are explicit. Results for the original EWA scheme exists nevertheless but with a constant strictly larger than in the oracle inequality. [17] obtain for instance a result under a Gaussian white noise assumption by penalizing the risk in the weights and taking a temperature at least 20 times greater than the noise variance. Golubev and Ostobski [32] does not use an overpenalization but assume some ordered structure on the estimate to obtain a result valid even for low temperature. An unpublished work, by [28], provides also weak oracle inequality with high probability for projection estimates on non linear models. Alquier and Lounici [2] consider frozen and bounded preliminary estimators and obtain a sharp oracle inequality in deviation for the excess risk under a sparsity assumption, if the regression function is bounded, with again a modified version of exponential weights.

In this article, we will play on both the temperature and the penalization. We will be able to obtain oracle inequalities for the Exponential Weighting Aggregation under a general sub-Gaussian noise assumption that does not require a coordinate independent setting. We conduct an analysis of the relationship between the choice of the penalty and the minimal temperature. In particular, we show that there is a continuum between the usual noise based penalty and a sup norm type one allowing a sharp oracle inequality.

## 2 Framework and estimate

Recall that we observe

 ∀i∈{1,…,n},Yi=f0(xi)+Wi

with an unknown function and the fixed grid points. Our only assumption will be on the noise. We do not assume any independence between the coordinates but only that is a centered sub-Gaussian variable. More precisely, we assume that and there exists such that

 ∀α∈Rn,E[exp(α⊤W)]≤exp(σ22∥α∥22),

where is the usual euclidean norm in . If is a centered Gaussian vector with covariance matrix then is nothing but the largest eigenvalue of .

The quality of our estimate will be measured through its error at the design points. More precisely, we will consider the classical euclidean loss, related to the squared norm

 ∥g∥22=n∑i=1g(xi)2.

Thus, our unknown is the vector rather than the function .

As announced, we will consider affine estimators corresponding to affine smoothed projection.

We will assume that

 ^ft(Y)=At(Y−b)+b+bt=n∑i=1ρt,i⟨Y−b,gt,i⟩gt,i+b+bt

where is an orthonormal basis, a sequence of non-negative real numbers and . By construction, is thus a symmetric positive semi-definite real matrix. We assume furthermore that the matrix collection is such that . For sake of simplicity, we only use the notation in the following.

To define our estimate from the collection , we specify the estimate of the (penalized) risk of the estimator , choose a prior probability measure over and a temperature . We define the exponentially weighted measure , a probability measure over , by

 dρEWA(t)=exp(−1β˜rt)∫exp(−1β˜rt′)dπ(t′)dπ(t)

and the exponentially weighted aggregate by . If is countable then

 fEWA=∑t∈Te−˜rt/βπt∑t′∈Te−˜rt′/βπt′^ft.

This construction naturally favors low risk estimates. When the temperature goes to zero, this estimator becomes very similar to the one minimizing the risk estimate while it becomes an indiscriminate average when grows to infinity. The choice of the temperature appears thus to be crucial and a low temperature seems to be desirable.

Our choice for the risk estimate is to use the classical Stein unbiased estimate, which is sufficient to obtain optimal oracle inequalities in expectation,

 rt=∥Y−^ft(Y)∥22+2σ2Tr(At)−nσ2

and add a penalty . We will consider simultaneously the case of a penalty independent of and the one where the penalty may depend on an upper bound of (kind of) sup norm.

More precisely, we allow the use, at least in the analysis, of an upper bound which can be thought as the supremum of the sup norm of the coefficients of in any basis appearing in . Indeed, we define as the smallest non-negative real number such that for any ,

 ∥At(f0−b)∥22≤C2Tr(A2t).

By construction, is smaller than the sup norm of any coefficients of in any basis appearing in the collection of estimators. Note that can also be upper bounded by , or where the and sup norm can be taken in any basis.

Our aim is to obtain sufficient conditions on the penalty and the temperature so that an oracle inequality of type

 ∥f0−fEWA∥22≤infμ∈M1+(T) (1+ϵ)∫∥f0−^ft∥22dμ(t) +(1+ϵ′)(∫price(t)dμ(t)+2βKL(μ,π)+βln1η)

holds either in probability or in expectation. Here, and are some small non-negative numbers possibly equal to and a loss depending on the choice of and . When is countable, such an oracle proves that the risk of our aggregate estimate is of the same order as the one of the best estimate in the collection as it implies

 ∥f0−fEWA∥22≤inft∈T{(1+ϵ)∥f0−^ft∥22+(1+ϵ′)(price(t)+βln1π(t)2η)}.

Before stating our more general result, which is in Section 4, we provide a comparison with some similar results in the literature on the countable setting.

## 3 Penalization strategies and preliminary results

The most similar result in the literature is the one from Dai et al. [17] which holds under a Gaussian white noise assumption and uses a penalty proportional to the known variance :

###### Proposition 1 (Dai et al. [17]).

If , and , then for all , with probability at least ,

 ∥f0−fEWA∥2≤mint{(1+128σ23β)∥f0−^ft∥2+8σ2Tr(At)+3βln1πt+3βln1η}.

Our result generalizes this result to the non necessarily independent sub-Gaussian noise. We obtain

###### Proposition 2.

If there exists , such that if
, for any , with probability at least ,

 ∥f0−fEWA∥2≤inft {(1+4γ1−2γ)∥f0−^ft∥2 +(1+2γ1−2γ)(pen(t)+2σ2Tr(At)+2βln1πt+βln1η)}.

The parameter is explicit and satisfies . We recover thus a similar weak oracle inequality under a weaker assumption on the noise. It should be noted that [4] obtains a sharp oracle inequality for a slightly different aggregation procedure but only under the very strong assumption that .

Following Guedj and Alquier [33], a lower bound on the penalty, that involves the sup norm of , can be given. In that case, the oracle inequality is sharp as . Furthermore, the parameter is not necessary and the minimum temperature is lower.

###### Proposition 3.

If , and

 pen(t)≥4σ2β−4σ2(σ2Tr(A2t)+2[˜∥f0−b∥2∞Tr(A2t)+∥bt∥22]),

then for any , with probability at least ,

 ∥f0−fEWA∥2≤inft{∥f0−^ft∥2+2σ2Tr(At)+8σ2β−4σ2[˜∥f0−b∥2∞Tr(A2t)+∥bt∥22]+pen(t)+2βln1πt+βln1η}.

The two results can be combined in a single one. Indeed, to obtain the first oracle inequality, we rely in the proof on bounds of type

 ∥(At−Au)f0+bt−bu∥22≤C1∥^ft−f0∥22+C2∥^fu−f0∥22,

with some constants and depending on which allows to link to and . Whereas, for the second inequality we rely on bounds of type

 ∥(At−Au)f0+bt−bu∥22 ≤4(∥Atf0∥22+∥Auf0∥22+∥bt∥22+∥bu∥22) ≤4[˜∥f0∥2∞(Tr(A2t)+Tr(A2u))+∥bt∥22+∥bu∥22].

Combining these two upper bounds produce weak oracle inequalities for a wider range of temperatures than Proposition 2, drawing a continuum between Proposition 2 and Proposition 3. More precisely, one obtains

###### Proposition 4.

For any , if and , there exists , such that if

 pen(t)≥4σ2β−4σ2V(σ2Tr(A2t)+2(1−δ)(1+2γV)2[˜∥f0∥2∞Tr(A2t)+∥bt∥22]),

then for any , with probability at least ,

 ∥f0−fEWA∥2≤inft {(1+ϵ)∥f0−^ft∥2 +(1+ϵ′)(price(t)+2βln1πt+βln1η)}.

with , and

 price(t) =pen(t)+2σ2Tr(At)+8σ2(1−δ)(1+2γV)2β−4σ2V[˜∥f0∥2∞Tr(A2t)+∥bt∥22].

The convex combination parameter measures the account for signal to noise ratio in the penalty. We are now ready to state the central result of this paper, which gives an explicit expression for and introduce an optimization parameter .

## 4 A general oracle inequality

We consider now the general case for which is not necessarily countable. Recall that we have defined the exponentially weighted measure , a probability measure over , by

 dρEWA(t)=exp(−1β˜rt)∫exp(−1β˜rt′)dπ(t′)dπ(t)

and the exponentially weighted aggregate by . We will directly consider a lower bound on the penalty of the same type than in Proposition 4 and propositions similar to Propositions 2 and 3 will be obtained as straightforward corollaries.

Our main contribution is the following two similar theorems:

###### Theorem 4.1.

For any , let

 γ=β−12σ2−√β−4σ2√β−20σ216σ2.

If for any

 pen(t)≥4σ2β−4σ2σ2Tr(A2t),

then

• for any , with probability at least

 ∥f0−fEWA∥22≤infμ∈M1+(T)(1+4γ1−2γ)∫∥f0−^ft∥22dμ(t)+(1+2γ1−2γ)∫pen(t)+2σ2Tr(At)dμ(t)+β(1+2γ1−2γ)(2KL(μ,π)+ln1η).
• Furthermore

 E∥f0−fEWA∥22≤infμ∈M1+(T)(1+4γ1−2γ)∫E∥f0−^ft∥22dμ(t)+(1+2γ1−2γ)∫pen(t)+2σ2Tr(At)dμ(t)+2β(1+2γ1−2γ)KL(μ,π).

and

###### Theorem 4.2.

For any , if , If for any

 pen(t)≥4σ2β−4σ2(σ2Tr(A2t)+2[˜∥f0−b∥2∞Tr(A2t)+∥bt∥22]),

then

• for any , with probability at least

 ∥f0−fEWA∥22≤infμ∈M1+(T)∫∥f0−^ft∥22dμ(t)+∫pen(t)+2σ2Tr(At)+8σ2β−4σ2[˜∥f0−b∥2∞Tr(A2t)+∥bt∥22]dμ(t)+β(2KL(μ,π)+ln1η).
• Furthermore

 E∥f0−fEWA∥22≤infμ∈M1+(T)(1+4γ1−2γ)∫E∥f0−^ft∥22dμ(t)+∫pen(t)+2σ2Tr(At)+8σ2β−4σ2[˜∥f0−b∥2∞Tr(A2t)+∥bt∥22]dμ(t)+2βKL(μ,π).

When is discrete, one can replace the minimization over all the probability measure by the minimization overall Dirac measure with . Propositions 2 and 3 are then straightforward corollaries. Note that the result in expectation is obtained with the same penalty, which is known not to be necessary, at least in the Gaussian case, as shown by [19].

If we assume the penalty is given

 pen(t)=κTr(A2t)σ2,

one can give rewrite the assumption in term of . The weak oracle inequality holds for any temperature greater than as soon as . while an exact oracle inequality holds for any vector and any temperature greater than as soon as

 β−4σ24σ2κ−1≥˜∥f0−b∥2∞+∥bt∥2/Tr(A2t)σ2.

For fixed and , this corresponds to a low peak signal to noise ratio up to the term which vanishes when . Note that similar results hold for a penalization scheme but with much larger constants and some logarithmic factor in .

Finally, the minimal temperature of can be replaced by some smaller values if one further restrict the smoothed projections used. As it appears in the proof, the temperature can be replaced by or even when the smoothed projections are respectively classical projections and projections in the same basis. The question of the minimality of such temperature is still open. Note that in this proof, there is no loss due to the sub-Gaussianity assumption, since the same upper bound on the exponential moment of the deviation as in the Gaussian case are found, providing the same penalty and bound on temperature.

The two results can be combined in a single one producing weak oracle inequalities for a wider range of temperatures than Theorem 4.1. in Apprendix, we prove that a continuum between those two cases exists: a weak oracle inequality, with smaller leading constant than the one of Theorem 4.1, holds as soon as there exists such that and

 β−4σ24σ2κ−1≥(1−δ)(1+2γ)2˜∥f0−b∥2∞+∥bt∥2/Tr(A2t)σ2,

where the signal to noise ratio guides the transition. The temperature required remains nevertheless always above . The convex combination parameter measures the account for signal to noise ratio in the penalty.

Note that in practice, the temperature can often be chosen smaller. It is an open question whether the limit is an artifact of the proof or a real lower bound. In the Gaussian case, [32] have been able to show that this is mainly technical. Extending this result to our setting is still an open challenge.

## Appendix A Proof of the oracle inequalities

The proof of this result is quite long and thus postponed in Appendix A.1. We provide first the generic proof of the oracle inequalities, highlighting the role of Gibbs measure and of some control in deviation. Then, we focus on the aggregation of projection estimators in the Gaussian model. This example already conveys all the ideas used in the complete proof of the deviation lemma : exponential moments inequalities for Gaussian quadratic form and the control of the bias by on the one hand, to obtain an exact oracle inequality, and by on the other hand, giving a weak inequality.

The extension to the general case is obtained by showing that similar exponential moments inequalities can be obtained for quadratic form of sub-Gaussian random variables, working along the fact that the systematic bias is no longer always smaller than and providing a fine tuning optimization allowing the equality in the constraint on and an optimization on the parameters .

We provide in the next section the sketch of proof of Theorem A.1, an extended version of the Theorems as well as its proof in the sub-Gaussian case and a simplified case dealing with Gaussian noise and orthonormal projection meant to be compared with the one of Dai et al. [17].

### a.1 Extended result in the sub-Gaussian case

We will consider affine estimators corresponding to affine smoothed projection. We will assume that

 ^ft(Y)=At(Y−b)+b+bt=n∑i=1ρt,i⟨Y−b,gt,i⟩gt,i+b+bt

where is an orthonormal basis, a sequence of non-negative real numbers and . By construction, is thus a symmetric positive semi-definite real matrix. We only assume here that the matrix collection is such that there exists a finite for which . For sake of simplicity, we only use the notation in the following.

We obtain a theorem in which plays a role and in which a parameter can be optimized.

###### Theorem A.1.

For any , if and , let

 γ=β−4σ2V(1+2δ)−√β−4σ2V√β−4σ2V(1+4δ)16σ2δV2\mathds1δ>0.

If for any

 pen(t)≥4σ2β−4σ2V(σ2Tr(A2t)+2(1−δ)(1+2γV)2[˜∥f0−b∥2∞Tr(A2t)+∥bt∥22]),

then

• for any , with probability at least

 ∥f0−fEWA∥22≤infν∈Ninfμ∈M1+(T)(1+ϵ(ν))∫∥f0−^ft∥22dμ(t)+(1+ϵ′(ν))∫price(t)dμ(t)+β(1+ϵ′(ν))(2KL(μ,π)+ln1η).
• Furthermore

 E∥f0−fEWA∥22≤infν∈Ninfμ∈M1+(T)(1+ϵ(ν))∫E∥f0−^ft∥22dμ(t)+(1+ϵ′(ν))∫price(t)dμ(t)+2β(1+ϵ′(ν))KL(μ,π),

with , ,

 price(t) =pen(t)+2σ2Tr(At)+8σ2(1−δ)β−4σ2V(1+2γV)2[˜∥f0−b∥2∞Tr(A2t)+∥bt∥22]

and .

The parameter is a technical parameter that can be optimized, provided is non empty. If , then for any . Thus as soon as with if we assume that . If we assume , we have to impose in order to have a non empty . Finally, if then and , and no optimization is required. Theorems 4.1 and 4.2 correspond to the case and the choice .

### a.2 General sketch of proof

Theorem A.1 relies on the characterization of Gibbs measure (Lemma 1) and a control of deviation of the empirical risk of any aggregate around its true risk.

is a Gibbs measure. Therefore it maximizes the entropy for a given expected energy. That is the subject of Lemma 1.1.3 in Catoni [12]:

###### Lemma 1.

For any bounded measurable function and any probability distribution such that

 log(∫exp(h)dπ)=∫hdρ−KL(ρ,π)+KL(ρ,πexp(h)),

where by definition Consequently,

 log(∫exp(h)dπ)=supρ∈M1+(T)∫hdρ−KL(ρ,π).

With this lemma states that for any probability distribution such that

 ∫hdρ−KL(ρ,π)≥∫hdμ−KL(μ,π).

Equivalently,

 ∫∥f0−^ft∥22dρ(t)+∫(rt−∥f0−^ft∥22+pen(t))dρ(t)+βKL(ρ,π) ≤∫∥f0−^ft∥22dμ(t)+∫(rt−∥f0−^ft∥22+pen(t))dμ(t)+βKL(μ,π) ⇔ ∫∥f0−^ft∥22dρ(t)−∫∥f0−^ft∥22dμ(t)≤∫(∥f0−^ft∥22−rt)dρ(t) −βKL(ρ,π)−∫(∥f0−^ft∥22−rt)dμ(t)−∫pen(t)dρ(t) +∫pen(t)dμ(t)+βKL(μ,π).

The key is to upper bound the right-hand side with terms that may depend on but only through and Kullback-Leibler distance. We will obtain two different controls in the sub-Gaussian case and the Gaussian one that provide upper bounds in probability (and in expectation) of type:

 ∫(∥f0−^ft∥22−rt)dρ(t)−∫(∥f0−^fu∥22−ru)dμ(u)≤C1∫∥f0−^ft∥22dρ(t)+C2∫∥f0−^fu∥22dμ(u)+∫(C3Tr(A2t)+C4∥bt∥22)dρ(t)+C5∫Tr(Au)dμ(u)+∫(C6Tr(A2u)+C7∥bu∥22)dμ(u)+β(KL(ρ,π)+KL(μ,π)+ln1η)

where to are known functions. Combining with the previous inequality and taking gives

 (1−C1)∫∥f0−^ft∥22dρ(t)−(1+C2)∫∥f0−^ft∥22dμ(t)≤C5∫Tr(Au)dμ(u)+∫(C6Tr(A2u)+C7∥bu∥22)dμ(u)+∫pen(u)dμ(u)+β(2KL(μ,π)+ln1η).

The additional condition allows to conclude. It is now clear that the whole work lies in the proof of the lemma.

### a.3 Proof of Theorem a.1

The proof follows from the scheme described in section A.2. The main point is to control

 ∫(∥f0−^ft∥22−rt)dρ(t)−∫(∥f0−^ft∥22−rt)dμ(t).

We recall that is a symmetric positive semi-definite matrix, there exists such that and is a centered sub-Gaussian noise. For any we denote