Bayesian Model Averaging with Exponentiated Least Squares Loss

Bayesian Model Averaging with Exponentiated Least Squares Loss

Dong Dai
IMS Health, PA, USA
   Ting Yang
Rutgers University, NJ, USA
   Tong Zhang
Baidu Inc, Beijing, China & Rutgers University, NJ, USA
Abstract

The model averaging problem is to average multiple models to achieve a prediction accuracy not much worse than that of the best single model in terms of mean squared error. It is known that if the models are misspecified, model averaging is superior to model selection. Specifically, let be the sample size, then the worst case regret of the former decays at a rate of while the worst case regret of the latter decays at a rate of . The recently proposed -aggregation algorithm (Dai et al., 2012) solves the model averaging problem with the optimal regret of both in expectation and in deviation; however it suffers from two limitations: (1) for continuous dictionary, the proposed greedy algorithm for solving -aggregation is not applicable; (2) the formulation of -aggregation appears ad hoc without clear intuition. This paper examines a different approach to model averaging by considering a Bayes estimator for deviation optimal model averaging by using exponentiated least squares loss. We establish a primal-dual relationship of this estimator and that of -aggregation and propose new greedy procedures that satisfactorily resolve the above mentioned limitations of -aggregation.

1 Introduction

This paper considers the model averaging problem, where the goal is to average multiple models in order to achieve improved prediction accuracy.

Let be given design points in a space , let be a given dictionary of real valued functions on and denote for each . The goal is to estimate an unknown regression function at the design points based on observations

where are i.i.d .

The performance of an estimator is measured by its mean squared error (MSE) defined by

We want to find an estimator that mimics the function in the dictionary with the smallest MSE. Formally, a good estimator should satisfy the following exact oracle inequality in a certain probabilistic sense:

(1)

where the remainder term should be as small as possible.

The problem of model averaging has been well-studied, and it is known (see, e.g., Tsybakov, 2003; Rigollet, 2012) that the smallest possible order for is for oracle inequalities in expectation, where “smallest possible” is understood in the following minimax sense. There exists a dictionary such that the following lower bound holds. For any estimator , there exists a regression function such that

for some positive constant . It also implies that the lower bound holds not only in expectation but also with positive probability.

Although our goal is to achieve an MSE as close as that of the best model in  , it is known (see Rigollet and Tsybakov, 2012, Theorem 2.1) that there exists a dictionary  such that any estimator taking values restricted to the elements of (such an estimator is referred to as a model selection estimator) cannot achieve an oracle inequality of form (1) with a remainder term of order smaller than ; in other words, model selection is suboptimal for the purpose of competing with the best single model from a given family.

Instead of model selection, we can employ model averaging to derive oracle inequalities of form (1) that achieves the optimal regret in expectation (see the references in Rigollet and Tsybakov, 2012). More recently, several work has produced optimal oracle inequalities for model averaging that not only hold in expectation but also in deviation (Audibert, 2008; Lecué and Mendelson, 2009; Gaïffas and Lecué, 2011; Dai and Zhang, 2011; Rigollet, 2012; Dai et al., 2012). In particular, the current work is closely related to the -aggregation estimator investigated in (Dai et al., 2012) which solves the optimal model averaging problem both in expectation and in deviation with a remainder term of order ; the authors also proposed a greedy algorithm GMA-0 for -aggregation which improves the Greedy Model Averaging (GMA) algorithm firstly proposed by Dai and Zhang (2011). Yet there are still two limitations of -aggregation: (1) -aggregation can be generalized for continuous candidates dictionary , but the greedy model averaging method GMA-0 can not be applied in such setting; (2) -aggregation can be regarded intuitively as regression with variance penalty, but it lacks a good decision theoretical interpretation.

In this paper we introduce a novel method called Bayesian Model Averaging with Exponentiated Least Squares Loss (BMAX). We note that the previously studied exponential weighted model aggregation estimator EWMA (e.g. Rigollet and Tsybakov, 2012) is the Bayes estimator under the least squares loss (posterior mean), which leads to optimal regret in expectation but is suboptimal in deviation. In contrast, the new BMAX model averaging estimator is essentially a Bayes estimator under an appropriately defined exponentiated least squares loss, and we will show that the -aggregation formulation (with Kullback-Leibler entropy) in Dai et al. (2012) is essentially a dual representation of the newly introduced BMAX formulation, and it directly implies the optimality of the aggregate by BMAX. Computationally, the new model aggregation method BMAX can be approximately solved by a greedy model averaging algorithm that is applicable to continuous candidates dictionary. In summary, this paper establishes a natural Bayesian interpretation of -aggregation, and provides additional computational procedure that is applicable for the continuous dictionary setting. This relationship provides deeper understanding for modeling averaging procedures, and resolves the above mentioned limitations of the -aggregation scheme.

2 Notations

This section introduces some notations used in this paper. In the following, we denote by the observation vector, the model output, and the noise vector. The underlying statistical model can be expressed as

(2)

with . We also denote norm as , and the inner product as .

Let be the flat simplex in defined by

and be a given prior.

Each yields a model averaging estimator as ; that is, using the vector notation we have .

The Kullback-Leibler divergence for is defined as

and in the definition we use the convention .

For matrix , is equivalent to is positive semi-definite.

3 Bayesian Model Averaging with Exponentiated Least Squares Loss

The traditional Bayesian model averaging estimator is the exponential weighted model averaging estimator EWMA (Rigollet and Tsybakov, 2012) which optimizes the least squares loss. Although the estimator is optimal in expectation, it is suboptimal in deviation (Dai et al., 2012). In this section we introduce a different Bayesian model averaging estimator called BMAX that optimizes an exponentiated least squares loss.

In order to introduce the BMAX estimator, we consider the following Bayesian framework, where we should be noted that the assumptions below are only used to derive BMAX, and these assumptions are not used in our theoretical analysis. is a normally distributed observation vector with mean and covariance matrix :

(3)

and for , the prior for each model is

(4)

In this setting, the posterior distribution of given is

In the Bayesian decision theoretical framework considered in this paper, the quantity of interest is , and we consider a loss function which we would like to minimize with respect to the posterior distribution. The corresponding Bayes estimator minimizes the posterior expected loss from as follows:

(5)

It is worth pointing out that the above Bayesian framework is only used to obtain decision theoretically motivated model averaging estimators (Bayesian estimators have good theoretical properties such as admissibility, etc). In particular we do not assume that the model itself is correctly specified. That is, in this paper we allow misspecified models, where the parameters and are not necessarily equal to the true mean and the true variance in (2), and does not necessarily belong to the dictionary .

The Bayesian estimator of (5) depends on the underlying loss function . For example, under the standard least squares loss , the Bayes estimator is the posterior mean, which leads to the Exponential Weighted Model Aggregation (EWMA) estimator (Rigollet and Tsybakov, 2012) :

(6)

This estimator is optimal in expectation  (Dalalyan and Tsybakov, 2007, 2008), but suboptimal in deviation  (Dai et al., 2012).

In this paper, we introduce the following exponentiated least squares loss motivated from the exponential moment technique for proving large deviation tail bounds for sums of random variables:

(7)

where the parameter . It is easy to verify that the Bayes estimator defined by (5) with the loss function defined in (7) can be written as

(8)

where

(9)

The estimator will be referred to as the Bayesian model aggregation estimator with exponentiated least squares loss (BMAX).

To minimize , it is equivalent to minimize . Lemma 1 below shows the strong convexity and smoothness (under some conditions) of .

Given and , we define

(10)

moreover, if -norm of every is bounded by a constant :

(11)

and are defined as

(12)
(13)
Lemma 1.

For any , define the Hessian matrix of as , then we have

(14)

If satisfies condition (11), then

(15)

where and are defined in (10) and (12).

4 Dual Representation and -aggregation

In this section, we will show that the -aggregation scheme of Dai et al. (2012) with the standard Kullback-Leibler entropy solves a dual representation of the BMAX formulation defined by (8) and (9).

Given and , -aggregation is defined as following:

(16)

where such that

(17)
(18)

for some , where the -entropy is defined as

(19)

where is a real valued function on satisfying

(20)

When , becomes , the Kullback-Leibler entropy. When , , a linear entropy in , and in particular the penalty in  (18) becomes a constant when is a flat prior.

To illustrate duality, we shall first introduce a function as

(21)

and denote the maximizer of as

(22)

Define function as

(23)

It is not difficult to verify that for , is convex in and concave in . The following duality lemma states the relationship between and .

Lemma 2.

When , we have the following result

where the equality is achieved at . Moreover, we have

where and are two hyper-surfaces in defined as

(24)

Lemma 2 states that, is the only joint of hyper-surfaces and , and the saddle point of function over space .

With defined as in (21), we can employ the transformation and it is easy to verify that

(25)

where is defined in (9). Since is strict convex, is strictly concave and is unique.

It follows that maximizing is equivalent to minimizing , and thus

We can combine this representation with

from Lemma 2 to obtain . Therefore we have the following relationship:

Theorem 1.

When ,

where is defined by (8) and (9), and is defined by (16),(17) and (18).

Theorem 1 states that, when , becomes the Kullback-Leibler entropy, and -aggregation with the Kullback-Leibler entropy leads to an estimator that is essentially a dual representation of the BMAX estimator . It follows that, should share the same optimality (both in expectation and deviation) as in solving the model averaging problem (optimality of is shown in Theorem 3.1 of Dai et al. (2012) with more general where only needs to satisfy the condition (20)).

However, unlike the primal objective function which is defined on , the dual objective function is defined on . When is large or infinity, the optimization of is non-trivial. Although greedy algorithms are proposed in (Dai et al., 2012), they cannot handle the standard KL-divergence; instead they can only work with the linear entropy where ; it gives a larger penalty than the standard KL-divergence (and thus worse resulting oracle inequality), and it cannot be generalized to handle continuous dictionaries (because in such case, the linear entropy with will always be ). Therefore the numerical greedy procedures of (Dai et al., 2012) converges to a solution with a worse oracle bound than that of the solution for the primal formulation considered in this paper.

The following two corollaries (Corollary 1 and Corollary 2) are listed for illustration convenience, they are directly implied from Theorem 1 and optimality of in Theorem 3.1 of Dai et al. (2012).

Corollary 1.

Assume that and . For any , the following oracle inequality holds

(26)

with probability at least . Moreover,

(27)

Since , our theorem implies that can compete with an arbitrary in the convex hull with as long as the variance term and the divergence term are small.

Although for notation simplicity, the result is stated for finite dictionary , the analysis of the paper directly applies to infinite dictionaries where as well as continuous dictionaries. For example, given a matrix , we may consider a continuous dictionary indexed by vector as . We may consider the uniform prior on , and the corollary is well-defined as long as a distribution on is concentrated around a single model (so that the variance term corresponding to is small) with finite KL divergence with respect to . For example, can be chosen as the uniform distribution on a small ball . In such case, and the variance term is small when is small. Corollary 1 can be applied to derive an oracle inequality that competes with any single model .

In the case that is finite, we can more directly obtain an oracle inequality that competes with the best single model, which is the situation that is at a vertex of the simplex :

Corollary 2.

Under the assumptions of Corollary 1, satisfies

(28)

with probability at least . Moreover,

(29)

It is also worth pointing out that the condition implies that is at least greater than (when ), and intuitively this inflation of noise allows the Bayes estimator to handle misspecification of the true mean , which is not necessarily included in the candidate dictionary .

Finally we note that in the Bayesian framework stated in this section, when we change the underlying loss function from the standard least squares loss to the exponentiated least squares loss (7), Bayes estimator changes from EWMA which is optimal only in expectation to BMAX which is optimal both in expectation and in deviation. The difference is that the least squares loss only controls the bias, while the exponentiated least squares loss controls both bias and variance (as well as higher order moments) simultaneously. This can be seen by using Taylor expansion

Since deviation bounds require us to control high order moments, the exponentiated least squares loss is naturally suited for obtaining deviation bounds.

5 Greedy Model Averaging Algorithm (GMA-BMAX)

Strong convexity of directly implies that the minimizer is unique. Moreover, it implies the following result which means that an estimator that approximately minimizes satisfies an oracle inequality slightly worse than that of in Corollary 1. This result suggests that we can employ appropriate numerical procedures to approximately solve (8), and Corollary 1 implies an oracle inequality for such approximate solutions.

Proposition 1.

Let be an -approximate minimizer of for some :

Then we have

Proof.

The strong convexity of in (14) implies that

Now plug the above inequality to the following equation

we obtain the desired bound. ∎

The GMA-BMAX algorithm given in Algorithm 1 is a greedy algorithm that adds at most one function from the dictionary at each iteration. This feature is attractive as it outputs a -sparse solution that depends on at most functions from the dictionary after iterations. Similar algorithms for model averaging have appeared in Dai and Zhang (2011) and Dai et al. (2012).

0:  Noisy observation , dictionary , prior , parameters .
0:  Aggregate estimator .
  Let .
  for  do
     Set
     
     
  end for
Algorithm 1 Greedy Model Averaging Algorithm (GMA-BMAX)

The following proposition follows from the standard analysis in Frank and Wolfe (1956); Jones (1992); Barron (1993). It shows that the estimator from Algorithm 1 converges to .

Proposition 2.

For as defined in Algorithm 1 (GMA-BMAX), if satisfies condition (11), then

(30)

Proposition 2 states that, after running algorithm GMA-BMAX for steps to obtain , the corresponding objective value converges to the optimal objective value at a rate . Combine this result with Proposition 1, we obtain the following oracle inequality, which shows that the regret of the estimator after running steps of GMA-BMAX converges to that of in Corollary 1 at a rate .

Corollary 3.

Assume and if . Consider as in Algorithm 1 (GMA-BMAX). For any , the following oracle inequality holds

(31)

with probability at least . Moreover,

(32)

From Corollary 3, if , for any we have

with probability at least , and

When , achieves the optimal deviation bound. However, it does not imply optimal deviation bound for with small , while the greedy algorithms described in Dai and Zhang (2011) (GMA) and Dai et al. (2012) (GMA-0) achieve optimal deviation bounds for small when . The advantage of GMA-BMAX is that the resulting estimator competes with any with under the KL entropy, and such a result can be applied even for infinity dictionaries containing functions indexed by continuous parameters, as long as the KL divergence is well-defined (see relevant discussions in Section 3). On the other hand, the greedy estimators of Dai et al. (2012) for the -aggregation scheme can only deal with an upper bound of KL divergence referred to as linear entropy (see Section 4) that is not well-defined for continuous dictionaries. This means that GMA-BMAX is more generally applicable than the corresponding greedy algorithm GMA-0 in (Dai et al., 2012).

6 Experiments

Although the contribution of this work is mainly theoretical, we include some simulations to illustrate the performance of greedy model averaging algorithm GMA-BMAX proposed for the BMAX formulation. We focus on the average performance of different algorithms and configurations.

Set and . We identify a function with a vector . Let denote the identity matrix of and let be a random vector, and define as

(33)

where are independent random vectors.

Let be a random vector. The regression function is defined by . Note that typically will be the closest function to but not necessarily. The noise vector is drawn independently of where .

We define the oracle model (OM) , where . The model is clearly not a valid estimator because it depends on the unobserved , however it can be used as a performance benchmark. The performance difference between an estimator and the oracle model is measured by the regret defined as:

(34)

Since the target is , and and are random Gaussian vectors, the oracle model is likely (but it may not be due to the misspecification vector ). The noise is relatively large, which implies a situation where the best convex aggregation does not outperform the oracle model. This is the scenario we considered here. For simplicity, all algorithms use a flat prior for all . The experiment is performed with 100 replications.

One method we compare to is the STAR algorithm of Audibert (2008) which is optimal both in expectation and in deviation under the uniform prior. Mathematically, suppose is the empirical risk minimizer among functions in , where

(35)

the STAR estimator is defined as

(36)

where

(37)

Another natural solution to solve the model averaging problem is to take the vector of weights defined by

(38)

which minimizes the empirical risk. We call the vector of projection weights since the aggregate estimator is the projection of onto the convex hull of the ’s.

-aggregation (with Kullback-Leibler entropy when ) is a dual representation of BMAX, which will be solved by greedy model averaging algorithm GMA-BMAX, while -aggregation (with linear entropy when ) can be solved by GMA-0 from Dai et al. (2012) (see Algorithm 2 below).

0:  Noisy observation , dictionary , prior , parameters .
0:  Aggregate estimator .
  Let , .
  for  do
     Set
     
     
  end for
Algorithm 2 GMA-0 Algorithm

We adopt flat priors () for simplicity. From the definition of (18), it is easy to see that, the minimizer of (when with flat prior) becomes in (38) by setting , so is approximated by GMA-0 with and 200 iterations, and the projection algorithm is denoted by “PROJ”. GMA-BMAX and GMA-0 are run for iterations up to , with . Parameter for GMA-BMAX and exponential weighted model averaging (denoted by “EWMA”) is tuned by 10-fold cross-validation. STAR estimator is also included. Regrets of all algorithms defined in (34) are reported for comparisons.

In the following, we consider two scenarios. The first situation is when the basis are not very correlated; in such case GMA-0 can perform better than GMA-BMAX because the former (which employs linear entropy) produces sparser estimators. The second situation is when the basis are highly correlated; in such case GMA-BMAX is superior to GMA-0 because the former (which employs strongly convex KL-entropy) gives the clustered basis functions similar weights while the latter tends to select one from the clustered basis functions which may lead to model selection error. The correlated basis situation occurs in the continuous dictionary setting.

6.1 Experiment 1: when and , basis are not very correlated

Table 1 is a comparison of commonly used estimators (STAR, EWMA and PROJ) with GMA-BMAX and GMA-0. The regrets are reported using the “” format. Table 2 is the cumulative frequency table of GMA-BMAX, GMA-0 and EWMA with 100 replicates and fixed iteration . For each entry, we summarize the number of replicates with regrets which are smaller than or equal to the upper boundary value.

STAR EWMA PROJ
GMA-BMAX
GMA-0
Table 1: Performance Comparison ( and )
Upper Boundary
GMA-BMAX
GMA-0
EWMA
Table 2: Cumulative Frequency of Regret (, and )

The results in Table 1 indicate that GMA-0 has the best performance as iteration increases, and GMA-0 outperforms STAR, EWMA and PROJ after as small as iterations, which still gives a relatively sparse averaged model. This is consistent with Theorem 4.1 and 4.2 in Dai et al. (2012) which states that GMA-0 has optimal bounds for small ().

GMA-BMAX prefers dense model putting similar weights on similar candidates. Specifically, GMA-BMAX is greedy algorithm with estimators converge to , the aggregate by BMAX as defined in (8). It is easy to verify that with defined as

thus two similar candidates and will have similar weights.

In contrast, GMA-0 prefers sparse model by selecting candidate less related to estimator from previous iteration. Specifically, with flat prior assumed, the choice of in GMA-0 algorithm can be further simplified to

(39)

thus at each iteration in GMA-0 algorithm, estimator is preferred if it has similar distance to yet being less correlated to current aggregate estimator (because the minimization requires to be large while being small).

In Experiment 1, the first 50 candidates are closer to truth than other candidate (), yet they are not very correlated when . Sparsity is preferred when correlations are not strong among predicting features (in our experiment, the first 50 candidates), and GMA-0 is to output sparser estimator than GMA-BMAX. Therefore, we would expect GMA-0 achieving smaller regret than GMA-BMAX in this situation.

Although GMA-BMAX is worse than GMA-0 when the basis are not very correlated, it beats EWMA, STAR and PROJ when iteration is large enough.

Figure 1: (a)-(c) show the histograms of regrets for GMA-BMAX, GMA-0 and EWMA with ; (d) reports the performance by plotting regrets versus iterations for case and .

Figure 1 compares the performance of GMA-BMAX, GMA-0 and EWMA. (a), (b) and (c) illustrate the histograms of regrets with 100 replicates. The corresponding cumulative frequencies are presented in Table 2. Since Corollary 3 indicates the optimal deviation bound is obtained by , we pick up for GMA-BMAX and GMA-0 in order to make a fair comparison. As the histograms show, although EWMA has the most replicates which are close to zero, the distribution of EWMA estimator is the most dispersive with the most extreme values among these three methods. The performance is consistent with  Dalalyan and Tsybakov (2007, 2008) and Dai et al. (2012) which state that EWMA estimator is optimal in expectation but sub-optimal in deviation. Therefore, we would expect GMA-BMAX and GMA-0 enjoy more concentrated distribution than EWMA because they are also optimal in deviation. (d) shows the convergence of GMA-BMAX and GMA-0. Note that GMA-BMAX and GMA-0 both initialize with , but they produce difference estimators after the first iteration (), GMA-BMAX selects that minimizes , whiles GMA-0 selects that minimizes and the first stage output is actually the empirical risk minimizer