Meta-Learning PAC-Bayes Priors in Model Averaging

# Meta-Learning PAC-Bayes Priors in Model Averaging

## Abstract

Nowadays model uncertainty has become one of the most important problems in both academia and industry. In this paper, we mainly consider the scenario in which we have a common model set used for model averaging instead of selecting a single final model via a model selection procedure to account for this model’s uncertainty to improve reliability and accuracy of inferences. Here one main challenge is to learn the prior over the model set. To tackle this problem, we propose two data-based algorithms to get proper priors for model averaging. One is for meta-learner, the analysts should use historical similar tasks to extract the information about the prior. The other one is for base-learner, a subsampling method is used to deal with the data step by step. Theoretically, an upper bound of risk for our algorithm is presented to guarantee the performance of the worst situation. In practice, both methods perform well in simulations and real data studies, especially with poor quality data.

## 1 Introduction

It is very common in practice that the distributions generating the observed data are described more adequately by multiple models. A standard procedure to make the inference is that according to some criteria, such as model predictive ability, model fitting ability, and many different information criteria the best model is chosen and assumed as the true model. After selection, all the inferences and conclusions are made based on the assumption.

However, the drawbacks of this approach exist. The selection of one particular model may lead to riskier decisions since it ignores the model uncertainty. In other words, if we choose the wrong model, the consequence will be disastrous. \citeauthorMoral2015Model \shortciteMoral2015Model already pointed out the concern, “From a pure empirical viewpoint, model uncertainty represents a concern because estimates may well depend on the particular model considered.” Therefore, combining multiple models to reduce the model uncertainty is very desirable.

As an alternative strategy, combining multiple models which is called model averaging enables researchers to draw conclusions based on the whole universe of candidate models. In particular, researchers estimate all the candidate models and then compute a weighted average of all the estimates. There are two different approaches to model averaging in the literature, including Frequentist Model Averaging (FMA) and Bayesian Model Averaging (BMA). Frequentist approaches focus on improving prediction and use the weighted mean of estimates from different models while Bayesian approaches focus on the probability that a model is true and consider priors and posteriors for different models.

The FMA approach does not consider priors, so the corresponding estimators depend solely on data. For its simplicity, the FMA approach has received some attention over the last decade. See \citeauthorYang2001Adaptive \shortciteYang2001Adaptive, \citeauthorleung2006information \shortciteleung2006information, \citeauthorhansen2007least \shortcitehansen2007least and a detailed review \citeauthorwang2009frequentist \shortcitewang2009frequentist for reference.

\citeauthor

leamer1978specification \shortciteleamer1978specification suggested to use Bayesian inference to reduce the model uncertainty as a framework and pointed out the importance of the fragility of regression analysis to arbitrary decisions about the choice of control variables. Bayesian Model Averaging considers model uncertainty through the prior distribution. The model posteriors are obtained by Bayes’ theorem, and therefore allow for combined estimation and prediction. Compared with the FMA approaches, there is a huge literature on the use of BMA in statistics.

Influenced by \citeauthorleamer1978specification \shortciteleamer1978specification, most works were concentrated on the linear models only. \citeauthorRaftery1996Approximate \shortciteRaftery1996Approximate extended in generalized linear models by providing a straightforward approximation. For more details, refer to a landmark review \citeauthorMoral2015Model \shortciteMoral2015Model on BMA.

The Bayesian approaches have the advantage of using arbitrary domain knowledge through a proper prior. However, as commented by \citeauthorhjort2003frequentist \shortcitehjort2003frequentist, how to set prior probabilities and how to deal with the priors when they conflict with each other are still problems. The PAC-Bayes framework, first formulated by \citeauthorMcallester1999PAC \shortciteMcallester1999PAC, was proposed to take the priors into account. In the beginning, most researches assumed that loss functions were bounded. For detailed information, see \citeauthorcatoni2007pac \shortcitecatoni2007pac. For unbounded loss, \citeauthorcatoni2004statistical \shortcitecatoni2004statistical provided a result under exponential moment assumptions. In last decade, it has been widely developed. Different types of PAC-Bayes bounds were presented under various assumptions, for example, \citeauthorSeeger2002PAC \shortciteSeeger2002PAC, \citeauthorseldin2010pac \shortciteseldin2010pac, \citeauthorguedj2013pac \shortciteguedj2013pac, \citeauthoralquier2016properties \shortcitealquier2016properties, \citeauthorgrunwald2016fast \shortcitegrunwald2016fast, \citeauthorcatoni2016pac \shortcitecatoni2016pac, \citeauthorlugosi2019regularization \shortcitelugosi2019regularization, and \citeauthoralquier2018simpler \shortcitealquier2018simpler. And, many distribution-dependent priors are used to derive tighter PAC-Bayes bounds such as \citeauthorLever2013Tighter \shortciteLever2013Tighter, \citeauthoroneto2016pac \shortciteoneto2016pac, \citeauthordziugaite2018data \shortcitedziugaite2018data and \citeauthorrivasplata2018pac \shortciterivasplata2018pac. Here, we must distinguish between obtaining the tighter bounds by distribution-dependent priors and using part of the data to meta-learn a prior and the rest to learn the function.

Note that for getting the posterior distribution of the weights, \citeauthorYe2016Sparsity \shortciteYe2016Sparsity gave a method without choosing a proper prior. For meta-learning the prior, some meta-learners [\citeauthoryearFinn, Abbeel, and Levine2017, \citeauthoryearLi et al.2017, \citeauthoryearAmit and Meir2018] are limited to their use of gradient. \citeauthoramit2018meta-learning \shortciteamit2018meta-learning provided an extended PAC-Bayes bound for learning the proper priors. Nevertheless, it involved reusing the data which increased the probability of overfitting, and they gave an implementation only for the normal distribution of the weights.

In this paper, we propose a specific risk bound under our settings and two data-based methods for adjusting the priors in the PAC-Bayes framework. And, two practical algorithms are given accordingly. The main contributions of this work are the following. First, when the historical data existed, we use similar old tasks to extract mutual knowledge with the current task for adjusting the priors. Second, a sequential batch sampling method is proposed to deal with the base-learner for learning posterior by subsampling with the rules made by researchers. Third, two theoretical risk bounds are provided for these two situations respectively. Fourth, empirical demonstration shows that the proposed meta-methods have excellent performances in the numerical studies.

The remainder of this paper is organized as follows. In Section 2, an upper bound for the averaging model and a practical historical data related algorithm are established for obtaining a better prior. In case that there is no historical data, Section 3 proposes another method called a sequential batch sampling algorithm to adjust the prior step by step. Illustrative simulations including regression and classification tasks given in Section 4 show that our algorithms will lead to more effective prediction. We further apply the proposed methods to two real datasets and confirm the higher prediction accuracy of the minimizing risk bound method. Some proofs of theories are delegated to the supplementary materials.

## 2 Learning the Prior in Meta-Learner

In a traditional supervised learning task, the learner needs to find an optimal model (or hypothesis) to fit the data and then uses the learned model to make predictions. In the Bayesian approach, various models are allowed to fit the data. In particular, the learner needs to learn an optimal model distribution over the candidate models and then uses the learned model distribution to make predictions.

More specifically, in a supervised learning task, we are given a set of i.i.d. samples drawn from an unknown distribution over , i.e., . The goal is to find a model in the candidate model set , a set of functions mapping features (feature vector) to responses, that minimizes the expected loss function , where is a bounded loss function. Without loss of generality, we assume is bounded by . Note that this assumption is often used at the beginning of the PAC-Bayes framework. This paper uses McAllester’s bound, so this assumption is necessary. Note that our procedure can use other, possibly tighter bounds. In other words, under different regularization conditions, it can be replaced by many other PAC-type bounds as long as the assumption matches the corresponding PAC bound. In the Bayesian framework, a distribution over is the purpose instead of searching a specific optimal model . Therefore, the goal turns to find the optimal model distribution , which minimizes . Then one could use the weighted average of these models over to make predictions, namely, . More generally, we further assume that the candidate model set consists of classes of models with . Each model class is associated with a probability , and for each model class , there is a distribution over . For example, a model class could be a group of models obtained from the Lasso method, and the hyper-parameter in Lasso follows a distribution . Another common example is that is a group of neural networks with a certain architecture, and the hyperparameters of neural networks follow a joint distribution . In this way, the total distribution over can be written as , where consists of with . The goal of the learning task is to find an optimal distribution , the posterior of , which minimizes the expected risk , and then the prediction is made by .

Since sample distribution is unknown, the expected risk cannot be computed directly. Therefore, it is usually approximated by the empirical risk in practice, and is learned by minimizing the empirical risk . When the sample size is large enough, it would be a good approximation. However, in many situations, we do not have so much data, which may lead to a large difference between them. Thus, using the empirical risk to approximate the expected risk is not appropriate any longer.

We first study the difference between the empirical risk and the expected risk . Based on the literature [\citeauthoryearMcallester1999], we can obtain an upper bound of their difference which is stated as the following lemma.

###### Lemma 1.

Let be a prior distribution over that must be chosen before observing the samples, and let . Then with probability at least , the following inequality holds for all posterior distributions over ,

 R(ξ,D)≤^R(ξ,S) + ⎷KL(w||w0)+∑Kk=1wkKL(Qk||Q0k)+lnnδ2(n−1), (1)

where is the cardinality of sample set , and is the Kullback-Leibler (KL) divergence between two distributions1.

According to the above lemma, it reveals two facts. (a) The upper bound can be divided into two terms. The first term is called sample complexity caused by the randomness of sampling. The second term is called model complexity caused by the difference between the true distribution of data generation models and the distribution learned for predicting the data. (b) It is clear that only when the sample size is large, the difference can be guaranteed to be small. Thus, minimizing may not lead to the minimizer of , which matches our intuition. To avoid the risk of the approximation, one can minimize the upper bound of the expected risk instead of using the empirical risk as an approximation. In particular, we denote the right hand side of Eq.(1) by . Then one can learn the model distribution by minimizing . Intuitively, such a choice of for the learning task makes the worst case best.

Lemma 1 also indicates that the prior plays an important role. Since the choice of balances the tradeoff between the empirical risk and the regularization term, if the prior is far away from the true optimal model distribution , the posterior will also be bad. The best situation for optimizing the posterior is that the prior exactly equals the true optimal model distribution . Then, the regularization term disappears. In other words, if there is a good prior which is close to , the upper bound will be small. However, without any prior knowledge, one can only use data to help obtain a better prior. The naive method is directly using the non-informative prior as for minimizing to get the posterior .

When the extra data of historical tasks have been collected, the learner has the chance to learn a good prior to more reliable inferences. To get a good prior, it is helpful to extract mutual knowledge from similar tasks. Figure 1 schematically illustrates this process.

In particular, there are sample tasks i.i.d. generated from an unknown task distribution . For each sample task , a sample set with samples is generated from an unknown distribution . Without ambiguity, we use the notation to denote the posterior under the prior after observing the sample set . Note that, the proposed meta-learner still works if the base-learner of getting the posterior is replaced by other popular methods. The quality of a prior is measured by . Thus, the expected loss we want to minimize is

 R(ξ0,τ)=EDi∼τESi∼DniiR(ξ(ξ0,Si),Di).

The above expected risk cannot be computed directly for the unknown distribution , thus the following empirical risk is used to estimate it:

 ^R(ξ0,S1,…,Sm)=1mm∑i=1^R(ξ(ξ0,Straini),Svalidationi),

where each sample set is divided into a training set and a validation set .

Consider the regression setting for task . Suppose the true model is

 yT=fT(xT)+σT(xT)⋅εT,

where is the function to be learned, the error term is assumed to be independent of and has a known probability density with mean and a finite variance. The unknown function controls the variance of the error at . There are i.i.d. samples drawn from an unknown joint distribution of . Assume that there is a candidate model set , each of which is a function mapping feature (feature vector) to response, i.e., . To take the information of the old tasks, which can reflect the importance of each , the following Algorithm 1 (meta-learner) is proposed.

This algorithm is based on the cross-validation framework. First, using to obtain the candidate priors by any Bayesian procedure, for example, minimizing the PAC bound introduced in Lemma 1 with non-informative prior. Cross-validation determines the importance of the priors. The -th task is divided into two parts randomly. The first part is used to learn the posterior with the prior . The second part is to evaluate the performance of the posterior by its likelihood function. This evaluation is inspired by \citeauthorRaftery1995Bayesian \shortciteRaftery1995Bayesian. To simplify the determination of the weights, \citeauthorRaftery1995Bayesian \shortciteRaftery1995Bayesian proposed a frequentist approach to BMA. The Bayes’ theorem was replaced by the Schwarz asymptotic approximation which could be viewed as using maximized likelihood function as the weights of the candidate models. The on the denominator of makes the weight larger if the model is accurate. This procedure repeats many times for each pair . Their averages reveal the importance of the priors. In the end, the is obtained by weighted averaging them all. The property of this algorithm can be guaranteed by Lemma 2.

The following regularization conditions are assumed for the results. First, is assumed to be a known distribution with and variance .

• The functions and are uniformly bounded, i.e., and for constants and .

• The error distribution satisfies that for each and , there exists a constant such that

 ∫q(x)lnq(x)1sq(x−ts)μ(dx)≤B((1−s)2+t2)

for all and .

• The risks of the estimators for approximating and decrease as the sample size increases.

For the condition (C1), note that when we deal with -way classification tasks, the responses belong to which is bounded obviously. Moreover, if the input space is a finite region that often happens in real datasets, the most common functions are bounded uniformly. The constants are involved in the derivation of the risk bounds, but they can be unknown in practice when we implement the Algorithm 1. The condition (C2) is satisfied by Gaussian, (with the degree of freedom larger than two), double-exponential, and so on. The condition (C3) usually holds for a good estimating procedure, like consistent estimators. An estimator is called consistent if the expected risk tends to zero when the experimental size tends to infinity.

###### Lemma 2.

Assume three regularization conditions are satisfied. The loss function and is known. Then, the combined prior as given above satisfies

 R(ξ∗,τ)≤infj(C1∑i≠j(ni−n′i) +C2∑i≠j(ni−n′i)∑i≠j(ni−n′i)R(ξj,Di)),

with probability at least , where the constant depend on the regularization conditions.

Note that we assume a known just for simplifying the expression. It has a more general version for unknown . The proof is given briefly with unknown in Supplementary Materials.

In this general prove, it can be seen that (i) Variance estimation is also important for the Algorithm 1. Even if a procedure estimates very well, a bad estimator of can substantially reduce its weight in the final estimator. (ii) Under the condition (C3), the risk of a good procedure for estimating and usually decreases as the sample size increases. The influence of the number of testing points is quite clear. Smaller decreases the first penalty term but increases the main terms that involve the risks of each . (iii) Lemma 2 reveals the vital property that if one alternative model is consistent, the combined model will also have consistency.

If Algorithm 1 is used to obtain the prior from multi-tasks, we could get the following theorem theoretically by simply combining Lemmas 1 and 2. See supplementary materials for the detailed proofs.

###### Theorem 1.

Assume conditions (C1), (C2) and (C3) are satisfied. The loss function and is known. Then, the combined posterior as given above satisfies

 R(ξ∗,τ)≤infj⎛⎝C1∑i≠j(ni−n′i) +C2∑i≠j(ni−n′i)∑i≠j(ni−n′i)[^R(ξ∗j,S(2)i,n′i) + ⎷KL(ω∗j||ωj)+∑Kk=1ωj,kKL(Q∗j,k||Qj,k)+lnniδ2(ni−1)]⎞⎟⎠

with probability at least , where the constant depend on the regularization conditions, is the initial prior which should be non-informative prior and is the minimizer of Eq.(1) with and .

The major contribution of Theorem 1 is that it implies a simple consequence of consistency. The penalty for adaptation (the first term in RHS of Theorem 1) is basically of order , which is negligible for nonparametric rates. Thus, for any bounded regression function, the combined model performs asymptotically as well as any model in the candidate model set . The detailed expression refers to Corollary 1 in \citeauthoryang2000combining \shortciteyang2000combining. Further, even if the underlying true model is not in the candidate set, the combined model may still be able to approach the true model, e.g, there exists a sequence of models in the set approaching an optimal one.

For classification tasks, the one-hot response is used which means the -th response in the model is a vector describing the probability of each class. The loss function still uses to maintain consistency with the regression case. The prediction of a new observation is to choose the class with the largest probability in . Thus, we do not need to change the condition in Theorem 1, but just change the response to be one-hot for classification. Consequently, the results for regression still hold for classification.

Besides the risk that we consider, other performance measures (e.g., cross entropy loss which is used in our last experiment with the MNIST dataset or hinge loss) are useful from both theoretical and practical points of view. It is thus of interest to investigate whether similar adaptation procedures exist for other loss functions and if not, what prices are one needs to pay for adaptation, which we leave for future work.

## 3 Adjusting the Prior in Base-Learner

In this section, we will discuss how to adjust the prior of models if there is no information from extra similar tasks.

In the following, we consider an iterative procedure of adjusting the prior in the base-learner. In each round, the learner can sample the data according to the prior distribution in the current round. Such iterative procedure updates the prior step by step. Ultimately, compared with dealing with the whole data at once, this procedure of adjusting prior leads to a smaller upper bound. Moreover, it also gives an opportunity to choose some good sample sets for reducing the volatility of the estimators which is measured by . The function is defined to measure the volatility of the posterior at the sample set . The complete algorithm for sequential batch sampling is shown in Algorithm 2.

For Algorithm 2, we do not handle the whole data at once. Instead, the data is processed in steps. First, a space-filling design is used as the initial experiment points to reduce the probability of overfitting caused by unbalanced sampling. The traditional space-filling design aims to fill the input space with design points that are as “uniform” as possible in the input space. The uniformity of space-filling design is illustrated in Figure 2. For the next steps, uncertain points are needed to be explored. And, the uncertainty is measured by the volatility . Hence, the batch with large volatility will be chosen. Note that if we set a huge , we will just explore a small region of the input space.

The setting of refers to [\citeauthoryearZhou et al.2018]. However, in practice, it is found that this parameter does not matter much since the results are similar to a wide range of . This procedure helps to reduce the variance of the estimator which is proved in [\citeauthoryearZhou et al.2018] by sequential sampling. Furthermore, it also helps to adjust the prior in each step which is called learning the prior. The proposition is stated below.

###### Proposition 1.

For , let , is the minimizer of the RHS of Eq.(1) with non-informative prior and obtained by Algorithm 2, then we have .

The above proposition can be understood straightforwardly. First, since we adjust the prior through the data step by step, the final prior is better than the non-informative prior. Consequently, it receives a smaller expected risk. Second, we choose the sample sets sequentially with large volatility to do experiments in order to reduce uncertainty. The performance of the Sequential Batch Sampling (SBS) method is also demonstrated in Section 4.

## 4 Experiments

In this section, some examples are shown to illustrate the procedure of Algorithms 1 and 2. The method of minimizing the upper bound in Lemma 1 with non-informative prior is denoted by RBM (Risk Bound Method). First, we begin with linear regression models that have the same setting in \citeauthorYe2016Sparsity \shortciteYe2016Sparsity. Hence, their method called SOIL is under comparison. The optimization for RHS of Eq.(1) in our algorithms is dealt with gradient descend. R package “SOIL” is used to obtain the results of the SOIL method.

### 4.1 Synthetic Dataset

###### Example 1.

The simulation data is generated for the RBM from the linear model , where , and . For each element of , or with . The sequential batch sampling has steps, and each step uses samples following Algorithm 2.

All the specific settings for parameters are summarized in Table 1, and the confidence level in Lemma 1 is set to . The Mean Squared Prediction Error (MSPE) and volatility defined in the base-learner are compared. They are obtained by sampling samples from the same distribution and computing their empirical MSPE and volatility. For each model setting with a specific choice of the parameters , we repeat times and compute the average empirical value. The comparison among RBM, SOIL and SBS is shown in Tables 2, 3 and 4.

The volatility of the SOIL method is the smallest and very close to zero. This phenomenon shows that SOIL is focused on a few models, even just one model when the volatility equals zero. Consequently, its MSPE is larger than the other two methods. SBS as a modification of RBM has similar results with RBM when is small. However, when is large, SBS performs much better than RBM. In this situation, the information of data is easily covered by big noises. Hence, a good prior which can provide more information is vital for this procedure.

The next example considers the same comparison but in non-linear models. In the last example, the alternative models include the true model, but now the true non-linear model is approximated by many linear models.

###### Example 2.

The simulation data is generated for the RBM from the non-linear models

1. ,

2. ,

where , and . The sequential batch sampling has steps, and each step uses samples following Algorithm 2.

The results of Example 2 are listed in Table 5. Mostly, it is similar to the results of Example 1. The difference is that the volatility of SOIL becomes large when the model is completely non-linear. Using linear models to fit non-linear model increases the model uncertainty since none of the fitting models is correct.

The final example is under the situation that the data has been already collected. Hence, we cannot use the SBS method to get the data. However, we have the extra data of many old similar tasks. In particular, we have the data of Example 1. Now, the new task is to fit a new model.

###### Example 3.

The data of Example 1 with is given. The new task data is generated from the linear model , where , , and .

The method described in Algorithm 1 is denoted by HDR (Historical Data Related). The results in Figure 3 show high consistency with the last two examples. When is small, the different priors lead to a similar result since the current data has a key influence. However, when is large, the difference between RMB and HDR is huge. The reason is that the current data has been polluted by the strong noise. Hence, a good prior can provide vital information about the model distribution.

### 4.2 Real-World Dataset

Here, we apply the proposed methods to two real datasets, BGS data and Bardet data, which are also used in \citeauthorYe2016Sparsity \shortciteYe2016Sparsity.

First, the BGS data is with a small and from the Berkeley Guidance Study (BGS) by \citeauthortuddenham1954physical \shortcitetuddenham1954physical. The dataset records boys’ physical growth measures from birth to eighteen years. Following \citeauthorYe2016Sparsity \shortciteYe2016Sparsity, we consider the same regression model. The response is age height and the factors include weights at ages two (WT2) and nine (WT9), heights at ages two (HT2) and nine (HT9), age nine leg circumference (LG9) and age strength (ST18).

Second, for large , the Bardet data collect tissue samples from the eyes of twelve-week-old male rats. For each tissue, the RNAs of selected probes are measured by the normalized intensity valued. The gene intensity values are in log scale. Gene TRIM32, which causes the Bardet-Biedl syndrome, is the response in this study. The genes that are related to it are investigated. A screening method [\citeauthoryearHuang, Ma, and Zhang2008] is applied to the original probes. This screened data with probes for each of tissues are also used in \citeauthorYe2016Sparsity \shortciteYe2016Sparsity.

Both cases are data-given cases that we cannot use the sequential batch sampling method. For the different settings of , we assign corresponding similar historical data for two real datasets. The data of model in Example 1 for the BGS data with small . The data of model in Example 1 for the Bardet data with large .

We randomly sample rows from the data as the test set to calculate empirical MSPE and volatility. The results are summarized in Table 6. From Table 6, we can see that both RBM and HDR have smaller MSPE than SOIL. However, HDR does not perform much better than RBM. This can be explained intuitively as follows. In theory, the historical tasks and the current task are assumed that they come from the same task distribution. But in practice, how to measure the similarity between tasks is still a problem. Hence, an unrelated historical dataset may provide less information for the current prediction.

In this section, the performance of our HDR method for classification tasks is demonstrated. Here, we use the same image classification example in \citeauthoramit2018meta-learning \shortciteamit2018meta-learning. The hypothesis class is the set of neural networks including the architecture given in \citeauthoramit2018meta-learning \shortciteamit2018meta-learning and other CNN architectures provided in Keras. Different architectures are weighted by the parameter . And, the distribution is to characterize the hyperparameter of the -th architecture. The cross-entropy loss is used.

The task environment is constructed based on augmentations of the MNIST dataset [\citeauthoryearLeCun1998]. Each task is created by a permutation of the image pixels. We randomly pick , and pixel swaps, and find that they have similar results. Thus we just show the results with pixel swaps. For the meta-learner, it is trained by the meta-training tasks each with training samples and validation samples. For a new task with fewer training samples and test samples, we randomly sample training samples times and compare the average test error percentage of different learning methods. ’ CI’ in Table 7 means the confidence interval. MLAP represents the method in \citeauthoramit2018meta-learning \shortciteamit2018meta-learning, and different subscripts mean that different PAC bounds are used. The Model-Agnostic-Meta-Learning [\citeauthoryearFinn, Abbeel, and Levine2017] is denoted by MAML. See \citeauthoramit2018meta-learning \shortciteamit2018meta-learning for detailed settings of the example.

Table 7 summarizes the results for the permuted pixel environment with pixel swaps and training tasks. We find that the best results are obtained by our HDR method. Note that for classification tasks, the weighted prediction is for the one-hot response. Hence, the prediction can be viewed as picking the largest probability among all the models not just in one model. Consequently, model averaging used in classification has much better performance than regression cases.

We also investigate whether the number of training tasks affects the error rate of the predictions on the new test tasks, and find that it is improved a lot if the meta-learner is used. But, the number of training tasks does not have a significant effect.

## Acknowledgment

We sincerely thank Prof. Zhihua Zhang (Peking University) and all reviewers for their valuable comments which have led to further improvement of this work.

## Supplementary Materials

For Lemma 1, we review the classical PAC-Bayes bound [\citeauthoryearMcallester1999] with general notations first.

###### Lemma 3.

Let be a sample space and be a function space over . Define a loss function , and be a sequence of independent identical distributed random samples. Let be some prior distribution over . For any , the following bound holds for all posterior distributions over ,

 PS(EXEfg(f,X)≤1nn∑i=1Efg(f,Xi)+ ⎷ρ||π+lnnδ2(n−1))≥1−δ. (2)

Proof of Lemma 1: We use Lemma 3 to bound the expected risk with the following substitutions. The samples are . The function where . The loss function . The prior is defined by , in which we first sample from according to corresponding weights and then sample from . The posterior is defined similarly, .

The KL-divergence term is

 ρ||π=Eflnρ(f)π(f)=Ek∈{1,…,K}Eh∈MklnwkQk(h)w0,kQ0k(h)=K∑k=1wkEh∈MklnwkQk(h)w0,kQ0k(h)=ω||ω0+K∑k=1ωk(Qk||Q0k). (3)

Substituting the above into Eq.(2), it follows that

 (4)

Using the notations in Section 2, we can rewrite the above as below,

 PS(R(ξ,D)≤^R(ξ,S)+ ⎷(ω||ω0+K∑k=1ωk(Qk||Q0k)+lnnδ)/(2n−2))≥1−δ. (5)

Proof of Proposition 1:

First, we prove that for ,

 ¯¯¯¯R(ξi,ξi−1,Bi)≤¯¯¯¯R(ξi−1,ξi−2,Bi−1).

By definition of ,

 ¯¯¯¯R(ξi,ξi−1,Bi)≤¯¯¯¯R(ξi−1,ξi−1,Bi) =^R(ξi−1,Bi)+√lnnδ/(2n−2) ≤¯¯¯¯R(ξi−1,ξi−2,Bi)=¯¯¯¯R(ξi−1,ξi−2,Bi−1).

Following these inequalities,

 ¯¯¯¯R(ξb,ξb−1,S)=¯¯¯¯R(ξb,ξb−1,Bb) ≤¯¯¯¯R(ξ1,ξ0,B1)=¯¯¯¯R(ξ∗,ξ0,S).

This finishes the proof. ∎

Proof of Lemma 2:

According to Theorem in [\citeauthoryearYang2001], we have

 (6)

where is the minimizer of Eq.(1) with and denoted by .

For any and an estimator satisfied the condition (C3), the inequalities and hold. Plugging into (6) for , it follows that

 R(ξ∗,τ)≤infj(C1∑i≠j(ni−n′i)

where is the minimizer of Eq.(1) with and .

In order to obtaining the form in Theorem 2, it only needs to note that if is known, the term vanishes. ∎

### Footnotes

1. is defined as .

### References

1. Alquier, P., and Guedj, B. 2018. Simpler PAC-Bayesian bounds for hostile data. Machine Learning 107(5):887–902.
2. Alquier, P.; Ridgway, J.; and Chopin, N. 2016. On the properties of variational approximations of Gibbs posteriors. The Journal of Machine Learning Research 17(1):8374–8414.
3. Amit, R., and Meir, R. 2018. Meta-learning by adjusting priors based on extended PAC-Bayes theory. international conference on machine learning 205–214.
4. Catoni, O. 2004. Statistical learning theory and stochastic optimization: Ecole d’Eté de Probabilités de Saint-Flour XXXI-2001. Springer.
5. Catoni, O. 2007. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. IMS.
6. Catoni, O. 2016. PAC-Bayesian bounds for the gram matrix and least squares regression with a random design. arXiv preprint arXiv:1603.05229.
7. Dziugaite, G. K., and Roy, D. M. 2018. Data-dependent PAC-Bayes priors via differential privacy. In Advances in Neural Information Processing Systems, 8430–8441.
8. Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1126–1135. JMLR. org.
9. Grünwald, P. D., and Mehta, N. A. 2016. Fast rates with unbounded losses. arXiv preprint arXiv:1605.00252 2:12.
10. Guedj, B.; Alquier, P.; et al. 2013. PAC-Bayesian estimation and prediction in sparse additive models. Electronic Journal of Statistics 7:264–291.
11. Hansen, B. E. 2007. Least squares model averaging. Econometrica 75(4):1175–1189.
12. Hjort, N. L., and Claeskens, G. 2003. Frequentist model average estimators. Journal of the American Statistical Association 98(464):879–899.
13. Huang, J.; Ma, S.; and Zhang, C. H. 2008. Adaptive lasso for sparse high-dimensional regression. Stat Sin 18(4):1603–1618.
14. Leamer, E. E. 1978. Specification searches. New York: Wiley.
15. LeCun, Y. 1998. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/.
16. Leung, G., and Barron, A. R. 2006. Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory 52(8):3396–3410.
17. Lever, G.; Laviolette, F.; and Shawe-Taylor, J. 2013. Tighter PAC-Bayes bounds through distribution-dependent priors. Theoretical Computer Science 473(2):4–28.
18. Li, Z.; Zhou, F.; Chen, F.; and Li, H. 2017. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835.
19. Lugosi, G.; Mendelson, S.; et al. 2019. Regularization, sparse recovery, and median-of-means tournaments. Bernoulli 25(3):2075–2106.
20. Mcallester, D. A. 1999. PAC-Bayesian model averaging. In In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, 164–170.
21. Moral-Benito, E. 2015. Model averaging in economics: An overview. Journal of Economic Surveys 29(1):46–75.
22. Oneto, L.; Anguita, D.; and Ridella, S. 2016. PAC-bayesian analysis of distribution dependent priors: Tighter risk bounds and stability analysis. Pattern Recognition Letters 80:200–207.
23. Raftery, A. E. 1995. Bayesian model selection in social research. Sociological Methodology 25(25):111–163.
24. Raftery, A. E. 1996. Approximate bayes factors and accounting for model uncertainty in generalised linear models. Biometrika 83(2):251–266.
25. Rivasplata, O.; Szepesvari, C.; Shawe-Taylor, J. S.; Parrado-Hernandez, E.; and Sun, S. 2018. PAC-Bayes bounds for stable algorithms with instance-dependent priors. In Advances in Neural Information Processing Systems, 9214–9224.
26. Seeger, M. 2002. PAC-Bayesian generalisation error bounds for gaussian process classification. Journal of Machine Learning Research 3(2):233–269.
27. Seldin, Y., and Tishby, N. 2010. PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research 11(Dec):3595–3646.
28. Tuddenham, R. D., and Snyder, M. M. 1954. Physical growth of california boys and girls from birth to eighteen years. Publications in Child Development 1:183–364.
29. Wang, H.; Zhang, X.; and Zou, G. 2009. Frequentist model averaging estimation: a review. Journal of Systems Science and Complexity 22(4):732.
30. Yang, Y. 2000. Combining different procedures for adaptive regression. Journal of multivariate analysis 74(1):135–161.
31. Yang, Y. 2001. Adaptive regression by mixing. Journal of the American Statistical Association 96(454):574–588.
32. Ye, C.; Yang, Y.; and Yang, Y. 2016. Sparsity oriented importance learning for high-dimensional linear regression. Journal of the American Statistical Association (2):1–16.
33. Zhou, Q.; Ernst, P. A.; Morgan, K. L.; Rubin, D. B.; and Zhang, A. 2018. Sequential rerandomization. Biometrika 105(3):745–752.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters