Uncertainty in Multitask Transfer Learning
Using variational Bayes neural networks, we develop an algorithm capable of accumulating knowledge into a prior from multiple different tasks. The result is a rich and meaningful prior capable of few-shot learning on new tasks. The posterior can go beyond the mean field approximation and yields good uncertainty on the performed experiments. Analysis on toy tasks shows that it can learn from significantly different tasks while finding similarities among them. Experiments of Mini-Imagenet yields the new state of the art with 74.5% accuracy on 5 shot learning. Finally, we provide experiments showing that other existing methods can fail to perform well in different benchmarks.
While conventional supervised learning is getting more stable and used in a wide range of applications, learning a complex model may require a daunting amount of labeled data. For this reason, transfer learning is often considered as an option to reduce the sample complexity of learning a new task
Recently, significant progress has been made to scale Bayesian neural networks to large tasks and to provide better approximations of the posterior distribution (Blundell et al., 2015; Louizos and Welling, 2017; Krueger et al., 2017). This, however, comes with an important question: “What does the posterior distribution actually represent?”.
For neural networks, the prior is often chosen for convenience and the approximate posterior is often very limited (Blundell et al., 2015). For sufficiently large datasets, the observations overcome the prior, and the posterior becomes a single mode around the true model
However, many usages of the posterior distribution require a meaningful prior. That is, a prior expressing our current knowledge on the task and, most importantly, our lack of knowledge on the task. In addition to that, a good approximation of the posterior under the small sample size regime is required, including the ability to model multiple modes. This is indeed the case for Bayesian optimization (Snoek et al., 2012), Bayesian active learning (Gal et al., 2017), continual learning (Kirkpatrick et al., 2017), safe reinforcement learning (Berkenkamp et al., 2017), exploration-exploitation trade-off in reinforcement learning (Houthooft et al., 2016). Gaussian processes (Rasmussen, 2004) have historically been used for these applications, but using an RBF kernel is a too generic prior for many tasks. More recent tools such as deep Gaussian processes (Damianou and Lawrence, 2013) show great potential and yet their scalability whilst learning from multiple tasks needs to be improved.
Our aim in this work is to learn a good prior across multiple tasks and transfer it to a new task. To be able to express a rich and flexible prior learned across a large number of tasks, we use neural networks learned with a variational Bayes procedure. By doing so, we are able to (i) isolate a small number of task specific parameters and (ii) obtain a rich posterior distribution over this space. Additionally, the knowledge accumulated from the previous tasks provides a meaningful prior on the target task, yielding a meaningful posterior distribution which can be used in a small data regime.
The rest of the paper is organized as follows: We first describe the proposed approach in Section 2 while reviewing hierarchical Bayes modeling. Section 4 focuses on outlining key differences between our approach and related methods. In Section 3, we extend to 3 level of hierarchies to obtain a model more suited for classification. In Section 5, we conduct experiments on toy tasks to gain insight on the behavior of the algorithm. Finally, we show that we can obtain the new state of the art on the Mini-Imagenet benchmark Vinyals et al. (2016).
2 Learning a Deep Prior
By leveraging the variational Bayes approach, we show how we can learn a prior over models with neural networks. Also, by factorizing the posterior distribution into a task agnostic and task specific component, we show an important simplification resulting in a scalable algorithm, which we refer to as deep prior.
2.1 Hierarchical Bayes
We consider learning a prior from previous tasks by learning a probability distribution over the weights of a network parameterized by . This is done using a hierarchical Bayes approach across tasks, with hyper-prior . Each task has its own parameters , with . Using all datasets , we have the following posterior:
The term corresponds to the likelihood of sample of task given a model parameterized by e.g. the probability of class from the softmax of a neural network parameterized by with input . For the posterior , we assume that the large amount of data available across multiple tasks will be enough to overcome generic prior such as an isotropic Normal distribution. Hence, we consider a point estimate of the posterior using maximum a posteriori
We can now focus on the remaining term: . Since is potentially high dimensional with intricate correlations among the different dimensions, we cannot use a simple Gaussian distribution. Following inspiration from generative models such as GANs (Goodfellow et al., 2014) and VAE (Kingma and Welling, 2013), we use an auxiliary variable and a deterministic function projecting the noise to the space of i.e. . Marginalizing , we have: , where is the Dirac delta function. Unfortunately, directly marginalizing is intractable for general . To overcome this issue, we add to the joint inference and marginalize it at inference time. Considering the point estimation of , the full posterior is factorized as follows:
where is the conventional likelihood function of a neural network with weight matrices generated from the function i.e.: . Similar architecture has been used in Krueger et al. (2017) and Louizos and Welling (2017), but we will soon show that it can be reduced to a simpler architecture in the context of multi-task learning. The other terms are defined as follows:
The task will consist of jointly learning a function common to all tasks and a posterior distribution for each task. At inference time, predictions are performed by marginalizing i.e.: .
2.2 Hierarchical Variational Bayes Neural Network
In the previous section, we describe the different components for expressing the posterior distribution of Equation 4. While all those components are tractable, the normalization factor hidden behind the ”″ sign is still intractable. To address this issue, we follow the Variational Bayes approach (Blundell et al., 2015).
Conditioning on , we saw in Equation 1 that the posterior factorizes independently for all tasks. This reduces the joint Evidence Lower BOund (ELBO) to a sum of individual ELBO for each task.
Given a family of distributions , parameterized by and , the Evidence Lower Bound for task is:
Notice that after simplification
But, most importantly, any explicit reference to has now vanished from both Equation 5 and Equation 6. This simplification has an important positive impact on the scalability of the proposed approach. Since we no longer need to explicitly calculate the KL on the space of , we can simplify the likelihood function to , which can be a deep network parameterized by , taking both and as inputs. This contrasts with the previous formulation, where produces all the weights of a network, yielding an extremely high dimensional representation and slow training.
2.3 Posterior Distribution
For modeling , we can use , where and can be learned individually for each task. This, however limits the posterior family to express a single mode. For more flexibility, we also explore the usage of more expressive posterior, such as Inverse Autoregressive Flow (IAF) (Kingma et al., 2016).
This gives a flexible tool for learning a rich variety of multivariate distributions. In principle, we can use a different IAF for each task, but for memory and computational reasons, we use a single IAF for all tasks and we condition
Note that with IAF, we cannot evaluate for any values of efficiently, only for those which we just sampled, but this is sufficient for estimating the KL term with a Monte-Carlo approximation i.e.:
where . It is common to approximate with a single sample and let the mini-batch average the noise incurred on the gradient. We experimented with , but this did not significantly improve the rate of convergence.
2.4 Training Procedure
In order to compute the loss proposed in Equation 5, we would need to evaluate every sample of every task. To accelerate the training, we describe a procedure following the mini-batch principle. First we replace summations with expectations:
Now it suffices to approximate the gradient with samples across all tasks. Thus, we simply concatenated all datasets into a meta-dataset and added as an extra field. Then, we sample uniformly
3 Extending to 3 Level of Hierarchies
Deep prior, gives rise to a very flexible way to transfer knowledge from multiple tasks. However, there is still an important assumption at the heart of deep prior (and other VAE based approach such as Edwards and Storkey (2016)), the task information must be encoded in a low dimensional variable . In Section 5, we show that it is appropriate for regression, but for image classification, it is not the most natural assumption. Hence, we propose to extend to a third level of hierarchy by introducing a latent classifier on the obtained representation.
In Equation 5, for a given
To compute ELBO in 5 and update the parameters , the only requirement is to be able to compute the marginal likelihood . There are closed form solutions for, e.g., linear regression with Gaussian prior, but our aim is to compare with algorithms such as Prototypical Networks (Proto Net) (Snell et al., 2017) on a classification benchmark. Alternatively, we can factor the marginal likelihood as follow . If a well calibrated task uncertainty is not required, one can also use a leave one out procedure . Both of these factorizations corresponds to training times the latent classifier on a subset of the training set and evaluating on a left out sample. We refer the reader to Rasmussen (2004, Chapter 5) for a discussion on the difference between leave one out cross validation and marginal likelihood.
For a practical algorithm, we propose a closed form solution for leave one out in prototypical networks. In it’s standard form, the prototypical network produces a prototype by averaging all representations of class i.e. , where . Then, predictions are made using .
Let be the prototypes computed without example in the training set. Then,
We defer to supplementary materials. Hence, we only need to compute prototypes one time and rescale the Euclidean distance when comparing with a sample that was used for computing the current prototype. This gives an efficient algorithm with the same complexity as the original one and a good proxy for the marginal likelihood.
4 Related Work
Hierarchical Bayes algorithms for multitask learning has a long history (Daumé III, 2009; Wan et al., 2012; Bakker and Heskes, 2003). However most of the literature focus on simple statistical models and do not consider transferring on new tasks.
More recently, Edwards and Storkey (2016) and Bouchacourt et al. (2017) explore hierarchical Bayesian inference with neural networks and evaluate on new tasks. Both of them use a two level Hierarchical VAE for modeling the observations. While similar, our approach differs in a few different ways. We use a discriminative approach and focus on model uncertainty. We show that we can obtain a posterior on without having to explicitly encode . We also explore the usage of more complex posterior family such as IAF. Those differences make our algorithm simpler to implement, and easier to scale to larger datasets.
Some recent works on meta-learning are also targeting transfer learning from multiple tasks. Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) finds a shared parameter such that for a given task, one gradient step on using the training set will yield a model with good predictions on the test set. Then, a meta-gradient update is performed from the test error through the one gradient step in the training set, to update . This yields a simple and scalable procedure which learns to generalize. Recently Grant et al. (2018) considers a Bayesian version of MAML. Additionally, (Ravi and Larochelle, 2016) also consider a meta-learning approach where an encoding network reads the training set and generates the parameters of a model, which is trained to perform well on the test set.
Finally, some recent interest in few-shot learning give rise to various algorithms capable of transferring from multiple tasks. Many of these approaches (Vinyals et al., 2016; Snell et al., 2017) find a representation where a simple algorithm can produce a classifier from a small training set. Bauer et al. (2017) use a neural network pre-trained on a standard multi-class dataset to obtain a good representation and use classes statistics to transfer prior knowledge to new classes.
5 Experimental Results
Through experiments, we want to answer i) Can deep prior learn a meaningful prior on tasks? ii) Can it compete against state of the art on a strong benchmark? iii) In which situations deep prior and other approaches are failing?
5.1 Regression on one dimensional Harmonic signals
To gain a good insight into the behavior of the prior and posterior, we choose a collection of one dimensional regression tasks. We also want to test the ability of the method to learn the task and not just match the observed points. For this, we will use periodic functions and test the ability of the regressor to extrapolate outside of its domain.
Specifically, each dataset consists of pairs (noisily) sampled from a sum of two sine waves with different phase and amplitude and a frequency ratio of 2: , where . We construct a meta-training set of 5000 tasks, sampling , and independently for each task. To evaluate the ability to extrapolate outside of the task’s domain, we make sure that each task has a different domain. Specifically, values are sampled according to , where is sample from the meta-domain . The number of training samples ranges from 4 to 50 for each task and, evaluation is performed on 100 samples from tasks never seen during training.
Once is sampled from IAF, we simply concatenate it with and use 12 densely connected layers of 128 neurons with residual connections between every other layer. The final layer linearly projects to 2 outputs and , where is used to produce a heteroskedastic noise, . Finally, we use to express the likelihood of the training set. To help gradient flow, we use ReLU activation functions and Layer Normalization
Figure 0(a) depicts examples of tasks with 1, 2, 8, and 64 samples. The true underlying function is in blue while 10 samples from the posterior distributions are faded in the background. The thickness of the line represent 2 standard deviations. The first plot has only one single data point and mostly represents samples from the prior, passing near this observed point. Interestingly, all samples are close to some parametrization of Equation 5.1. Next with only 2 points, the posterior is starting to predict curves highly correlated with the true function. However, note that the uncertainty is over optimistic and that the posterior failed to fully represent all possible harmonics fitting those two points. We discuss this issue more in depth in supplementary materials. Next, with 8 points, it managed to mostly capture the task, with reasonable uncertainty. Finally, with 64 points the model is certain of the task.
To add a strong baseline, we experimented with MAML (Finn et al., 2017). After exploring a variety of values for hyper-parameter and architecture design we couldn’t make it work for our two harmonics meta-task. We thus reduced the meta-task to a single harmonic and reduced the base frequency range by a factor of two. With those simplifications, we managed to make it converge, but the results are far behind that of deep prior even in this simplified setup. Figure 0(b) shows some form of adaptation with 16 samples per task but the result is jittery and the extrapolation capacity is very limited. Those results were obtained with a densely connected network of 8 hidden layers of 64 units
Finally, to provide a stronger baseline, we remove the KL regularizer of deep prior and reduced the posterior to a deterministic distribution centered on . The mean square error is reported in Figure 2 for an increasing dataset size. This highlights how the uncertainty provided by deep prior yields a systematic improvement.
5.2 Mini-Imagenet Experiment
Vinyals et al. (2016) proposed to use a subset of Imagenet to generate a benchmark for few-shot learning. Each task is generated by sampling 5 classes uniformly and 5 training samples per class, the remaining images from the 5 classes are used as query images to compute accuracy. The number of unique classes sums to 100, each having 600 examples of images. To perform meta-validation and meta-test on unseen tasks (and classes), we isolate 16 and 20 classes respectively from the original set of 100, leaving 64 classes for the training tasks. This follows the procedure suggested in Ravi and Larochelle (2016).
The training procedure proposed in Section 2 requires training on a fixed set of tasks. We found that 1000 tasks yields enough diversity and that over 9000 tasks, the embeddings are not being visited often enough over the course of the training. To increase diversity during training, the training and test sets are re-sampled every time from a fixed train-test split of the given task
We first experimented with the vanilla version of deep prior (2). In this formulation, we use a ResNet (He et al., 2016) network, where we inserted FILM layers (Perez et al., 2017; de Vries et al., 2017) between each residual block to condition on the task. Then, after flattening the output of the final convolution layer and reducing to 64 hidden units, we apply a 64 5 matrix generated from a transformation of . Finally, predictions are made through a softmax layer. We found this architecture to be slow to train as the generated last layer is noisy for a long time and prevent the rest of the network to learn. Nevertheless, we obtained 62.6% accuracy on Mini-Imagenet, on par with many strong baselines.
To enhance the model, we combine task conditioning with prototypical networks as proposed in Section 3. This approach alleviates the need to generate the final layer of the network, thus accelerating training and increasing generalization performances. While we no longer have a well calibrated task uncertainty, the KL term still acts as an effective regularizer and prevents overfitting on small datasets
5.3 Heterogeneous Collection of Tasks
In Section 5.2, we saw that conditioning helps, but only yields a minor improvement. This is due to the fact that Mini-Imagenet is a very homogeneous collection of tasks where a single representation is sufficient to obtain good results. To support this claim, we provide a new benchmark
|Matching Networks (Vinyals et al., 2016)||60.0 %|
|Meta-Learner LSTM (Ravi and Larochelle, 2016)||60.6 %|
|MAML (Finn et al., 2017)||63.2%|
|Prototypical Networks (Snell et al., 2017)||68.2 %|
|SNAIL (Mishra et al., 2018)||68.9 %|
|Discriminative k-shot (Bauer et al., 2017)||73.9 %|
|adaResNet (Munkhdalai et al., 2018)||71.9 %|
|Deep Prior (Ours)||62.7 %|
|Deep Prior + Proto Net (Ours)||74.5 %|
|5-way, 5-shot||4-way, 4-shot|
|Proto Net (ours)||68.6 0.5%||69.6 0.8%|
|+ ResNet(12)||72.4 1.0%||76.8 0.4%|
|+ Conditioning||72.3 0.6%||80.1 0.9%|
|+ Leave One Out||73.9 0.4%||82.7 0.2%|
|+ KL||74.5 0.5%||83.5 0.4%|
Using variational Bayes, we developed a scalable algorithm for hierarchical Bayes learning of neural networks, called deep prior. This algorithm is capable of transferring information from tasks that are potentially remarkably different. Results on the Harmonics dataset shows that the learned manifold across tasks exhibits the properties of a meaningful prior. Finally, we found that MAML, while very general, will have a hard time adapting when tasks are too different. Also, we found that algorithms based on a single image representation only works well when all tasks can succeed with a very similar set of features. Together those findings allowed us to develop the new state of the art on Mini-Imagenet.
Appendix A Appendix
a.1 Proof of Leave One Out
Let be the prototypes computed without example in the training set. Then,
Let , and assume then,
When , the result is trivially . ∎
a.2 Limitations of IAF
When experimenting with the Harmonics toy dataset in Section 5.1, we observed issues with repeatability, most likely due to local minima. We decided to investigate further on the multimodality of posterior distributions with small sample size and the capacity of IAF to model them. For this purpose we simplified the problem to a single sine function and removed the burden of learning the prior. The likelihood of the observations is defined as follows:
where is given and . Only the frequency and the bias are unknown
We observe a high amount of multi-modality on the posterior distribution (Figure 3-middle). Some of the modes are just the mirror of another mode and correspond to the same functions e.g. or . But most of the time they correspond to different functions and modeling them is crucial for some application. The number of modes varies a lot with the choice of observed dataset, ranging from a few to several dozens. Now, the question is: “How many of those modes can IAF model?”. Unfortunately, Figure 3-bottom reveals poor capability for this particular case. After carefully adjusting the hyperparameters
- A task is defined as modeling the underlying distribution from a dataset of observations.
- The true model must have positive probability under the prior. Also, when the true model can be parameterized differently, modeling one or multiple modes is equivalent.
- cancelled with itself from the denominator since it does not depend on nor . This would have been different for a generative approach.
- This can be done through simply minimizing the cross entropy of a neural network with regularization.
- We can justify the cancellation of the Dirac delta functions by instead considering a Gaussian with finite variance, . For all , the cancellation is valid, so letting , we recover the result.
- We follow the architecture proposed in Kingma et al. (2016).
- We also explored a sampling scheme that always make sure to have at least samples from the same task. The aim was to reduce gradient variance on task specific parameters but, we did not observed any benefits.
- We removed from equations to alleviate the notation.
- Layer norm only marginally helped.
- We also experimented with various other architectures.
- If the train and test split is not fixed for a given task, one could leak the test information through the task embeddings across different resampling of the task.
- We had to cross validate the weight of the kl term and obtained our best results using values around 0.1
- Code and dataset will be provided.
- We scale and by a factor of 5 so that the range of interesting values fits well in the interval . This Makes it more approachable by IAF.
- 12 layers with 64 hidden units MADE network for each layer, learned with Adam at a learning rate of .
- J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. Journal of Machine Learning Research, 4(May):83–99, 2003.
- M. Bauer, M. Rojas-Carulla, J. B. Świątkowski, B. Schölkopf, and R. E. Turner. Discriminative k-shot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.
- F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pages 908–919, 2017.
- C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
- D. Bouchacourt, R. Tomioka, and S. Nowozin. Multi-level variational autoencoder: Learning disentangled representations from grouped observations. arXiv preprint arXiv:1705.08841, 2017.
- A. Damianou and N. Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
- H. Daumé III. Bayesian multitask learning with latent hierarchies. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 135–142. AUAI Press, 2009.
- H. de Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. Courville. Modulating early visual processing by language. In Advances in Neural Information Processing Systems, pages 6597–6607, 2017.
- H. Edwards and A. Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
- C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. International Conference on Machine Learning, pages 1126–1135, 2017.
- Y. Gal, R. Islam, and Z. Ghahramani. Deep bayesian active learning with image data. arXiv preprint arXiv:1703.02910, 2017.
- Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- D. P. Kingma, T. Salimans, and M. Welling. Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934, 2016.
- J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
- D. Krueger, C.-W. Huang, R. Islam, R. Turner, A. Lacoste, and A. Courville. Bayesian hypernetworks. arXiv preprint arXiv:1710.04759, 2017.
- B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
- C. Louizos and M. Welling. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
- N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. In ICLR, 2018.
- T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler. Rapid adaptation with conditionally shifted neurons. In ICML, 2018.
- E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871, 2017.
- C. E. Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer, 2004.
- S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2016.
- J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4080–4090, 2017.
- J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
- O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638. 2016.
- J. Wan, Z. Zhang, J. Yan, T. Li, B. D. Rao, S. Fang, S. Kim, S. L. Risacher, A. J. Saykin, and L. Shen. Sparse bayesian multi-task learning for predicting cognitive outcomes from neuroimaging measures in alzheimer’s disease. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 940–947. IEEE, 2012.