Meta Cyclical Annealing Schedule: A Simple Approach to Avoiding MetaAmortization Error
Abstract
The ability to learn new concepts with small amounts of data is a crucial aspect of intelligence that has proven challenging for deep learning methods. Metalearning for fewshot learning offers a potential solution to this problem: by learning to learn across data from many previous tasks, fewshot learning algorithms can discover the structure among tasks to enable fast learning of new tasks. However, a critical challenge in fewshot learning is task ambiguity: even when a powerful prior can be metalearned from a large number of prior tasks, a small dataset for a new task can simply be very ambiguous to acquire a single model for that task. The Bayesian metalearning models can naturally resolve this problem by putting a sophisticated prior distribution and let the posterior well regularized through Bayesian decision theory. However, currently known Bayesian metalearning procedures such as VERSA suffer from the socalled information preference problem, that is, the posterior distribution is degenerated to one point and is far from the exact one. To address this challenge, we design a novel metaregularization objective using cyclical annealing schedule and maximum mean discrepancy (MMD) criterion. The cyclical annealing schedule is quite effective at avoiding such degenerate solutions. This procedure includes a difficult KLdivergence estimation, but we resolve the issue by employing MMD instead of KLdivergence. The experimental results show that our approach substantially outperforms standard metalearning algorithms.
1 Introduction
The human visual system is efficient at grasping the main concepts of any new image from only a single or a few images. Over the last few years, fewshot learning techniques have been developed by many researchers to achieve humanlevel performance on image recognition tasks. Generally, it is expected that a “good” fewshot learning technique should satisfy properties such as the following: (i) it is able to learn new tasks with fewshot examples efficiently, thus learning the new categories fast; (ii) the performance can be improved even as increasing numbers of input samples are given on a new task; (iii) performance on the initial tasks at training time is not sacrificed (without forgetting).
Although many fewshot classification algorithms are proposed, it is a tough task to organize the best unified framework for fewshot learning. Metric learning methods [25, 22, 24] aim to learn a datadependent metric to reduce intraclass distance and increase interclass distances. Gradientbased metalearning [12, 5, 22] attempts to learn the commonalities among various tasks. MAML [5] is an effective metalearning method that directly optimizes the gradient descent procedure for taskspecific learners. In the amortized Bayesian inference framework, [19, 20] proposed a method for predicting the weights of classes from activations of a pretrained network to transfer from a highshot classification task to a separate lowshot classification task. Recently, [9] proposed a general metalearning framework (MLPIP) with approximate probabilistic inference and its implementation to few shot learning tasks (VERSA). MLPIP unified a number of important approaches on metalearning, including both gradient and metricbased metalearning [12, 5, 22, 19, 20] with amortized inference frameworks (neural processes) [7, 8]. It is a general framework because of the endtoend training and supports full multitask learning by sharing information between many tasks. In particular, VERSA replaces the optimization at test time with efficient posterior inference by generating a distribution over the taskspecific parameters in a single forward pass. Therefore, this framework can amortize the cost of inference and relieve the need for second derivatives for fewshot training during test time. It is also worth noting that their inference framework is focused on the posterior predictive distribution, i.e., it aims to minimize the KLdivergence between the true and approximate predictive distributions rather than maximizing the ELBO, which is generally utilized in VAEbased methods [13].
In the stateoftheart models on fewshot learning tasks, amortized inference distribution is practically utilized because it is efficient and scalable to large datasets, and it requires only the specified parameters of the neural network. However, to get proper amortized inference, we need to tackle the amortization gap problem and information preference problem as stated below. As analyzed in [3], the inference mismatch between the true and approximate posterior which consists of two gaps (i) approximation gap and (ii) amortization gap. Their conclusions are that increasing the capacity of the encoder reduces the amortization error and when efficient test time inference is required, encoder generalization is important and expressive approximations in decoder are likely advantageous. Another example of the estimation difficulty of amortized inference is that cosinesimilaritybased nonamortization models [2] achieved superior performance than those with amortization inference on fewshot learning. This implies that effective estimation methodology for amortization inference has still not been established.
Our contributions in this paper are as follows:

We show that one of the amortization gap problems comes from the information preference problem of the latent distribution.

We adapt both the annealing method and regularization of parameter estimation in the amortized inference network to avoid the information preference problem by applying cyclical annealing schedule and maximum mean discrepancy.

Our proposal meets the “good” properties of fewshot learning, get better performance on standard fewshot classification tasks.
Despite its simplicity of our proposed method, it can significantly improve the performance. Through several experimental analyses, we show that our methodology outperforms other stateoftheart fewshot learning algorithms.
2 Preliminaries
Along with the many fewshot learning methods, a number of measures for assessing their actual performance has also being proposed. The MLPIP model unified a number of important approaches on metalearning, including both gradient and metric based metalearning [12, 5, 22, 19] with amortized inference framework [7, 8]. Although their method is similar to these models, it is more general, employing endtoend training and supporting full multitask learning by sharing information between many tasks. In this section, we describe the multitask metalearning problem that we deal with in this paper, and we review the VERSA (implementation of MLPIP for metalearning) and neural processes (NPs) [9, 8].
2.1 Metalearning problem
In this paper, we mainly consider fewshot classification problems, in which we are given fewshot (say, shot) observations consisting of inputoutput pairs for each of the classes (we call way), and we perform class classification for an unseen test input data. We call this problem way shot metalearning problem. One typical approach to tackle this problem is to construct an “encoder” beforehand, estimate a weight vector from the fewshot observations and apply the softmax operation for the linear discriminator (we call “decoder”) . The encoder is usually trained based on other training data (which typically does not contain the class fewshot observations) so that extracts informative features that can distinguish the unseen classes. For the training phase, we are given training data for several tasks, ( is the task index: ) where and , the number of observations for each task is supposed to be small. Based on the support dataset , we train the encoder and the network which produces the weight vector . This procedure can be seen as a kind of learning a training procedure. In the test phase, we are given test data ( is the task index: ) where and of new unseen tasks. Based on the support dataset , the encoder produces the new weight vector . In the fewshot learning setting, is a class label among classes and the total number of data is .
2.2 MetaLearning via amortized Bayesian inference
VERSA is a Bayesian metalearning framework and is also used in the fewshot classification task. Its graphical model is shown in Figure 1 (a). VERSA consists of two parts: an encoder and an amortization network. The encoder maps an input to a feature vector, and its parameter is denoted by (we call global latent variable). We use the same across all the tasks. As we have described in the previous section, this encoder is trained through the support dataset . When a new task appears at test time, the same encoder as the one estimated at the training time is used for test as well; that is, for a newly observed task , shared statistical encoder is fed as input and it outputs as a representation of the input . The amortization network outputs the predictive distribution from the representation of the input . It is characterized by the task specific parameter which represents a network that maps the encoded input to the parameters of the approximated posterior distribution of the parameters of the output label . In VERSA model, have to be trained with fewshot samples at training time using the training data . In practice, as the amortized function, essentially a neural network, is estimated to take a representation variable as input, and outputs the mean and variance parameter for predictive distribution of each task.
For the fewshot classification task, VERSA encodes the class by the average of the encodedinput : . This acts like the weight vector for the classification. Basically, the predictive distribution for is given by the softmax value of . To obtain the approximated posterior predictive distribution, VERSA generates as a stochastic version of from the Gaussian distribution with mean and variance specified by the output of , and sample the predictive distribution corresponding to .
This framework approximates the posterior predictive distribution by an amortized distribution as follows. Here, the predictive distribution of the test output given the input and the fewshot sample is given as
(1) 
where corresponds to softmax function. However, the posterior distribution of is difficult to calculate. Therefore, VERSA approximates the predictive distribution by the amortized distribution by utilizing the approximated posterior distribution . VERSA employs a Gaussian distribution as the approximated posterior which is characterized by the network output: . Then, the amortized predictive distribution is given as
(2) 
Since VERSA wants to approximate the predictive distribution as accurate as possible, the endtoend stochastic training objective to be minimized for and is given as follows:
(3) 
However, in general, learning “good” latent code is difficult because even when a powerful prior can be metalearned from a large number of prior tasks, a small dataset for a new task can simply be too ambiguous to acquire a single accurate model. Here, we consider a more general objective which includes the regularization term:
(4) 
In the objective, the KLdivergence between the posterior distributions work as regularization. Unfortunately, the conditional prior in the above expression is intractable. To resolve this issue, we instead use an approximated posterior , which gives:
(5) 
It is interesting to note that replacing the feature vector with , we got the NPs objective from the above objective:
(6) 
where the function is the neural network. NPs combines the strengths of neural networks and Gaussian processes to achieve both flexible learning and fast prediction in stochastic processes. While VERSA uses linear discriminator as a decoder, NPs uses neural network as it. Both models are represented in Figure 1 (b).
The point is that, in VERSA and NPs, the central (stochastic) function being learnt has a form , of an output given an input , a support dataset and the encoder’s parameter (global latent variable) .
2.3 Information preference property
As in VERSA and NPs, we consider the following generative process for ,
(7) 
where is the prior and is given by a generative model with parameter . Under ideal conditions, optimizing the objective using sufficiently flexible model families for and over will achieve both goals of correctly capturing and performing correct amortized inference. However, this approach suffers from the following problem: the decoder tends to neglect the latent variables altogether, that is, the mutual information between and conditioned on becomes negligibly small. For example,
Therefore,
The above objective tells us that the more the learning procedure proceeds (that is, the lefthand side reduces), the more the mutual information between and decreases. It follows that the mutual information between and decrease. Intuitively, the reason is because the distribution of tends to shrink to a single point and is almost uniquely identified by a given and . This is undesirable because such a shrunken posterior of is far from the true posterior and severely lose variation of the posterior sampling for . This effect, which we shall refer to as the information preference problem, was studied in the VAE framework with a coding efficiency argument [13]. In the VAE framework, the issue causes two undesirable outcomes: (1) the learned features are almost identical to the uninformative Gaussian prior for all observed tasks; and (2) the decoder completely ignores the latent code, and the learned model reduces to a simpler model [6].
3 Proposed Method
As seen in the previous section, a unified Bayesian inference framework MLPIP and its implementation VERSA and NPs utilize the amortized inference distribution because it is efficient and scalable to large datasets, and it requires only the specified parameters of the neural network. However, as we have seen in the previous section, the regularization in the objective causes the information preference problem (which is a wellknown issue in the VAE framework) and inaccurate estimation of amortized inference distributions.
One approach to remedy this issue is to introduce a hyperparameter to control the strength of regularization [11]. Furthermore, [6] found that scheduling during the model training highly improve the performance. In addition, [26] reported an alternative approach: replacing the KLdivergence of the latent distributions in the objective with the alternative divergence.
However, these previous studies applied their methodologies to only single task learning framework. In contrast, this paper considers the cyclical annealing schedule for during multitask learning (metatraining) and replacing the divergence with the maximum mean discrepancy criterion. This procedure leads high mutual information between and , which encourages the model to use the latent code and avoids the information preference problem.
3.1 Cyclical annealing schedule
Several attempts have been made to ameliorate the information preference problem in the VAE framework. Among them, the simplest solution is monotonic KL annealing, where the weight of the KL penalty term is scheduled to gradually increase during training (monotonic schedule) [1] (see Figure 3 (a)). In the VAE framework, latent code learned earlier can be viewed as the initialization; such latent variables are much more informative than random and are thus ready for the decoder to use. Therefore, to mitigate the information preference problem, it is key to have meaningful at the beginning of training the decoder, so that can be utilized. Furthermore, [6] found that simply repeating the monotonic schedule multiple times (cyclical annealing schedule) enables high improvement in the above method (see Figure 3 (b)).
We apply this type of regularization to the multitask learning context. We basically consider the following objective that anneals the second term of the righthand side in Eq. (6) by the factor with :
We can see that by annealing the regularization term (second term), the effect of information preference property discussed in Section 2.3 is mitigated. However, it is experimentally observed that just utilizing a fixed does not produce a good result. Thus, we gradually change the penalty term during the training. For that purpose, we decompose the objective into each task and control depending on each task as
(8) 
where is the loss corresponding to a task and is the regularization term for each training task. During the training, we randomly sample task one after another and update the parameters where we apply different at each update. There could be several possibility of scheduling (Figure 3), but we employ the cyclical annealing schedule during the training phase. This realizes that each task can be trained with several different values of throughout the training, which yields avoiding the information preference problem. Our experimental results show that this approach helps each taskspecific learner to avoid falling into local minima. We call this approach as the meta cyclical annealing (MCA).
3.2 Maximum mean discrepancy
Unfortunately, the KLterm in the righthand side in Eq. (8) is difficult to compute. Therefore, we employ the maximum mean discrepancy (MMD) [10] as a discrepancy measure between the distributions, which enables us to compute the corresponding term. MMD is a framework to quantify the distance between two distributions by comparing all of their moments via a kernel technique. Letting be any positive definite kernel, such as Gaussian kernel . MMD between and is defined as: It is known that if the kernel is characteristic, if and only if [17]. A rough intuition of MMD is that difference of the moments of each distributions and are measured through the (characteristic) kernel to know how different those distributions are. MMD can accomplish this efficiently via the kernel embedding trick.
We propose to employ MMD as the alternative divergence because MMD is easy to calculate and is stable against the support mismatch between the two distributions. To do so, instead of optimizing the objective introduced in Eq.(8), we minimize the following objective:
(9) 
4 Experimental Results
In this section, we experimentally show that the information preference problem of the posterior distribution actually occurs in nonregularized amortized inference and our proposal which aims to restrict the parameters of amortized distributions with MCA and MMD significantly improves the performance compared with existing methods.
4.1 Omniglot
Omniglot [14] consists of 1623 characters from 50 different alphabets. Each of alphabets was hand drawn by 20 different people, thus 20 instances for each class (each character). We follow a preprocessing and training procedure by [25] and [9].
The training, validation and test sets consist of a random split of 1100, 100, and 423 characters, respectively. Each training iteration consists of the number of the mini batch that consists of random tasks extracted from the training set. During training, kshot samples are used as training and remained 15 are used as test inputs. Evaluation after training is conducted on 600 randomly selected tasks from the test set. At the test phase, kshot instances are utilized as test inputs which is unseen task for trained model. We use the Adam [13] optimizer with a constant learning rate of 0.0001 with 16 tasks per batch to train all models.
4.2 miniImagenet
The miniImageNet dataset consists of a subset of 100 classes from the ImageNet dataset [4] and contains 600 images for each class. Also 100 classes are divided into 64 train, 16 validation, and 20 test classes. This dataset is complex and difficult enough to evaluate fewshot classification tasks. Training proceeds in the same episodic manner as with Omniglot.
4.3 Effect of regularization
Here, we checked if our model estimates the latent code as expected via MCA and MMD. MCA and MMD regularizes the latent distribution close to standard Gaussian distribution to avoid information preference problem. Figure 4 shows the distributions of . We can see that the distributions of MCA+NPs and MCA+MMD+NPs are well regulated and close to Gaussian distributions (see Figure 4 (c) and (d)). This is because of our MCA and MMD regularization. On the other hand, the distribution of NPs is far from Gaussians and the distribution of each class tends to degenerate to onepoint (like a deltadistribution) which loses variation of resulting in worse posterior approximation (see Figure 4 (a)). This supports our expectation that the MMDregularization effectively avoids the information preference problem.
4.4 Fewshot classification
To compare with existing methods, we focus our method on standard fewshot classification tasks, 20way classification for Omniglot and 5way classification for miniImageNet. We do not evaluate 5way classification for Omniglot because it is already set to more than 99% with the existing methods, which is too high for comparing accuracy.
The results of Omniglot are shown in Table 1. Our proposal, MCA+NPs and MCA+MMD+NPs set good results. For 20way 1shot classification of Omniglot, our model achieves a new stateoftheart result (99.810.14) which is significantly improved comparing with exiting methods. The result on miniImageNet is shown in Table 2. We see that, for miniImageNet, MCA+NPs achieves 77.371.67% for 5way 1shot classification, MCA+MMD+NPs achieves 91.780.89% for 5way 5shot classification. Both results are also new stateoftheart. Furthermore, our experimental results demonstrate that our models surpass VERSA in terms of performance, which suggests that mitigating amortization error provides improvement.
5 Conclusions
In this paper, we proposed the MCA+NPs and MCA+MMD+NPs models to improve amortized inference distribution with regularization techniques based on the latest fewshot learning framework, VERSA [9] and NPs [8]. Through comparing methods on a common ground, our results show that the MCA+MMD+NPs model is comparable to stateoftheart models under standard conditions, and the MCA+NPs model achieves comparable performance to recent stateoftheart metalearning algorithms on both the Omniglot and miniImageNet benchmark datasets. Additionally, our proposal seems to avoid the information preference problem by analysis.
Acknowledgments
We would like to thank Naonori Ogasahara and Iwato Amano for insightful discussions. Taiji Suzuki was partially supported by JSPS Kakenhi (26280009, 15H05707 and 18H03201), and JSTCREST.
References
 (2016) Generating sentences from a continuous space. Proceedings of Conference on Computational Natural Language Learning SIGNLL. External Links: Document Cited by: §3.1.
 (2019) A closer look at fewshot classification. The International Conference on Learning Representations (ICLR). External Links: Document Cited by: §1.
 (2018) Inference suboptimality in variational autoencoders. The International Conference on Learning Representations (ICLR). External Links: Document Cited by: §1.
 (2009) Imagenet: a largescale hierarchical image database. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Document Cited by: §4.2.
 (2017) Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML). External Links: Document Cited by: §1, §2, Table 1, Table 2.
 (2019) Cyclical annealing schedule: a simple approach to mitigating KL vanishing. Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). External Links: Document Cited by: §2.3, §3.1, §3.
 (2018) Conditional neural processes. In International Conference on Machine Learning (ICML). External Links: Document Cited by: §1, Figure 2, §2, Table 1.
 (2018) Neural processes. In International Conference on Machine Learning (ICML). External Links: Document Cited by: §1, Figure 1, Figure 2, §2, §5.
 (2019) Metalearning probabilistic inference for Prediction. The International Conference on Learning Representations (ICLR). External Links: Document Cited by: §1, Figure 1, §2, §4.1, Table 1, Table 2, §5.
 (2007) A kernel method for the twosampleproblem. In Advances in Neural Information Processing Systems (NIPS). External Links: Document Cited by: §3.2.
 (2017) VAE: learning basic visual concepts with a constrained variational framework. The International Conference on Learning Representations (ICLR). External Links: Document Cited by: §3.
 (2018) Semiamortized variational autoencoders. The International Conference on Learning Representations (ICLR). External Links: Document Cited by: §1, §2.
 (2014) Autoencoding variational bayes. The International Conference on Learning Representations (ICLR). External Links: Document Cited by: §1, §2.3, §4.1.
 (2015) Humanlevel concept learning through probabilistic program induction. Science. External Links: Document Cited by: §4.1.
 (2019) Metalearning with differentiable convex optimization. External Links: Document Cited by: Table 2.
 (2017) Metasgd: learning to learn quickly for few shot learning. External Links: Document Cited by: Table 1, Table 2.
 (2017) Kernel mean embedding of distributions: a review and beyond. Foundations and Trends in Machine Learning 10 (12), pp. 1–141. Cited by: §3.2.
 (2018) On firstorder metalearning algorithms. External Links: Document Cited by: Table 1, Table 2.
 (2018) Fewshot image recognition by predicting parameters from activations. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Document Cited by: §1, §2.
 (2019) Amortized bayesian metalearning. The International Conference on Learning Representations (ICLR). External Links: Document Cited by: §1.
 (2019) Metalearning with latent embedding optimization. External Links: Document Cited by: Table 2.
 (2017) Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems (NIPS). External Links: Document Cited by: §1, §2, Table 1, Table 2.
 (2019) Fast and generalized adaptation for fewshot learning. External Links: Document Cited by: Table 2.
 (2018) Learning to compare: relation network for fewshot learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Document Cited by: §1, Table 1, Table 2.
 (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems (NIPS). External Links: Document Cited by: §1, §4.1, Table 1, Table 2.
 (2018) Info: information maximizing variational autoencoders. External Links: Document Cited by: §3.