Few-shot Learning: A Survey
The quest of “can machines think” and “can machines do what human do” are quests that drive the development of artificial intelligence. Although recent artificial intelligence succeeds in many data intensive applications, it still lacks the ability of learning from limited exemplars and fast generalizing to new tasks. To tackle this problem, one has to turn to machine learning, which supports the scientific study of artificial intelligence. Particularly, a machine learning problem called Few-Shot Learning (FSL) targets at this case. It can rapidly generalize to new tasks of limited supervised experience by turning to prior knowledge, which mimics human’s ability to acquire knowledge from few examples through generalization and analogy. It has been seen as a test-bed for real artificial intelligence, a way to reduce laborious data gathering and computationally costly training, and antidote for rare cases learning. With extensive works on FSL emerging, we give a comprehensive survey for it. We first give the formal definition for FSL. Then we point out the core issues of FSL, which turns the problem from “how to solve FSL” to “how to deal with the core issues”. Accordingly, existing works from the birth of FSL to the most recent published ones are categorized in a unified taxonomy, with thorough discussion of the pros and cons for different categories. Finally, we envision possible future directions for FSL in terms of problem setup, techniques, applications and theory, hoping to provide insights to both beginners and experienced researchers. 111Correspondence to Q. Yao at email@example.com
“Can machines think (Turing, 1950)? ” This is the question raised in Alan Turing’s seminal paper entitled “Computing Machinery and Intelligence” in 1950s. He made the statement that, “The idea behind digital computers may be explained by saying that these machines are intended to carry out any operations which could be done by a human computer”. In other words, the ultimate goal of machines is to be as intelligent as human. This opens the door of Artificial Intelligence (AI), which is named by McCarthy et al. (1955). Since birth, AI goes through the initial flourish and prosper since 1956s and two AI winters (Crevier, 1993; McCorduck, 2009) since 1970s, and revives since 2000s. Recent years, due to powerful computing devices such as GPU, large-scale data sets such as ImageNet (Deng et al., 2009), advanced models and algorithms such as CNN (Krizhevsky et al., 2012), AI fastens its pace to be like human and defeats human in many fields. To name a few, AlphaGo (Silver et al., 2016) has defeated human champion in playing the ancient game of go and ResNet (He et al., 2016) has defeated human classification rate for ImageNet data of 1000 classes. While in other fields, AI involves in human’s daily life as highly intelligent tools, such as voice assistant, search engines, autonomous driving car , and industrial robots.
Albeit its prosperity, AI still has some important steps to take before it acts like human, and one of them is to rapidly generalize from few data to perform the task. Recall human can rapidly generalize what they learn to new task scenarios rapidly. For example, a children who has taught how to add can rapidly migrate its knowledge to get the hang of how to multiply given a few examples, e.g., and . Another example is that given one photos of a stranger, a children can identify it from a large number of photos easily. Human make it as they can combine what they learned in the past to new examples, therefore can rapidly generalize to new tasks. In contrast, the aforementioned successful applications relies on exhaustive learning from large-scale data.
Bridging this gap between AI and human-like learning is an important direction. This can be tackled by turning to machine learning, a sub-field of AI which supports its scientific study bases such as models, algorithms and theories. Concretely, machine learning is concerned with the question of how to construct computer programs that automatically improve with experience (Mitchell, 1997). For the thirst of learning from limited supervised information to get the hang of the task, a new machine learning problem called Few-Shot Learning (FSL) (Fink, 2005; Fei-Fei et al., 2006) emerges. When there is only one exemplar to learn, FSL is also called one-shot learning problem. FSL can learn new task of limited supervised information by incorporating prior knowledge.
As discussed, FSL acts as test-bed for real artificial intelligence. It first applies to those applications that are well-understood to human, so as to fully learn like human. A typical example is character recognition (Lake et al., 2015), where computer programs are asked to classify, parse and generate new handwritten characters given few-shot. To deal with this task, one can decompose the characters into smaller parts transferable across characters, then aggregated these smaller components into new characters. This is a way of learning like human (Lake et al., 2017). Naturally, FSL advances the development of robotics (Craig, 2009) which targets at developing machines that can replicate human actions so as to replace human in some scenarios. Examples are one-shot imitation (Duan et al., 2017), multi-armed bandits (Duan et al., 2017), visual navigation (Duan et al., 2017), continuous control in locomotion (Finn et al., 2017).
Besides testing for real AI, FSL can also help to relieve the burden of collecting large-scale supervised date for industrial usages. For example, ResNet (He et al., 2016) has defeated human classification rate for ImageNet data of 1000 classes. However, this is under the circumstances where each class has sufficient labeled images. In contrast, human can recognize around 30,000 classes (Biederman, 1987), where collect sufficient images of each class for machines are very laborious. This is almost like mission impossible. Instead, FSL can help reduce the data gathering effort for these data intensive applications, such as image classification (Vinyals et al., 2016), image retrieval (Triantafillou et al., 2017), object tracking (Bertinetto et al., 2016), gesture recognition (Pfister et al., 2014), image captioning and visual question answering (Dong et al., 2018), video event detection (Yan et al., 2015), language modeling (Vinyals et al., 2016). Likewise, being able to perform FSL can reduce the cost of those computationally expensive applications such as one-shot architecture search (Brock et al., 2018). And when the models and algorithms succeed for FSL, they naturally can apply for data sets of many-shots which are easier to learn.
Another classic scenario for FSL is tasks where supervised information is hard or impossible to acquire due to some reason, such as privacy, safety or ethic issues. For example, drug discovery is process of discovering the properties of new molecules so as to identify useful ones as new drugs (Altae-Tran et al., 2017). However, due to possible toxicity, low activity, and low solubility, these new molecules do not have much real biological records on clinical candidates. This makes the drug discovery task a FSL problem. Similar rare case learning applications can be few-shot translation (Kaiser et al., 2017), cold-start item recommendation (Vartak et al., 2017), where the target tasks does not have much exemplars. It is through FSL that learning suitable models for these rare cases becomes possible.
With both academic dream of real AI and industrial needs of cheap learning, FSL draws much attention and becomes a hot topic. As a learning paradigm, many methods endeavors to solve it, such as meta-learning method (Santoro et al., 2016), embedding method (Vinyals et al., 2016) and generative modeling (Edwards and Storkey, 2017). However, there is no organized taxonomy that can connect them, explains why some methods work while other fails, and discuss the pros and cons. Therefore, we give a comprehensive survey on FSL problem. Contributions of this survey are summarized as
We give the formal definition for FSL. It can naturally link to the classic machine learning definition proposed in (Mitchell, 1997). The definition is not only general enough to include all existing FSL problems, but also specific enough to clarify what is the goal of FSL and how we can solve it. Such definition is helpful for setting future research target in the FSL area.
We point out the core issues of FSL based on error decomposition in machine learning. We figure out that it is the unreliable empirical risk minimizer that makes FSL hard to learn. This can be relieved by satisfying or reducing the sample complexity of learning. Understanding the core issues can help categorize different works into data, model and algorithm according to how they solve the core issues. More importantly, this provide insights to improve FSL methods in a more organized and systematic way.
We perform extensive literature review from the birth of FSL to the most recent published ones, and categorizes them in a unified taxonomy. The pros and cons of different categories are thoroughly discussed. We also present a summary of the insights underneath each category. These will serve as an good guideline for both beginners and experienced researchers.
We envision four promising future directions for FSL in terms of problem setup, techniques, applications and theory. These insights are based on weakness of the current development of FSL, with possible directions to explore in the future. We hope this part can provide some insights, contribute to the solving of FSL problem, and strives for real AI.
In comparison to existing FSL related survey on concept learning and experience learning for small sample (Shu et al., 2018), we provide a formal definition of what FSL is, why FSL is hard, and how FSL combines the few-shot supervised information with the prior knowledge to make learning possible. We conduct an extensive literature review based on the proposed taxonomy with detailed discussion of pros and cons, summary and insights. We also discuss the relatedness and difference between FSL and these relevant topics such as semi-supervised learning, imbalanced learning, transfer learning and meta-learning.
The reminder of this survey is constructed as follows. Section 2 provides the overview of the survey, including definition of FSL, its core issues, related learning problems and taxonomy of existing works. Section 3 presents the FSL methods that manipulate data to solve FSL problem, Section 4 discuss the FSL methods that constrain the model so as to make FSL feasible, and Section 5 illustrates how algorithm can be altered to help FSL problem. In Section 6, we envision future directions for FSL from the view of problem setup, techniques, applications and theory. Finally, the survey closes with conclusions in Section 7.
In this section, we first given the notation used throughout the paper in Section 2.1. A formal definition of FSL problem is given in Section 2.2 with concrete examples. Consider FSL problem is related to many machine learning problems, we discuss the relatedness and difference between them and FSL in Section 2.3. Then in Section 2.4, we reveal the cores issues that makes FSL problem hard. Accordingly to how existing works deal with the core issues, we present a unified taxonomy in Section 2.5.
Consider a supervised learning task , FSL deals with a data set with training set of small and test set . Usually, people consider the way shot classification task where contains examples from classes each with examples. Let be the ground truth joint distribution of input and output . FSL learns to discover the optimal hypothesis from to by fitting , and performs well for . To approximate , Model determines a hypothesis space of hypotheses parameterized by 222Parametric is used, as the non-parametric ones count on large scale data to fit the shape, hence they are not suitable for FSL.. Optimization algorithm is strategy to search through in order to find the that parameterizes the optimal for . The performance is measured by a loss function defined over the prediction (e.g., ) and the real .
2.2. Problem Definition
As FSL is a naturally a sub-area in machine learning, before giving the definition of FSL, let us recall how machine learning is defined literately. We adopt Mitchell’s definition here, which is shown in Definition 2.1.
Definition 2.1 (Machine learning (Mitchell, 1997)).
A computer program is said to learn from experience with respect to some classes of task and performance measure if its performance can improve with on measured by .
As we can see, a machine learning problem is specified by , and . For example, let be image classification task, machine learning programs can improve its measured by classification accuracy through obtained by training with large-scale labeled images, e.g., ImageNet data set (Krizhevsky et al., 2012). Another example is the recent computer program, AlphaGo (Silver et al., 2016), which has defeated human champion in playing the ancient game of go (). It improves its winning rate () against opponents by of training using a database of around 30 million recorded moves of human experts as well as playing against itself repeatedly.
The above-mentioned typical applications of machine learning require a lot of supervised information for the given tasks. However, as mentioned in the introduction, this may be difficult or even not possible. FSL is a special case of machine learning, which exactly targets at getting good learning performance with limited supervised information provided by data set . Formally, FSL is defined in Definition 2.2.
Definition 2.2 ().
Few-Shot Learning (FSL) is a type of machine learning problems (specified by , and ) where contains little supervised information for the target .
To understand this definition better, let us show three typical scenarios of FSL (Table 1):
Test bed for human-like learning: To move towards human intelligence, computer programs with ability to solve FSL problem is vital. A popular task () is to generate samples of a new character given only a few examples (Lake et al., 2015). Training a computer program with solely the given examples is not enough. Inspired by how human learns, the computer programs learns to recognize this character based on prior knowledge of parts and relations. Now contains both the given examples in the data set as supervised information and pre-trained concepts as prior knowledge. The generated characters is evaluated through the pass rate of visual Turing test (), which discriminates whether the images are generated by humans or machines. With this enlarged experience, we can also classify, parse and generate new handwritten characters of few-shot like human.
Few-shot to reduce data gathering effort and computation cost: FSL can also help to relieve the burden of collecting large-scale supervised information. Consider classifying classes of few-shot through FSL (Fei-Fei et al., 2006). The image classification accuracy () improves with the obtained by the supervised few labeled images for each class of the target , and the prior knowledge extracted from other classes, such as raw images to co-training, pre-trained models to adapt, or good initialization point for the algorithm to start with. Then, models succeed in this task usually has higher generality, hence can be easily applied for many-shots.
Few-shot due to rare cases: Finally, consider tasks where supervised information is hard or impossible to acquire due to some reason, such as privacy, safety or ethic issues. It is through FSL that learning suitable models for these rare cases becomes possible. For example, drug discovery is process of discovering the properties of new molecules so as to identify useful ones as new drugs (Altae-Tran et al., 2017). However, due to possible toxicity, low activity, and low solubility, these new molecules do not have much real biological records on clinical candidates. This makes the drug discovery task a FSL problem. For example, consider a common drug discovery task which is to predict whether the new molecule brings in toxic effects. To make FSL feasible, contains both the new molecule’s limited assay, and many similar molecules’ assays as prior knowledge. The is measured by the percent of molecules correctly assigned to toxic or not toxic.
|supervised information||prior knowledge|
|character generation (Lake et al., 2015)||a few examples of new character||pre-learned knowledge of parts and relations||pass rate of visual Turing test|
|image classification (Koch, 2015)||supervised few labeled images for each class of the target||raw images of other classes, or pre-trained models.||classification accuracy|
|drug toxicity discovery (Altae-Tran et al., 2017)||new molecule’s limited assay||similar molecules’ assays||classification accuracy|
As only a little supervised information directly related to is contained in , it is naturally that common supervised machine learning approaches fail on FSL problems. Therefore, FSL methods combine prior knowledge with available supervised information in to make the learning of the target feasible.
2.3. Relevant Learning Problems
In this section, we discuss the relevant learning problems of FSL. The relatedness and difference with respect to FSL is specially clarified.
Semi-supervised learning (Zhu, 2005) learns the optimal hypothesis from input to output by experience consisting of both labeled and unlabeled examples. Example applications are text and web page classification , where obtaining output for every input is not possible due to large-scale . Usually the unlabeled examples are of large quantity while the labeled examples are in small-scale. The unsupervised data can be used to form clusters on space of input . Then a decision boundary is constructed by separating these clusters. Learning in this way can have better accuracy than using the small-scale labeled data alone. Positive-unlabeled learning (Li et al., 2009) is a special case of semi-supervised learning, where only positive and unlabeled samples are given. The unlabeled samples can be either positive or negative. For example, in friend recommendation in social networks , we can only recommend according user’s friend list, while its relationship to the rest people is unknown. Another popular special case of semi-supervised learning, active learning (Settles, 2009) selects informative unlabeled data to query an oracle for output . This is usually used for applications where annotation is costly, such as pedestrian detection. By definition, few-shot learning can be supervised learning, semi-supervised learning and reinforcement learning, depending on what kind of data is available apart from the little supervised information.
Imbalanced learning (He and Garcia, 2008) learns from experience with severely skewed distribution for output . This occurs when some values of are rarely taken, such as fraud detection and catastrophes anticipation. It trains and tests to choose among all possible . In contrast, FSL trains for with few-shot while possibly taking other as prior knowledge to help learning, and only predict for with few-shot.
Transfer learning (Pan and Yang, 2010) transfers knowledges learned from source domain and source task where sufficient training data is available, to target domain and target task where training data is limited. The notation domain is specified by feature space and marginal distribution of input (Pan and Yang, 2010). It has been used in cross-domain recommendation , WiFi localization across time periods, space and mobile devices. Domain adaptation (Ben-David et al., 2007) is a type of transfer learning, where the tasks are the same but the domain is different. For example, the task is sentiment analysis, while the source domain data is about customer comments for movies while the target domain data is about customer comments for daily goods. Another transfer learning problem closely related to FSL is Zero-shot learning (Lampert et al., 2009). Both FSL and zero-shot learning are extreme cases in transfer learning, as they meed to transfer prior knowledge learned from other tasks or domain (Goodfellow et al., 2016). However, FSL and zero-shot learning learns for new class using different strategies : FSL manages to learn from limited training examples with the help of prior knowledge, while zero-shot learning directly uses prior knowledge from other data sources to construct hypothesis . It recognizes new class with no supervised training examples by linking them to existing classes that one has already learned. Due to a lack of supervised information, the linking between classes is extracted from other data sources. It is suitable for situations where supervised examples are extremely difficult or expensive to get, such as neural activity encoding (Palatucci et al., 2009). For example, in image classification, this relationship can be annotated by human, mined from text corpus or extracted from lexical database (Xian et al., 2018).
Meta-learning (Schmidhuber et al., 1996) or learning-to-learn (Hochreiter et al., 2001) that improves performance on task by data set of the task and the meta knowledge extracted across tasks by a meta-learner. Here, learning occurs at two levels: meta learner gradually learns generic information (meta knowledge) across tasks, and learner rapidly generalizes meta-learner for new task using task-specific information. It can be used for scenarios where meta knowledge is useful, such as learning optimization algorithms (Li and Malik, 2016; Andrychowicz et al., 2016), reinforcement learning (Finn et al., 2017) and FSL problems (Santoro et al., 2016; Vinyals et al., 2016), Indeed, many methods discussed in this survey is meta-learning method. Hen we introduce it formally as reference. Vividly, meta-learner gives the sketches of while learner completes the concrete . The learning of meta-learner needs large-scale data. Let be the distribution of task . In mete-learning, it learns from a set of tasks . Each task operates on data set of classes where with and . Each learner learns from ’s and measures test error on ’s. The parameter of meta-learner learns to minimize the error across all learners by
Then in meta-testing, another disjoint set of tasks is used to test the generalization ability of meta-learner. Each works on data set of classes where with , and . Finally, learner learns from ’s and tests on ’s to obtain the meta-learning testing error.
To understand meta-learning, an illustration of its setup is in Figure 1.
2.4. Core Issues
Usually, we cannot get perfect predictions for a machine learning problem, i.e., there are some prediction errors. In this section, we illustrate the core issue under FSL based on error decomposition in machine learning (Bottou and Bousquet, 2008; Bottou et al., 2018).
Recall that machine learning is about improving with on measured by . In terms of our notation, this can be written as
Therefore, learning is about algorithm searching in for the which parameterizes the hypothesis chosen by model. that best fit data .
2.4.1. Empirical Risk Minimization
In essence, we want minimize the the expected risk , which is the losses measured with respect to . Let be the prediction of some function for . is defined as
Likewise, for , the expected risk is denoted as . However, is unknown. Hence empirical risk is used to estimate the expected risk . It is defined as the average of the sample losses over the training data set ():
and learning is done by empirical risk minimization (Vapnik, 1992) (perhaps also with some regularizers). For illustrative purpose, let
, where attains its minima;
, where is minimized with respect to ;
, where is minimized with respect to ;
Assume and are unique for simplicity. The total error of learning taken with respect to the random choice of training set can be decomposed into
where the approximation error measures how closely functions in can approximate the optimal solution , the estimation error measures the effect of minimizing the empirical risk instead of the expected risk . The estimation error is also called generalization error.
As shown, the total error is affected by hypothesis space and the number of examples in . In other words, learning to reduce the total error can be attempted from the perspectives of data which offers , model which determines and algorithm which searches through for the parameter of the best .
2.4.2. Sample Complexity
Sample complexity refers to the number of training samples needed to guarantee the effect of minimizing empirical risk instead of expected risk is within accuracy of the best possible with probability . Mathematically, for , sample complexity is an integer such that for , we have
When is finite, is learnable. In conclusion, empirical risk minimization is closely related to sample complexity. To obtain a reliable empirical risk minimizer , we can turn to reduce the sample complexity.
For infinite space , its complexity can be measured in terms of VapnikâChervonenkis (VC) dimension (Vapnik and Chervonenkis, 1974). VC dimension is defined as the size of the largest set of inputs that can be shattered (split in all possible ways) by . The sample complexity is tightly bounded as
where the lower and upper bound are proven in (Vapnik and Chervonenkis, 1974) and (Talagrand et al., 1994) respectively. As shown, sample complexity increases with more complicated chosen by the model, higher probability () that the learned is approximately correct, and higher demand of optimization accuracy of algorithm.
2.4.3. Unreliable Empirical Risk Minimizer
Note that, for in (2), we have
which means more examples can help reduce estimation error. Besides, we also have
Thus, in common setting of supervised learning task, the training data set is armed with sufficient supervised information, i.e., is large. Empirical risk minimizer can provide a good, i.e., by (5), and stable, i.e., by (6), approximation to the best possible for ’s in .
However, the number of available examples is small in FSL, which is smaller than the required sample complexity . Therefore, the empirical risk is far from being a good approximation for expected risk , and the resultant empirical risk minimizer is not good nor stable. Indeed, this is the core issue underneath FSL, i.e., the empirical minimizer is no longer reliable. Therefore, FSL is much harder than common machine learning settings. A comparison between common versus few-shot setting is shown in Figure 2.
Historically, classical machine learning methods learns with regularizations (Goodfellow et al., 2016) to generalize the learned methods for new data. Regularization techniques have been rooted in machine learning, which helps to reduce estimation error and get better learning performance (Mitchell, 1997). Classical examples include Tikhonov regularizer (Hoerl and Kennard, 1970), lasso regularizer (Tibshirani, 1996) and early stopping (Zhang et al., 2005). However, these simple regularization techniques, cannot address the problem of FSL. The reason is that they do not bring in any extra supervised information or exploit prior knowledge, therefore they cannot satisfy or reduce the sample complexity , in turn it cannot address the unreliability of the empirical minimizer causing by small . Thus, learning with regularization is not enough to offer good prediction performance for FSL problem.
In above sections, we have shown how learning is performed by empirical risk minimization and why the critical problem underneath FSL is the unreliable empirical risk minimizer. We also link empirical risk minimization to sample complexity , and examines the sample complexity in terms of data, model and algorithm. Existing works try to overcome unreliability of the empirical risk minimizer via these three perspectives:
Data: use prior knowledge to augment so as to provide an accurate of smaller variance and to meet the sample complexity needed by common model and algorithms (Figure 3(a)).
Model: design based on prior knowledge in experience to constrain the complexity of and reduce its sample complexity . An illustration is shown in Figure 3(b), the gray areas are not considered for later optimization as they are known by the prior knowledge not possible to contain optimal . For this smaller , is enough to learn with more reliable , as the sample complexity is reduced.
Algorithm: take advantage of prior knowledge to search for the which parameterizes the best . The prior knowledge alters the search by providing good initial point to begin the search, or directly providing the search steps. Meta-learning methods are one popular example for methods of this kind. Instead of working towards the unreliable , this perspective directly targets at . We use a transparent marker for in Figure 3(c) to show that optimization for empirical risk can be skipped here. With the meta-learned prior knowledge from other tasks, meta-learning methods can offer each task a good initialization point to fine-tune by , or guide it in the correct optimization direction and pace to optimize towards .
Follow this setup, we provide a taxonomy for how existing works solve FSL in terms of manipulating sample complexity with the help of prior knowledge. An overview is in Figure 4, where we explicitly show what prior knowledge is included in the experience for each category.
Methods solving FSL problem by augmenting data by prior knowledge, so as to enrich the supervised information in . With more samples, the data is sufficient to meet the sample complexity needed by subsequent machine learning models and algorithms, and to obtain a more reliable with smaller variance.
Next, we will introduce in detail how data is augmented in FSL using prior knowledge. Depending on the type of prior knowledge, we classify these methods into four kinds as shown in Table 2. Accordingly, an illustration of how transformation works is shown in Figure 5. As the augmentation to each of classes in is done independently, we illustrate using example of class in as an example.
|handcrafted rule||original||handcrafted rule on||(transformed , )|
|learned transformation||original||learned transformation on||(transformed , )|
|unlabeled data Set||unlabeled data||predictor trained by||(unlabeled data, label predicted by )|
|similar data set||sample from similar data set||aggregate new and by weighted average of samples of similar data set||aggregated sample|
3.1. Duplicate with Transformation
This strategy augments by duplicating each into several samples with some transformation to bring in variation. The transformation procedure, which can be learned from similar data or handcrafted by human expertise, is included in experience as prior knowledge. It is only applied on images so far, as the synthesized images can be easily evaluated by human.
3.1.1. Handcrafted Rule
On image recognition tasks, many works augment by transforming original examples in using handcrafted rules as pre-processing routine, e.g., translating (Shyam et al., 2017; Lake et al., 2015; Santoro et al., 2016; Benaim and Wolf, 2018), flipping (Shyam et al., 2017; Qi et al., 2018), shearing (Shyam et al., 2017), scaling (Lake et al., 2015; Zhang et al., 2018b), reflecting (Edwards and Storkey, 2017; Kozerawski and Turk, 2018), cropping (Qi et al., 2018; Zhang et al., 2018b) and rotating (Santoro et al., 2016; Vinyals et al., 2016) the given examples.
3.1.2. Learned Transformation
In contrast, this strategy augment by duplicating original examples into several samples which are then modified by learned transformation. The learned transformation itself is the prior knowledge in , while neither its training samples nor learning procedure are needed for the current FSL task.
The earliest paper on FSL (Miller et al., 2000) uses exactly this strategy to solve FSL image classification. A set of geometric transformation is learned from a similar class by iteratively aligning each sample in correspondence to other samples. Then this learned transformation is applied on each to form a large data set which can be learned normally. Similarly, Schwartz et al. (2018) learn a set of auto-encoders from a similar class, each representing one intra-class variability, to generate new samples by adding the variation to . Assuming all categories share general transformable variability across samples, a single transformation function is learned in (Hariharan and Girshick, 2017) to transfer variation between sample pairs learned from others classes to by analogy. In object recognition, the object often main transient attributes, such as sunny for scene and white for snow. In contrast to enumerate the variability within pairs, Kwitt et al. (2016) transform each to several new samples using a set of independent attribute strength regressors learned from a large set of scene images with fine-grained annotations, and assign these new samples the label of the original . Based on (Kwitt et al., 2016), Liu et al. (2018) further propose to learn a continuous attribute subspace to easily interpolate and embed any attribute variation to .
Duplicating by handcrafted rule is task-invariant. It is popularly used in deep models to reduce the risk of overfitting (Goodfellow et al., 2016). However, usually deep models are used for large-scale data sets, where the samples are enough to estimate its rough distributions (either conditional distribution for discriminative models or generating distribution for generative model) (Mitchell, 1997). In this case, augmenting by more samples can help the shape of the distribution to be clearer. In contrast, FSL contains only a little supervised information, thus its distribution is not exposed. Directly using this handcrafted rules without considering the task or desired data property available in can make the estimation of distribution easily go stray. Therefore, it can only mediate rather than solve FSL problem, and is only used as a pre-processing step for image data.
As for duplicating by learned transformation, it can augment more suitable samples as it is data-driven and exploit prior knowledge akin to or task . However, this prior knowledge needed to be extracted from similar tasks, which may not always be available and can be costly to collect.
3.2. Borrow From Other Data Sets
This strategy borrows samples from other data sets and adapts them to be like sample of the target output, so as to be augmented to the supervised information .
3.2.1. Unlabeled Data Set
This strategy uses a large set of unlabeled samples as prior knowledge, which possibly contains samples of the same label as . The crux is to find the samples with the same label and add them to augment . As this unlabeled data set is usually large, it can contain enormous variations of samples. Adding them to can help depict a more precise . This strategy is used for gesture recognition from videos in (Pfister et al., 2014). A classifier learned from is used to pick the same gesture from a large but weakly supervised gesture reservoir which contains large variations of continuous gestures of different people but no clear break between gestures. Then the final gesture classifier is build using these selected samples. Label propagation is used to label directly in (Douze et al., 2018).
3.2.2. Similar Data Set
This strategy augments by aggregating samples pairs from other similar data sets with many-shot. By similar, we mean the classes in these data sets are similar, such as one data set of different kinds of tiger and another data set of different kinds of cat. The underlying assumption is that the underlying hypothesis applies to all classes, and the similarity between of classes can be transferred to of classes . Therefore, new samples can be generated as a weighted average of sample pairs of classes of the similar data set, where the weight is usually some similarity measure. Consider class of , the similarity is measured between th class and each of the classes in the similar data set. In this way, can be augmented using aggregated samples from the similar data set. The similar data set with many-shot is the the prior knowledge in training experience to aid learning. This similarity can extracted from other information sources, such as text corpus, hierarchy structure is used (Tsai and Salakhutdinov, 2017). However, as this kind of similarity is not designed for the target task, it can be misleading. Besides, directly augmenting the aggregated samples to can be of high bias, as these samples are not from the target FSL class. Gao et al. (2018) designs a method based on generative adversarial network (GAN) (Goodfellow et al., 2014) to generate in-discriminable synthetic aggregated from data set of many-shot, where both the mean and covariance of each class the many-shot data set are used in the aggregation to allow more variability in the generating process. The similarity between classes of the many-shot similar data set and the current class is measured by only.
The use of is usually cheap, as no human effort is needed for labeling. However, accompany this cheapness, the quality of is usually low, e.g., coarse and lack of strict data set collection and scrutinizing procedure, resulting in uncertain synthesizing quality. Besides, it is also costly to pick useful samples from this large data set.
Similar data set shares some property with , and contains sufficient supervised information, making it a more informative data source to be exploited. However, determining the key property so as to seek similar data set can be objective, and collecting this similar data set is laborious.
By augmenting , methods in this section reaches the desired sample complexity and obtains a reliable empirical risk minimizer . The first kind of methods duplicate by transforming each original sample by handcrafted or learned transformation rules. It augments based on original samples, hence the constructed new samples will not be too far away from . But also due to this reason, given the few-shot in and some transformation rules, there may not be much combination choices. The second kind of methods borrow samples from other data set and adapt them to mimic samples in . Considering the large-scale data sets to be borrowed, either unlabeled or similar ones, there are tremendous samples for transformation. However, how to adapt those samples to be like samples in can be hard.
In general, solving FSL from the perspective of augmenting is straightforward. The data can be augmented in consideration of incorporating the target of the problem which eases learning. And this augmentation procedure is usually reasonable to human. If the prior knowledge which guides the augmentation is ideal, it can generates as many as samples to required sample complexity, and can use any common machine learning models and algorithms. However, as is unknown, a perfect prior knowledge is not possible. This means the augmentation procedure is not precise. The gap between the estimated one and the ground truth largely interferes the data quality, even leading to concept drift.
Model determines a hypothesis space of hypotheses parameterized by to approximate the optimal hypothesis from input to output .
If common machine learning models are used to deal with the few-shot , they have to choose small hypothesis space . As shown in (4), a small has small sample complexity , thus requiring less samples to be trained (Mitchell, 1997). When the learning problem is simple, e.g., the feature dimension is low, a small can indeed get desired good learning performance. However, as learning problems in real-world are usually very complex and can not be well represented by hypothesis in a small due to significant (Goodfellow et al., 2016). Therefore, large is preferred for FSL, which makes common machine learning models not feasible. As we will see in the sequel, methods in this section learns a large by complementing the lack of samples by prior knowledge in . Specifically, the prior knowledge is used to affect the design choices of by constraining . In this way, the sample complexity is reduced, the empirical risk minimization is more reliable, and the risk of overfitting is reduced. In terms of what prior knowledge is used, methods falling in this kind can be further classified into five strategies, as summarized in Table 3.
|strategy||prior knowledge||how to constrain|
|multi-task learning||other ’s with their data sets ’s||share parameter|
|embedding learning||embedding learned from/together with other ’s||project samples to a smaller embedding space where similar and dissimilar samples can easily be discriminated|
|learning with external memory||embedding learned from other ’s to interact with memory||refine samples by stored in memory to incorporate task-specific information|
|generative models||learned prior for parameter by other ’s||restrict the form of distribution|
4.1. Multitask Learning
Multitask learning (Caruana, 1997) learns multiple learning tasks spontaneously, exploiting the generic information shared across tasks and specific information of each task. These tasks are usually related. For examples, consider documents classification, a task is the classification for one specific category, such as cat. It shares some similarity with other tasks such as classification for tigers or dogs that can be exploited. When the tasks are from different domains, this is also called domain adaptation (Goodfellow et al., 2016). Multitask learning is popularly used for applications where exist multiple related tasks each of limited training examples. Hence it can be used to solve FSL problems. Here we present some instantiations of using multitask learning for FSL problems. For a comprehensive introduction of multitask learning, please refer to (Zhang and Yang, 2017) and (Ruder, 2017).
Formally, given a set of related tasks ’s including both tasks of few-shot and many-shots, each task operates on data sets ’s where with , and . Among them tasks, We call the few-shot tasks as target tasks, while the as source tasks. Multitask learning learns from ’s to obtain for each . As these tasks are related, they are assumed to have similar or overlapping hypothesis space ’s. Explicitly, this is done by sharing parameters among these tasks. And these shared parameters can be viewed as a way to constrain each by others jointly learned tasks. In terms whether parameter sharing is explicitly enforced, we separate methods in this strategy into hard and soft parameter sharing. Illustrations about hard and soft parameter sharing is in Figure 6.
4.1.1. Hard parameter sharing
This strategy explicitly shares parameter among tasks to promote overlapping ’s, and can additionally learns a task-specific parameter for each task to account for task specialties. In (Zhang et al., 2018b), this is done by sharing the first several layers of two networks to learn the generic information from both source and target task, while learning a different last layer to deal with different output for each task. It also proposes a method to select only the most relevant samples from source tasks to contribute to learning. Benaim and Wolf (2018) operate in the opposite way in domain adaptation. It leans separate embedding for source and target tasks in different domains to map them into an task-invariant space, then learns a shared classifier to classify samples from all tasks. Finally, Motiian et al. (2017) consider one-shot domain translation, meaning to generate source domain’ samples conditioned on few-shot target task in target domain. Similar to (Zhang et al., 2018b), it first pre-trains a variational auto-encoder from the source tasks in source domain, colons it for target task. Then it shares the layers to capture generic information, i.e., top layers of the encoder part and lower layers of the decoder part, and lets both tasks to have some task-specific layers. The target task can only update their task-specific layers, while source task can update both shared and their specific layers. It avoids using few-shot to directly update the shared layers so as to reduce the risk of overfitting. The shared layers are only indirectly adjusted by target task’s information so as to translate.
4.1.2. Soft parameter sharing
This strategy does not explicitly share parameters across tasks. Instead, each task has its own hypothesis space and parameter . It only encourages parameters of different tasks to be similar, resulting in similar ’s. This can be done by regularizing ’s. Yan et al. (2015) penalize the pairwise difference of ’s among all combinations, forcing all to be learned similarly. If the relations between ’s are given, this regularizer can become a graph Laplacian regularizer on the similarity graphs of ’s, so that this relations can guide information flow between ’s. Apart from regularizing ’s directly, another way that forces soft parameter sharing by adjusting ’s through loss. Therefore, after optimization, the learned ’s also utilize information of each other. Luo et al. (2017) initialize the CNN for target tasks in target domain by a pre-trained CNN learning from source tasks in source domain. During training, it uses an adversarial loss calculated from representations in multiple layers of CNN to force the two CNNs projects samples to a task-invariant space. It additionally leverages unlabeled data from target task as data augmentation.
Multitask learning constrains learned for each task by a set of tasks jointly learned. By sharing parameters explicitly or implicitly among tasks,these tasks together eliminate those infeasible regions. The hard parameter sharing is suitable for multiple similar tasks such as classification for different categories. A shared hypothesis space is used to capture the commonality while each task builds their specific model hypothesis space on top it. Sharing this way can be enforced easily. In contrast, soft parameter sharing only encourages similar hypothesis, which is a more flexible way to constrain . But how to enforce the similarity constraint needs careful design.
4.2. Embedding Learning
Embedding learning (Spivak, 1970; Jia et al., 2014) methods learn which embeds to a smaller embedding space , where the similar and dissimilar pairs can be easily identified. The embedding function is mainly learned by prior knowledge, and can additionally uses to bring in task-specific information.
Embedding learning methods have the following key components: function which embeds samples to predict to , function which embeds examples to , and similarity measure in . Note that and are embedded differently by and . This is because can embedded without considering , while usually needs to be embedded depending on information from so as to adjust comparing interest (Bertinetto et al., 2016; Vinyals et al., 2016). Then, the prediction is done by assigning to the class of the most similar in . Usually, a set of data sets ’s with and are used. Note that can be both data set of many-shot or few-shot.
Table 4 presents the details of existing methods in embedding learning in terms of , and . And illustration of embedding learning strategy is shown in Figure 7. Next, according to what information is embedded in the embedding, we will classify these methods into task-invariant (in other words, general), task-specific and a combination of the two.
|class relevance pseudo-metric (Fink, 2005)||kernel||kernel||squared distance||invariant|
|mAP-DLM/SSVM(Triantafillou et al., 2017)||CNN||CNN||cosine similarity/mAP||specific|
|convolutional siamese net (Koch, 2015)||CNN||CNN||weighted distance||invariant|
|Micro-Set(Tang et al., 2010)||logistic projection||logistic projection||distance||combined|
|Learnet (Bertinetto et al., 2016)||adaptive CNN||adaptive CNN||weighted distance||combined|
|DyConNet (Zhao et al., 2018)||adaptive CNN||-||-||combined|
|R2-D2 (Bertinetto et al., 2019)||adaptive CNN||-||-||combined|
|Matching Nets (Vinyals et al., 2016)||CNN, then LSTM with attention||CNN, biLSTM||cosine similarity||combined|
|resLSTM (Altae-Tran et al., 2017)||GCN, then LSTM with attention||GCN, then LSTM with attention||cosine similarity||combined|
|Active MN (Bachman et al., 2017)||CNN||biLSTM||cosine similarity||combined|
|ProtoNet (Snell et al., 2017)||CNN||CNN||squared distance||combined|
|semi-supervised ProtoNet(Ren et al., 2018)||CNN||CNN||squared distance||combined|
|PMN (Wang et al., 2018)||CNN, then LSTM with attention||CNN, then biLSTM||cosine similarity||combined|
|TADAM (Oreshkin et al., 2018)||CNN||CNN||squared distance||combined|
|ARC (Shyam et al., 2017)||RNN with attention, then biLSTM||-||-||combined|
|Relation Net (Sung et al., 2018)||CNN||CNN||-||combined|
|GNN (Satorras and Estrach, 2018)||CNN, then GNN||-||learned distance||combined|
|TPN (Liu et al., 2019)||CNN||-||Gaussian similarity||combined|
|SNAIL (Mishra et al., 2018)||CNN with attention||-||-||combined|
Task-specific embedding methods learns an embedding function tailored for . Given the few-shot in , the sample complexity is largely reduced by enumerating all pairwise comparison between examples in as input pairs. Then a model is learned to verify whether the input pairs have the same or different . In this way, each original example can be included in multiple input pairs, hence enriching the supervised information in training experience . Triantafillou et al. (2017) construct ranking list for each in , where those of the same class rank higher and otherwise lower. An embedding is learned to maintain these ranking list in the embedding space by ranking loss.
Task-invariant embedding methods learns embedding function from a large set of data sets ’s which does not include the . The assumption is that if many data sets are well-separated on embedded by , it can be general enough to work well for without retraining. Therefore, the learned embedding is task-invariant. Fink (2005) proposes the first embedding method for FSL. It learns from auxiliary ’s a kernel space as , embeds both and to , where to the class of nearest neighbor in . A recent deep model convolutional siamese net (Koch, 2015) learns twin convolutional neural networks to embed sample pairs from a large set of data sets ’s to a common embedding space . It also construct input pairs using original samples of , and reformulates the classification task as a verification/ matching task which verifies whether the resultant embeddings of the input pairs belong to the same class or not. . This idea has been used in many embedding learning papers, such as (Vinyals et al., 2016; Bertinetto et al., 2016), to reduce the sample complexity.
4.2.3. Combine Task-invariant and Task-specific
Task-specific embedding methods learns embedding for each task solely based on the task specialty, while task-invariant embedding methods can rapidly generalize for new task without re-training. A trend is to combine the best of the above mentioned methods: learns to adapt the generic task-invariant embedding space learned from prior knowledge by task-specific information contained in . Tang et al. (2010) firstly propose to optimize over a distribution of FSL tasks, under the name micro-sets. It learns by logistic projection from these FSL tasks. Then for a given and , it classifies by nearest neighbor classifier on .
Recent works mainly use meta-learning methods to combine the task-invariant knowledge across tasks and specialty of each task. For these methods, ’s are the meta-training data sets ’s, and new task is one of the meta-testing tasks. We group them by the core ideas and highlight the representative works.
Learnet (Bertinetto et al., 2016) improves upon convolutional siamese net (Koch, 2015) by incorporating the specialty of of each task to . It learns a meta-learner to map exemplar to the parameter of each layer in in convolutional siamese net. However, the meta-learner needs huge number of parameters to capture the mapping. To reduce the computation cost, Bertinetto et al. (2016) factorize the weight matrices of each layer in convolutional siamese net, which in turn reduces the parameter space of meta-learner. To further reduce the parameter number of learner, Zhao et al. (2018) pre-train a large set of basis filters, and the meta-learner only needs to map the exemplars to the combination weights to linearly combine those basis filters to be fed to learner. The recent works (Bertinetto et al., 2019) replaces the final classification layer of Learnet by a ridge regression. The meta-leaner now learns both the conditional convolutional siamese net and the hyper-parameters in ridge regression. Each learner just needs to use their embedded to calculate the parameter of ridge regression by closed form solution. Note that Learnet performs pairwise matching to decide whether the provided samples pairs are from the same class, as in (Koch, 2015). In contrast, both (Zhao et al., 2018) and (Bertinetto et al., 2019) directly classify samples. This is more efficient to perform prediction. But the model needs to be re-trained if the number of classes changes.
Matching Nets (Vinyals et al., 2016) assigns to the most similar in , where and are embedded differently by and . The meta-learner learns the parameter of and from ’s, and the learner is a nearest neighbor classifier. Once learned, one can use the learned meta-learner for new task with data set and perform nearest neighbor search directly. Specially, information in is utilized by the so-called fully conditional embedding (FCE), where is a LSTM imposed on CNN with attention on all examples in and is a bi-directional LSTM on top of CNN. However, learning with the bi-directional LSTM implicitly enforces an order among examples in . Due to the vanishing gradient problem, nearby examples have larger influence on each other. To remove the unnatural order, Altae-Tran et al. (2017) replace the biLSTM used in by LSTM with attention, and further iteratively refines both and to encode contextual information. Specially, it deals with molecular structure, hence a GCN rather than CNN is used to extract sample features beforehand. An active learning variant in (Bachman et al., 2017) extends matching nets (Vinyals et al., 2016) with a sample selection stage, which can label the most beneficial unlabeled sample and add it to augment .
Prototypical Networks (ProtoNet) (Snell et al., 2017) assigns to the most similar class prototype in , hence only one comparison is needed between and each class in . The class ’s prototype is defined as the mean of embeddings of that class, i.e., where is one the examples of th class in . In this way, it does not have class-imbalance problem. However, it can only capture the mean, while the variance information is dropped. A semi-supervised variant (Ren et al., 2018) learns to soft-assign related unlabeled samples to augment during learning. The ProtoNet embeds both and using the same CNN, and ignores specialty of different ’s, while the LSTM with attention used in of matching nets makes it hard to attend to rare classes. Noticed that, a combination of the best of matching nets and ProtoNet is proposed in (Wang et al., 2018). It uses the same as in matching nets, while calculates ’s to feed to LSTM with attention in . During nearest neighbor search, the comparison is also done between and ’s to reduce the computation cost. Also considering task-dependent information, Oreshkin et al. (2018) average ’s as the task embedding, which is then mapped to scaling and bias parameters of CNN used in ProtoNet.
Relative representations further embeds the embedding of and each calculated from in jointly, which is then directly mapped to the similarity score like classification. This idea is independently developed in attentive recurrent comparators (ARC) (Shyam et al., 2017) and relation net (Sung et al., 2018). ARC uses a RNN with attention to recurrently compare different regions of and each class prototype and produces the relative representation, additionally uses a biLSTM to embeds the information of other comparisons as the final embedding. Relation net firsts uses a CNN to embed and to , simply concatenates them as the relative representation, and outputs the similarity score by another CNN.
Relation graph replaces the ranking list learned among samples (Triantafillou et al., 2017). This graph is constructed using samples from both and as nodes, while its edges between nodes is determined by a learned similarity function . To build the relation graph, one has to use transductive learning, where are provided during training. Then is predicted using neighborhood information. GCN is used in (Satorras and Estrach, 2018) to learn the relation graph between ’s from and a test example , and the resultant embedding for node is used to predict . In contrast, Liu et al. (2019) meta-learns an embedding function which maps each and to , accordingly builds a relation graph there, and labels by closed-form label propagation rules.
SNAIL (Mishra et al., 2018) designs a special embedding networks consisting of interleaved temporal convolution layer and attention layers. The temporal convolution is used to aggregate information from past time steps, and the attention selectively attends to specific time step relevant to the current input. Within each task, the networks takes sequentially, and predicts immediately. The networks’ parameter is then optimized across tasks.
Task-specific embedding fully considers the domain knowledge of . However, as the given few-shots in are biased, they may not be proper representatives. Modeling ranking list among has a high risk of overfitting to . Then the resulting model may not work well. Besides, learned this way cannot generalizes for new tasks or be adapted easily.
Learning task-invariant embeddings means using pre-trained general embedding for new task without re-training. Obviously, the computation cost for new FSL tasks is low. However, the learned embedding function does not consider any of the task-specific knowledge for . For general common tasks, they obey the common rules. But the reason why contains few-shot is that it is special, hence directly applying task-invariant embedding function may not be suitable.
Combining the efficiency of task-invariant embedding methods and the task specialty concentrated by task-specific embedding methods is usually done by meta-learning methods. Learning with meta-learning can model a general task distribution and capture their generic information. Then they can provide good and quickly generalize for different tasks by learner. The learner usually performs nearest neighbor search to classify . This nonparametric model suits few-shot learning as it does not need to learn parameter out of . One weakness is that meta-learning methods usually assume the tasks are similar. However, there is no scrutinizing step to guarantee that. How to generalize for new but unrelated task within bring in negative transfer, and how to avoid the unrelated tasks contaminating the meta-learner are not sure.
4.3. Learning with External Memory
External memory such as neural Turing machine (Graves et al., 2014) and memory networks (Weston et al., 2014; Sukhbaatar et al., 2015) allows short-term memorization and rule based manipulation (Graves et al., 2014). Note that learning is a process of mapping useful information of training samples to the model parameters. Given new , the model has to be re-trained to incorporate its information, which is costly. Instead, learning with external memory directly memorizes the needed knowledge in an external memory to be retrieved or updated, hence relieves the burden of learning and allows fast generalization. Formally, denote memory as , which has memory slots . Given a sample , it is first embedded by as query , then it is attends to each through some similarity measure , e.g, cosine similarity. Then it uses the similarity to determine which knowledge is extracted from the memory, and then predicts based on it.
For FSL, has limited samples, re-training the model is infeasible. Learning with external memory can help solve this problem by storing knowledge extracted from into an external memory. Now the embedding function learned from prior knowledge is not re-trained, hence the initial hypothesis space is not changed. When a new sample comes, relevant contents are extracted from the memory and combine into the local approximation for this sample. In other words, the memory re-interprets the sample using . Then the approximation is fed to the subsequent model for prediction, which is also not changed. As is stored in the memory, the task-specific information is effectively used. Samples are refined and re-interpreted by stored in memory, consequently reshaping .
|MANN (Santoro et al., 2016)||cosine similarity|
|abstraction memory(Xu et al., 2017)||dot product|
|CMN (Zhu and Yang, 2018)||and age||dot product|
|MN-Net (Cai et al., 2018)||dot product|
|life-long memory (Kaiser et al., 2017)||and age||cosine similarity|
|MetaNet (Munkhdalai and Yu, 2017)||fast weight||cosine similarity|
|CSNs (Munkhdalai et al., 2018)||fast weight||cosine similarity|
|APL (Ramalho and Garnelo, 2019)||squared distance|
Usually, when the memory is not full, new samples can be written to vacant memory slots. However, when the memory is full, one has to decide which memory slots to be updated or replaced by the some designed rules, such as life-long learning and selectively updating the memory. According to different preference revealed in their update rules, existing works can be separated into the following groups.
Update the least recently used memory slot. The earliest work (Santoro et al., 2016), which uses memory to solve the FSL problem, Memory augmented neural networks (MANN) (Santoro et al., 2016) designs its memory based on neural turing machine (NTM) (Graves et al., 2014), with modification of its addressing mechanism. It wipes out the least recently used memory slot for the new sample when the memory is full. As the image-label binding is shuffled across tasks, MANN cares more on mapping samples of the same class to the same label. In turn, samples of same class together refine their class representation kept in the memory. MANN is a meta-learning method, where is meta-learned across tasks and memory is wiped out at the start of each task.
Update by location-based addressing of neural Turing machine (Graves et al., 2014). Some works use the location-based addressing proposed in the neural Turing machine (Graves et al., 2014), which updates all memory slots at all time according by back-propagation of gradients (Xu et al., 2017). It proposed a new high-level key-value memory (Miller et al., 2016) called abstraction memory. which concentrates useful information from a large fixed external memory for current FSL task. The large fixed external memory contains large-scale auxiliary data, where image features extracted by pre-trained CNN are used as key and word embeddings of its label are used as value. In each task, the meta-learner first extracts relevant from the large fixed memory, then further embeds and puts them in the abstraction memory. The output of the abstraction memory is then used for prediction.
Update according to age of memory slots. Life-long memory (Kaiser et al., 2017) is a memory module that protects rare classes by deigning a memory update rule which prefers samples to update memory slots of the same classes. It assigns an age to each memory slot. Each time the memory slot is attended to, its age increase by one. Therefore, memory slots with old age usually contain rare events By preferring to update memory slots of the same classes, the rare events are protected. In contrast, CMN (Zhu and Yang, 2018) updates the oldest memory slot when the memory is full. This is because it only puts few-shot in memory, hence each class occupies comparatively number of memory slots. The oldest one is more likely a not useful information. It extends the idea of abstraction memory to video classification. An abstraction memory still helps scrutinize and summarize relevant information. However, it removes the large fixed external memory and updates the memory by designed rules rather than gradient descent. It puts original embedding matrix of each video to one extra field to abstraction memory. The embedding matrix is obtained by embedding the multiple frames of the video by multiple saliency descriptors for different genres. It is then compressed into one embedding as the keys in the abstraction memory, while original embedding is also kept for later updating of memory. For , it is embedded the same way into a single query embedding, and a nearest neighbor is searched among the keys of abstraction memory, and extracts corresponding value as the predicted label.
Only update the memory when the loss is high. The surprised-based memory module (Ramalho and Garnelo, 2019) designs a memory update rule which uses to update the memory when its prediction loss is above a threshold. Therefore, the computation cost is reduced compared with a differentiable memory, and the memory contains the minimal but diverse information for equivalent prediction.
Use the memory as storage only and wipe it out across tasks. MetaNet (Munkhdalai and Yu, 2017) uses memory to contain model parameters. It meta-learns two weight generating models for task and sample as fast weights, and also an embedding model and a classification model conditioned on the fast weights. A memory is used to contain sample-level fast weights for , which is used to extract relevant fast weights for . In this way, for each , its embedding and classifier incorporate task and sample specific information to the general ones. MetaNet outputs one fast weight which is repeatedly applied to selected layers of a CNN. However, as in Learnet (Bertinetto et al., 2016), learning to produce parameters of a layer as a whole is computationally expensive. To solve this problem, Munkhdalai et al. (2018) instead learn fast weight to change the activation value of each neuron. Considering the number of neurons versus the number of parameters of a layer, the computation efficiency is obtained.
Aggregate the new information into the most similar one. Memory Matching Networks (Cai et al., 2018) merges the information new sample to its most similar memory slots. It replaces the direct classification by matching procedure as in matching nets (Vinyals et al., 2016). Recall that both MANN and abstraction networks directly perform -way -shot classification through softmax loss, hence they can only deal with fixed and . However, this setting is not natural. At least during inference, these setting should be changed to show generality. In (Cai et al., 2018), the memory still contains all contextual information of . However, instead of directly predicting for , the memory is used to refine the , and to parameterize a CNN as in Learnet (Bertinetto et al., 2019). Then for , it is embedded by this conditional CNN, and is matched with by nearest neighbor search.
Adapting to new tasks can be done by simply putting to the memory, hence fast generalization can be done easily. Besides, preference such as lifelong learning or reduction of memory updates can be incorporated into the designing of memory updating and accessing rules. However, it relies on human knowledge to design a desired rule. The existing works do not have a clear winner. How to design or choose update rules according to different setting automatically are important issues.
4.4. Generative Model
Generative modeling methods here refer to methods involve learning . It uses both prior knowledge and to obtain the estimated distribution. Prior knowledge is usually learned prior probability distributions of each , which are learned from a set of data sets ’s with , and . Usually, is large and is not one of ’s. Finally, it updates the probability distribution over for prediction.
Concretely, the posterior which is the probability for given is computed by Bayes’ rule as
This can be expanded by parametrization as
where is the parameter of . If is large enough, we can use it to learn a well-peaked , and obtain using maximum likelihood estimation (MLE) or maximum a posterior (MAP) . However, in FSL tasks has limited samples, which is not enough to learn . Consequently, it cannot learn a good .
Generative models for FSL assume is transferable across different (e.g., classes). Hence can be instead learned from a large set of data sets ’s. In detail, expands as
which also has parameter . Since it is the parameter of parameter , we call hyper-parameter. Then we can obtain by adapting the distribution or learning the hyper-parameter by .
By learning prior probability out of prior knowledge, the shape of is restricted. According to how is defined and shared across , we classify existing methods into part and relation, super class and latent variable.
|category||method||prior from ’s||how to use|
|Part and Relation||Bayesian One-Shot (Fei-Fei et al., 2006)|
|BPL (Lake et al., 2015)||and||fine-tune partial|
|Super Class||HB (Salakhutdinov et al., 2012)||a hierarchy of||as one of ’s|
|HDP-DBM (Torralba et al., 2011)||a hierarchy of||as one of ’s|
|Latent Variable||Neural Statistician (Edwards and Storkey, 2017)||as input|
|VERSA (Gordon et al., 2019)||as input|
|SeqGen(Rezende et al., 2016)||as input|
|GMN (Bartunov and Vetrov, 2018)||as input|
|MetaGAN(Zhang et al., 2018a)||as input|
|Attention PixelCNN(Reed et al., 2018)||as input|
4.4.1. Part and Relation
This strategy learns part and relation (a.k.a. ) from a large set of ’s as prior knowledge. Although the few shot classes have few samples at the granularity defining their labels, at a finer granularity, the parts such as shapes and appearance exist in many classes. For example, visually, animals of different classes are simply combinations of different colors, shapes and organs. Although one particular class has few shots, its color can be used in many classes. The ways to relate these parts into samples are also limited and reused across classes. With much more samples to use, learning part and relation is much easier. For , the model needs to infer the correct combination of related parts and relations, then decides which target class this combination belongs to. Therefore, instead of using directly, is used to learn the few hyper-parameter (parameter of posterior for ) (Fei-Fei et al., 2006) or adapt partial (Lake et al., 2015).
The two famous one-shot learning works, Bayesian One-Shot (Fei-Fei et al., 2006) and BPL (Lake et al., 2015) fall in this category. Bayesian One-Shot leverage shapes and appearances of objects to help recognize objects, while BPL separate a character into type, token and further template, part, primitives to model characters. As the inference procedure is costly, a handful of parts is used in Bayesian One-Shot which largely reduce the combinatorial space of parts and relations, while only the five most possible combinations is considered in BPL.
4.4.2. Super Class
Part and relation turn to model smaller part of samples, while super class clusters similar classes by unsupervised learning. Considering each task is classification of one class, this strategy finds the best which parameterize for these super classes as prior knowledge. A new class is first assigned to super class, then finds the best through adapting the super class’s .
In (Salakhutdinov et al., 2012), they learn to form a hierarchy of classes using ’s (which includes ), whose is learned using MCMC. In this way, similar classes together contribute to learning a precise general prior representing super classes, and in return each super class can provide guidance to its assigned classes, especially for of few-shot. The feature learning part of (Salakhutdinov et al., 2012) is further improved in (Torralba et al., 2011). by incorporating deep boltzmann machines to learn more complicated images features.
4.4.3. Latent Variable
Separating samples into part and relation is handcrafted, and relies heavily on knowledge of human expertise. Instead, this strategy models latent variables with no implicit meaning shared across classes. Without decomposition, learned from ’s no longer needs to be adjusted, thus the computation cost for the new task is largely reduced. And in order to handle more complicated , the models used in this strategy is usually deep models.
Consider density estimation problem, where a probability distribution is approximated using training samples, i.e, ’s. Rezende et al. (2016) propose to model using a set of sequentially inferred latent variables. By repeatedly attending to different regions of each from ’s and analyzing the capability of the current model to provide feedback, the proposed sequential generative models can model the density estimation well. To generate new sample for th class, one just need to feed in one as support. Improved upon it, an auto-regressive model is proposed in (Reed et al., 2018), which decomposes the density estimation of an image into pixel-wise. It sequentially generates each pixel conditioned on both already generated pixels and related information acquired from a memory storing support set. To generate , the whole can be put in the memory, and the relation information can be pinpointed by attention. Edwards and Storkey (2017) use a latent variable to model the common generative process of classes, and learns a set of latent variables to model the relations between and the context. Besides the inference networks which infers these latent variables, it learns from ’s an inference networks that can maps to , the parameter of its generative distribution. Besides generation, this inference networks can do FSL classification. It maps each and to their distributions, and classifies by nearest neighbor search using KL divergence to measure distribution distance. Also learning an inference networks by amortized variational inference, Gordon et al. (2019) learn to map to the parameter of a variational distribution which approximates the predictive posterior distribution over output . In this way, the uncertainty using the estimated for each can be quantified.
The aforementioned works use large-scale ’s. However, directly using FSL tasks at training can force the model to absorb useful information using limited resources, thus people tend to use ’s with each is few-shot. With fewer data in each training task, meta-learning is used to learn a good generic parameter shared across tasks. Matching network is extended to generative modeling in (Bartunov and Vetrov, 2018). Consider generation task, it first embeds by and the randomly generated latent variable by to embedding space , and uses another embedding function maps to another embedding space. The similarity value between and each is calculated and used to linearly combine in the new embedding space. This resultant embedding and are then fed to a decoder networks to generate new sample . Finally, an imperfect GAN is jointly learned with discriminative models such as relation networks (Sung et al., 2018) for classification task (Zhang et al., 2018a). The imperfect GAN is learned to augment each with fake data, which is similar to examples in but slightly different. In this way, it learns to discriminate between ’s and the slightly different fake data, the learned decision boundary can be sharper. The learned model is then applied to .
Learning each class object by decomposing them into smaller parts and relations leverages the human knowledge to do the decomposition. In contrast to other types of generative modeling methods discussed, using parts and relations is more interpretable. However, the human knowledge contains high bias, which may not suits the given data set. Besides, it can be hard or expensive to get, hence putting a strict restriction on application scenarios.
In contrast, consider each task is classification of one class, learning of super class can aggregate information from many related classes and act as a general for the new class. It can be good initialization for the new class, but may not be optimal since not specific information is utilized.
Finally, when latent variable is used as shared information among tasks, it does need re-training to generalize to new tasks. In contrast to the above mentioned typer, it is more efficient and needs less human knowledge. However, as the exact meaning of the latent variable is unknown, the model is less easy to be understood.
In sum, all methods in this section design based on prior knowledge in experience to constrain the complexity of and reduce its sample complexity .
Multitask learning constrains learned by regularized by a set of tasks jointly learned. When these tasks are highly related, they can guide each other in complement and prevent overfitting problem. In contrast to other methods which extract useful generic information (e.g., models. priors) beforehand, multitask learning can communicate between different tasks and improve these tasks along the optimization process. In other words, these tasks are learned with fully consideration of generic information and task-specific knowledge in a dynamic way. Besides constraining and consequently reducing sample complexity, multitask learning also implicitly augments data as some of parameters are learned jointly by many tasks. Therefore, it is possible to learn a large number of parameters without synthesizing samples based on human expertise. However, the target must be one of ’s to perform joint training. Hence for each new task, one has to learn from scratch, which can be costly and slow. It is not suitable for tasks which only have one shot or prefers fast inference. Finding related auxiliary tasks ’s for few-shot task can be another issue.
Embedding learning methods learns to embed embeds samples from to a smaller embedding space, where the similar and dissimilar pairs can be easily identified. These embedding functions are usually deep models, which can approximate complicated function (Goodfellow et al., 2016). Mort works learn from large-scale data sets for task-invariant information, and can take in task-specialty of new tasks. Besides, many embedding learning methods construct input pairs using original samples of , and reformulates the classification task as a verification/ matching task which verifies whether the resultant embeddings of the input pairs belong to the same class or not. As each original samples can be included in many input pairs, the supervised training data is enlarged. Or in other words, sample complexity is reduced. Once learned, most methods can easily generalize to new tasks by a forward pass and perform nearest neighbor among the embedded samples. But the training of the embedding function usually needs large scale tasks. Besides, it remembers both invariant and specific information of tasks by the model parameter . How they should be mixed in is unclear.
Learning with external memory refines and re-interprets each sample by stored in memory, consequently reshaping . By explicitly storing in memory, it avoids laborious re-training to adapt for . By refining each sample by the memory, and original is reshaped by task-specific information of . The task-specific information is effectively used and not forgot easily. However, learning with the external memory incurs additional space and computational cost, which increases with a enlarged memory. Therefore, current external memory has limited size. Consequently, it cannot memorize much information.
Generative model for FSL learns prior probability from prior knowledge, hence shaping the form the . It has good interpret ability, causality and compositionality (Lake et al., 2015). By learning the joint distribution , it can deal with broader types of tasks such as generation and view reconstruction etc. It understands the data in a through way as it learns in an analysis-by-synthesis manner. The learned generative model can generate many samples of the new class to do data augmentation, hence solves FSL problem. Besides, it can be used to inject new design preference so as to produce novel variants of samples, such as new fonts and styles. However, generative models typically has high computational cost and is difficult to derive compared with other categories. To make it computationally feasible, it requires severe simplification on the structure which leads to inaccurate approximations. Besides, training these generative models for these prior knowledge usually needs lots of data.
Algorithm is strategy to search in the hypothesis space for the parameter of the best that fit . For example, gradient descent and its variants are one type of popular search strategy to search in . Let be the composite function of prediction function and loss function , i.e., , which measures the loss incurred by with respect to th sample. Using gradient descent, is updated through a sequence of updates . At th iteration, is updated by
where is the step size to be hand-tuned. In standard algorithms, there are enough training data to update to arbitrary precision, and to find an appropriate through cross validation. However, with few-shot in , this is not possible as the sample complexity is not required and empirical risk minimizer is unreliable.
Methods in this section do not restrict the shape of . Common models can still be used. Instead, they takes advantage of prior knowledge to alter the search for the best within general so as to solve the FSL problem. In terms of how the search strategy is affected by prior knowledge, we classify methods falling in this section into three kinds (Table 7):
Refining existing parameters . An initial learned from other tasks is used to initialize the search, then is refined by .
Refining meta-learned . A meta-learner is learned form a a set of tasks drawn from the same task distribution as the few-shot task to output a general , then each learner refines the provided by the meta-learner by .
Learning search steps. Similarly to refining meta-learned strategy, meta-learning is used. This strategy learns a meta-learner to output search steps or update rules to guide each learner directly. Instead of learning better initialization, it alters the search steps, such as direction or step size.
|strategy||prior knowledge||how to search in|
|refining existing parameters||learned as initialization||refine by|
|refining meta-learned||meta-learner learned from a task distribution||refine the meta-learned by|
|learning search steps||meta-learner learned from a task distribution||search steps provided by meta-learner|
5.1. Refining Existing Parameters
This strategy takes from a pre-trained model as a good initialization, and adapts it to by . The assumption is that captures general structures by learning from large-scale data, hence it can be adapted using a few iterations to work well on .
5.1.1. Fine-tune with regularization
This strategy fine-tunes the given with some regularization. Fine-tuning is popularly used in practice which adapts the value of a (deep) model trained on large scale data such as ImageNet to smaller data sets through back-propagation (Donahue et al., 2014). The single which contain generic knowledge is usually parameter of deep models, hence contains large number of parameters. Given the few-shot in , simply fine-tuning by gradient descent easily leads to overfit. How to adapt the value of without overfitting to the limited is the key design issue. Illustration of this strategy is shown in Figure 10.
Early-stopping is used in fine-tuning the parameter of a multiple users by a new user (Arik et al., 2018). However, it requires a separate validation set from to monitor the training, which further reduces the number of samples for training. And using a small set of validation set makes the searching strategy highly biased. Therefore, others design regularization akin to . Keshari et al. (2018) fix the filters of a CNN as those pre-trained from a large scale data by dictionary learning, and only learns filter strength, which is a scalar to control the multitude of all elements within the filter, by . With fewer parameters, is enough to train this CNN without overfitting. Yoo et al. (2018) reduce the efforts in searching in parameter space by clustering redundant parameters of . It uses auxiliary data (not ) to group filers of a pre-trained CNN, and fine-tunes the CNN by group-wise back-propagation using . Also considering a pre-trained CNN, Qi et al. (2018) directly add the weights for each class in as new column in the weight matrix of the final layer as a scaled version of mean embedding of the class, while leaving pre-trained weights unchanged. Without training, this CNN works well to classify , and can be improved by slightly fine-tuning.
Another series of works implicitly regularize when refining . Model regression networks is proposed in (Wang and Hebert, 2016b), assuming there exists a generic class-agnostic and task-agnostic transformation between parameter trained using few-shot to parameter trained using large enough data. Hence the networks is trained from a large collection of parameter pairs to capture this transformation. Wang and Hebert (2016b) then refine learned with fixed way shot problem. Likewise, one can also assume there exists generic transfer from embedding of to its parameter. Kozerawski and Turk (2018) learn to transform from embedding of to a classification decision boundary, while Qiao et al. (2018) map embedding of to new column in the weight matrix of the final layer of CNN to classify samples of the class.
5.1.2. Pick and combine a set of ’s
Usually we do not have a suitable to fine-tune. Instead, we may have many model parameter learned related tasks, such as the task is face recognition while only eye, nose, ear recognition models are available. Therefore, one can pick from a set of ’s the relevant ones and combine them into the suitable initialization to be adapted by . Illustration of this strategy is shown in Figure 11.
This set of parameters ’s is usually pre-trained from other data sources, Bart and Ullman (2005) consider classification by image fragments for a new class of one image. The classifier for the new class is built by replacing the features from already learned classes by similar features taken from the novel class and reusing their classifier parameters. Only the threshold for classification is adjusted to avoid confusing with those similar classes. Similar to (Qi et al., 2018) and (Qiao et al., 2018), a pre-trained CNN is adapted to deal with new class in (Gidaris and Komodakis, 2018). But instead of solely using embedding of as classifier parameter which is highly biased, it leverages already learned classes of the CNN. Recall that the weight matrix of the final layer of CNN corresponds to classifiers, it add a column for this new class which is linear combination of the embedding of and a classifier built from already learned knowledge by attending to other classes’ classifier parameter (columns in the weight matrix of the final layer of the pre-trained CNN). The linear combination weight is then learned using . The aforementioned works use labeled data set. In fact, ’s learned from unlabeled data set can also be discriminative to separate samples. Given , pseudo-labels is iteratively adjusted, so as to learn decision boundaries to split (Wang and Hebert, 2016a). A binary feature is learned from each , whose each dimension the binary feature marks the side of the decision boundary it lies. These unlabeled trained decision boundaries are incorporated into pre-trained CNN in (Wang and Hebert, 2016a). It adds a special layer in front of the fully connect layer for classification, fixes the rest pre-trained parameters, while learns new which can separate well. Note that in a pre-trained CNN, the captured embedding transits from generic to specific. By learning to separate the , the generality of the embedding of the last layers are improved. To classify , one only needs to learn the linear layer for final classification and reuses the rest parts of the CNN.
5.1.3. Fine-tune with new parameters
The pre-trained may not suit the structure of the new FSL task. Illustration of this strategy is shown in Figure 12.
For example, is trained for image classification while the new task is for fine-grained image classification of desks. In other words, contains coarse information that can be helpful to the current task, but this general information may mot be discriminative enough. Hence we also need additional parameters for the specialty of . Therefore, this strategy fine-tunes while learning , making the model parameter is now . Hoffman et al. (2013) use the parameters of the lower layers of a pre-trained CNN for feature embedding, while learns a linear classifier on top of it using . Consider font style transfer task, Azadi et al. (2018) pre-train a network to capture the font in gray images, and fine-tunes it together with the training of a network for generating stylish colored fonts.
Methods discussed in this section reduces reduces the effort of doing architecture search for from the scratch. Existing parameters are trained from large-scale data set, hence can contain generic information that can be helpful to the current FSL task. Since directly fine-tuning can easily overfit, fine-tuning a with regularization methods turn to regularizers or modification based on existing parameters. However, these designs are heuristic, relying on human knowledge. They usually consider a single of deep models. However, suitable existing parameters are not always easy to find. In contrast, a set of parameters ’s from related tasks can be used to aggregate into a suitable initialization. Some works even refine parameters learned from unsupervised task, which learns from abundant and cheap unlabeled data. However, one must make sure that the knowledge embedded in these existing parameters is useful to the current task. Besides, one usually has to search over a large set of existing parameters to find the relevant ones. This also accompanies a huge computation cost. FIne-tuning with new parameters has more flexibility, as it can additionally add more parameters according to the FSL task. However, given the few-shot , one can only add limited parameters, otherwise the sample complexity is still high and overfitting may occurs.
5.2. Refining Meta-learned
Methods fall in the following sections all are meta-learning methods, whose definition is in Section 4.2.3. Instead of working towards the unreliable , this perspective directly targets at . In the following, we denote the parameter of meta-learner as , and the task-specific parameter for task as . Iteratively, the meta-learner (optimizer) parameterized by provides information about parameter updates to parameter of task ’s learner (optimizee) and the learner returns error signals to meta-learner to improve it.
In this section, the meta-learner is used to provide good initialization for each , while next section the meta-learner is learned to directly output search steps for each learner. In contrast to refining existing parameters , here is learned from a set of tasks drawn from a task distribution. Illustration of this strategy is shown in Figure 13.
5.2.1. Refining by gradient descent
This strategy refines the meta-learned by gradient descent.
Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) is an representative method of this kind. It meta-learns as a good initialization for , which can be adjusted effectively through a few gradient descent using to obtain a good task specific . Mathematically, this is done by , where and is a fixed step size to be chosen. Through summing over all samples, it provides a permutation-invariant . Then, meta-learner updates through the averaged gradient steps across all tasks, , where is also a fixed step size. Grant et al. (2018) re-interpret MAML as approximate inference in a hierarchical Bayesian model. It connects MAML to Maximum A Posteriori (MAP) estimation with prior resulting from taking gradient descent using for task .
MAML provides the same initialization for all tasks, while neglects the task-specific information. This only suits a set of very similar tasks, while works bad when tasks are distinct. In (Lee and Choi, 2018), it learns to choose a subset of as initialization for . In other words, it meta-learns task-specific subspace and metric for the learn to perform gradient descent. Therefore,different initialization of is provided for different ’s.
Refining simply by gradient descent may not be reliable, regularization is used to correct biased descent direction due to few-shot. is further adapted by a model regression networks (Wang and Hebert, 2016b) in (Gui et al., 2018). As in original paper, this model regression networks captures a general transformation from model trained with few-shot to model trained with many-shot. Therefore, the adapted is regularized to be more close to model trained with many-shot. The parameter of the model regression networks is learned by gradient descent like .
5.2.2. Refining in consideration of uncertainty
Learning with few-shot inevitably results in a model with high uncertainty (Finn et al., 2018). Where the learned model can predict for new task with high confidence? Will the model be improved with more samples? The ability to measure this uncertainty provides a sign for active learning or further data collection.
Therefore are three kinds of uncertainty considered so far.
Uncertainty over the shared parameter . A single may not act as good initialization for all tasks. Therefore, by modeling ’s posterior distribution, one can sample appropriate initialization