Boosting Few-Shot Visual Learning with Self-Supervision
Few-shot learning and self-supervised learning address different facets of the same problem: how to train a model with little or no labeled data. Few-shot learning aims for optimization methods and models that can learn efficiently to recognize patterns in the low data regime. Self-supervised learning focuses instead on unlabeled data and looks into it for the supervisory signal to feed high capacity deep neural networks. In this work we exploit the complementarity of these two domains and propose an approach for improving few-shot learning through self-supervision. We use self-supervision as an auxiliary task in a few-shot learning pipeline, enabling feature extractors to learn richer and more transferable visual representations while still using few annotated samples. Through self-supervision, our approach can be naturally extended towards using diverse unlabeled data from other datasets in the few-shot setting. We report consistent improvements across an array of architectures, datasets and self-supervision techniques.
Deep learning based methods have achieved impressive results on various image understanding tasks, such as image classification [21, 29, 51, 53], object detection , or semantic segmentation . However, in order to successfully learn these tasks, such models need to access large volumes of manually labeled training data. If not, the trained models might suffer from poor generalization performance on the test data. In image classification for instance, learning to recognize reliably a set of categories with convolutional neural networks (convnets) requires hundreds or thousands of training examples per class. In contrast, humans are perfectly capable of learning new visual concepts from only one or few examples, generalizing without difficulty to new data. The aim of few-shot learning [9, 10, 26, 30, 54] is to endow artificial perception systems with a similar ability, especially with the help of modern deep learning.
Hence, the goal of few-shot visual learning is to devise recognition models that are capable of efficiently learning to recognize a set of classes despite the fact that there are available very few training examples for them (e.g., only 1 or 5 examples per class). In order to avoid overfitting due to data scarcity, few-shot learning algorithms rely on transfer learning techniques and have two learning stages. During a first stage, the model is usually trained using a different set of classes, called base classes, which is associated to a large set of annotated training examples. The goal of this stage is to let the few-shot model acquire transferable visual analysis abilities, typically in the form of learned representations, that are mobilized in the second stage. In this subsequent step, the model indeed learns to recognize novel classes, unseen during the first learning stage, using only a few training examples per class.
Few-shot learning relates with self-supervised representation learning [6, 7, 14, 31, 39, 60]. The latter is a form of unsupervised learning that trains a model on an annotation-free pretext task defined using only the visual information present in images. The purpose of this self-supervised task is to make the model learn image representations that would be useful when transferred to other image understanding tasks. For instance, in the seminal work of Doersch et al. , a network, by being trained on the self-supervised task of predicting the relative location of image patches, manages to learn image representations that are successfully transferred to the vision tasks of object recognition, object detection, and semantic segmentation. Therefore, as in few-shot learning, self-supervised methods also have two learning stages, the first that learns image representations with a pretext self-supervised task, and the second that adapts those representations to the actual task of interest. Also, both learning approaches try to limit the dependence of deep learning methods on the availability of large amounts of labeled data.
Inspired by the connection between few-shot learning and self-supervised learning, we propose to combine the two methods to improve the transfer learning abilities of few-shot models. More specifically, we propose to add a self-supervised loss to the training loss that a few-shot model minimizes during its first learning stage (see Figure 1). Hence, we artificially augment the training task(s) that a few-shot model solves during its first learning stage. We argue that this task augmentation forces the model to learn a more diversified set of image features, and this in turn improves its ability to adapt to novel classes with few training data. Moreover, since self-supervision does not require data annotations, one can include extra unlabeled data to the first learning stage. By extending the size and variety of training data in this manner, one might expect to learn richer image features and to get further performance gain in few-shot learning. At the extreme, using only unlabeled data in the first learning stage, thus removing the use of base classes altogether, is also appealing. We will show that both these semi-supervised and unsupervised regimes can be indeed put at work for few-shot recognition thanks to self-supervised tasks.
In summary, the contributions of our work are: (1) We propose to weave self-supervision into the training objective of few-shot learning algorithms. The goal is to boost the ability of the latter to adapt to novel classes with few training data. (2) We study the impact of the added self-supervised loss by performing exhaustive quantitative experiments on MiniImagenet, CIFAR-FS, and tiered-MiniImagenet few-shot datasets. In all of them self-supervision improves the few-shot learning performance leading to state-of-the-art results. (3) Finally, we extend the proposed few-shot recognition framework to semi-supervised and unsupervised setups, getting further performance gain in the former, and showing with the latter that our framework can be used for evaluating and comparing self-supervised representation learning approaches on few-shot object recognition.
2 Related work
There is a broad array of few-shot learning approaches, including, among many: gradient descent-based approaches [1, 11, 38, 44], which learn how to rapidly adapt a model to a given few-shot recognition task via a small number of gradient descent iterations; metric learning based approaches that learn a distance metric between a query, i.e., test image, and a set of support images, i.e., training images, of a few-shot task [26, 52, 54, 56, 58]; methods learning to map a test example to a class label by accessing memory modules that store training examples for that task [12, 25, 34, 37, 49]; approaches that learn how to generate the weights of a classifier [13, 16, 42, 43] or of a multi-layer neural network [3, 18, 19, 57] for the new classes given the few available training data for each of them; methods that “hallucinate” additional examples of a class from a reduced amount of data [20, 56].
This is a recent paradigm which is mid-way between unsupervised and supervised learning, and aims to mitigate the challenging need for large amounts of annotated data. Self-supervised learning defines an annotation-free pretext task, in order to provide a surrogate supervision signal for feature learning. Predicting the colors of images [31, 60], the relative position of image patches [6, 39], the random rotation that has been applied to an image , or the missing central part of an image , are some of the many methods [15, 32, 35, 55, 61] for self-supervised feature learning. The intuition is that, by solving such tasks, the trained model extracts semantic features that can be useful for other downstream tasks. In our case, we consider a multi-task setting where we train the backbone convnet using joint supervision from the supervised end-task and an auxiliary self-supervised pretext task. Unlike most multi-task settings aiming at good results on all tasks simultaneously , we are interested in improving performance on the main task only by leveraging supervision from the surrogate task, as also shown in . We expect that, in a few-shot setting where squeezing out generalizable features from the available data is highly important, the use of self-supervision as an auxiliary task will bring improvements. Also, related to our work, Chen et al.  recently added rotation prediction self-supervision to generative adversarial networks  leading to significant quality improvements of the synthesized images.
As already explained, few-shot learning algorithms have two learning stages and two corresponding sets of classes. Here, we define as the training set of base classes used during the first learning stage, where is an image with label in label set of size . Also, we define as the training set of novel classes used during the second learning stage, where each class has samples ( or in benchmarks). One talks about -way -shot learning. Importantly, the label sets and are disjoint.
In the remainder of this section, we first describe in §3.1 the two standard few-shot learning methods that we consider and introduce in §3.2 the proposed method to boost their performance with self-supervision.
3.1 Explored few-shot learning methods
The main component of all few-shot algorithms is a feature extractor , which is a convnet with parameters . Given an image , the feature extractor will output a -dimensional feature . In this work we consider two few-shot algorithms, Prototypical Networks (PN)  and Cosine Classifiers (CC) [13, 42], described below. They are fairly similar, with their main difference lying in the first learning stage: only CC learns actual base classifiers along with the feature extractor, while PN simply relies on class-level averages.
Prototypical Networks (PN) .
During the first stage of this approach, the feature extractor is learned on sampled few-shot classification sub-problems that are analogue to the targeted few-shot classification problem. In each training episode of this learning stage, a subset of base classes are sampled (they are called “support classes”) and, for each of them, training examples are randomly picked from within . This yields a training set . Given current feature extractor , the average feature for each class , its “prototype”, is computed as
and used to build a simple similarity-based classifier. Then, given a new image from a support class but different from samples in , the classifier outputs for each class the normalized classification score
where is a similarity function, which may be cosine similarity or negative squared Euclidean distance. So, in practice, the image will be classified to its closest prototype. Note that the classifier is conditioned on in order to compute the class prototypes. The first learning stage finally amounts to iteratively minimizing the following loss w.r.t. :
where is a training sample from a support class defined in but different from images in .
In the second learning stage, the feature extractor is frozen and the classifier of novel classes is simply defined as , with prototypes defined as in (1) with .
In CC few-shot learning, the first stage trains the feature extractor together with a cosine-similarity based classifier on the (standard) supervised task of classifying the base classes. Denoting the matrix of the -dimensional classification weight vectors, the normalized score for an input image reads
where is the cosine operation between two vectors, and the scalar is the inverse temperature parameter of the softmax operator.111Specifically, controls the peakiness of the probability distribution generated by the softmax operator .
The first learning stage aims at minimizing w.r.t. and the negative log-likelihood loss:
One of the reasons for using the cosine-similarity based classifier instead of the standard dot-product based one, is that the former learns feature extractors that reduce intra-class variations and thus can generalize better on novel classes. By analogy with PN, the weight vectors ’s can be interpreted as learned prototypes for the base classes, to which input image features are compared for classification.
As with PN, the second stage boils down to computing one representative feature for each new class by simple averaging of associated samples in , and to define the final classifier the same way as in (4).
3.2 Boosting few-shot learning via self-supervision
A major challenge in few-shot learning is encountered during the first stage of learning. How to make the feature extractor learn image features that can be readily exploited for novel classes with few training data during the second stage? With this goal in mind, we propose to leverage the recent progress in self-supervised feature learning to further improve current few-shot learning approaches.
Through solving a non-trivial pretext task that can be trivially supervised, such as recovering the colors of images from their intensities, a network is encouraged to learn rich and generic image features that are transferable to other downstream tasks such as image classification. In the first stage of few-shot learning, we propose to extend the training of the feature extractor by including such a self-supervised task besides the main task of recognizing base classes.
We consider two ways for incorporating self-supervision into few-shot learning algorithms: (1) by using an auxiliary loss function based on a self-supervised task, and (2) by exploiting unlabeled data in a semi-supervised way during training. In the following we will describe the two techniques.
3.2.1 Auxiliary loss based on self-supervision
We incorporate self-supervision to a few-shot learning algorithm by adding an auxiliary self-supervised loss during its first learning stage. More formally, let be the self-supervised loss applied to the set of training examples in deprived of their class labels. The loss is a function of the parameters of the feature extractor and of the parameters of a network only dedicated to the self-supervised task. The first training stage of few-shot learning now reads
where stands either for the PN few-shot loss (3) or for the CC one (5), with additional argument in the latter case (hence bracket notation). The positive hyper-parameter controls the importance of the self-supervised term222In our experiments, we use .. An illustration of the approach is provided in Figure 1.
For the self-supervised loss, we consider two tasks in the present work: predicting the rotation incurred by an image, , which is simple and readily incorporated into a few-shot learning algorithm; predicting the relative location of two patches from the same image , a seminal task in self-supervised learning. In a recent study, both methods have been shown to achieve state-of-the-art results .
In this task, the convnet must recognize among four possible 2D rotations in the one applied to an image (see Figure 1). Specifically, given an image , we first create its four rotated copies , where is the image rotated by degrees. Based on the features extracted from such a rotated image, the new network attempts to predict the rotation class . Accordingly, the self-supervised loss of this task is defined as:
where is the original training set of non-rotated images and is the predicted normalized score for rotation . Intuitively, in order to do well for this task the model should reduce the bias towards up-right oriented images, typical for ImageNet-like datasets, and learn more diverse features to disentangle classes in the low-data regime.
Relative patch location.
Here, we create random pairs of patches from an image and then predict, among eight possible positions, the location of the second patch w.r.t. to the first, e.g., “on the left and above” or “on the right and below”. Specifically, given an image , we first divide it into regions over a grid and sample a patch within each region. Let’s denote the central image patch, and its eight neighbors lexicographically ordered. We compute the representation of each patch333If the architecture of is fully convolutional, we can apply it to both big images and smaller patches. and then generate patch feature pairs by concatenation. We train a fully-connected network to predict the position of from each pair.
The self-supervised loss of this task is defined as:
where is a set of images and is the predicted normalized score for the relative location . Intuitively, a good model on this task should somewhat recognize objects and parts, even in presence of occlusions and background clutter. Note that, in order to prevent the model from learning low-level image statistics such as chromatic aberration , the patches are preprocessed with aggressive color augmentation (i.e., converting to grayscale with probability and normalizing the pixels of each patch individually to have zero mean and unit standard deviation).
3.2.2 Semi-supervised few-shot learning
The self-supervised term in the training loss (6) does not depend on class labels. We can easily extend it to learn as well from additional unlabeled data. Indeed, if a set of unlabeled images is available besides , we can make the self-supervised task benefit from them by redefining the first learning stage as:
By training the feature extractor to also minimize the self-supervised loss on these extra unlabeled images, we open up its visual scope with the hope that this will further improve its ability to accommodate novel classes with scarce data. This can be seen as a semi-supervised training approach for few-shot algorithms. An interesting aspect of this semi-supervised training approach is that it does not require the extra unlabeled data to be from the same (base) classes as those in labeled dataset . Thus, it is much more flexible w.r.t. the source of the unlabeled data than standard semi-supervised approaches.
4 Experimental Results
In this section we evaluate self-supervision as auxiliary loss function in §4.2 and in §4.3 as a way of exploiting unlabeled data in semi-supervised training. Finally, in §4.4 we use the few-shot object recognition task for evaluating self-supervised methods.
We perform experiments on three few-shot datasets, MiniImageNet , tiered-MiniImageNet  and CIFAR-FS . MiniImageNet consists of classes randomly picked from the ImageNet dataset  (i.e., base classes, validation classes, and novel test classes); each class has images with size pixels. tiered-MiniImageNet consists of classes randomly picked from ImageNet  (i.e., base classes, validation classes, and novel test classes); in total there are images again with size . Finally, CIFAR-FS is a few-shot dataset created by dividing the classes of CIFAR-100 into base classes, validation classes, and novel test classes. The images in this dataset have size pixels.
Few-shot classification algorithms are evaluated based on the classification performance in their second learning stage (when the learned classifier is applied to test images from the novel classes). More specifically, a large number of -way -shot tasks are sampled from the available set of novel classes. Each task is created by randomly selecting novel classes from the available test (validation) classes and then within the selected images randomly selecting training images and test images per class (making sure that train and test images do not overlap). The classification performance is measured on the test images and is averaged over all the sampled few-shot tasks. Except otherwise stated, for all experiments we used , , and or (1-shot and 5-shot settings respectively).
4.1 Implementation details
We conduct experiments with different feature extractor architectures , Conv-4-64, Conv-4-512, and WRN-28-10. Conv-4-64  consists of convolutional blocks each implemented with a convolutional layer followed by BatchNorm  + ReLU + max-pooling units . All blocks of Conv-4-64 have feature channels. The final feature map has size and is flattened into a final -dimensional feature vector. Conv-4-512 is derived from Conv-4-64 by gradually increasing its width across layers leading to , , , and feature channels for its 4 convolutional blocks respectively. Therefore, its output feature vector has dimensions after flattening. Finally, WRN-28-10 is a -layer Wide Residual Network  with width factor . Its output feature map has size which after global average pooling creates a -dimensional feature vector.
The network specific to the rotation prediction task gets as input the output feature maps of and is implemented as a convnet. Given two patches, the network specific to the relative patch location task gets the concatenation of their feature vectors extracted with as input, and forwards it to two fully connected layers.
For further architecture details see Appendix B.1.
Training optimization routine for fist learning stage.
The training loss is optimized with mini-batch stochastic gradient descent (SGD). For the labeled data we apply both recognition and self-supervised losses. For the semi-supervised training, at each step we sample mini-batches that consist of labeled data, for which we use both losses, and unlabeled data, in which case we apply only . Learning rates, number of epochs, and batches sizes, were cross-validated on the validation sets. For further implementation details see Appendices B.2 and B.3.
Implementation of relative patch location task.
Due to the aggressive color augmentation of the patches in the patch localization task, and the fact that the patches are around 9 times smaller than the original images, the data distribution that the feature extractor “sees” from them is very different from that of the images. To overcome this problem we apply an extra auxiliary classification loss to the features extracted from the patches. Specifically, during the first learning stage of CC we merge the features of the 9 patches of an image (e.g., with concatenation or averaging) and then apply the cosine classifier (4) to the resulting feature (this classifier does not share its weight vectors with the classifier applied to the original images features). Note that this patch based auxiliary classification loss has the same weight as the original classification loss . Also, during the second learning stage we do not use the patch based classifier.
4.2 Self-supervision as auxiliary loss function
|CC||Conv-4-64||54.31 0.42%||70.89 0.34%|
|CC+rot||54.83 0.43%||71.86 0.33%|
|CC||Conv-4-512||55.68 0.43%||73.19 0.33%|
|CC+rot||56.27 0.43%||74.30 0.33%|
|CC||WRN-28-10||61.09 0.44%||78.43 0.33%|
|CC+rot||62.93 0.45%||79.87 0.33%|
|PN||Conv-4-64||52.20 0.46%||69.98 0.36%|
|PN+rot||53.63 0.43%||71.70 0.36%|
|PN||Conv-4-512||54.60 0.46%||71.59 0.36%|
|PN+rot||56.02 0.46%||74.00 0.35%|
|PN||WRN-28-10||55.85 0.48%||68.72 0.36%|
|PN+rot||58.28 0.49%||72.13 0.38%|
|CC||Conv-4-64||61.80 0.30%||78.02 0.24%|
|CC+rot||63.45 0.31%||79.79 0.24%|
|CC||Conv-4-512||65.26 0.31%||81.14 0.23%|
|CC+rot||65.87 0.30%||81.92 0.23%|
|CC||WRN-28-10||71.83 0.31%||84.63 0.23%|
|CC+rot||73.62 0.31%||86.05 0.22%|
|PN||Conv-4-64||62.82 0.32%||79.59 0.24%|
|PN+rot||64.69 0.32%||80.82 0.24%|
|PN||Conv-4-512||66.48 0.32%||80.28 0.23%|
|PN+rot||67.94 0.31%||82.20 0.23%|
|PN||WRN-28-10||68.35 0.34%||81.79 0.23%|
|PN+rot||68.60 0.34%||81.25 0.24%|
|CC||Conv-4-64||53.72 0.42%||70.96 0.33%|
|CC+loc||54.30 0.42%||71.58 0.33%|
|CC||Conv-4-512||55.58 0.42%||73.52 0.33%|
|CC+loc||56.87 0.42%||74.84 0.33%|
|CC||WRN-28-10||58.43 0.46%||75.45 0.34%|
|CC+loc||60.71 0.46%||77.64 0.34%|
Rotation prediction as auxiliary loss function.
We first study the impact of adding rotation prediction as self-supervision to the few-shot learning algorithms of Cosine Classifiers (CC) and Prototypical Networks (PN). We perform this study using the MiniImageNet and CIFAR-FS datasets and report results in Tables 1 and 2 respectively. For the CC case, we use as strong baselines, CC models without self-supervision but trained to recognize all the 4 rotated versions of an image. The reason for using this baseline is that during the first learning stage, the model “sees” the same type of data, i.e., rotated images, as the model with rotation prediction self-supervision. Note that despite the rotation augmentations of the first learning stage, during the second stage the model uses as training examples for the novel classes only the up-right versions of the images. Still however, using rotation augmentations improves the classification performance of the baseline models when adapted to the novel classes. Therefore, for fair comparison, we also apply rotation augmentations to the CC models with rotation prediction self-supervision. For the PN case, we do not use rotation augmentation since in our experiments this lead to performance degradation.
Relative patch location prediction as auxiliary loss function.
As explained in §3.2.1, we consider a second self-supervised task, namely relative patch location prediction. For simplicity, we restrict its assessment to the CC few-shot algorithm, which in our experiments proved to perform better than PN and to be simpler to train. Also, for this study we consider only the MiniImageNet dataset and not CIFAR-FS since the latter contains thumbnail images of size from which it does not make sense to extract patches: their size would have to be less than pixels, which is too small for any of the evaluated architectures. We report results on MiniImageNet in Table 3. As a strong baseline we used CC models without self-supervision but with the auxiliary patch based classification loss described in §4.1.
Based on the results of Table 3 we observe that: (1) relative patch location also manages to improve the few-shot classification performance and, as in the rotation prediction case, the improvement is more significant for high capacity network architectures. (2) Also, comparing to the rotation prediction case, the relative patch location offers smaller performance improvement.
Comparison with prior work.
In Tables 4, 5, and 6, we compare our approach with prior few-shot methods on the MiniImageNet, CIFAR-FS, and tiered-MiniImageNet datasets respectively. For our approach we used CC and rotation prediction self-supervision, which before gave the best results. In all cases we achieve state-of-the-art results surpassing prior methods with a significant margin. For instance, in the 1-shot and 5-shot settings of MiniImageNet we outperform the previous leading method LEO  by around and percentage points respectively.
More detailed results are provided in Appendix A.
|MAML ||Conv-4-64||48.70 1.84%||63.10 0.92%|
|Prototypical Nets ||Conv-4-64||49.42 0.78%||68.20 0.66%|
|LwoF ||Conv-4-64||56.20 0.86%||72.81 0.62%|
|RelationNet ||Conv-4-64||50.40 0.80%||65.30 0.70%|
|R2-D2 ||Conv-4-64||48.70 0.60%||65.50 0.60%|
|R2-D2 ||Conv-4-512||51.20 0.60%||68.20 0.60%|
|TADAM ||ResNet-12||58.50 0.30%||76.70 0.30%|
|Munkhdalai et al. ||ResNet-12||57.10 0.70%||70.04 0.63%|
|SNAIL ||ResNet-12||55.71 0.99%||68.88 0.92%|
|Qiao et al. ||WRN-28-10||59.60 0.41%||73.74 0.19%|
|LEO ||WRN-28-10||61.76 0.08%||77.59 0.12%|
|CC+rot||Conv-4-64||54.83 0.43%||71.86 0.33%|
|CC+rot||Conv-4-512||56.27 0.43%||74.30 0.34%|
|CC+rot||WRN-28-10||62.93 0.45%||79.87 0.33%|
|CC+rot+unlabeled||WRN-28-10||63.77 0.45%||80.70 0.33%|
|PN ||Conv-4-64||55.50 0.70%||72.00 0.60%|
|PN ||Conv-4-512||57.90 0.80%||76.70 0.60%|
|PN ||Conv-4-64||62.82 0.32%||79.59 0.24%|
|PN ||Conv-4-512||66.48 0.32%||80.28 0.23%|
|MAML ||Conv-4-64||58.90 1.90%||71.50 1.00%|
|MAML ||Conv-4-512||53.80 1.80%||67.60 1.00%|
|RelationNet ||Conv-4-64||55.00 1.00%||69.30 0.80%|
|R2-D2 ||Conv-4-64||60.00 0.70%||76.10 0.60%|
|R2-D2 ||Conv-4-512||64.00 0.80%||78.90 0.60%|
|CC+rot||Conv-4-64||63.45 0.31%||79.79 0.24%|
|CC+rot||Conv-4-512||65.87 0.30%||81.92 0.23%|
|CC+rot||WRN-28-10||73.62 0.31%||86.05 0.22%|
|MAML ||Conv-4-64||51.67 1.81%||70.30 0.08%|
|Prototypical Nets ||Conv-4-64||53.31 0.89%||72.69 0.74 %|
|RelationNet ||Conv-4-64||54.48 0.93%||71.32 0.78%|
|Liu et al. ||Conv-4-64||57.41 0.94%||71.55 0.74|
|LEO ||WRN-28-10||66.33 0.05%||81.44 0.09 %|
|CC||WRN-28-10||70.04 0.51%||84.14 0.37%|
|CC+rot||WRN-28-10||70.53 0.51%||84.98 0.36%|
4.3 Semi-supervised few-shot learning
Next, we evaluate the proposed semi-supervised training approach. In these experiments we use CC models with rotation prediction self-supervision. We perform two types of semi-supervised experiments: (1) training with unlabeled data from the same base classes, and (2) training with unlabeled data that are not from the base classes.
Training with unlabeled data from the same base classes.
From the base classes of MiniImageNet, we use only a small percentage of the training images (e.g., 5% of images per class) as annotated training data, while the rest of the images (e.g., 95%) are used as the unlabeled data in the semi-supervised training. We provide results in the first two sections of Table 7. The proposed semi-supervised training approach is compared with a CC model without self-supervision and with a CC model with self-supervision but no recourse to the unlabeled data. The results demonstrate that indeed, our method manages to improve few-shot classification performance by exploiting unlabeled images. Compared to Ren et al. , that also propose a semi-supervised method, our method with Conv-4-64 and 20% annotations achieves better results than their method with Conv-4-64 and 40% annotations (i.e., our 51.21% and 68.89% MiniImageNet accuracies vs. their 50.41% and 64.39% for the 1-shot and 5-shot settings respectively).
Training with unlabeled data not from the base classes.
This is a more realistic setting, since it is hard to constrain the unlabeled images to be from the same classes as the base classes. For this experiment, we used as unlabeled data the training images of the tiered-MiniImageNet base classes minus the classes that are common with the base, validation, or test classes of MiniImageNet. In total, unlabeled images are used from tiered-MiniImageNet. We report results in the last row of Table 7. Indeed, even in this difficult setting, our semi-supervised approach is still able to exploit unlabeled data and improve the classification performance. Furthermore, we did an extra experiment in which we trained a WRN-28-10 based model using 100% of MiniImageNet training images and unlabeled data from tiered-MiniImageNet. This model achieved 63.77% and 80.70% accuracies for the 1-shot and 5-shot settings respectively on MiniImageNet (see entry CC+rot+unlabeled of Table 4), which improves over the already very strong CC+rot model (see Table 4).
4.4 Few-shot object recognition to assess self-supervised representations
Given that our framework allows the easy combination of any type of self-supervised learning approach with the adopted few-shot learning algorithms, we also propose to use it as an alternative way for comparing/evaluating the effectiveness of different self-supervised approaches. To this end, the only required change to our framework is to use uniquely the self-learning loss (i.e., no labeled data is now used) in the first learning stage (for implementation details see Appendix B.2). The performance of the few-shot model resulting from the second learning stage can then be used for evaluating the self-supervised method under consideration.
Comparing competing self-supervised techniques is not straightforward since it must be done by setting up another, somewhat contrived task that exploits the learned representations [6, 28]. Instead, given the very similar goals of few-shot and self-supervised learning, we argue that the proposed comparison method could be more meaningful for assessing different self-supervised features. Furthermore, it is quite simple and fast to perform when compared to some alternatives such as fine-tuning the learned representations on the PASCAL  detection task , with the benefit of obtaining more robust statistics aggregated over evaluations of thousands of episodes with multiple different configurations of classes and training/testing samples.
To illustrate our point, we provide in Table 8 quantitative results of this type of evaluation on the MiniImageNet dataset, for the self-supervision methods of rotation prediction and relative patch location prediction. For self-supervised training we used the training images of the base classes of MiniImageNet and for the few-shot classification step we used the test classes of MiniImageNet. We observe that the explored self-supervised approaches achieve relatively competitive classification performance when compared to the supervised method of CC and obtain results that are on par or better than other, more complex, unsupervised systems. We leave as future work a more detailed and thorough comparison of self-learned representations in this evaluation setting.
Inspired by the close connection between few-shot and self-supervised learning, we propose to add an auxiliary loss based on self-supervision during the training of few-shot recognition models. The goal is to boost the ability of the latter to recognize novel classes using only few training data. Our detailed experiments on MiniImagenet, CIFAR-FS, and tiered-MiniImagenet few-shot datasets reveal that indeed adding self-supervision leads to significant improvements on the few-shot classification performance, which makes the employed few-shot models achieve state-of-the-art results. Furthermore, the annotation-free nature of the self-supervised loss allows us to exploit diverse unlabeled data in a semi-supervised manner, which further improves the classification performance. Finally, we show that the proposed framework can also be used for evaluating self-supervised or unsupervised methods based on few-shot object recognition.
-  M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In NIPS, 2016.
-  L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi. Meta-learning with differentiable closed-form solvers. In ICLR, 2019.
-  L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi. Learning feed-forward one-shot learners. In NIPS, 2016.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. PAMI, 40(4):834–848, 2018.
-  T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby. Self-supervised generative adversarial networks. arXiv preprint arXiv:1811.11212, 2018.
-  C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
-  A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, pages 766–774, 2014.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
-  L. Fei-Fei et al. A bayesian approach to unsupervised one-shot learning of object categories. In ICCV, 2003.
-  L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE Trans. PAMI, 28(4):594–611, 2006.
-  C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
-  V. Garcia and J. Bruna. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.
-  S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting. In CVPR, 2018.
-  S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
-  C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.
-  F. Gomez and J. Schmidhuber. Evolving modular fast-weight networks for control. In ICANN, 2005.
-  I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
-  D. Ha, A. Dai, and Q. V. Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
-  C. Han, S. Shan, M. Kan, S. Wu, and X. Chen. Face recognition with contrastive convolution. In ECCV, 2018.
-  B. Hariharan and R. B. Girshick. Low-shot visual recognition by shrinking and hallucinating features. In ICCV, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  K. Hsu, S. Levine, and C. Finn. Unsupervised learning via meta-learning. In ICLR, 2019.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  Ł. Kaiser, O. Nachum, A. Roy, and S. Bengio. Learning to remember rare events. arXiv preprint arXiv:1703.03129, 2017.
-  G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, 2015.
-  I. Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, 2017.
-  A. Kolesnikov, X. Zhai, and L. Beyer. Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005, 2019.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum. One shot learning of simple visual concepts. In An. Meeting of the Cognitive Science Society, 2011.
-  G. Larsson, M. Maire, and G. Shakhnarovich. Learning representations for automatic colorization. In ECCV, 2016.
-  H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang. Unsupervised representation learning by sorting sequences. In ICCV, 2017.
-  Y. Liu, J. Lee, M. Park, S. Kim, and Y. Yang. Transductive propagation network for few-shot learning. arXiv preprint arXiv:1805.10002, 2018.
-  N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. In ICLR, 2018.
-  I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In ECCV, 2016.
-  T. Mordan, N. Thome, G. Henaff, and M. Cord. Revisiting multi-task learning with rock: a deep residual auxiliary block for visual detection. In NIPS, 2018.
-  T. Munkhdalai and H. Yu. Meta networks. arXiv preprint arXiv:1703.00837, 2017.
-  A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. CoRR, abs/1803.02999, 2018.
-  M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.
-  B. N. Oreshkin, A. Lacoste, and P. Rodriguez. Tadam: Task dependent adaptive metric for improved few-shot learning. arXiv preprint arXiv:1805.10123, 2018.
-  D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
-  H. Qi, M. Brown, and D. G. Lowe. Low-shot learning with imprinted weights. In CVPR, 2018.
-  S. Qiao, C. Liu, W. Shen, and A. Yuille. Few-shot image recognition by predicting parameters from activations. arXiv preprint arXiv:1706.03466, 2, 2017.
-  S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. ICLR, 2017.
-  M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676, 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018.
-  A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memory-augmented neural networks. In ICML, 2016.
-  A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In NIPS, 2017.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175, 2017.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
-  O. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra. Matching networks for one shot learning. In NIPS, 2016.
-  C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy. Tracking emerges by colorizing videos. In ECCV, 2018.
-  Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Low-shot learning from imaginary data. arXiv preprint arXiv:1801.05401, 2018.
-  Y.-X. Wang, D. Ramanan, and M. Hebert. Learning to model the tail. In NIPS, 2017.
-  F. S. Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018.
-  S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.
-  R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In ECCV, 2016.
-  T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.
Appendix A Extra experimental results
a.1 Rotation prediction self-supervision: Impact of rotation augmentation
In the experiments reported in Section 4, we use rotation augmentation when training the baselines to compare against the CC-models with self-supervised rotation prediction. In Table 9 of this Appendix we also provide results without using rotation augmentation. The purpose is to examine what is the impact of this augmentation technique. We observe that (1) the improvements yielded by rotation prediction self-supervision are more significant, and (2) in some cases rotation augmentation actually reduces the few-shot classification performance.
a.2 Relative patch location self-supervision: impact of patch based object classification loss
When we study the impact of adding relative patch location self-supervision to CC-based models in Section 4, we use an auxiliary patch based object classification loss. In Table 10 we also provide results without using this auxiliary loss when training CC models. The purpose is to examine what is the impact of this auxiliary loss. We observe that the improvement brought by this auxiliary loss is small (or no-existing) when compared to the performance improvement thanks to the relative patch location self-supervision.
|CC||Conv-4-64||53.94 0.42%||71.13 0.34%||62.83 0.31%||79.14 0.24%|
|CC+rot||55.41 0.43%||72.98 0.33%||63.98 0.31%||80.44 0.23%|
|CC||✓||54.31 0.42%||70.89 0.34%||61.80 0.30%||78.02 0.24%|
|CC+rot||✓||54.83 0.43%||71.86 0.33%||63.45 0.31%||79.79 0.24%|
|CC||Conv-4-512||54.51 0.42%||72.52 0.34%||65.64 0.31%||81.10 0.23%|
|CC+rot||56.59 0.43%||74.67 0.34%||67.00 0.30%||82.55 0.23%|
|CC||✓||55.68 0.43%||73.19 0.33%||65.26 0.31%||81.14 0.23%|
|CC+rot||✓||56.27 0.43%||74.30 0.33%||65.87 0.30%||81.92 0.23%|
|CC||WRN-28-10||58.59 0.45%||76.59 0.33%||70.43 0.31%||83.84 0.23%|
|CC+rot||60.10 0.45%||77.40 0.33%||72.49 0.31%||84.77 0.22%|
|CC||✓||61.09 0.44%||78.43 0.33%||71.83 0.31%||84.63 0.23%|
|CC+rot||✓||62.93 0.45%||79.87 0.33%||73.62 0.31%||86.05 0.22%|
|CC||Conv-4-64||53.63 0.42%||70.74 0.34%|
|CC||✓||53.72 0.42%||70.96 0.33%|
|CC+loc||✓||54.30 0.42%||71.58 0.33%|
|CC||Conv-4-512||54.51 0.42%||72.52 0.34%|
|CC||✓||55.58 0.42%||73.52 0.33%|
|CC+loc||✓||56.87 0.42%||74.84 0.33%|
|CC||WRN-28-10||58.59 0.45%||76.59 0.33%|
|CC||✓||58.43 0.46%||75.45 0.34%|
|CC+loc||✓||60.71 0.46%||77.64 0.34%|
Appendix B Additional implementation details
b.1 Network architectures
It consists of convolutional blocks each implemented with a convolutional layer with channels followed by BatchNorm + ReLU + max-pooling units. In the MiniImageNet experiments for which the image size is pixels, its output feature map has size and is flattened into a final -dimensional feature vector. For the CIFAR-FS experiments, the image size is pixels, the output feature map has size and is flattened into a -dimensional feature vector.
It is derived from Conv-4-64 by gradually increasing its width across layers leading to , , , and feature channels for its 4 convolutional blocks respectively. Therefore, for a sized image (i.e., MiniImageNet experiments) its output feature map has size and is flattened into a final -dimensional feature vector, while for a sized image (i.e., CIFAR-FS experiments) its output feature map has size and is flattened into a final -dimensional feature vector.
It is a Wide Residual Network with convolutional layers and width factor . The residual layers of this architecture are grouped into residual blocks ( residual layers per block). In the MiniImageNet and tiered-MiniImageNet experiments, the network gets as input images of size (rescaled from ), and during feature extraction each residual block downsamples by a factor of 2 the processed feature maps. Therefore, the output feature map has size which, after global average pooling, creates a -dimensional feature vector. In the CIFAR-FS experiments, the input images have size and during feature extraction only the last two residual blocks downsample the processed feature maps. Therefore, in the CIFAR-FS experiments, the output feature map has size which again after global average pooling creates a -dimensional feature vector.
Rotation prediction network, .
This network gets as input the output feature maps of and is implemented as a convnet. More specifically, for the Conv-4-64 and Conv-4-512 feature extractor architectures (regardless of the dataset), consists of two convolutional layers with BatchNorm + ReLU units, followed by a fully connected classification layer. For Conv-4-64, those two convolutional layers have and feature channels respectively, while for Conv-4-512 both convolutional layers have feature channels. In the WRN-28-10 case, consists of a -residual-layer residual block that actually replicates the last (3rd) residual block of WRN-28-10. This residual block is followed by global average pooling plus a fully connected classification layer.
Relative patch location network, .
Given two patches, gets the concatenation of their feature vectors extracted with as input, and forwards it to two fully connected layers. The single hidden layer, which includes BatchNorm + ReLU units, has , , and channels for the Conv-4-64, Conv-4-512, and WRN-28-10 architectures respectively.
b.2 Incorporating self-supervision during training
Here we provide more implementation details regarding how we incorporate self-supervision during the fist learning stage.
Training with rotation prediction self-supervision.
During training for each image of a mini-batch we create its rotated copies and apply to them the rotation prediction task (i.e., loss). When training the object classifier with rotation augmentation (e.g., CC-based models) the object classification task (i.e., loss) is applied to all rotated versions of the images. Otherwise, only the upright images (i.e., the degrees images) are used for the object classification task. Note that in the PN-based models, we apply the rotation task to both the support and the query images of a training episode, and also we do not use rotation augmentation for the object classification task.
Training with relative patch location self-supervision.
In this case during training each mini-batch includes two types of visual data, images and patches. Similar to , in order to create patches, an image is: (1) resized to pixels (from ), (2) converted to grayscale with probability , and then (3) divided into regions of size with a regular grid. From each sized region we (4) randomly sample a patch, and then (5) normalize the pixels of the patch individually to have zero mean and unit standard deviation. The object classification task is applied to the image data of the mini-batch while the relative patch location task to the patch data of the mini-batch. Also, as already explained, we also apply an extra auxiliary object classification loss to the patch data.
b.3 Training routine for first learning stage
To optimize the training loss we use mini-batch SGD optimizer with momentum and weight decay . In the MiniImageNet and CIFAR-FS experiments, we train the models for epochs (each with SGD iterations), starting with a learning rate of which is decreased by a factor of every epochs. In the tiered-MiniImageNet experiments we train for epochs (each with SGD iterations), starting with a learning rate of which is decreased by a factor of every epochs. The mini-batch sizes were cross-validated on the validation split. For instance, the models based on CC and Conv-4-64, Conv-4-512, or WRN-28-10 architectures are trained with mini-batch sizes equal to , , or respectively. Finally, we perform early stopping w.r.t. the few-shot classification accuracy on the validation novel classes (for the CC-based models we use the 1-shot classification accuracy).
Here each mini-batch consists of both labeled and unlabeled data. Specifically, for the experiments that use the Conv-4-64 network architecture, and , , or of MiniImageNet as labeled data, each mini-batch consists of labeled images and unlabeled images. For the experiments that use the WRN-28-10 network architecture, and , , or of MiniImageNet as labeled data, each mini-batch consists of labeled images and unlabeled images. For the experiment that uses of MiniImageNet as labeled data and tiered-MiniImageNet for unlabeled data, then each mini-batch consists of labeled images and unlabeled images.
b.4 Assessing self-supervised representations based on the few-shot object recognition task
Here we provide implementation details for the experiments in §4.4, which assess the self-supervised representations using the few-shot object recognition task. Except from the fact that during the first learning stage, (1) there is no object based supervision (i.e., no loss), and (2) no early stopping based on a validation set, the rest of the implementation details remain the same as in the other CC-based experiments. A minor difference is that, when evaluating the WRN-28-10 and relative patch location based model, we create the representation of each image by averaging the extracted feature vectors of its 9 patches (similar to ).