Object-Level Representation Learning for Few-Shot Image Classification

Object-Level Representation Learning for Few-Shot Image Classification

Liangqu Long, Wei Wang, Jun Wen, Meihui Zhang, Qian Lin, Beng Chin Ooi
National University of Singapore, Zhejiang University, Beijing Institute of Technology
dcslong@nus.edu.sg, {wangwei, linqian, ooibc}@comp.nus.edu.sg,
jungel2star@gmail.com, meihui_zhang@bit.edu.cn


Few-shot learning that trains image classifiers over few labeled examples per category is a challenging task. In this paper, we propose to exploit an additional big dataset with different categories to improve the accuracy of few-shot learning over our target dataset. Our approach is based on the observation that images can be decomposed into objects, which may appear in images from both the additional dataset and our target dataset. We use the object-level relation learned from the additional dataset to infer the similarity of images in our target dataset with unseen categories. Nearest neighbor search is applied to do image classification, which is a non-parametric model and thus does not need fine-tuning. We evaluate our algorithm on two popular datasets, namely Omniglot and MiniImagenet. We obtain 8.5% and 2.7% absolute improvements for 5-way 1-shot and 5-way 5-shot experiments on MiniImagenet, respectively. Source code will be published upon acceptance.


Object-Level Representation Learning for Few-Shot Image Classification

  Liangqu Long, Wei Wang, Jun Wen, Meihui Zhang, Qian Lin, Beng Chin Ooi National University of Singapore, Zhejiang University, Beijing Institute of Technology dcslong@nus.edu.sg, {wangwei, linqian, ooibc}@comp.nus.edu.sg, jungel2star@gmail.com, meihui_zhang@bit.edu.cn


noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Real-world data typically follows power-law distributions, where the majority of the data categories have only a small number of examples. For instance, to train an image classifier for food images, one would probably crawl few images for some local dishes and many images for common dishes like burger. Similarly, there are few images for new products, e.g. new toys. However, state-of-the-art image classifiers, i.e. deep convolutional neural networks (ConvNets) Krizhevsky et al. [2012], are extremely hungry for data. The popular benchmark datasets for ConvNets, including CIFAR10 and ImageNet Deng et al. [2009], usually have more than images per category. Fine-tuning ConvNets  Yosinski et al. [2014] by transferring the knowledge (i.e. parameters) learned from a big dataset could alleviate the gap in some degree, but still fails to resolve the issue. One reason is that the widely used gradient-based optimization algorithms need many iterations over plenty of examples to adapt the ConvNets (with a large number of parameters) for new categories Ravi and Larochelle [2016].

Two approaches have been proposed towards the above challenge. They are referred as few-shot learning, which trains classifiers over datasets with few (e.g. less than ) examples per category. The first approach Ravi and Larochelle [2016], Li et al. [2017], Finn et al. [2017] is based on meta-learning. MAML Finn et al. [2017] trains a meta-learner to provide good initialization for the parameters of the classifier. Meta-SGD Li et al. [2017]’s meta-learner generates adaptive learning rate for training the classifier. Ravi and Larochelle [2016] replaces the gradient-based optimizer with a LSTM to train the classifier. The second approach Koch et al. [2015], Vinyals et al. [2016], Santoro et al. [2016], Snell et al. [2017], Sung et al. [2017] is based on metric learning. It learns an embedding function to project the images into an embedding space and then classify images from new categories via nearest neighbour search (NNS). No fine-tuning is required over the new categories as the NNS classifier is non-parametric. The embedding functions are vital to the classification accuracy, which must be general enough to extract effective features for measuring the distance or similarity between images from the unseen categories.

Figure 1: Left: Segway, Right: electric scooter. Human can easily recognize Segway as one type of motors by its components: wheels, pedals, rider.

In this paper, we follow the second approach and take the following observation into consideration. Humans usually recognize a new (unseen) object by decomposing it and comparing the components with those of other objects they have seen before. Take the Segway in Figure 1 as an example (Lake et al. [2015b]), although Segway could be new to us, we are familiar with its components (i.e. objects), e.g. wheels and pedals, which are similar to those of the motors or electric scooters in our memory. By comparing the components of Segway and electric scooters, we would know Segway is a kind of motors for riding.

Motivated by human intelligence, we propose a novel approach by learning image similarity based on their object-level relation. This approach is called OLFSL, short for Object-Level Few-Shot Learning. OLFSL compares the objects from two images to learn the object-level relations, which are used to infer the image-level similarity. More specifically, OLFSL is composed of three modules: representation learning , objects relation learning and similarity learning . extracts features of objects from each image; The object-level relations of two images are learned by feeding the object features into . All learned relations from are aggregated to generate the image-level similarity score: . The three modules are trained using the additional big dataset. Nearest neighbor search is applied over the target dataset to do few-shot classification.

The primary contribution of this work is exploiting the object-level relation learned from known categories (the additional dataset) to infer the similarity of samples from unseen categories (the target dataset) for few-shot learning. We evaluate our approach on two popular datasets, Omniglot Lake et al. [2015b], MiniImagenet Vinyals et al. [2016]. The experimental results show that we reach the state-of-the-art performance on Omniglot. For MiniImagenet, we achieve 8.5% and 2.7% absolute improvements over the state-of-the-art methods for 5-way 1-shot and 5-way 5-shot experiments respectively. Besides, OLFSL is model-agnostic, scalable, and free of fine-tuning on new tasks.

Figure 2: OLFSL architecture. Image from the support set and from the query set are fed into separately. (resp. ) denotes the object representations extracted from the raw image feature (resp. ). Objects from are concatenated with objects in pair-wisely. The concatenated vectors are fed into network to learn object-level relations. All object-level relations are aggregated together to learn the similarity score by , .

2 Methodology

In this section, we first give the problem definition of few-shot learning, and then introduce the model architecture, and finally explain the details about the training and inference procedure. We use 1-shot learning to explain the idea and then extend it to K-shot learning (K>1).

2.1 Problem Definition

For few-shot image learning, we are given a support set of labeled images , where is the feature of an image, and is the label. If the number of images per category is and the classes number is , the task is denoted as N-way K-shot learning, which classifies the images from a query set (denoted as ) by assigning each image with a label from . The support set and query set together form a training (or test) episode Typically, is a small number for few-shot setting, e.g. or .

2.2 Model Architecture

Our model consists of three parts as shown in Figure 2. The architecture configuration of each part is illustrated in Figure 3.

2.2.1 Object Feature Learning

For each image sample , we denote as the raw image feature and as the label. is fed into a convolutional neural network to extract the object features. In particular, we decompose the feature maps from the last convolution layer to get the object features. There are in total objects, i.e. , where stands for the number of objects on horizontal and vertical dimension (we assume has equal height and width). The i-th object is denoted as .

2.2.2 Object-Level Relation Learning

This module learns the object-level relations from object-level representations extracted from . Give two images and , extracts the object-level representations as and respectively. We compare and following rule . A simple rule is to concatenate objects from and pair-wisely, i.e.


Other rules can also be applied, e.g. concatenating objects at the same spatial location.

Each concatenated vector is fed into another fully connected neural network to learn the object-level relation. The output feature vector, denotd as , represents the relation between and . In total, there are such relation feature vectors.

All object relation vectors are aggregated so as to get the image-level relation,


where stands for element-wise add operation.

2.2.3 Similarity Learning

The image-level relation feature vector is fed into a fully connected neural network to generate the final image-level similarity score,


where is the similarity between two samples and . It is normalized to , where 0 stands for completely distinct and 1 for almost the same.

2.3 Training and Inference

We design the training procedure following Vinyals et al. [2016] to make the training and inference conditions match. More specifically, We divide the whole dataset into there parts, namely , and with disjoint label space. serves as the additional big dataset to learn the object-level relation. For each sub dataset, we create a N-way K-shot episode (or task), denoted as , by randomly sampling images to construct the support set and query set. Cross-entropy is employed as the training objective as shown in Equation 4, where the ground truth y=1 if a and b are from the same category; otherwise the ground truth is y=0. By optimizing this objective, the predicted similarity becomes close to the ground truth similarity.


During inference, for each query, we compare it with each image from the support set by feeding them through the model. Nearest neighbor search then assigns the label of the image with the largest similarity to the query. Fine-tuning is not required for OLFSL.

For the case of , there are more than one example per category in the support set. Feature maps for images from the same category are averaged to get an average representation to reduce the representation variance,


The averaged representation is also used in Prototypical Network Snell et al. [2017]. Other modules for K-shot learning work in the same way as one-shot learning. The pseudo code of our algorithm is shown in Algorithm 1.

1: Training Stage
2:randomly initialize
3:while not done do
4:     sample one episode
5:     randomly sample from and
6:     extract object-level representation
7:     compute similarity score via Equation 3
8:     compute Equation 4 and update parameters via SGD
9:end while
10: Testing Stage
11:Given the support set and one query sample
12:for each sample in  do
13:     compute similarity score via Equation 3
15:end for
16:Return the label with the largest (aggregated) similarity score.
Algorithm 1 OLFSL Pseudo-code for training and testing

3 Experimental Evaluation

We evaluate our approach on two popular benchmark datasets: Omniglot, MiniImagenet. For each dataset, we partition it into training, validation and testing subsets: , and . The N-way K-shot learner is trained by sampling classes and examples per class for each training episode . We introduce the experiments on the two datasets respectively in the following subsections.

3.1 Evaluation on Omniglot

Omniglot Lake et al. [2015a] contains 1623 characters (classes) from 50 different alphabets. Each class has 20 samples drawn by different people. We use 1200 classes for training (including validation), and the remaining 423 classes for testing. Following Sung et al. [2017], Snell et al. [2017], all input images are augmented by rotations in multiples of 90 degrees. In every testing episode, 15 query images per class are tested.

1-shot 5-shot 1-shot 5-shot
Koch et al. [2015] N 96.7% 98.4% 88.0% 96.5%
Koch et al. [2015] Y 97.3% 98.4% 88.1% 97.0%
MANN Santoro et al. [2016] N 82.8% 94.9% - -
Vinyals et al. [2016] N 98.1% 98.9% 93.8% 98.5%
Vinyals et al. [2016] Y 97.9% 98.7% 93.5% 98.7%
Kaiser et al. [2017] N 98.4% 99.6% 95.0% 98.6%
Munkhdalai and Yu [2017] N 99.0% - 97.0% -
Snell et al. [2017] N 98.8% 99.7% 96.0% 98.9%
MAML Finn et al. [2017] Y 98.70.4% 99.90.1% 95.80.3% 98.90.2%
Li et al. [2017] Y 99.50.3% 99.90.1% 95.90.4% 99.00.2%
L2C Sung et al. [2017] N 99.60.2% 99.80.1% 97.60.2% 99.10.1%
Ours N 99.80.1% 99.90.1% 98.2 0.1% 99.50.1%
Table 1: Performance comparison on Omniglot dataset. FT means fine-tuned or not.
Figure 3: Details of network architecture on Omniglot and MiniImagenet experiements.

The detailed configuration of our networks is illustrated in Figure 3. denotes a convolution layer with input channel dimension and output channel dimension . All convolutional layers here have kernel size of 3x3. The numbers associated with MaxPool (resp ‘AvgPool’) stand for the pooling kernel size and stride size. Previous papers Sung et al. [2017], Snell et al. [2017] resize the images to 28x28 or 20x20, which results in small feature maps from the last convolution layer, e.g. 1x1x64; In order to get a large feature map for object relation modeling, we resize the input images to 84x84. Consequently, the output from has 64 feature maps, each of size 7x7. Therefore, there are combinations, i.e object relation features, each of size . The network processes these 2401 object relation features independently through a MLP model. The output feature of each relation is of dimension 256. All features are summed over into a single feature, which is then fed into to generate the image similarity score . All Omniglot experiments are trained with Adam Kingma and Ba [2014] with a learning rate of 0.001 and no weight decay.

Following the experimental setting in previous papers, we compare our approach with existing methods on four tasks, namely 5-way 1-shot, 5-way 5-shot, 20-way 1-shot and 20-way 5 shot classification. The results in terms of classification accuracy are presented in Table 1. For existing methods, we copy their performance reported in the original papers or other published papers. Both meta-learning and metric-learning based approaches are compared. Meta-learning based solutions need to fine-tune the model over the support set on test dataset. For metric-learning based approaches, fine-tuning is not necessary. It may improve or decrease the performance as reported by Matching Nets Vinyals et al. [2016], shown in the 3rd and 4th rows in the table. The second column indicates whether the model is fined-tuned over the test support set or not.

Our results are averaged over 600 test episodes and are reported with 95% confidence intervals. The variance is also reported. We can see that our approach outperforms existing methods for 3 out of 4 tasks. Note that the accuracy of existing solutions are very high, especially for 5-way tasks. Hence, a small improvement over the state-of-the-art should be considered as significant. The improvement for 20-way tasks is clearer. 20-way tasks are more difficult than 5-way tasks as the model needs to be more discriminative to differentiate more classes. To confirm the advantage of our approach, we perform comparison against another difficult dataset in the next subsection.

In addition, We observe the training progress is rather stable. From Figure 4, we can see that overfitting is not a problem for our algorithm although the weight decay is 0, the datasets are not large and there are many fully connected layers in and . One possible reason is that the aggregation operation of object-level relations behaves like model ensemble, which helps prevent overfitting.

3.2 Evaluation on MiniImagenet

The MiniImagenet dataset Vinyals et al. [2016] consists of 60,000 colour images with 100 classes sampled from ImageNet Deng et al. [2009]. Each class has 600 examples. We follow the partition scheme as in the original paper Vinyals et al. [2016] to get 64, 16, 20 classes for training, validation and testing, respectively. We resize the images to 224x224 and do channel-wise standardization. No data augmentation is conducted. MiniImagenet is a more difficult benchmark than Omniglot because it has a larger number of classes and greater variations among the images within each class.

The configuration of and keep the same with Omniglot tasks as shown in Figure 2. The network of is almost the same as that in Figure 3 except the final average pooling layer has a larger kernel and stride size. This is to reduce the memory cost caused by large input images. generates 64 feature maps, each of size 10x10. Consequently, we have combinations of object features. processes the 10,000 combinations independently via a 3 layers MLP model. The output is summed over to generate a 256-d feature. The network from Figure 3 is used again to generate the image pairs score . All MiniImagenet experiments are trained with Adam Kingma and Ba [2014] with a learning rate of 0.001 and no weight decay.

Four tasks are conducted to do the evaluation, namely, 5-way 1-shot, 5-way 5-shot, 20-way 1-shot and 20-way 5-shot classification. In Table 2, we report the classification accuracy including the mean and variance over 600 test episodes. The performance of existing methods are copied from their original papers or other papers. We can see that our model achieves absolute improvement over existing methods. The above observations are consistent with the results on Omniglot dataset, which indicates that our approach has larger capacity in modelling more difficult tasks.

1-shot 5-shot 1-shot 5-shot
Matching Nets N
Meta-SGD Y %
Meta Nets N 49.2 0.9% - - -
Prototypical Nets N 49.4 0.8% 68.2 0.7% - -
L2C N 50.44 0.82% 65.32 0.7% - -
Ours N 59.0 1.0% 70.9 0.5% 22.2 0.3% 32.2 0.2%
Table 2: Performance comparison on MiniImagenet dataset. FT means fine-tuned or not.

4 Discussion

In this paper, we learned an object level representation for few-shot learning. In particular, we consider each vector from feature map of sample as an object. We explore the similarity of sample and on the level of and . Then we learned a metric mapping aggregated vector into a space such that nearest neighbor search could be applied to predict the query sample’s label.

Our best performance on MiniImagenet experiments is achieved on dimension of objects , which means one representation includes number of objects. We also explored the affect on different number of objects on MiniImagenet dataset as Figure 4. It demonstrates that the number of objects in representation have an important effect on performance. When is too small, e.g. , only one objects can be learned and thus can not exploit the rich objects information. Meanwhile, When increasing the number of objects, e.g. , the improvement become trivial. Therefore, an appropriate value of should be set properly.

In our work, we resize the image size of MiniImagenet to 224x224. Some previous work, e.g. Sung et al. [2017], resize the raw image to 84x84. We argue that our algorithm need to extract rich information from raw sample, for example, we extract number of objects information when . We also observe that the image size does have very trivial influence on performance. To fair comparison, we re-run the experiments based on published source code from  Sung et al. [2017]. Table 3 indicates the performance of L2C varies subtly with 224x224 image size, even decreasing 0.28% on 5way 1shot experiments.

Figure 4: (a):Influence of objects number= on accuracy for MiniImagenet experiments. (b):Training and testing Accuracy curves for 5-way 5-shot Omniglot experiment. (c):Training and testing Loss curves for 5-way 5-shot Omniglot experiment.
Sung et al. [2017] 5way-1shot 5way-5shot
84x84 50.44% 65.32%
224x224 50.16% 65.98%
Table 3: L2C experiments on MiniImagenet with different image size.

5 Related Work

Few-shot Learning is the task of learning over datasets with few examples per category. It is useful for recognizing new categories, e.g. products. With the resurgence of deep learning, most few-shot image classifiers are based on ConvNets. A simple solution is to fine-tune the ConvNets trained on a similar dataset with many examples per category. However, the widely used gradient based optimization algorithms (e.g. mini-batch Stochastic Gradient Descent, SGD) need a lot of examples to adapt (fine-tune) the ConvNets Ravi and Larochelle [2016] for the new categories. Therefore two types of approaches are proposed recently.

Meta Learning Towards the optimization challenge, meta learning trains a meta learner that guides the optimization algorithms to fine-tune the learner (i.e. classifier). It is also called learning to learn. The meta learner is trained iteratively and learns slowly. For each iteration, an episode is sampled from the training dataset, which has the same setting as the test scenario. In other words, an episode has the same number of categories and the same number of examples per category as the test. The meta learner is trained to fine-tune the classifier for a large number of episodes. After training, the meta learner is expected to guide the learning of basic learner, e.g. the classifier for the test episodes. The meta learner of MAML Finn et al. [2017] learns good parameter initialization such that the fine-tuning can adapt the parameters quickly and effectively for the test episodes. Meta-SGD’s meta learner Li et al. [2017] generates both the initialization and learning rate for the fine-tuning optimization algorithm. A more aggressive approach Ravi and Larochelle [2016] is to learn a LSTM Hochreiter and Schmidhuber [1997] to generate the updates for fine-tuning. It replaces the optimization of basic learner with LSTM for fine-tuning.

Metric Learning Another set of approaches eliminate the fine-tuning step to avoid the optimization problem. They learn a general embedding function to project both training and testing images into a metric space, where nearest neighbor search could be used as the classifier. Koch et al. [2015] adapt Siamese network to do the feature embedding by training the network to predict the relation (from the same category or different categories) of two training images.

Meta-LSTM Vinyals et al. [2016] and memory-augmented networks Santoro et al. [2016] are also applied to learn the embedding space. Prototypical Network Snell et al. [2017] learns an embedding for each category by averaging the features of all samples in the category. L2C Sung et al. [2017] is motivated by Relation network Santoro et al. [2017], which explores the relation between question and each objects in one image. L2C compares two images within one episode to learn their relation scores, which serve as the metric distance. However, it ignores the rich object-level information when doing the comparison between images.

6 Conclusion

Supervised learning has shown great success in computer vision, audio recognition with large scale dataset available. However, few-shot learning is challenging because the training algorithms of neural network, i.e. gradient based optimization algorithms, require many iterations to fine-tune the parameters over a lot of examples for new image classes. In this paper, we learn an object level representation and exploit rich object-level information to infer image similarity. We show absolute improvements on MiniImagenet dataset and state-of-the-art perofmance on Omniglot dataset. Our algorithm is intuitive, model-agnostic and keep good generalization performance.


  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  • Finn et al. [2017] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. CoRR, abs/1703.03400, 2017. URL http://arxiv.org/abs/1703.03400.
  • Hochreiter and Schmidhuber [1997] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov. 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL http://dx.doi.org/10.1162/neco.1997.9.8.1735.
  • Kaiser et al. [2017] L. Kaiser, O. Nachum, A. Roy, and S. Bengio. Learning to remember rare events. CoRR, abs/1703.03129, 2017. URL http://arxiv.org/abs/1703.03129.
  • Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Koch et al. [2015] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. ICML Deep Learning Workshop, 2015.
  • Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
  • Lake et al. [2015a] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015a. ISSN 0036-8075. doi: 10.1126/science.aab3050. URL http://science.sciencemag.org/content/350/6266/1332.
  • Lake et al. [2015b] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 2015b.
  • Li et al. [2017] Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few shot learning. CoRR, abs/1707.09835, 2017. URL http://arxiv.org/abs/1707.09835.
  • Munkhdalai and Yu [2017] T. Munkhdalai and H. Yu. Meta networks. CoRR, abs/1703.00837, 2017. URL http://arxiv.org/abs/1703.00837.
  • Ravi and Larochelle [2016] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. ICLR, 2016.
  • Santoro et al. [2016] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. P. Lillicrap. One-shot learning with memory-augmented neural networks. CoRR, abs/1605.06065, 2016. URL http://arxiv.org/abs/1605.06065.
  • Santoro et al. [2017] A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. P. Lillicrap. A simple neural network module for relational reasoning. CoRR, abs/1706.01427, 2017. URL http://arxiv.org/abs/1706.01427.
  • Snell et al. [2017] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. CoRR, abs/1703.05175, 2017. URL http://arxiv.org/abs/1703.05175.
  • Sung et al. [2017] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. CoRR, abs/1711.06025, 2017. URL http://arxiv.org/abs/1711.06025.
  • Vinyals et al. [2016] O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. CoRR, abs/1606.04080, 2016. URL http://arxiv.org/abs/1606.04080.
  • Yosinski et al. [2014] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? CoRR, abs/1411.1792, 2014. URL http://arxiv.org/abs/1411.1792.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description