Low-shot learning with large-scale diffusion

Low-shot learning with large-scale diffusion

Matthijs Douze
Facebook AI Research
Institution1 address
firstauthor@i1.org
   Arthur Szlam
secondauthor@i2.org
   Matthijs Douze
&Arthur Szlam
&Bharath Hariharan
&Hervé Jégou
Abstract

This paper considers the problem of inferring image labels for which only a few labelled examples are available at training time. This setup is often referred to as low-shot learning in the literature, where a standard approach is to re-train the last few layers of a convolutional neural network learned on separate classes. We consider a semi-supervised setting in which we exploit a large collection of images to support label propagation. This is made possible by leveraging the recent advances on large-scale similarity graph construction.

We show that despite its conceptual simplicity, scaling up label propagation to up hundred millions of images leads to state of the art accuracy in the low-shot learning regime.

 

Low-shot learning with large-scale diffusion


  Matthijs Douze Arthur Szlam Bharath Hariharan Hervé Jégou

\@float

noticebox[b]—————–\end@float

1 Introduction

Large, diverse collections of images are now commonplace; these often contain a “long tail” of visual concepts. Some concepts like “person” or “cat” appear in many images, but the vast majority of the visual classes do not occur frequently. Even though the total number of images may be large, it is hard to collect enough labeled data for most of the visual concepts. Thus if we want to learn them, we must do so with few labeled examples. This task is named low-shot learning in the literature.

In order to learn new classes with little supervision, a standard approach is to leverage classifiers already learned for the most frequent classes, employing a so-called transfer learning strategy. For instance, for new classes with few labels, only the few last layers of a convolutional neural network are re-trained. This limits the number of parameters that need to be learned and limits over-fitting.

In this paper, we consider the low-shot learning problem described above, where the goal is to learn to detect new visual classes with only a few annotated images per class, but we also assume that we have many unlabelled images. This is called semi-supervised learning [42, 40] (considered, e.g., for face annotation [14]). The motivation of this work is threefold. First we want to show that with modern computational tools, classical semi-supervised learning methods scale gracefully to hundreds of millions of unlabeled points. A limiting factor in previous evaluations was the cost of construction the similarity graph supporting the diffusion. This is no longer a bottleneck: thanks to advances both in computing architectures and algorithms, one can routinely compute the similarity graph for 100 millions images in a few hours [23]. Second, we want to answer the question: Does exploiting a very large number of images help for semi-supervised learning? Finally, by comparing the results of these methods on Imagenet and the YFCC100M dataset [36], we highlight how these methods exhibit some artificial aspects of Imagenet that can influence the performance of low shot learning algorithms.

In summary, the contribution of our paper is a study of semi-supervised learning considered in the scenario where we have a very large number of unlabeled images. Our main results are that in this setting, semi-supervised learning leads to state of the art low-shot learning performance. In more detail, we make the following contributions: we carry out a large-scale evaluation for diffusion methods for semi-supervised learning and compare it to recent low-shot learning papers. Our experiments are all carried out on the public benchmark Imagenet [11] and the YFC100M dataset [36]. We show that our approach is efficient and that the diffusion process scales up to hundreds of millions of images, which is order(s) of magnitude larger than what we are aware in the literature on image-based diffusion [21, 20]. This is made possible by leveraging the recent state of the art for efficient k-nearest neighbor graph construction [23]. We evaluate several variants and hypotheses involved in diffusion methods, such as exploiting class frequency priors [41]. This scenario is realistic in situations where this statistic is known a priori. We propose a simple way to estimate and exploit it without this prior knowledge, and extend this assumption to a multiclass setting by introducing a probabilistic projection step derived from Sinkhorn-Knopp algorithm. Our experimental study shows that a simple propagation process, when carried out at a large with many unlabelled images, significantly outperforms some state-of-the-art approaches in low-shot visual learning when (i) the number of annotated images per class is small and when (ii) the number of unlabeled images is large.

This paper is organized as follows. Section 2 reviews related works and Section 3 describes the label propagation methods and variants. The experimental study is presented in Section 4. Our conclusion in section 5 summarizes our findings.

2 Related work

Low-shot learning

Recently there has been a renewed interest for low-shot learning, i.e., learning with few examples by exploiting prior learning on other classes. Such works include metric learning [28], learning kNN [38], regularization and feature hallucination [16] or predicting parameters of the network [5]. Ravi and Larochelle introduce a meta-learner to learn the optimization parameters invovled in the low-shot learning regime [32]. Most of the works consider small datasets like Omniglot, CIFAR, or a small subset of Imagenet. In our paper we will focus solely on large datasets, in particular the Imagenet collection [33] associated with the ILSVRC challenge.

Diffusion methods

We refer the reader to [3, 12] for a review of diffusion processes and matrix normalization options. Such methods are an efficient way of clustering images given a matrix of input similarity, or a kNN graph, and have been successfully used in a semi-supervised discovery setup [14]. They share some connections with spectral clustering [6]. In [31], a kNN graph is clustered with spectral clustering, which amounts to computing the eigenvectors associated with the largest eigenvalues of the graph, and clustering these eigenvectors. Since the eigenvalues are obtained via Lanczos iterations [15, Chapter 10], the basic operation is similar to a diffusion process. This is also related to power iteration clustering [27], as in the work of Cho et al. [8] to find clusters.

Semi-supervised learning

The kNN graph can be used for transductive and semi-supervised learning (see e.g. [3, 42] for an introduction). In transductive learning, a relatively small number of labels are used to augment a large set of unlabeled data and the goal is to extend the labeling to the unlabeled data (which is given at train time). Semi-supervised learning is similar, except there may be a separate set of test points that are not seen at train time. In our work, we consider the simple proposal of Zhu et al. [41], where powers of the (normalized) kNN graph are used to find smooth functions on the kNN graph with desired values at the labeled points. There many variations on the algorithms, e.g., Zhou et al. [40] weight the edges based on distances and introduce a loss trading a classification fitting constraint and a smoothness term enforcing consistency of neighboring nodes.

Label propagation is a transductive method. In order to evaluate on new data, i.e., new classes, we need to extend the smooth functions out of the training data. A standard method is to use a weighted sum of nearest neighbors from the training data [4]. Here we instead use a deep convolutional network trained similarly to the one used to build the features. Deep networks have been used before for out of sample extension, e.g., unsuccessfully in [7] but successfully in [22], in the speech domain. Furthermore, in addition to regressing the actual predicted labels from the label propagation, we also show results regressing the distributions, similar to [19].

Efficient kNN-graph construction

The diffusion methods use a matrix as input containing the similarity between all images of the dataset. Considering images, e.g., , it is not possible to store a matrix of size . However most of the image pairs are not related and have a similarity close to 0. Therefore diffusion methods are usually implemented with sparse matrices. This means that we compute a graph connecting each image to its neighbors, as determined by the similarity metric between image representations. In particular, we consider the k-nearest neighbor graph (kNN-graph) over a set of vectors. Several approximate algorithms [10, 25, 1, 17] have been proposed to efficiently produce the kNN graph used as input of iterative/diffusion methods, since this operation is of quadratic complexity in the number of images. In this paper, we employ the Faiss library,which was shown capable to construct a graph connecting up to 1 billion vectors [23].

3 Propagating labels

This section describes the initial stage of our proposal, which estimates the class of the unlabelled images with a diffusion process. It includes an image description step, the construction of a kNN graph connecting similar images, and a label diffusion algorithm.

3.1 Image description

A meaningful semantic image representation and an associated metric is required to match instances of classes that have not been seen beforehand. While early works on semi-supervised labelling [14] were using ad-hoc semantic global descriptors like GIST [29], the classification performance has substantially improved in the last years with deep CNN architectures [34, 18], which are a compelling choice for our purpose. Therefore, for the image description, we extract activation maps from a CNN trained from base classes that are independent from the novel classes on which the evaluation is performed. See the experimental section for more details about the training process for descriptors.

The mean class classifier introduced for low-shot learning [28] is another way to perform dimensionality reduction while improving accuracy thanks to a better comparison metric. We do not consider this approach since it can be seen as part of the descriptor learning.

3.2 Affinity matrix: approximate kNN graph

As discussed in the related work, most diffusion processes use as input the kNN graph representing the sparse similarity matrix, denoted by , which connects the images of the collection. We build this graph using approximate k-nearest neighbor search. Thanks to recent advances in efficient similarity search [10, 23], trading some accuracy against efficiency drastically improves the graph construction time. As an example, with the Faiss library [23], building the graph associated with 600k images takes 2 minutes on 1 GPU. From preliminary experiments, the approximation in the knn-graph construction does not induce any sub-optimality, possibly because the diffusion process compensates the artifacts induced by the approximation.

Different strategies exist to set the weights of the affinity matrix . We choose to search the nearest neighbors of each image, and set a 1 for each of the neighbors in the corresponding row of a sparse matrix . Then we symmetrize the matrix by adding it to its transpose. We subsequently -normalize the rows to produce a sparse stochastic matrix: , with the diagonal matrix of row sums.

The handling for the test points is different: test points do not participate in label propagation because we classify each of them independently of the others. Therefore, there are no outgoing edges on test points, they only get incoming edges from their nearest neighbors.

3.3 Label propagation

Figure 1: The diffusion setup. The arrows indicate the direction of diffusion. No diffusion is performed from the test, for the rest of the graph the edges are bidirectional (ie. the graph matrix is symmetric). Except mentioned otherwise, the edges have no weights.

We now give details about the diffusion process itself, which is summarized in Figure 1. We build on the straightforward label propagation algorithm of  [41]. The set of images on which we perform diffusion is composed of labelled seed images and unlabelled background images (). Define the matrix , where is the number of classes for which we want to diffuse the labels, i.e., the new classes not seen in the training set. Each row in is associated with a given image, and represents the probabilities of each class for that image. A given column corresponds to a given class, and gives its probabilities for each image. The method initializes to a one-hot vector for the seeds. Background images are initialized with 0 probabilities for all classes. Diffusing from the known labels, the method iterates as .

We can optionally reset the rows corresponding to seeds to their 1-hot ground-truth. When iterating to convergence, all would eventually converge to the eigenvector of with largest eigenvalue (when not resetting), or to the harmonic function with respect to with boundary conditions given by the seeds (when resetting). Empirically, for low-shot learning, we observe that resetting has a negative impact on accuracy. Also, early stopping performs better in both cases, so we cross-validate the number of diffusion iterations.

Classification decision & combination with logistic regression

We predict the class of a test example as the the column that maximizes the score . Similar to Zhou et al. [40], we have also optimized a loss balancing the fitting constraint with the diffusion smoothing term. However we found that a simple late fusion (weighted mean of log-probabilities, parametrized by a single cross-validated coefficient) of the scores produced by diffusion and logistic regression achieves better results.

3.4 Variations

Exploiting priors

The label propagation can take into account several priors depending on the assumptions of the problem, which are integrated by defining a normalization operator and by modifying the update equation as

(1)

Multiclass assumption. For instance, in the ILSVRC challenge built upon the Imagenet dataset [33], there is only one label per class, therefore we can define as a function that -normalizes each row to provide a distribution over labels, and by convention with no effect if the -norm is 0.

Class frequency priors. Additionally, we point out that labels are evenly distributed in Imagenet. If we translate this setup to our semi-unsupervised setting, it would mean that we may assume that the distribution of the unlabelled images is uniform over labels. This assumption can be taken into account by defining as the function performing a normalization of columns of .

While one could argue that this is not realistic in general, a more realistic scenario is to consider that we know the marginal distribution of the labels, as proposed by Zhu et al. [41], who show that the prior can be simply enforced (i.e., apply column-wise normalization to and multiply each column by the prior class probability). This arises in situations such as tag prediction, if we can empirically measure the relative probabilities of tags, possibly regularized for lowest values.

Combined Multiclass assumption and class frequency priors. We propose a variant way to exploit both a multiclass setting and prior class probabilities by enforcing the matrix to jointly satisfy the following properties:

(2)

where is the prior distribution over labels. For this purpose, we adopt a strategy similar to that of Cuturi [9] in his work on optimal transport, in which he shows that the Sinkhorn-Knopp algorithm [35] provides an efficient and theoretically grounded way to project a matrix so that it satisfies such marginals. The Sinkhorn-Knopp algorithm iterates by alternately enforcing the marginal conditions, as

(3)

until convergence. Here we assume that the algorithm only operates on rows and columns whose sum is strictly positive. As discussed by Knight [26], the convergence of this algorithm is fast. Therefore we stop after 5 iterations. This projection is performed after each update by Eqn. 1. Note that Zhu et al. [41] solely considered the second constraint in Eqn. 2, which can be obtained by enforcing the prior, as discussed by Bengio et al. [3]. We evaluate both variants in the experimental section 4.

Non-linear updates.

The Markov Clustering (MCL) [13] is another diffusion algorithm with nonlinear updates originally proposed for clustering. In contrast to the previous algorithm, MCL iterates directly over the similarity matrix as

(4)

where is an element-wise raising to power of the matrix, followed by a column-wise normalization [13]. The power is a bandwidth parameter: when is high, small edges quickly vanish along the iterations. A smaller preserves the edges longer. The clustering is performed by extracting connected components from the final matrix. In Section 4 we evaluate the role of the non-linear update of MCL by introducing the non-linearity in the diffusion procedure. More precisely, we modify Equation 1 as

3.5 Complexity

For the complexity evaluation, we distinguish two stages. In the off-line stage, (i) the CNN is trained on the base classes, (ii) descriptors are extracted for the background images, and (iii) a knn-graph is computed for the background images. In the on-line stage, we receive training and test images from novel classes, (i) compute features for them, (ii) complement the knn-graph matrix to include the training and test images, and (iii) perform the diffusion iterations. Here we assume that the graph matrix is decomposed in four blocks

(5)

The largest matrix is computed off-line. On-line we compute the three other matrices. We combine and by merging similarity search result lists, hence each row of contains exactly non-zero elements. This requires to store the distances along with .

We are mostly interested in the complexity of the on-line phase. Therefore we exclude the descriptor extraction, which is independent of the classification cost, and the cost of handling the test images, which is negligible compared to the training operations. We consider the logistic regression as a baseline for the complexity comparison:

Logistic regression

the training cost is multiply-adds, with denotes the descriptor dimensionality and the number of classes. The number of iterations and batch size are .

Diffusion

the cost is decomposed into: computing the matrices , and , which involves multiply-adds using brute-force distance computations; and performing iterations of sparse-dense matrix multiplications, which incurs multiply-adds (note, sparse matrix operations are more limited by irregular memory access patterns than arithmetic operations). Therefore the diffusion cost is linear in the number of background images . See appendix E for more details.

Memory usage.

One important bottleneck of the algorithm is its memory usage. The sparse matrix occupies bytes in RAM, and almost twice this amount, because most nearest neighbors are not reciprocal; the matrix is bytes. Fortunately, the iterations can be performed one column of at a time, reducing this to bytes for and (in practice, when memory is an issue, we iterate on columns at a time).

4 Experiments

4.1 Datasets and evaluation protocol

We use Imagenet 2012 [11] and follow a recent setup [16] previously introduced for low-shot learning. The 1000 Imagenet classes are split randomly into two groups, each containing base and novel classes. Group 1 (193 base and 300 novel classes) is used for hyper-parameter tuning and group 2 (196+311 classes) for testing with fixed hyper-parameters. We assume the full Imagenet training data is available for the base classes. For the novel classes, only images per class are available for training. Similar to [16] the subset of images is drawn randomly and the random selection is performed 5 times with different random seeds.

As a large source of unlabelled images, we use the YFCC100M dataset [36]. It consists of 99 million representative images from the Flickr photo sharing site111Of the 100M original files, some are videos and some are not available anymore. We replace them with uniform white images.. Note that some works have used this dataset with tags or GPS metadata as weak supervision [24].

Learning the image descriptors.

Similarly to [16], we train a 50-layer Resnet CNN [18] on all base classes (group 1 + group 2), to ensure that the description calculation has never seen any image of the novel classes. We run the CNN on all images, and extract a 2048-dim vector from the 49th layer, just before the last fully connected layer. This descriptor is used directly as input for the logistic regression. For the diffusion, we PCA-reduce the feature vector to 256 dimensions and L2-normalize it, as standardly done in prior works on unsupervised image matching with pre-learned image representations [2, 37].

Performance measure and baseline

In a given group (1 or 2), we classify the Imagenet validation images from both the base and novel classes, and measure the top-5 accuracy. Therefore the class distribution is heavily unbalanced. Since the seed images are drawn randomly, we repeat the random draws 5 times with different random seeds and average the obtained top-5 accuracy (the xx notation gives the standard deviation).

The baseline is a logistic regression applied on the labelled points. We employ a per-class image sampling strategy to circumvent the unbalanced number of examples per class. We optimize at the learning rate, batch size and L2 regularization factor of the logistic regression on the group 1 images.

Background images for diffusion

We consider the following sets of background images:

  1. None: the diffusion is directly from the seed images to the test images;

  2. In-domain setting: the background images are the Imagenet training image from the novel classes, but without labels. This corresponds to a use case where all images are known to belong to a set of classes, but only a subset of them have been labelled;

  3. Out-of-domain setting: background images are taken from YFCC100M. We denote this setting by F100k, F1M, F10M or F100M, depending on the number of images we use (e.g., we note F1M for ). This corresponds to a more challenging setting where we have no prior knowledge about the image used in the diffusion.

4.2 Parameters of diffusion

We compare a few settings of the diffusion algorithm that are of interest. In all cases, we set the number of nearest neighbors to and evaluate with . The nearest neighbors are computed with Faiss  [23], using the IVFFlat algorithm, that computes exact distances but may occasionally miss a few neighbors (see Appendix E for details).

Graph edge weighting.

We experimented with different edge weightings for , that were proposed in the literature. We compared a constant weight, a Gaussian weighting [27, 3], (with a hyper-parameter), and a weighting based on the “meaningful neighbors” theory [30].

Table 1 shows that results are remarkably independent of the choice of edge weightings, which is why we set it to 1. The best normalization that can be applied to the matrix is a simple column-wise L1 normalization. Thanks to the linear iteration formula, it can be applied at the end of the iterations.

background none F1M imnet
edge weighting
constant 62.70.68 65.40.55 73.30.72
Gaussian weighting* 62.70.66 65.40.58 73.60.71
meaningful neighbors* 62.70.68 40.00.20 73.60.62
operator
none 40.60.18 41.10.10 42.30.19
Sinkhorn 61.10.69 56.80.50 72.30.72
column-wise 62.70.68 65.40.55 73.30.72
non-linear transform* 62.70.68 65.40.55 73.30.72
class frequency prior* 62.70.66 65.40.60 73.30.65
Table 1: Variations on weighting for edges and normalization steps on iterates of . The tests are performed for and , with 5 runs on the group 1 validation images. Variants that require a parameter (e.g., the of the Gaussian weighting) are indicated with a “*”. In this case we report only the best result, see Appendix B material for full results. In the rest of the paper, we use the variants indicated in bold, since they are simple and do not add any parameter.

Class-specific weighting

In out-of-domain diffusion with many background images, the distribution over classes is not determined by the seed or test images, but predominantly by the background images. Therefore, the class prior applied to images can be adjusted to match that of the background.

To do this, we classify the YFCC100M images using a logistic regression classifier and use the resulting class distribution as a prior for per-class normalization. Since the estimated class distribution is relatively flat, we make it more peaky using a power transform that gives a trade-off with the uniform distribution. The single hyper-parameter (the power) is cross-validated.

4.3 Large-scale diffusion

Figure 2 reports experiments by varying the number of background images and the number of neighbors, for . In Appendix C, we also show how fast “fills up” (it is dense after a few iterations). The maximum in accuracy is also reached quickly. This maximum accuracy occurs later if is larger and when is smaller. The plot also shows that it is important to do early stopping.

The plot on the right reports the large-scale behavior of the diffusion. All the curves have an optimal point in terms of accuracy vs computational cost at =30. This may be a intrinsic property of the descriptor manifold. It is worthwhile to note that before starting the diffusion iterations, with =1000 and no background images (the best setting) we obtain an accuracy of 60.5%. This is a knn-classifier and this is the fastest setting because the knn-graph does not need to be constructed. Another advantage is that it does not require to store the graph.

Figure 2: Classification performance with — Left: accuracy as a function of the iteration — Right: various settings of and , ordered by total number of edges (average of 5 test runs, with cross-validated number of iterations). Appendix E material provides a complementary analysis.

4.4 Complexity: Runtime and memory

We measured the runtime of the diffusion process on a 48-core 2.5GHz machine:

background none F1M F10M F100M
optimal iteration 2 3 4 5
timing: graph completion 2m57s 8m36s 40m41s 4h08m 23m on 8 GPUs
timing: diffusion 4.4s 19s 3m44s 54m

The graph construction time is linear in , thanks to the pre-computation of the graph matrix for the background images (see Section 3.5). For comparison, training the logistic regression takes between 2m27s and 12m, depending on the cross-validated parameters.

In terms of memory usage, the biggest F100M experiments need to simultaneously keep in RAM a matrix of 5.3 billion non-zero values (39.5 GiB), and and (35.8 GiB, using slices of  columns). This is the main drawback of using diffusion. However Table 2 shows that restricting the diffusion to 10 million images already provides most of the gain, while dividing by an order of magnitude memory and computation complexity.

4.5 Comparison with baseline classifiers

We compare the performance of diffusion against some baseline classifiers, see Table 2. For low-shot learning () the in-domain diffusion outperforms the other methods by a large margin. As stated in Section 3.2, we do not include the test points in the diffusion, which is standard for a classification setting. However, if we allow this, as in a fully transductive setting, we obtain a top-5 accuracy of 67.9%0.76 with with diffusion over F1M, i.e., on par with diffusion over F100M.

In the following, we comment the out-of-domain setting. Our diffusion outperforms logistic regression only for .

Classifier combination.

We experimented with a very simple late fusion: to combine the scores of the two classifiers, we simply take a weighted average of their predictions (log-probabilities), and cross validate the weight factor. Table 2 shows the result is significantly above the best of the two input classifiers. This shows that the logistic regression classifier and the diffusion classifier access different aspects of image collection. We also experimented with more complicated combination methods, like using the graph edges as a regularizer during the logistic regression, which did not improve this result.

This combination outperforms to the state-of-the-art result of [16] (and which, itself, outperforms or is closely competitive with [38, 39] in this setting). This is remarkable because their method is a complementary combination of a specific loss and a learned data augmentation procedure that is specifically tailored to the experimental setup with base and novel classes. In contrast, our diffusion procedure is generic and has only 3 parameters (, and the number of iterations).

out-of-domain diffusion in-domain logistic combined [16]
none F1M F10M F100M Imagenet regression +F10M + F100M
57.10.53 60.00.52 61.40.68 62.30.59 68.00.64 57.30.51 62.00.75 62.60.63 60.6
62.50.50 65.50.50 66.80.44 67.80.62 73.20.51 66.00.59 68.70.43 69.20.60 68.9
68.40.38 70.60.31 71.90.48 73.10.61 77.80.35 76.40.26 76.90.23 77.40.27 77.3
72.70.16 74.20.30 75.30.05 76.20.15 80.10.14 80.90.21 81.30.17 81.50.18 80.6
76.00.28 77.00.21 77.50.17 78.60.15 81.40.10 83.70.15 83.90.15 84.10.09 82.5
Table 2: Comparison of classifiers for different values of , with for the diffusion results. For the out-of-domain setup, the “none” column indicates that the diffusion solely relies on the labelled images. The in-domain diffusion corresponds to the column “imagenet”. The results of most-right column [16] are state-of-the-art on this benchmark to our knowledge, generally outperforming the results of matching networks and model regression [38, 39] in this setting.

5 Conclusion

We experimented on large-scale label propagation for low-shot learning. Unsurprisingly, we have found that performing diffusion over images from the same domain works much better than images from a different domain. We clearly observe that, as the number of images over which we diffuse grows, the accuracy steadily improve. The main performance factor is the total number of edges, which also reasonably reflects the complexity. We also report neutral results for most sophisticated variants, for instance we show that edge weights are not useful. Furthermore, labeled images should be included in the diffusion process and not just used as sources, i.e., not enforced to keep their label.

The main outcome of our study is to show that diffusion over a large image set is superior to state-of-the-art methods for low-shot learning when very few labels are available. Interestingly, late-fusion with a standard classifier’s result is effective, which shows the complementary of the approaches.

References

  • [1] Y. Avrithis, Y. Kalantidis, E. Anagnostopoulos, and I. Z. Emiris. Web-scale image clustering revisited. In ICCV, 2015.
  • [2] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural codes for image retrieval. In ECCV, September 2014.
  • [3] Y. Bengio, O. Delalleau, and N. L. Roux. Label propagation and quadratic criterion. In O. Chapelle, B. Schölkopf, and A. Zien, editors, Semi-Supervised Learning, chapter 11, pages 195–216. MIT Press, Boston, 2006.
  • [4] Y. Bengio, J. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet. Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In NIPS, pages 177–184, 2003.
  • [5] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi. Learning feed-forward one-shot learners. In NIPS, 2016.
  • [6] R. R. C. Boaz Nadler, Stephane Lafon and I. G. Kevrekidis. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems. Technical report, Arxiv, 2008.
  • [7] T. Chin, L. Wang, K. Schindler, and D. Suter. Extrapolating learned manifolds for human activity recognition. In ICIP, pages 381–384, 2007.
  • [8] M. Cho and K. M. Lee. Mode-seeking on graphs via random walks. In CVPR, June 2012.
  • [9] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, pages 2292–2300, 2013.
  • [10] W. Dong, M. Charikar, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In WWW, March 2011.
  • [11] W. Dong, R. Socher, L. Li-Jia, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, June 2009.
  • [12] M. Donoser and H. Bischof. Diffusion processes for retrieval revisited. In CVPR, pages 1320–1327, 2013.
  • [13] A. J. Enright, S. Van Dongen, and C. A. Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic acids research, 30(7), 2002.
  • [14] R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learning in gigantic image collections. In NIPS, pages 522–530, 2009.
  • [15] G. H. Golub and C. V. Loan. Matrix computations. John Hopkinks University Press, 2013.
  • [16] B. Hariharan and R. Girshick. Low-shot visual recognition by shrinking and hallucinating features. arXiv preprint arXiv:1606.02819, 2016.
  • [17] B. Harwood and T. Drummond. Fanng: Fast approximate nearest neighbour graphs. In CVPR, 2016.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
  • [19] G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
  • [20] A. Iscen, Y. Avrithis, G. Tolias, T. Furon, and O. Chum. Fast spectral ranking for similarity search. In CVPR, June 2017.
  • [21] A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum. Efficient diffusion on region manifolds: Recovering small objects with compact CNN representations. In CVPR, June 2017.
  • [22] A. Jansen, G. Sell, and V. Lyzinski. Scalable out-of-sample extension of graph embeddings using deep neural networks. CoRR, abs/1508.04422, 2015.
  • [23] J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734, 2017.
  • [24] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. Learning visual features from large weakly supervised data. In ECCV, 2016.
  • [25] Y. Kalantidis, L. Kennedy, H. Nguyen, C. Mellina, and D. A. Shamma. Loh and behold: Web-scale visual search, recommendation and clustering using locally optimized hashing. arXiv preprint arXiv:1604.06480, 2016.
  • [26] P. A. Knight. The Sinkhorn-Knopp algorithm: convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1):261–275, 2008.
  • [27] F. Lin and W. W. Cohen. Power iteration clustering. In ICML, 2010.
  • [28] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV, December 2012.
  • [29] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 42(3):145–175, 2001.
  • [30] D. Omercevic, O. Drbohlav, and A. Leonardis. High-dimensional feature matching: employing the concept of meaningful nearest neighbors. In ICCV, October 2007.
  • [31] J. Philbin and A. Zisserman. Object mining using a matching graph on very large image collections. In Computer Vision, Graphics & Image Processing, 2008.
  • [32] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In ICLR, April 2017.
  • [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
  • [34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [35] R. Sinkhorn and P. Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21:343–348, 1967.
  • [36] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. YFCC100M: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  • [37] G. Tolias, R. Sicre, and H. Jégou. Particular object retrieval with integral max-pooling of CNN activations. In ICLR, 2016.
  • [38] O. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra. Matching networks for one shot learning. In NIPS, 2016.
  • [39] Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning. In ECCV, 2016.
  • [40] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In NIPS, volume 16, pages 321–328, 2003.
  • [41] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In ICML, 2003.
  • [42] X. Zhu and A. B. Goldberg. Introduction to Semi-Supervised Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2009.

Appendices

We present a few additional results and details to complement the paper. Section A reports another evaluation protocol that restricts the evaluation to novel classes. Sections B and D are parametric evaluations, Section C comments on the speed of the diffusion process, and Section E gives some details about the graph computation.

Appendix A Evaluation results on novel classes

In the paper, we evaluated the search performance on all the test images from group 2. The performance restricted to only the novel classes is also reported in a prior work [16]. Table 3 shows the results in this setting.

out-of-domain diffusion in-domain logistic combined [16]
none F1M F10M F100M Imagenet regression +F10M + F100M
38.31.08 43.01.02 45.41.19 46.91.00 56.31.06 38.70.73 45.31.21 46.60.97 40.9
47.00.81 51.90.76 54.10.60 55.70.82 64.70.74 50.80.88 55.50.72 56.20.97 55.7
56.60.51 60.10.46 62.40.77 64.00.84 71.70.50 67.70.36 68.30.39 69.10.45 69.7
63.40.31 65.90.48 67.60.14 69.10.27 75.30.19 75.40.31 75.80.29 76.00.34 75.9
69.10.43 70.40.36 71.40.23 72.60.31 77.20.13 80.00.26 80.10.20 80.40.14 79.3
Table 3: Comparison of classifiers for different values of , with for the diffusion results, evaluating only on novel classes.

As to be expected, the results are inferior than in the setup where all test images are classified, because novel classes are harder to classify than base classes. Otherwise the ordering of the methods is preserved and the conclusions identical. The diffusion is effective in the low-shot regime and by itself better than the state of the art by a large margin when only one example is available. The combination with late fusion significantly outperforms the state of the art in the out-of-domain setup.

Appendix B Details of the parametric evaluation

In the paper we reported results for the edge weighting and graph normalization with the best parameter setting. Here, we report results for all parameters. We evaluate the following edge weightings (Figure 3, first row):

  • Gaussian weighting. The edge weight is  with the distance between the edge nodes. Note that corresponds to a constant weighting;

  • Weighting based on the “meaningful neighbors” proposal [30]. It relies on an exponential fit of neighbor distances. For a given graph node, for the neighbor of its list of results, the weight is , where is the distance, remapped linearly to so that the first neighbor has and the th neighbor has . We vary parameter in the plot.

For normalizations of the matrix, we compare (Figure 3, second row):

  • the non-linear normalization, all elements of are raised to a power . We vary the parameter , and corresponds to the identity transform;

  • we classify all images in a graph with a logistic regression classifier. We use the predicted frequency of each class over the whole graph, and raise it to some power (the parameter) to reduce or increase its peakiness. This gives a normalization factor that we enforce for each column of , instead of the default uniform distribution.

The conclusion of these experiments is that these variants do not improve over constant weights and a standard diffusion, most of them having a neutral effect. Therefore, we conclude that the diffusion process mostly depends on the topology of the graph.

Gaussian weighting meaningful neighbors model
normalization normalization with class weights
Figure 3: Evaluation of edge weighting (top) and matrix normalizations (bottom) used in the diffusion. The common settings are: , , evaluation is averaged over 5 runs on the validation set (group 1), and we select the best iteration.

Appendix C Analysis of the diffusion process

In our paper we analyze the performance attained along iterations. In this section we complement this analysis by providing the rate of nodes reached by the diffusion process: we consider very large graphs, few seeds and a relatively small graph degree. While the graph is not necessarilly fully connected, we observe that most images can be reached by all labels in practice. Figure 4 measures the sparsity of the matrix (on one run of validation), which indicates the rate of (label, image) tuples that have not been attained by the diffusion process at each diffusion step.

Figure 4: Rate of non-zero element in the matrix , for and a varying degree.

The number of nodes reached by all labels grows rapidly and converges to a value close to 1 in only a few iterations when . We have generally observed that the iteration number at which the matrix closen to 1 is similar to the iteration at which accuracy is maximal, as selected by cross-validation.

Appendix D Late fusion weights

If are the distributions over classes found by the two classifiers for image , the top-5 prediction is given by:

(6)

where is the optimal coefficient for seed points, found by cross-validation.

Figure 5 shows the optimal mixing factors. Since the logistic regression is better at classifying with many training examples, the parameter increases with .

Figure 5: Performance as a function of the late fusion weight on the validation (group 1) images, averaged over 5 runs. A weight of 0 is pure diffusion, 1 is pure logistic regression. The combination factors that are selected for test are indicacted with circles.

Appendix E Computation of the blocks

As stated in the paper, we need to compute the 4 blocks of the matrix :

(7)

Where usually . Each block requires to do a -nearest neighbor search, performed using the Faiss library222http://github.com/facebookresearch/faiss, which is well optimized for this task [23]:

  • For we use a Faiss “IVFFlat”, with a relatively high probe setting (256). This guarantees that most of the actual neighbors are retrieved. With the recommended settings of Faiss, the cost of one search is proportional to , so the total cost is . This is super-linear in , but it is done off-line;

  • for we can re-use the same index to do similarity search operations, this time using only ;

  • for we need to index on the seed image descriptors. We found that in practice, constructing an index on these images is at best 1.4 faster than brute-force search. Therefore, we use brute-force search in this case, which costs ;

  • the computation of has negligible complexity.

The fusion of the result lists and to get results per row of is done in a single pass, and has negligible cost. Therefore the dominant complexity is . A typical breakdown of the timings for F100M is (in seconds):

Timings (s)
on CPU 65+2783 12003 32
on 8 GPUs 25929 732+64 533 1

For we decompose the timing into: loading of the precomputed IVFFlat index (and moving it to GPU if appropriate) and the actual computation of the neighbors.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
1317
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description