Lowshot learning with largescale diffusion
Abstract
This paper considers the problem of inferring image labels for which only a few labelled examples are available at training time. This setup is often referred to as lowshot learning in the literature, where a standard approach is to retrain the last few layers of a convolutional neural network learned on separate classes. We consider a semisupervised setting in which we exploit a large collection of images to support label propagation. This is made possible by leveraging the recent advances on largescale similarity graph construction.
We show that despite its conceptual simplicity, scaling up label propagation to up hundred millions of images leads to state of the art accuracy in the lowshot learning regime.
Lowshot learning with largescale diffusion
Matthijs Douze Arthur Szlam Bharath Hariharan Hervé Jégou
noticebox[b]—————–\end@float
1 Introduction
Large, diverse collections of images are now commonplace; these often contain a “long tail” of visual concepts. Some concepts like “person” or “cat” appear in many images, but the vast majority of the visual classes do not occur frequently. Even though the total number of images may be large, it is hard to collect enough labeled data for most of the visual concepts. Thus if we want to learn them, we must do so with few labeled examples. This task is named lowshot learning in the literature.
In order to learn new classes with little supervision, a standard approach is to leverage classifiers already learned for the most frequent classes, employing a socalled transfer learning strategy. For instance, for new classes with few labels, only the few last layers of a convolutional neural network are retrained. This limits the number of parameters that need to be learned and limits overfitting.
In this paper, we consider the lowshot learning problem described above, where the goal is to learn to detect new visual classes with only a few annotated images per class, but we also assume that we have many unlabelled images. This is called semisupervised learning [42, 40] (considered, e.g., for face annotation [14]). The motivation of this work is threefold. First we want to show that with modern computational tools, classical semisupervised learning methods scale gracefully to hundreds of millions of unlabeled points. A limiting factor in previous evaluations was the cost of construction the similarity graph supporting the diffusion. This is no longer a bottleneck: thanks to advances both in computing architectures and algorithms, one can routinely compute the similarity graph for 100 millions images in a few hours [23]. Second, we want to answer the question: Does exploiting a very large number of images help for semisupervised learning? Finally, by comparing the results of these methods on Imagenet and the YFCC100M dataset [36], we highlight how these methods exhibit some artificial aspects of Imagenet that can influence the performance of low shot learning algorithms.
In summary, the contribution of our paper is a study of semisupervised learning considered in the scenario where we have a very large number of unlabeled images. Our main results are that in this setting, semisupervised learning leads to state of the art lowshot learning performance. In more detail, we make the following contributions: we carry out a largescale evaluation for diffusion methods for semisupervised learning and compare it to recent lowshot learning papers. Our experiments are all carried out on the public benchmark Imagenet [11] and the YFC100M dataset [36]. We show that our approach is efficient and that the diffusion process scales up to hundreds of millions of images, which is order(s) of magnitude larger than what we are aware in the literature on imagebased diffusion [21, 20]. This is made possible by leveraging the recent state of the art for efficient knearest neighbor graph construction [23]. We evaluate several variants and hypotheses involved in diffusion methods, such as exploiting class frequency priors [41]. This scenario is realistic in situations where this statistic is known a priori. We propose a simple way to estimate and exploit it without this prior knowledge, and extend this assumption to a multiclass setting by introducing a probabilistic projection step derived from SinkhornKnopp algorithm. Our experimental study shows that a simple propagation process, when carried out at a large with many unlabelled images, significantly outperforms some stateoftheart approaches in lowshot visual learning when (i) the number of annotated images per class is small and when (ii) the number of unlabeled images is large.
2 Related work
Lowshot learning
Recently there has been a renewed interest for lowshot learning, i.e., learning with few examples by exploiting prior learning on other classes. Such works include metric learning [28], learning kNN [38], regularization and feature hallucination [16] or predicting parameters of the network [5]. Ravi and Larochelle introduce a metalearner to learn the optimization parameters invovled in the lowshot learning regime [32]. Most of the works consider small datasets like Omniglot, CIFAR, or a small subset of Imagenet. In our paper we will focus solely on large datasets, in particular the Imagenet collection [33] associated with the ILSVRC challenge.
Diffusion methods
We refer the reader to [3, 12] for a review of diffusion processes and matrix normalization options. Such methods are an efficient way of clustering images given a matrix of input similarity, or a kNN graph, and have been successfully used in a semisupervised discovery setup [14]. They share some connections with spectral clustering [6]. In [31], a kNN graph is clustered with spectral clustering, which amounts to computing the eigenvectors associated with the largest eigenvalues of the graph, and clustering these eigenvectors. Since the eigenvalues are obtained via Lanczos iterations [15, Chapter 10], the basic operation is similar to a diffusion process. This is also related to power iteration clustering [27], as in the work of Cho et al. [8] to find clusters.
Semisupervised learning
The kNN graph can be used for transductive and semisupervised learning (see e.g. [3, 42] for an introduction). In transductive learning, a relatively small number of labels are used to augment a large set of unlabeled data and the goal is to extend the labeling to the unlabeled data (which is given at train time). Semisupervised learning is similar, except there may be a separate set of test points that are not seen at train time. In our work, we consider the simple proposal of Zhu et al. [41], where powers of the (normalized) kNN graph are used to find smooth functions on the kNN graph with desired values at the labeled points. There many variations on the algorithms, e.g., Zhou et al. [40] weight the edges based on distances and introduce a loss trading a classification fitting constraint and a smoothness term enforcing consistency of neighboring nodes.
Label propagation is a transductive method. In order to evaluate on new data, i.e., new classes, we need to extend the smooth functions out of the training data. A standard method is to use a weighted sum of nearest neighbors from the training data [4]. Here we instead use a deep convolutional network trained similarly to the one used to build the features. Deep networks have been used before for out of sample extension, e.g., unsuccessfully in [7] but successfully in [22], in the speech domain. Furthermore, in addition to regressing the actual predicted labels from the label propagation, we also show results regressing the distributions, similar to [19].
Efficient kNNgraph construction
The diffusion methods use a matrix as input containing the similarity between all images of the dataset. Considering images, e.g., , it is not possible to store a matrix of size . However most of the image pairs are not related and have a similarity close to 0. Therefore diffusion methods are usually implemented with sparse matrices. This means that we compute a graph connecting each image to its neighbors, as determined by the similarity metric between image representations. In particular, we consider the knearest neighbor graph (kNNgraph) over a set of vectors. Several approximate algorithms [10, 25, 1, 17] have been proposed to efficiently produce the kNN graph used as input of iterative/diffusion methods, since this operation is of quadratic complexity in the number of images. In this paper, we employ the Faiss library,which was shown capable to construct a graph connecting up to 1 billion vectors [23].
3 Propagating labels
This section describes the initial stage of our proposal, which estimates the class of the unlabelled images with a diffusion process. It includes an image description step, the construction of a kNN graph connecting similar images, and a label diffusion algorithm.
3.1 Image description
A meaningful semantic image representation and an associated metric is required to match instances of classes that have not been seen beforehand. While early works on semisupervised labelling [14] were using adhoc semantic global descriptors like GIST [29], the classification performance has substantially improved in the last years with deep CNN architectures [34, 18], which are a compelling choice for our purpose. Therefore, for the image description, we extract activation maps from a CNN trained from base classes that are independent from the novel classes on which the evaluation is performed. See the experimental section for more details about the training process for descriptors.
The mean class classifier introduced for lowshot learning [28] is another way to perform dimensionality reduction while improving accuracy thanks to a better comparison metric. We do not consider this approach since it can be seen as part of the descriptor learning.
3.2 Affinity matrix: approximate kNN graph
As discussed in the related work, most diffusion processes use as input the kNN graph representing the sparse similarity matrix, denoted by , which connects the images of the collection. We build this graph using approximate knearest neighbor search. Thanks to recent advances in efficient similarity search [10, 23], trading some accuracy against efficiency drastically improves the graph construction time. As an example, with the Faiss library [23], building the graph associated with 600k images takes 2 minutes on 1 GPU. From preliminary experiments, the approximation in the knngraph construction does not induce any suboptimality, possibly because the diffusion process compensates the artifacts induced by the approximation.
Different strategies exist to set the weights of the affinity matrix . We choose to search the nearest neighbors of each image, and set a 1 for each of the neighbors in the corresponding row of a sparse matrix . Then we symmetrize the matrix by adding it to its transpose. We subsequently normalize the rows to produce a sparse stochastic matrix: , with the diagonal matrix of row sums.
The handling for the test points is different: test points do not participate in label propagation because we classify each of them independently of the others. Therefore, there are no outgoing edges on test points, they only get incoming edges from their nearest neighbors.
3.3 Label propagation
We now give details about the diffusion process itself, which is summarized in Figure 1. We build on the straightforward label propagation algorithm of [41]. The set of images on which we perform diffusion is composed of labelled seed images and unlabelled background images (). Define the matrix , where is the number of classes for which we want to diffuse the labels, i.e., the new classes not seen in the training set. Each row in is associated with a given image, and represents the probabilities of each class for that image. A given column corresponds to a given class, and gives its probabilities for each image. The method initializes to a onehot vector for the seeds. Background images are initialized with 0 probabilities for all classes. Diffusing from the known labels, the method iterates as .
We can optionally reset the rows corresponding to seeds to their 1hot groundtruth. When iterating to convergence, all would eventually converge to the eigenvector of with largest eigenvalue (when not resetting), or to the harmonic function with respect to with boundary conditions given by the seeds (when resetting). Empirically, for lowshot learning, we observe that resetting has a negative impact on accuracy. Also, early stopping performs better in both cases, so we crossvalidate the number of diffusion iterations.
Classification decision & combination with logistic regression
We predict the class of a test example as the the column that maximizes the score . Similar to Zhou et al. [40], we have also optimized a loss balancing the fitting constraint with the diffusion smoothing term. However we found that a simple late fusion (weighted mean of logprobabilities, parametrized by a single crossvalidated coefficient) of the scores produced by diffusion and logistic regression achieves better results.
3.4 Variations
Exploiting priors
The label propagation can take into account several priors depending on the assumptions of the problem, which are integrated by defining a normalization operator and by modifying the update equation as
(1) 
Multiclass assumption. For instance, in the ILSVRC challenge built upon the Imagenet dataset [33], there is only one label per class, therefore we can define as a function that normalizes each row to provide a distribution over labels, and by convention with no effect if the norm is 0.
Class frequency priors. Additionally, we point out that labels are evenly distributed in Imagenet. If we translate this setup to our semiunsupervised setting, it would mean that we may assume that the distribution of the unlabelled images is uniform over labels. This assumption can be taken into account by defining as the function performing a normalization of columns of .
While one could argue that this is not realistic in general, a more realistic scenario is to consider that we know the marginal distribution of the labels, as proposed by Zhu et al. [41], who show that the prior can be simply enforced (i.e., apply columnwise normalization to and multiply each column by the prior class probability). This arises in situations such as tag prediction, if we can empirically measure the relative probabilities of tags, possibly regularized for lowest values.
Combined Multiclass assumption and class frequency priors. We propose a variant way to exploit both a multiclass setting and prior class probabilities by enforcing the matrix to jointly satisfy the following properties:
(2) 
where is the prior distribution over labels. For this purpose, we adopt a strategy similar to that of Cuturi [9] in his work on optimal transport, in which he shows that the SinkhornKnopp algorithm [35] provides an efficient and theoretically grounded way to project a matrix so that it satisfies such marginals. The SinkhornKnopp algorithm iterates by alternately enforcing the marginal conditions, as
(3) 
until convergence. Here we assume that the algorithm only operates on rows and columns whose sum is strictly positive. As discussed by Knight [26], the convergence of this algorithm is fast. Therefore we stop after 5 iterations. This projection is performed after each update by Eqn. 1. Note that Zhu et al. [41] solely considered the second constraint in Eqn. 2, which can be obtained by enforcing the prior, as discussed by Bengio et al. [3]. We evaluate both variants in the experimental section 4.
Nonlinear updates.
The Markov Clustering (MCL) [13] is another diffusion algorithm with nonlinear updates originally proposed for clustering. In contrast to the previous algorithm, MCL iterates directly over the similarity matrix as
(4) 
where is an elementwise raising to power of the matrix, followed by a columnwise normalization [13]. The power is a bandwidth parameter: when is high, small edges quickly vanish along the iterations. A smaller preserves the edges longer. The clustering is performed by extracting connected components from the final matrix. In Section 4 we evaluate the role of the nonlinear update of MCL by introducing the nonlinearity in the diffusion procedure. More precisely, we modify Equation 1 as
3.5 Complexity
For the complexity evaluation, we distinguish two stages. In the offline stage, (i) the CNN is trained on the base classes, (ii) descriptors are extracted for the background images, and (iii) a knngraph is computed for the background images. In the online stage, we receive training and test images from novel classes, (i) compute features for them, (ii) complement the knngraph matrix to include the training and test images, and (iii) perform the diffusion iterations. Here we assume that the graph matrix is decomposed in four blocks
(5) 
The largest matrix is computed offline. Online we compute the three other matrices. We combine and by merging similarity search result lists, hence each row of contains exactly nonzero elements. This requires to store the distances along with .
We are mostly interested in the complexity of the online phase. Therefore we exclude the descriptor extraction, which is independent of the classification cost, and the cost of handling the test images, which is negligible compared to the training operations. We consider the logistic regression as a baseline for the complexity comparison:
 Logistic regression

the training cost is multiplyadds, with denotes the descriptor dimensionality and the number of classes. The number of iterations and batch size are .
 Diffusion

the cost is decomposed into: computing the matrices , and , which involves multiplyadds using bruteforce distance computations; and performing iterations of sparsedense matrix multiplications, which incurs multiplyadds (note, sparse matrix operations are more limited by irregular memory access patterns than arithmetic operations). Therefore the diffusion cost is linear in the number of background images . See appendix E for more details.
Memory usage.
One important bottleneck of the algorithm is its memory usage. The sparse matrix occupies bytes in RAM, and almost twice this amount, because most nearest neighbors are not reciprocal; the matrix is bytes. Fortunately, the iterations can be performed one column of at a time, reducing this to bytes for and (in practice, when memory is an issue, we iterate on columns at a time).
4 Experiments
4.1 Datasets and evaluation protocol
We use Imagenet 2012 [11] and follow a recent setup [16] previously introduced for lowshot learning. The 1000 Imagenet classes are split randomly into two groups, each containing base and novel classes. Group 1 (193 base and 300 novel classes) is used for hyperparameter tuning and group 2 (196+311 classes) for testing with fixed hyperparameters. We assume the full Imagenet training data is available for the base classes. For the novel classes, only images per class are available for training. Similar to [16] the subset of images is drawn randomly and the random selection is performed 5 times with different random seeds.
As a large source of unlabelled images, we use the YFCC100M dataset [36]. It consists of 99 million representative images from the Flickr photo sharing site^{1}^{1}1Of the 100M original files, some are videos and some are not available anymore. We replace them with uniform white images.. Note that some works have used this dataset with tags or GPS metadata as weak supervision [24].
Learning the image descriptors.
Similarly to [16], we train a 50layer Resnet CNN [18] on all base classes (group 1 + group 2), to ensure that the description calculation has never seen any image of the novel classes. We run the CNN on all images, and extract a 2048dim vector from the 49th layer, just before the last fully connected layer. This descriptor is used directly as input for the logistic regression. For the diffusion, we PCAreduce the feature vector to 256 dimensions and L2normalize it, as standardly done in prior works on unsupervised image matching with prelearned image representations [2, 37].
Performance measure and baseline
In a given group (1 or 2), we classify the Imagenet validation images from both the base and novel classes, and measure the top5 accuracy. Therefore the class distribution is heavily unbalanced. Since the seed images are drawn randomly, we repeat the random draws 5 times with different random seeds and average the obtained top5 accuracy (the xx notation gives the standard deviation).
The baseline is a logistic regression applied on the labelled points. We employ a perclass image sampling strategy to circumvent the unbalanced number of examples per class. We optimize at the learning rate, batch size and L2 regularization factor of the logistic regression on the group 1 images.
Background images for diffusion
We consider the following sets of background images:

None: the diffusion is directly from the seed images to the test images;

Indomain setting: the background images are the Imagenet training image from the novel classes, but without labels. This corresponds to a use case where all images are known to belong to a set of classes, but only a subset of them have been labelled;

Outofdomain setting: background images are taken from YFCC100M. We denote this setting by F100k, F1M, F10M or F100M, depending on the number of images we use (e.g., we note F1M for ). This corresponds to a more challenging setting where we have no prior knowledge about the image used in the diffusion.
4.2 Parameters of diffusion
We compare a few settings of the diffusion algorithm that are of interest. In all cases, we set the number of nearest neighbors to and evaluate with . The nearest neighbors are computed with Faiss [23], using the IVFFlat algorithm, that computes exact distances but may occasionally miss a few neighbors (see Appendix E for details).
Graph edge weighting.
We experimented with different edge weightings for , that were proposed in the literature. We compared a constant weight, a Gaussian weighting [27, 3], (with a hyperparameter), and a weighting based on the “meaningful neighbors” theory [30].
Table 1 shows that results are remarkably independent of the choice of edge weightings, which is why we set it to 1. The best normalization that can be applied to the matrix is a simple columnwise L1 normalization. Thanks to the linear iteration formula, it can be applied at the end of the iterations.
background  none  F1M  imnet 

edge weighting  
constant  62.70.68  65.40.55  73.30.72 
Gaussian weighting*  62.70.66  65.40.58  73.60.71 
meaningful neighbors*  62.70.68  40.00.20  73.60.62 
operator  
none  40.60.18  41.10.10  42.30.19 
Sinkhorn  61.10.69  56.80.50  72.30.72 
columnwise  62.70.68  65.40.55  73.30.72 
nonlinear transform*  62.70.68  65.40.55  73.30.72 
class frequency prior*  62.70.66  65.40.60  73.30.65 
Classspecific weighting
In outofdomain diffusion with many background images, the distribution over classes is not determined by the seed or test images, but predominantly by the background images. Therefore, the class prior applied to images can be adjusted to match that of the background.
To do this, we classify the YFCC100M images using a logistic regression classifier and use the resulting class distribution as a prior for perclass normalization. Since the estimated class distribution is relatively flat, we make it more peaky using a power transform that gives a tradeoff with the uniform distribution. The single hyperparameter (the power) is crossvalidated.
4.3 Largescale diffusion
Figure 2 reports experiments by varying the number of background images and the number of neighbors, for . In Appendix C, we also show how fast “fills up” (it is dense after a few iterations). The maximum in accuracy is also reached quickly. This maximum accuracy occurs later if is larger and when is smaller. The plot also shows that it is important to do early stopping.
The plot on the right reports the largescale behavior of the diffusion. All the curves have an optimal point in terms of accuracy vs computational cost at =30. This may be a intrinsic property of the descriptor manifold. It is worthwhile to note that before starting the diffusion iterations, with =1000 and no background images (the best setting) we obtain an accuracy of 60.5%. This is a knnclassifier and this is the fastest setting because the knngraph does not need to be constructed. Another advantage is that it does not require to store the graph.
4.4 Complexity: Runtime and memory
We measured the runtime of the diffusion process on a 48core 2.5GHz machine:
background  none  F1M  F10M  F100M  

optimal iteration  2  3  4  5  
timing: graph completion  2m57s  8m36s  40m41s  4h08m  23m on 8 GPUs 
timing: diffusion  4.4s  19s  3m44s  54m 
The graph construction time is linear in , thanks to the precomputation of the graph matrix for the background images (see Section 3.5). For comparison, training the logistic regression takes between 2m27s and 12m, depending on the crossvalidated parameters.
In terms of memory usage, the biggest F100M experiments need to simultaneously keep in RAM a matrix of 5.3 billion nonzero values (39.5 GiB), and and (35.8 GiB, using slices of columns). This is the main drawback of using diffusion. However Table 2 shows that restricting the diffusion to 10 million images already provides most of the gain, while dividing by an order of magnitude memory and computation complexity.
4.5 Comparison with baseline classifiers
We compare the performance of diffusion against some baseline classifiers, see Table 2. For lowshot learning () the indomain diffusion outperforms the other methods by a large margin. As stated in Section 3.2, we do not include the test points in the diffusion, which is standard for a classification setting. However, if we allow this, as in a fully transductive setting, we obtain a top5 accuracy of 67.9%0.76 with with diffusion over F1M, i.e., on par with diffusion over F100M.
In the following, we comment the outofdomain setting. Our diffusion outperforms logistic regression only for .
Classifier combination.
We experimented with a very simple late fusion: to combine the scores of the two classifiers, we simply take a weighted average of their predictions (logprobabilities), and cross validate the weight factor. Table 2 shows the result is significantly above the best of the two input classifiers. This shows that the logistic regression classifier and the diffusion classifier access different aspects of image collection. We also experimented with more complicated combination methods, like using the graph edges as a regularizer during the logistic regression, which did not improve this result.
This combination outperforms to the stateoftheart result of [16] (and which, itself, outperforms or is closely competitive with [38, 39] in this setting). This is remarkable because their method is a complementary combination of a specific loss and a learned data augmentation procedure that is specifically tailored to the experimental setup with base and novel classes. In contrast, our diffusion procedure is generic and has only 3 parameters (, and the number of iterations).
outofdomain diffusion  indomain  logistic  combined  [16]  

none  F1M  F10M  F100M  Imagenet  regression  +F10M  + F100M  
57.10.53  60.00.52  61.40.68  62.30.59  68.00.64  57.30.51  62.00.75  62.60.63  60.6  
62.50.50  65.50.50  66.80.44  67.80.62  73.20.51  66.00.59  68.70.43  69.20.60  68.9  
68.40.38  70.60.31  71.90.48  73.10.61  77.80.35  76.40.26  76.90.23  77.40.27  77.3  
72.70.16  74.20.30  75.30.05  76.20.15  80.10.14  80.90.21  81.30.17  81.50.18  80.6  
76.00.28  77.00.21  77.50.17  78.60.15  81.40.10  83.70.15  83.90.15  84.10.09  82.5 
5 Conclusion
We experimented on largescale label propagation for lowshot learning. Unsurprisingly, we have found that performing diffusion over images from the same domain works much better than images from a different domain. We clearly observe that, as the number of images over which we diffuse grows, the accuracy steadily improve. The main performance factor is the total number of edges, which also reasonably reflects the complexity. We also report neutral results for most sophisticated variants, for instance we show that edge weights are not useful. Furthermore, labeled images should be included in the diffusion process and not just used as sources, i.e., not enforced to keep their label.
The main outcome of our study is to show that diffusion over a large image set is superior to stateoftheart methods for lowshot learning when very few labels are available. Interestingly, latefusion with a standard classifier’s result is effective, which shows the complementary of the approaches.
References
 [1] Y. Avrithis, Y. Kalantidis, E. Anagnostopoulos, and I. Z. Emiris. Webscale image clustering revisited. In ICCV, 2015.
 [2] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural codes for image retrieval. In ECCV, September 2014.
 [3] Y. Bengio, O. Delalleau, and N. L. Roux. Label propagation and quadratic criterion. In O. Chapelle, B. Schölkopf, and A. Zien, editors, SemiSupervised Learning, chapter 11, pages 195–216. MIT Press, Boston, 2006.
 [4] Y. Bengio, J. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet. Outofsample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In NIPS, pages 177–184, 2003.
 [5] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi. Learning feedforward oneshot learners. In NIPS, 2016.
 [6] R. R. C. Boaz Nadler, Stephane Lafon and I. G. Kevrekidis. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems. Technical report, Arxiv, 2008.
 [7] T. Chin, L. Wang, K. Schindler, and D. Suter. Extrapolating learned manifolds for human activity recognition. In ICIP, pages 381–384, 2007.
 [8] M. Cho and K. M. Lee. Modeseeking on graphs via random walks. In CVPR, June 2012.
 [9] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, pages 2292–2300, 2013.
 [10] W. Dong, M. Charikar, and K. Li. Efficient knearest neighbor graph construction for generic similarity measures. In WWW, March 2011.
 [11] W. Dong, R. Socher, L. LiJia, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, June 2009.
 [12] M. Donoser and H. Bischof. Diffusion processes for retrieval revisited. In CVPR, pages 1320–1327, 2013.
 [13] A. J. Enright, S. Van Dongen, and C. A. Ouzounis. An efficient algorithm for largescale detection of protein families. Nucleic acids research, 30(7), 2002.
 [14] R. Fergus, Y. Weiss, and A. Torralba. Semisupervised learning in gigantic image collections. In NIPS, pages 522–530, 2009.
 [15] G. H. Golub and C. V. Loan. Matrix computations. John Hopkinks University Press, 2013.
 [16] B. Hariharan and R. Girshick. Lowshot visual recognition by shrinking and hallucinating features. arXiv preprint arXiv:1606.02819, 2016.
 [17] B. Harwood and T. Drummond. Fanng: Fast approximate nearest neighbour graphs. In CVPR, 2016.
 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
 [19] G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
 [20] A. Iscen, Y. Avrithis, G. Tolias, T. Furon, and O. Chum. Fast spectral ranking for similarity search. In CVPR, June 2017.
 [21] A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum. Efficient diffusion on region manifolds: Recovering small objects with compact CNN representations. In CVPR, June 2017.
 [22] A. Jansen, G. Sell, and V. Lyzinski. Scalable outofsample extension of graph embeddings using deep neural networks. CoRR, abs/1508.04422, 2015.
 [23] J. Johnson, M. Douze, and H. Jégou. Billionscale similarity search with GPUs. arXiv preprint arXiv:1702.08734, 2017.
 [24] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. Learning visual features from large weakly supervised data. In ECCV, 2016.
 [25] Y. Kalantidis, L. Kennedy, H. Nguyen, C. Mellina, and D. A. Shamma. Loh and behold: Webscale visual search, recommendation and clustering using locally optimized hashing. arXiv preprint arXiv:1604.06480, 2016.
 [26] P. A. Knight. The SinkhornKnopp algorithm: convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1):261–275, 2008.
 [27] F. Lin and W. W. Cohen. Power iteration clustering. In ICML, 2010.
 [28] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Metric learning for large scale image classification: Generalizing to new classes at nearzero cost. In ECCV, December 2012.
 [29] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 42(3):145–175, 2001.
 [30] D. Omercevic, O. Drbohlav, and A. Leonardis. Highdimensional feature matching: employing the concept of meaningful nearest neighbors. In ICCV, October 2007.
 [31] J. Philbin and A. Zisserman. Object mining using a matching graph on very large image collections. In Computer Vision, Graphics & Image Processing, 2008.
 [32] S. Ravi and H. Larochelle. Optimization as a model for fewshot learning. In ICLR, April 2017.
 [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
 [34] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [35] R. Sinkhorn and P. Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21:343–348, 1967.
 [36] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.J. Li. YFCC100M: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
 [37] G. Tolias, R. Sicre, and H. Jégou. Particular object retrieval with integral maxpooling of CNN activations. In ICLR, 2016.
 [38] O. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra. Matching networks for one shot learning. In NIPS, 2016.
 [39] Y.X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning. In ECCV, 2016.
 [40] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In NIPS, volume 16, pages 321–328, 2003.
 [41] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semisupervised learning using Gaussian fields and harmonic functions. In ICML, 2003.
 [42] X. Zhu and A. B. Goldberg. Introduction to SemiSupervised Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2009.
Appendices
We present a few additional results and details to complement the paper. Section A reports another evaluation protocol that restricts the evaluation to novel classes. Sections B and D are parametric evaluations, Section C comments on the speed of the diffusion process, and Section E gives some details about the graph computation.
Appendix A Evaluation results on novel classes
In the paper, we evaluated the search performance on all the test images from group 2. The performance restricted to only the novel classes is also reported in a prior work [16]. Table 3 shows the results in this setting.
outofdomain diffusion  indomain  logistic  combined  [16]  

none  F1M  F10M  F100M  Imagenet  regression  +F10M  + F100M  
38.31.08  43.01.02  45.41.19  46.91.00  56.31.06  38.70.73  45.31.21  46.60.97  40.9  
47.00.81  51.90.76  54.10.60  55.70.82  64.70.74  50.80.88  55.50.72  56.20.97  55.7  
56.60.51  60.10.46  62.40.77  64.00.84  71.70.50  67.70.36  68.30.39  69.10.45  69.7  
63.40.31  65.90.48  67.60.14  69.10.27  75.30.19  75.40.31  75.80.29  76.00.34  75.9  
69.10.43  70.40.36  71.40.23  72.60.31  77.20.13  80.00.26  80.10.20  80.40.14  79.3 
As to be expected, the results are inferior than in the setup where all test images are classified, because novel classes are harder to classify than base classes. Otherwise the ordering of the methods is preserved and the conclusions identical. The diffusion is effective in the lowshot regime and by itself better than the state of the art by a large margin when only one example is available. The combination with late fusion significantly outperforms the state of the art in the outofdomain setup.
Appendix B Details of the parametric evaluation
In the paper we reported results for the edge weighting and graph normalization with the best parameter setting. Here, we report results for all parameters. We evaluate the following edge weightings (Figure 3, first row):

Gaussian weighting. The edge weight is with the distance between the edge nodes. Note that corresponds to a constant weighting;

Weighting based on the “meaningful neighbors” proposal [30]. It relies on an exponential fit of neighbor distances. For a given graph node, for the neighbor of its list of results, the weight is , where is the distance, remapped linearly to so that the first neighbor has and the ^{th} neighbor has . We vary parameter in the plot.
For normalizations of the matrix, we compare (Figure 3, second row):

the nonlinear normalization, all elements of are raised to a power . We vary the parameter , and corresponds to the identity transform;

we classify all images in a graph with a logistic regression classifier. We use the predicted frequency of each class over the whole graph, and raise it to some power (the parameter) to reduce or increase its peakiness. This gives a normalization factor that we enforce for each column of , instead of the default uniform distribution.
The conclusion of these experiments is that these variants do not improve over constant weights and a standard diffusion, most of them having a neutral effect. Therefore, we conclude that the diffusion process mostly depends on the topology of the graph.
Gaussian weighting  meaningful neighbors model 
normalization  normalization with class weights 
Appendix C Analysis of the diffusion process
In our paper we analyze the performance attained along iterations. In this section we complement this analysis by providing the rate of nodes reached by the diffusion process: we consider very large graphs, few seeds and a relatively small graph degree. While the graph is not necessarilly fully connected, we observe that most images can be reached by all labels in practice. Figure 4 measures the sparsity of the matrix (on one run of validation), which indicates the rate of (label, image) tuples that have not been attained by the diffusion process at each diffusion step.
The number of nodes reached by all labels grows rapidly and converges to a value close to 1 in only a few iterations when . We have generally observed that the iteration number at which the matrix closen to 1 is similar to the iteration at which accuracy is maximal, as selected by crossvalidation.
Appendix D Late fusion weights
If are the distributions over classes found by the two classifiers for image , the top5 prediction is given by:
(6) 
where is the optimal coefficient for seed points, found by crossvalidation.
Figure 5 shows the optimal mixing factors. Since the logistic regression is better at classifying with many training examples, the parameter increases with .
Appendix E Computation of the blocks
As stated in the paper, we need to compute the 4 blocks of the matrix :
(7) 
Where usually . Each block requires to do a nearest neighbor search, performed using the Faiss library^{2}^{2}2http://github.com/facebookresearch/faiss, which is well optimized for this task [23]:

For we use a Faiss “IVFFlat”, with a relatively high probe setting (256). This guarantees that most of the actual neighbors are retrieved. With the recommended settings of Faiss, the cost of one search is proportional to , so the total cost is . This is superlinear in , but it is done offline;

for we can reuse the same index to do similarity search operations, this time using only ;

for we need to index on the seed image descriptors. We found that in practice, constructing an index on these images is at best 1.4 faster than bruteforce search. Therefore, we use bruteforce search in this case, which costs ;

the computation of has negligible complexity.
The fusion of the result lists and to get results per row of is done in a single pass, and has negligible cost. Therefore the dominant complexity is . A typical breakdown of the timings for F100M is (in seconds):
Timings (s)  

on CPU  —  65+2783  12003  32 
on 8 GPUs  25929  732+64  533  1 
For we decompose the timing into: loading of the precomputed IVFFlat index (and moving it to GPU if appropriate) and the actual computation of the neighbors.