LOH and Behold: Web-scale visual search, recommendation and clustering using Locally Optimized Hashing
We propose a novel hashing-based matching scheme, called Locally Optimized Hashing (LOH), based on a state-of-the-art quantization algorithm that can be used for efficient, large-scale search, recommendation, clustering, and deduplication. We show that matching with LOH only requires set intersections and summations to compute and so is easily implemented in generic distributed computing systems. We further show application of LOH to: a) large-scale search tasks where performance is on par with other state-of-the-art hashing approaches; b) large-scale recommendation where queries consisting of thousands of images can be used to generate accurate recommendations from collections of hundreds of millions of images; and c) efficient clustering with a graph-based algorithm that can be scaled to massive collections in a distributed environment or can be used for deduplication for small collections, like search results, performing better than traditional hashing approaches while only requiring a few milliseconds to run. In this paper we experiment on datasets of up to 100 million images, but in practice our system can scale to larger collections and can be used for other types of data that have a vector representation in a Euclidean space.
The rapid rise in the amount of visual multimedia created, shared, and consumed requires the development of better large-scale methods for querying and mining large data collections. Similarly, with increased volume of data comes a greater variety of use cases, requiring simple and repurposeable pipelines that can flexibly adapt to growing data and changing requirements.
Recent advances in computer vision have shown a great deal of progress in analyzing the content of very large image collections, pushing the state-of-the-art for classification , detection  and visual similarity search . Critically, deep Convolutional Neural Networks (CNNs)  have allowed processing pipelines to become much simpler by reducing complex engineered systems to simpler systems learned end-to-end and by providing powerful, generic visual representations that can be used for a variety of downstream visual tasks. Recently it has been shown that such deep features can be used to reduce visual search to nearest neighbor search in the deep feature space . Complimentary work has recently produced efficient algorithms for approximate nearest neighbor search that can scale to billions of vectors .
In this paper, we present a novel matching signature, called Locally Optimized Hashing (LOH). LOH extends LOPQ , a state-of-the-art nearest neighbor search algorithm, by treating the quantization codes of LOPQ as outputs of hashing functions. When applied to deep features, our algorithm provides a very flexible solution to a variety of related large-scale search and data mining tasks, including fast visual search, recommendation, clustering, and deduplication. Moreover, unlike , our system does not necessarily require specialized resources (i.e. dedicated cluster nodes and indexes for visual search) and is easily implemented in generic distributed computing environments.
Our approach sacrifices precision for speed and generality as compared to more exact quantization approaches, but it enables applications that wouldn’t be computationally feasible with more exact approaches. LOH can trivially cope with large multi-image query sets. In practice, our approach allows datasets of hundreds of millions of images to be efficiently searched with query sets of thousands of images. We are in fact able to query with multiple large query sets, e.g. from Flickr groups, simultaneously and get visual recommendations for all the sets in parallel. We are also able to cluster web-scale datasets with MapReduce by simply thresholding LOH matches and running a connected components algorithm. The same approach can be used for deduplication of, e.g. search results.
Our contributions can be summarized as follows:
We propose Locally Optimized Hashing (LOH), a novel hashing-based matching method that competes favorably with the state-of-the-art hashing methods for search and allows approximate ordering of results.
We extend LOH to multiple image queries and provide a simple and scalable algorithm that can provide visual recommendations in batch for query sets of thousands of images.
We show that this same representation can be used to efficiently deduplicate image search results and cluster collections of hundreds of millions of images.
Although in this paper we experiment on datasets of up to 100 million images (i.e. using the YFCC100M dataset , the largest publicly available image dataset), in practice our system is suited to web-scale multimedia search applications with billions of images. In fact, on a Hadoop cluster with 1000 nodes, our approach can find and rank similar images for millions of users from a search set of hundreds of millions of images in a runtime on the order of one hour. The method can be adapted to other data types that have vector representations in Euclidean space.
Large scale nearest neighbor search was traditionally based on hashing methods  because they offer low memory footprints for index codes and fast search in Hamming space . However, even recent hashing approaches  suffer in terms of performance compared to quantization-based approaches  for the same amount of memory. On the other hand, quantization-based approaches traditionally performed worse in terms of search times, and it was only recently with the use of novel indexing methods  that quantization-based search was able to achieve search times of a few milliseconds in databases of billions of vectors .
A benefit of quantization approaches is that, unlike classic hashing methods, they provide a ranking for the retrieved points. Recently, approaches for binary code re-ranking have been proposed in ; both papers propose a secondary, computationally heavier re-ranking step that, although is performed on only the retrieved points, makes search slower than state-of-the-art quantization-based approaches. In the approach presented here, we try to keep the best of both worlds by producing an approximate ordering of retrieved points without re-ranking. We argue that for use cases involving multiple queries, this approximation can be tolerated since many ranked lists are aggregated in this case.
A similar approach to ours, i.e. an approach that aims to produce multipurpose, polysemous codes  is presented at the current ECCV conference. After training a product quantizer, the authors then propose to optimize the so-called index assignment of the centroids to binary codes, such that distances between similar centroids are small in the Hamming space.
For multi-image queries, there are two broad categories based on the semantic concepts that the query image set represents. The first is query sets that share the same semantic concept or even the same specific object (i.e. a particular building in Oxford) . The second category is multi-image queries with multiple semantics. This category has been recently studied  and the authors propose a Pareto-depth approach on top Efficient Manifold Ranking  for such queries. Their approach is however not scalable to very large databases and they limit query sets to just be image pairs.
The current work uses visual features from a CNN trained for classification, thus similarities in our visual space capture broader category-level semantics. We focus on the first category of multi-image queries, i.e. multiple-image query sets with a single semantic concept, and provide a simple and scalable approach which we apply to Flickr group set expansion. However, it is straightforward to tackle the second category with our approach by introducing a first step of (visual or multi-modal) clustering on the query set with multiple semantics before proceeding with the LOH-based set expansion.
3Locally Optimized Hashing
4.1Approximate nearest neighbor search with LOH
We first investigate how the LOH approach compares with the hashing literature for the task of retrieving the true nearest neighbor within the first samples seen. We compare against classic hashing methods like Locality Sensitive Hashing (LSH) , Iterative Quantization  and the recent Sequential Projection Learning Hashing (USPLH)  and report results in Figure ?. One can see that LOH, built on the inverted multi-index after balancing the variance of the two subspaces, compares well with the state-of-the-art in the field, even outperforming recently proposed approaches like  for large enough .
In ?, we evaluate LOH ranking, i.e. how well LOH orders the true nearest neighbor after looking at a fraction of the database. We compare against Spectral Hashing (SH)  which, like LOH, also provides a ranking of the results. LOH performs similarly for small values of but outperforms SH when retrieving more than results, which is the most common case.
4.2Visual recommendations for Flickr groups
We conduct an experiment to evaluate the ability of the proposed approach to visually find images that might be topically relevant to a group of photos already curated by a group of users. On Flickr, such activity is common as users form groups around topical photographic interests and seek out high-quality photos relevant to the group. Group moderators may contact photo owners to ask them to submit to their group.
To evaluate this, we select 7 public Flickr groups that are representative of the types of topical interests common in Flickr groups, selected due to their clear thematic construction (graffiti, sailing, glacier, windmill, columns, cars & trucks, portrait & face), for ease of objective evaluation. For each group, we construct a large query of images randomly sampled from the group pool. We perform visual search using our proposed method on the YFCC100M dataset, aggregate results from all 10 thousand images and report precision after manual inspection of the top results. We visually scanned the photo pools of the groups and consider true positives all images that look like images in the photo pool and follow the group rules as specified by the administrators of each group.
Precision for each group is shown in Figure ?. For group “Portraits & Faces”, for example all 500 top results were high-quality portraits. We see some confusion due to the nature of the visual representation chosen (e.g. our visual representation may confuse desert and cloud images with snow images), but overall, Precision@500 was over for five out of the seven groups we tested.
Example results for the set expansion with our method and a baseline tag-based search are shown in Figure ? for Flickr group “vintage cars and trucks”. For the proposed approach, precision is high, as is the aesthetic quality of the results. The tag-based Flickr search returns more false positives for such a specific group, as irrelevant images are likely improperly tagged.
4.3Clustering and deduplication results
We evaluate the performance of LOH on the deduplication task using a dataset of Flickr searches for the query “Brad Pitt” with the LOH codes learned on the YFCC100M dataset. To measure precision and recall, we enumerate all pairs of images in the dataset and define a “positive” sample as a pair of images that belong to the same group in our dataset, and a “negative” sample as a pair of images that belong to different groups in our dataset.
In figure ? we plot the precision-recall for LOH versus LSH  and PCA-E . For LSH, we transform our PCA’d 128 dimensional image descriptor into a 128-bit binary code computed from random binary projection hash functions. For PCA-E we compute a 128-bit binary code by subtracting the mean of our PCA’d 128 dimensional image descriptor and taking the sign.
We run LOH clustering for the 100 million images of the YFCC100M dataset on a small Hadoop cluster and show sample clusters in Figure 7. We first did some cleaning and preprocessing by using a stoplist of codes (i.e. remove all triplets that appear more than or less than times) for efficiency. For a threshold of , we get million edges and approximately million connected components. Of those, about 6.5 million are small components of size smaller than . The graph construction for the 100M images took a couple of hours, while connected components runs in a few minutes.
By visual inspection, we notice that a large set of medium-sized clusters (i.e. clusters with images) contain visually consistent higher level concepts (e.g. from Figure 7: “motorbikes in the air”, “Hollywood St stars” or “British telephone booths”). Such clusters can be used to learn classifiers in a semi-supervised framework that incorporates noisy labels. Clustering YFCC100M gives us about 32K such clusters.
In this paper we propose a novel matching scheme, Locally Optimized Hashing or LOH, that is computed on the very powerful and compact LOPQ codes. We show how LOH can be used to efficiently perform visual search, recommendation, clustering and deduplication for web-scale image databases in a distributed fashion.
While LOPQ distance computation gives high quality, fast distance estimation for nearest neighbor search, it is not as well suited for large-scale, batch search and clustering tasks. LOH, however, enables these use-cases by allowing implementations that use only highly parallelizable set operations and summations. LOH can therefore be used for massively parallel visual recommendation and clustering in generic distributed environments with only a few lines of code. Its speed also allows its use for deduplication of, e.g. , search result sets at query time, requiring only a few milliseconds to run for sets of thousands of results.
- Arandjelovic, R., Zisserman, A.: Multiple queries for large scale specific object retrieval. In: BMVC (2012)
- Avrithis, Y., Kalantidis, Y., Anagnostopoulos, E., Emiris, I.Z.: Web-scale image clustering revisited. In: ICCV (2015)
- Avrithis, Y., Kalantidis, Y., Tolias, G., Spyrou, E.: Retrieving landmark and non-landmark images from community photo collections. In: ACM Multimedia (2010)
- Babenko, A., Lempitsky, V.: The inverted multi-index. In: CVPR (2012)
- Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: ECCV (2014)
- Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.: Locality-sensitive hashing scheme based on p-stable distributions. In: Symposium on Computational Geometry (2004)
- Dollár, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for object detection. PAMI 36(8), 1532–1545 (2014)
- Douze, M., Jégou, H., Perronnin, F.: Polysemous codes. In: ECCV (2016)
- Fernando, B., Tuytelaars, T.: Mining multiple queries for image retrieval: On-the-fly learning of an object-specific mid-level representation. In: ICCV (2013)
- Ge, T., He, K., Ke, Q., Sun, J.: Optimized product quantization. Tech. Rep. 4 (2014)
- Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
- Gong, Y., Lazebnik, S.: Iterative quantization: A procrustean approach to learning binary codes. In: CVPR (2011)
- Gong, Y., Pawlowski, M., Yang, F., Brandy, L., Bourdev, L., Fergus, R.: Web scale photo hash clustering on a single machine. In: CVPR (2015)
- Gordo, A., Perronnin, F., Gong, Y., Lazebnik, S.: Asymmetric distances for binary embeddings. PAMI 36(1), 33–47 (2014)
- Hsiao, K., Calder, J., Hero, A.O.: Pareto-depth for multiple-query image retrieval. arXiv preprint arXiv:1402.5176 (2014)
- Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. PAMI 33(1) (2011)
- Jégou, H., Tavenard, R., Douze, M., Amsaleg, L.: Searching in one billion vectors: Re-rank with source coding. In: ICASSP (2011)
- Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
- Jin, Z., Hu, Y., Lin, Y., Zhang, D., Lin, S., Cai, D., Li, X.: Complementary projection hashing. In: ICCV (2013)
- Kalantidis, Y., Tolias, G., Avrithis, Y., Phinikettos, M., Spyrou, E., Mylonas, P., Kollias, S.: Viral: Visual image retrieval and localization. MTAP (2011)
- Kalantidis, Y., Avrithis, Y.: Locally optimized product quantization for approximate nearest neighbor search. In: CVPR (2014)
- Kennedy, L., Naaman, M., Ahern, S., Nair, R., Rattenbury, T.: How flickr helps us make sense of the world: Context and content in community-contributed media collections. In: ACM Multimedia. vol. 3, pp. 631–640 (2007)
- Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
- Norouzi, M., Fleet, D.: Cartesian -means. In: CVPR (2013)
- Norouzi, M., Punjani, A., Fleet, D.J.: Fast search in Hamming space with multi-index hashing. In: CVPR (2012)
- Paulevé, L., Jégou, H., Amsaleg, L.: Locality sensitive hashing: a comparison of hash function types and querying mechanisms. Pattern Recognition Letters 31(11), 1348–1358 (Aug 2010)
- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575 (2014)
- Shen, F., Shen, C., Liu, W., Shen, H.T.: Supervised discrete hashing. CVPR (2015)
- Thomee, B., Elizalde, B., Shamma, D.A., Ni, K., Friedland, G., Poland, D., Borth, D., Li, L.J.: Yfcc100m: The new data in multimedia research. Communications of the ACM 59(2), 64–73 (2016)
- Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817 (2015)
- Tolias, G., Avrithis, Y., Jégou, H.: Image search with selective match kernels: Aggregation across single and multiple images. International Journal of Computer Vision pp. 1–15 (2015)
- Wang, J., Shen, H.T., Yan, S., Yu, N., Li, S., Wang, J.: Optimized distances for binary code ranking. In: Proceedings of the ACM International Conference on Multimedia. pp. 517–526. ACM (2014)
- Wang, J., Kumar, S., Chang, S.F.: Sequential projection learning for hashing with compact codes. In: ICML (2010)
- Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS (2008)
- Xu, B., Bu, J., Chen, C., Cai, D., He, X., Liu, W., Luo, J.: Efficient manifold ranking for image retrieval. In: SIGIR (2011)
- Zhang, L., Zhang, Y., Tang, J., Lu, K., Tian, Q.: Binary code ranking with weighted hamming distance. In: CVPR (2013)
- Zheng, Y., Zhao, M., Song, Y., Adam, H., Buddemeier, U., Bissacco, A., Brucher, F., Chua, T.S., Neven, H.: Tour the world: Building a web-scale landmark recognition engine. In: CVPR (2009)
- Zhu, C.Z., Huang, Y.H., Satoh, S.: Multi-image aggregation for better visual object retrieval. In: ICASSP (2014)