Detect-to-Retrieve: Efficient Regional Aggregation for Image Search
Retrieving object instances among cluttered scenes efficiently requires compact yet comprehensive regional image representations. Intuitively, object semantics can help build the index that focuses on the most relevant regions. However, due to the lack of bounding-box datasets for objects of interest among retrieval benchmarks, most recent work on regional representations has focused on either uniform or class-agnostic region selection. In this paper, we first fill the void by providing a new dataset of landmark bounding boxes, based on the Google Landmarks dataset, that includes images with manually curated boxes from unique landmarks. Then, we demonstrate how a trained landmark detector, using our new dataset, can be leveraged to index image regions and improve retrieval accuracy while being much more efficient than existing regional methods. In addition, we further introduce a novel regional aggregated selective match kernel (R-ASMK) to effectively combine information from detected regions into an improved holistic image representation. R-ASMK boosts image retrieval accuracy substantially at no additional memory cost, while even outperforming systems that index image regions independently. Our complete image retrieval system improves upon the previous state-of-the-art by significant margins on the Revisited Oxford and Paris datasets. Code and data will be released.
In this paper, we address the image retrieval problem: given a query image, a system should efficiently retrieve similar images from a database. Image retrieval systems are usually composed of two main stages: (1) filtering, where an efficient technique ranks database images according to their similarity with respect to the query; (2) re-ranking, where a small number of the most similar database images from the first stage are inspected in more detail and re-ranked.
Traditionally, hand-crafted local features [21, 6] were coupled to Bag-of-Words-inspired techniques [35, 26, 27, 14, 15, 16, 37] to construct high-dimensional representations used in the filtering step. Local feature matching and geometric verification [26, 27, 3] (commonly using RANSAC ) have been used as effective re-ranking strategies. Recently, several deep learning techniques have been proposed for these two stages. Global image representations based on convolutional neural networks (CNN) can produce compact embeddings to enable fast similarity computation in the filtering step [5, 4, 39, 1, 9, 30]. Local image representations can also be extracted using CNNs, suitable to re-ranking via spatial matching and geometric verification [25, 24, 23].
Today’s image retrieval systems tend to fail when relevant objects do not occupy a large enough fraction of database images, typically in cluttered scenes. Often, these objects produce local features that can be used to find local matches against the query in the re-ranking stage. However, such cluttered images usually fail to reach the re-ranking stage, since their initial representation does not lead to high similarity when compared to the query during the filtering stage. The most common solution to estimate an improved similarity with respect to the query image is to extract and separately store image representations for regions-of-interest in the database, using a fixed regional grid [2, 31] or a class-agnostic detector [36, 17]. However, the existing region selection techniques produce a large number of irrelevant regions. In a recent large-scale experimental image retrieval evaluation, Radenovic \etal concluded that such regional search approaches impose too high of a cost in terms of memory and latency, with only small accuracy gains.
(1) Our first contribution is aimed at improving region selection: we introduce a dataset of manually boxed landmark images, with images from unique classes, and we show that detectors can be trained for robust landmark localization. (2) Our second contribution is to leverage the trained detector and produce more efficient regional search systems, which improve accuracy for small objects with only a modest increase to the database size – much more efficiently than previously proposed techniques. (3) In our third contribution, we propose regional aggregated match kernels to leverage selected image regions and produce a discriminative image representation, illustrated in \figreffig:key_fig. This new representation outperforms regional search systems significantly, while at the same time being more efficient: only one descriptor needs to be stored per image. Our image retrieval system outperforms previously published results by absolute mean average precision on the Revisited Oxford-Hard dataset, and on the Revisited Paris-Hard dataset . Towards the goal of facilitating further research, we will release both code and data.
2 Related Work
To the best of our knowledge, no manually curated datasets of landmark bounding boxes exist. Gordo \etal use SIFT  matching to estimate boxes in landmark images. Such boxes are biased towards the feature extraction and matching technique, and may contain localization errors. Their dataset contains boxed images, from landmarks. In comparison, we use human raters to annotate the regions of interest, and produce boxed images from landmarks. The OpenImages dataset  contains images, annotated with generic object bounding boxes. Some of them may be considered landmarks, for example: buildings, towers, skyscrapers, billboards. However, these classes make for a small fraction of the entire dataset.
Regional search and aggregation.
Region selection has been explored in image retrieval systems. They have been used with two different purposes: (i) regional search: selected regions are encoded independently in the database, allowing for retrieval of subimages; (ii) regional aggregation: selected regions are used to improve image representations. In the following, we review these two types of approaches.
Regional search. Many papers propose to describe regions using VLAD  or Fisher Vectors : Arandjelovic and Zisserman  use a multi-scale grid to extract regions per image; Tao \etal use Selective Search  with thousands of regions per image; Kim \etal use maximally stable extremal regions (MSER) . Razavian \etal use a multi-scale grid with regions per image, and compute the similarity of two images by taking into account the distances between all region pairs. Iscen \etal[13, 12] leverage multi-scale grids in conjunction with CNN features , to enable query expansion via diffusion. More recently, Radenovic \etal performed a comprehensive evaluation of retrieval techniques and concluded that existing regional search methods may improve recognition accuracy, however at significantly larger memory and complexity costs. In contrast, our Detect-to-Retrieve framework aims at efficient regional search via the use of a custom trained detector.
Regional aggregation. Tolias \etal leverage the grid structure from  to pool pretrained CNN features [18, 34] into compact representations; approximately regions are selected per image. Radenovic \etal build upon  by re-training features on a dataset collected in an unsupervised manner. Gordo \etal train a region proposal network  from semi-automatic bounding box annotations, to replace the grid from . Hundreds of regions per image are considered in this case. Our work departs from these papers by using a small set of regions (fewer than per image), and by formulating regional aggregation as a new match kernel (instead of regional sum-pooling as in [39, 9]).
3 Landmark Boxes Dataset
In this section, we introduce our newly collected Landmark Boxes dataset, describing the manual box annotation process. Our work builds upon the recent Google Landmarks dataset (GLD) , whose training set contains images of unique landmarks, with a wide variety of objects including buildings, monuments, bridges, statues as well as natural landmarks such as mountains, lakes and waterfalls.
Each image in this dataset is considered to only depict one landmark. In some cases, a landmark may consist of a set of buildings: for example, skylines, which are common in this dataset, are considered as a single landmark. Since GLD is collected in a semi-automatic manner considering popular touristic locations, it is sometimes ambiguous what the landmark of interest may be. When collecting bounding box annotations, our goal is to capture the most prominent landmark in the image, according to the fact that each image is only assigned one landmark label. Each box should reflect the main object (or set of objects) which is showcased in each dataset image. For this reason, we instructed human operators to draw at most one box per image.
One of the main challenges in such a fine-grained dataset is the inherent long tail of number of image samples per class. In GLD, some landmarks are associated to several thousands of images, while for about half of the classes only or fewer images are provided. Our goal is to represent landmarks in a balanced manner in our new dataset, such that trained detectors are able to localize a wide variety of objects. For this reason, we first separate part of the training set into a validation set. We randomly select four training and four validation images per landmark. In total, this yields and boxed images for training and validation, respectively. Note that this means that for about of landmarks, all available images are annotated.
Examples of annotated images are shown in \figreffig:dataset_examples. In some cases, it is not possible to identify a prominent landmark (see \figreffig:corner_cases): the landmark of interest may be occluded, or the image may actually show the surroundings of a landmark. We remove such corner cases from our dataset (this applied to about of images which were initially selected). We will make all annotations public to stimulate progress in the area of landmark recognition and image retrieval.
4 Regional Search and Aggregation
We present techniques that enhance image retrieval performance by utilizing bounding boxes predicted by a trained landmark detector. In particular, our approach builds on top of deep local features (DELF)  and aggregated selective match kernels (ASMK) , which were recently shown to achieve state-of-the-art performance on a large-scale image retrieval benchmark .
4.1 Regional Search
In this section, we consider image retrieval systems where regional descriptors are stored independently in the database. We adopt aggregated match kernels, originally introduced by Tolias \etal, which we briefly review next.
An image is described by a set containing local descriptors, each of dimension . A codebook comprising visual words, learned using -means, is used to quantize the descriptors. Denote as the subset of descriptors from which are assigned to visual word by the nearest neighbor quantizer .
According to this framework, the similarity between two images and , represented by local descriptor sets and , can be computed as:
where is an aggregated vector representation, denotes a scalar selectivity function and is a normalization factor. This formulation encompasses popular local feature aggregation techniques, such as Bag-of-Words , VLAD  and ASMK .
In particular, for VLAD, and corresponds to an aggregated residual . For ASMK, corresponds to a thresholded polynomial selectivity function
where usually and ; and corresponds to a normalized aggregated residual .
Now, consider comparing a query image against a database of images , . We are mainly interested in the experimental configuration where a query contains a well-localized region-of-interest (\ie, the query in practice contains only one region), which is a common setting in image retrieval. For the -th database image, regions are predicted by a landmark detector, defining the subimages . We denote as the subimage corresponding to the original image, and always consider it as a valid region. To leverage uncluttered representations, we store aggregated descriptors independently for each subimage, which leads to a total of items in the database.
To compute the similarity between the query and a database image , we consider max-pooling or average-pooling individual regional similarities, respectively:
Max-pooling corresponds to assigning a database image’s score considering only its highest-scoring subimage. Average pooling, on the other hand, aggregates contributions from all subimages. These two variants are compared in \secrefsec:experiments.
4.2 Regional Aggregated Match Kernels
Storing descriptors of each region independently in the database incurs additional cost for both storage and search computation. In this section, we consider utilizing the detected bounding boxes to instead improve the aggregated representations of database images – producing discriminative descriptors at no additional cost. We extend the aggregated match kernel framework of Tolias \etal to regional aggregated match kernels, as follows.
We start by noting that the average pooling similarity \equrefeq:average_pooling_similarity can be rewritten as:
Simple regional aggregation.
For VLAD, this can be further expanded as:
where we define . Using this definition, note that . This derivation indicates that average pooling of regional VLAD similarities can be performed using aggregated regional descriptors and does not require storage of each region’s representation separately111Another way to see that this applies to VLAD kernels is to note that VLAD similarity is computed via a simple inner product, and that the average inner product with a set of vectors equals the inner product with the set average; \ie, for vector and set , .. We refer to this simple regional aggregated kernel as R-VLAD.
A similar derivation can be obtained for ASMK in the case where is the identity function (\ie, no selectivity is applied), by replacing by in \equrefeq:regional_aggregated_avg_vlad. A straightforward matching kernel using this idea would apply the selectivity function when comparing the query ASMK representation against this aggregated representation. We refer to this aggregation variant as Naive-R-ASMK.
Both the R-VLAD and Naive-R-ASMK kernels present an important problem when using many detected regions per image and large codebooks. For a given image region, most visual words will not be associated to any local feature, leading to many all-zero residuals for the region. For visual words that correspond to visual patterns observed in only a small number of regions, this will lead to substantially downweighted residuals. We propose to fix this weakness by developing the R-ASMK kernel as follows, inspired by the changes introduced by the original ASMK with respect to VLAD.
We define the R-ASMK similarity between a query and a database image as:
where is the normalized regionally aggregated residual corresponding to visual word .
The kernels we presented in this section can be regarded as different instantiations of a general regional aggregated match kernel (R-AMK), defined as follows:
where denotes the sets of local descriptors quantized to visual word , from each region of . specializes to for R-VLAD, and to for R-ASMK. Note that this definition involves regional aggregation for both images, while in this work we focus on the asymmetric case where regional aggregation is applied to the database image only. As previously mentioned, the asymmetric case is more relevant when the query image is itself a well-localized region-of-interest, which is a common setup in image retrieval benchmarks.
For codebooks with a large number of visual words, the storage cost for such aggregated representations may be prohibitive. Binarization is an effective strategy to allow scalable retrieval in these cases. We adopt a similar binarization strategy as , where a binarized version of can be obtained by the elementwise function . We denote the binarized version by a superscript (\eg, R-ASMK is the binarized version of R-ASMK).
We present two types of experiments: first, landmark detection, to assess the quality of object detector models trained on the new dataset. Second, we utilize the detected landmarks to enhance image retrieval systems.
5.1 Landmark Detection
We train two types of detection models on the bounding box data we have collected and described in \secrefsec:data: a single shot Mobilenet-V2  based SSD detector  and a two stage Resnet-50  based Faster-RCNN . Standard object detection evaluation metric Average Precision (AP) measured at Intersection-over-Union ratio is used during evaluation. Both models reach about AP on the validation set within steps (, respectively). The models are trained with publicly available Tensorflow Object Detection API . The results indicate that accurate landmark localization can be trained using our dataset. The Mobilenet-V2-SSD variant runs at ms per image, while the Resnet-50-Faster-RCNN runs at ms, both numbers on a TitanX GPU. Please refer to the appendix for details on learning curves and visualizations of detection results compared against ground-truth.
5.2 Image Retrieval
We perform regional search and regional aggregation experiments. The following describes the experimental setup.
We use the Oxford  and Paris  datasets, which have recently been revisited to correct annotation mistakes, add new query images and introduce new evaluation protocols ; the datasets are referred to as Oxf and Par, respectively. There are query images for each dataset, with () database images in the Oxf (Par) dataset. We report results on the Medium and Hard setups, as common practice; for ablations, we focus more specifically on the Hard setup. Performance is measured using standard metrics for these datasets: mean average precision (mAP) and mean precision at rank 10 (mP@).
We use the following setup in our experiments, except where indicated otherwise. The released DELF model  (pre-trained on the dataset from ) is used, with the default configuration (maximum of features per region are extracted, with a required minimum attention score of ), except that the feature dimensionality is set to as in previous work . A -sized codebook is used when computing aggregated kernels; as common practice, codebooks are trained on Oxf for retrieval experiments on Par, and vice-versa. We focus on improving the core image representations for retrieval, and do not consider query expansion (QE)  techniques such as Hamming QE , QE  or diffusion [13, 12]; these methods could be incorporated to our system to obtain even stronger retrieval performance.
Region selection techniques.
For our Detect-to-Retrieve (D2R) framework, we adopt the trained Faster R-CNN detector described in \secrefsubsec:landmark_detection. We compare against previously proposed region selection techniques for image retrieval: the uniform grid from [31, 39] (denoted RMACB, for “RMAC boxes”) and Selective Search (SS) [40, 36]. To vary the number of regions per image, we do as follows: (i) for D2R, we vary the landmark detector threshold; (ii) for RMACB, we sweep the number of levels from to ; (iii) for SS, we select the top boxes per image (as in this case there are no confidence scores associated to regions). For all region selection techniques, we add the original image as one of the selected regions.
We implemented the aggregated kernel framework in Python. As a comparison against the reference MATLAB implementation , our ASMK with a -sized codebook and DELF features obtains mAP in the Oxf-Hard dataset, while the reference implementation obtains . Note that the reference binarized implementation uses a similar configuration as Hamming Embedding (HE) , with a projection matrix before binarizarion, residuals computed with respect to the median, and IDF. We did not find consistent improvements using these, so we use the simpler version as described in \secrefsec:method. Similarly, the reference implementation uses multiple visual word assignments, but our preliminary experiments show improved results using single assignment, making retrieval faster and simpler – therefore we adopt single assignment in our experiments. We extend the implementation to support our regional search and aggregation techniques, and plan to release code to foster reproducibility of results.
5.2.1 Regional Search
|DELF–ASMK+SP [25, 28]||93.4|
|— DELF-GLD (ours)||80.7||61.3||93.4|
|— DELF-GLD (ours)||76.0||93.4||52.4||70.9|
We compare aggregated match kernels, region selection techniques and similarity computation methods on the Oxf-Hard dataset. When performing regional search, multiple regions are selected per image and stored independently in the database, leading to increased memory cost. \figreffig:box_comparison presents results for ASMK variants, where all techniques use max-pooling similarity from \equrefeq:max_pooling_similarity, except for D2R-ASMK, which uses average-pooling similarity from \equrefeq:average_pooling_similarity. Combining our proposed D2R regions with ASMK enhances mAP by when using an average of regions per image.
We compare the different region selection approaches using ASMK. Our D2R-ASMK achieves mAP when using regions per image, improvement of over the single-image ASMK baseline. Other region selection approaches improve retrieval accuracy, but with significantly larger memory requirements. RMACB-ASMK requires regions/image to achieve mAP (this is mAP below the previously mentioned D2R-ASMK operating point, despite requiring more memory). SS-ASMK benefits from some regions, while performance decreases when a large number of regions are selected, since many of those regions are irrelevant.
Average pooling of individual regional similarities improves upon the single-image baseline significantly, at low overhead memory requirements: D2R-ASMK achieves mAP with only storage cost. Note that in this case performance drops significantly as more regions are added, since irrelevant regional similarities are added to the final image similarity. We also experimented with a D2R-VLAD representation: mAP improves from (single-image) to ( regions/image).
tab:search_aggregation_comparison further presents D2R-ASMK results on the Par-Hard dataset. Regional search enables mAP improvement at regions/image. Note that our D2R approach is effective even if the landmarks in the Landmark Boxes dataset present much larger variability than the landmarks encountered in the Oxf/Par datasets.
5.2.2 Regional Aggregated Match Kernels
In this section, we evaluate the proposed regional aggregated match kernels. In this experiment, region selection is used to produce an improved image representation, with no additional memory requirements to the retrieval system. \figreffig:box_comparison_2 compares different aggregation methods and region selection approaches, on the Oxf-Hard dataset. Both our proposed D2R-R-ASMK and D2R-R-ASMK variants achieve substantial improvements compared to their baselines which do not use boxes for aggregation: and absolute mAP improvements, respectively. We also compare our D2R approach against other region selection methods. RMACB and SS improve upon the baseline, however with limited gain of at most mAP.
More interestingly, our proposed kernels outperform even the regional search configuration where each region is indexed separately in the database. \tabreftab:search_aggregation_comparison compiles experimental results on Oxf-Hard and Par-Hard. Our D2R-R-ASMK method outperforms the best regional search variant on both datasets, respectively by and absolute mAP, with storage savings of and .
In another ablation experiment, we assess the performance of simpler regional aggregation methods: R-VLAD and Naive-R-ASMK. We use the trained detector to select regions. For R-VLAD, mAP on Oxf improves from (single-image) to when using regions per image, but degrades quickly as more regions are considered. In particular, when setting a very low detection threshold () to obtain regions per image, performance degenerates to mAP – this agrees with the intuition that a large number of regions is detrimental to R-VLAD. For Naive-R-ASMK, no improvement is obtained when detected regions are used: mAP drops from to when regions per image are used, and similarly degenerates to when using regions per image. In comparison, using the same detection threshold of , R-ASMK obtains mAP, \ie, performance is high even if using a large number of regions, due to the improved aggregation technique.
5.2.3 Comparison Against State-of-the-Art
We compare our D2R-R-ASMK technique against state-of-the-art image retrieval systems. To make our system comparable with previously published results , for this experiment we use a codebook with visual words. We also further experiment with re-training the DELF local feature on the Google Landmarks dataset (denoted as DELF-GLD). Spatial verification (SP) is used to re-rank the top database images (we use RANSAC with an Affine model), again following the protocol from previous work .
Table 2 presents experimental results on Oxf and Par, using the Medium and Hard protocols. Our proposed D2R-R-ASMK representation by itself, without spatial verification, already improves mAP when comparing against all previously published results. SP further boosts performance by about mAP on Oxf. Surprisingly, it actually degrades performance on the Par dataset, by about . Re-training DELF on GLD improves performance by around . Our best results improve upon the previous state-of-the-art by mAP on Oxf-Medium, mAP on Par-Medium, mAP on Oxf-Hard and in Par-Hard.
Our experiments demonstrate that selecting relevant image regions can help boost image retrieval performance significantly. In our regional aggregation method, the detected regions allow for effective re-weighting of local feature contributions, emphasizing relevant visual patterns in the final image representation. Note, however, that it is crucial to perform both region selection and regional aggregation in a suitable manner. If the selected regions are not relevant to the objects of interest, regional aggregation cannot be very effective, as shown in \figreffig:box_comparison_2. Also, our experiments with naive versions of regional aggregation indicate that the aggregation needs to be performed in the right way: this is related to the poor R-VLAD and Naive-R-ASMK results.
It may initially seem unintuitive that the regional search method underperforms when compared to our regional aggregation technique. However, this can be understood by observing some retrieval result patterns, which are presented in \figreffig:qualitative. The addition of separate regional representations to the database may help retrieval of relevant small objects in cluttered scenes, as illustrated with the successful bottom-right D2R-ASMK retrieved image. However, it also increases the chances of finding localized regions which are similar but do not correspond to the same landmark, as illustrated with the top two cases.
Regional aggregation, on the other hand, can help retrieval by re-balancing the visual information presented in an image. The top-right D2R-R-ASMK result shows a database image where the detected boxes do not precisely cover the query object; instead, several selected regions cover it, and consequently its features are boosted. A similar case is illustrated in the bottom-left example, where the main detected region in the database image does not cover the object of interest entirely. The features inside the main box are boosted but those outside are also used, generating a more suitable representation for image retrieval.
In this paper, we present an efficient regional aggregation method for image retrieval. We first introduce a dataset of landmark bounding boxes, and show that landmark detectors can be trained and leveraged for extracting regional representations. Regional search using our detectors not only provides superior retrieval performance but also much better efficiency than existing regional methods. In addition, we propose a novel regional aggregated match kernel framework that further boosts the retrieval accuracy with no increase in memory. Our full system achieves state-of-the-art performance by a large margin on two image retrieval datasets.
Appendix A. Discussion: Detection Helps Finding Relevant Features
In this section, we analyze the detector’s ability to focus on relevant landmarks by empirically estimating the proportion of relevant local features located within or without predicted bounding boxes. We extract and match local features for image pairs that are known to depict the same landmark. A local feature is declared to be relevant if it is an inlier to a high-confidence estimated geometric transformation.
More specifically, we use DELF local features  and a Faster-RCNN  landmark detector trained on our new dataset. image pairs are collected from the Google Landmarks dataset . Local feature matching is performed via nearest neighbor search followed by geometric verification (RANSAC  with an affine model). \figreffig:plot_matching plots the relevance probabilities as a function of the DELF local feature attention scores (these attention scores can be interpreted as a measure of a local feature’s “landmarkness”). The blue curve denotes features that are located within bounding boxes, while the red curve represents features located outside bounding boxes.
The curves show that local features located within bounding boxes are much more likely to be relevant: for two features with the same attention score, the relevance probability for a feature located within a predicted box is approximately to larger than that for a local feature located outside the box. Note how feature relevance increases with attention scores, as expected, but the predicted boxes can provide important extra information to effectively select the best features. This can be interpreted as the merging of two information streams: bottom-up (DELF attention scores estimate per-local feature relevance) and top-down (landmark detector estimates relevance of large regions).
Our proposed R-ASMK can be seen as a local feature re-weighting mechanism, which favors features located within detected regions. The experimental results obtained on the Oxford and Paris datasets (as presented in the main paper) confirm that re-weighting features within detected regions boosts image retrieval performance substantially.
Appendix B. Detection Experiments
We present learning curves for the trained detectors, and examples of detected regions compared to ground-truth.
We train both Faster-RCNN and SSD based object detection models on our dataset. \figreffig:map_curves shows the comparison of learning progression of the two models. Both models converge to around 85% mAP within 600k training steps. The MobilenetV2 SSD model trains much faster than the Resnet-50 Faster-RCNN, due to much smaller model size and larger batch size ( vs. , respectively). We also observe that SSD-based model slightly outperforms the Faster-RCNN base model despite having a smaller/weaker feature extractor. We conjecture that the advantage is due to the multi-scale feature map of SSD capturing the landmarks at different scales better than Faster-RCNN, which operates on a single feature map.
To illustrate the effectiveness of our trained detectors, we present examples of detection using the SSD model. \figreffig:gld_detection_examples shows examples for a variety of landmarks with different scales, occlusion and lighting conditions. In addition, we also show some failure cases in \figreffig:gld_detection_failures_examples where the object of interest has ambiguous semantic boundary (resulting in double-detection) or is very hard to distinguish from the scene (resulting in missed detection). For both figures, only detections with confidence probability more than are shown.
Appendix C. Region Selection Comparison
In this section, we present landmark detection results on the Oxford and Paris datasets (\figreffig:roxford_detection and \figreffig:rparis_detection, respectively), comparing with the selected regions by competitive approaches (RMAC boxes and Selective Search). The three methods use a configuration that produces a roughly similar number of regions per image: D2R with detection threshold (about regions per image), RMAC boxes with levels ( regions per image), and Selective Search with selected regions per image. Note that our image retrieval experiments always use the whole image as a valid region, but in these visualizations we do not box the whole image, for a more concise presentation.
The figures show that our trained landmark detector tends to focus on the most prominent landmark regions in the image. RMAC boxes correspond to a fixed multi-scale grid, where the selected regions only depend on the input image size, not on its contents. This leads to regularly spaced boxes which do not usually overlap well with landmarks. Selective search produces boxes corresponding to prominent objects in the scene, which may or may not correspond to landmarks.
-  R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proc. CVPR, 2016.
-  R. Arandjelovic and A. Zisserman. All About VLAD. In Proc. CVPR, 2013.
-  Y. Avrithis and G. Tolias. Hough Pyramid Matching: Speeded-up Geometry Re-ranking for Large Scale Image Retrieval. IJCV, 2014.
-  A. Babenko and V. Lempitsky. Aggregating Local Deep Features for Image Retrieval. In Proc. ICCV, 2015.
-  A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural Codes for Image Retrieval. In Proc. ECCV, 2014.
-  H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-Up Robust Features (SURF). Computer Vision and Image Understanding, 2008.
-  O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval. In Proc. ICCV, 2007.
-  M. Fischler and R. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the ACM, 1981.
-  A. Gordo, J. Almazan, J. Revaud, and D. Larlus. Deep Image Retrieval: Learning Global Representations for Image Search. In Proc. ECCV, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Proc. CVPR, 2016.
-  J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/Accuracy Trade-offs for Modern Convolutional Object Detectors. In Proc. CVPR, 2017.
-  A. Iscen, Y. Avrithis, G. Tolias, T. Furon, and O. Chum. Fast Spectral Ranking for Similarity Search. In Proc. CVPR, 2018.
-  A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum. Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations. In Proc. CVPR, 2017.
-  H. Jégou, M. Douze, and C. Schmid. Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search. In Proc. ECCV, 2008.
-  H. Jégou, M. Douze, C. Schmidt, and P. Perez. Aggregating Local Descriptors into a Compact Image Representation. In Proc. CVPR, 2010.
-  H. Jégou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid. Aggregating Local Image Descriptors into Compact Codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
-  H. J. Kim, E. Dunn, and J.-M. Frahm. Predicting Good Features for Image Geo-Localization Using Per-Bundle VLAD. In Proc. ICCV, 2015.
-  A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet Classification with Deep Convolutional Neural Networks. In Proc. NIPS, 2012.
-  A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari. The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale. arXiv:1811.00982, 2018.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single Shot Multibox Detector. In ECCV, 2016.
-  D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 2004.
-  J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust Wide-Baseline Stereo from Maximally Stable Extremal Regions. Image and Vision Computing, 2004.
-  A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas. Working Hard to Know your Neighbor’s Margins: Local Descriptor Learning Loss. In Proc. NIPS, 2017.
-  D. Mishkin, F. Radenovic, and J. Matas. Repeatability Is Not Enough: Learning Affine Regions via Discriminability. In Proc. ECCV, 2018.
-  H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han. Large-Scale Image Retrieval with Attentive Deep Local Features. In Proc. ICCV, 2017.
-  J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object Retrieval with Large Vocabularies and Fast Spatial Matching. In Proc. CVPR, 2007.
-  J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases. In Proc. CVPR, 2008.
-  F. Radenović, A. Iscen, G. Tolias, Y. Avrithis, and O. Chum. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking. In Proc. CVPR, 2018.
-  F. Radenović, G. Tolias, and O. Chum. CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples. In Proc. ECCV, 2016.
-  F. Radenović, G. Tolias, and O. Chum. Fine-Tuning CNN Image Retrieval with no Human Annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
-  A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki. Visual Instance Retrieval with Deep Convolutional Networks. ITE Transactions on Media Technology and Applications, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proc. NIPS, 2015.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. In Proc. CVPR, 2018.
-  K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proc. ICLR, 2015.
-  J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to Object Matching in Videos. In Proc. ICCV, 2003.
-  R. Tao, E. Gavves, C. G. M. Snoek, and A. W. M. Smeulders. Locality in Generic Instance Search from One Example. In Proc. CVPR, 2014.
-  G. Tolias, Y. Avrithis, and H. Jegou. Image Search with Selective Match Kernels: Aggregation Across Single and Multiple Images. International Journal of Computer Vision, 2015.
-  G. Tolias and H. Jegou. Visual Query Expansion with or without Geometry: Refining Local Descriptors by Feature Aggregation. Pattern Recognition, 2014.
-  G. Tolias, R. Sicre, and H. Jégou. Particular Object Retrieval with Integral Max-Pooling of CNN Activations. In Proc. ICLR, 2015.
-  J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective Search for Object Recognition. IJCV, 2013.