Detect-to-Retrieve: Efficient Regional Aggregation for Image Search

Detect-to-Retrieve: Efficient Regional Aggregation for Image Search

Marvin Teichmann
University of Cambridge, UK
Both authors contributed equally to this work.
   André Araujo11footnotemark: 1 Menglong Zhu Jack Sim
Google AI, USA
{andrearaujo, menglong, jacksim}

Retrieving object instances among cluttered scenes efficiently requires compact yet comprehensive regional image representations. Intuitively, object semantics can help build the index that focuses on the most relevant regions. However, due to the lack of bounding-box datasets for objects of interest among retrieval benchmarks, most recent work on regional representations has focused on either uniform or class-agnostic region selection. In this paper, we first fill the void by providing a new dataset of landmark bounding boxes, based on the Google Landmarks dataset, that includes images with manually curated boxes from unique landmarks. Then, we demonstrate how a trained landmark detector, using our new dataset, can be leveraged to index image regions and improve retrieval accuracy while being much more efficient than existing regional methods. In addition, we further introduce a novel regional aggregated selective match kernel (R-ASMK) to effectively combine information from detected regions into an improved holistic image representation. R-ASMK boosts image retrieval accuracy substantially at no additional memory cost, while even outperforming systems that index image regions independently. Our complete image retrieval system improves upon the previous state-of-the-art by significant margins on the Revisited Oxford and Paris datasets. Code and data will be released.


1 Introduction

In this paper, we address the image retrieval problem: given a query image, a system should efficiently retrieve similar images from a database. Image retrieval systems are usually composed of two main stages: (1) filtering, where an efficient technique ranks database images according to their similarity with respect to the query; (2) re-ranking, where a small number of the most similar database images from the first stage are inspected in more detail and re-ranked.

Figure 1: Overview of our proposed regional aggregation method. Deep local features (stars) and object regions (boxes) are extracted from an image. Regional aggregation proceeds in two steps, using a large codebook of visual words (red and yellow visual words are depicted): first, per-region VLAD description; second, sum pooling and per-visual word normalization. Our final regionally aggregated image representation can be combined to selective match kernels and provide improved image similarity estimation: we refer to this technique as regional aggregated selective match kernels (R-ASMK). It leverages detected regions to improve image retrieval at no additional memory cost when compared to the original ASMK method [37].

Traditionally, hand-crafted local features [21, 6] were coupled to Bag-of-Words-inspired techniques [35, 26, 27, 14, 15, 16, 37] to construct high-dimensional representations used in the filtering step. Local feature matching and geometric verification [26, 27, 3] (commonly using RANSAC [8]) have been used as effective re-ranking strategies. Recently, several deep learning techniques have been proposed for these two stages. Global image representations based on convolutional neural networks (CNN) can produce compact embeddings to enable fast similarity computation in the filtering step [5, 4, 39, 1, 9, 30]. Local image representations can also be extracted using CNNs, suitable to re-ranking via spatial matching and geometric verification [25, 24, 23].

Today’s image retrieval systems tend to fail when relevant objects do not occupy a large enough fraction of database images, typically in cluttered scenes. Often, these objects produce local features that can be used to find local matches against the query in the re-ranking stage. However, such cluttered images usually fail to reach the re-ranking stage, since their initial representation does not lead to high similarity when compared to the query during the filtering stage. The most common solution to estimate an improved similarity with respect to the query image is to extract and separately store image representations for regions-of-interest in the database, using a fixed regional grid [2, 31] or a class-agnostic detector [36, 17]. However, the existing region selection techniques produce a large number of irrelevant regions. In a recent large-scale experimental image retrieval evaluation, Radenovic \etal[28] concluded that such regional search approaches impose too high of a cost in terms of memory and latency, with only small accuracy gains.


(1) Our first contribution is aimed at improving region selection: we introduce a dataset of manually boxed landmark images, with images from unique classes, and we show that detectors can be trained for robust landmark localization. (2) Our second contribution is to leverage the trained detector and produce more efficient regional search systems, which improve accuracy for small objects with only a modest increase to the database size – much more efficiently than previously proposed techniques. (3) In our third contribution, we propose regional aggregated match kernels to leverage selected image regions and produce a discriminative image representation, illustrated in \figreffig:key_fig. This new representation outperforms regional search systems significantly, while at the same time being more efficient: only one descriptor needs to be stored per image. Our image retrieval system outperforms previously published results by absolute mean average precision on the Revisited Oxford-Hard dataset, and on the Revisited Paris-Hard dataset [28]. Towards the goal of facilitating further research, we will release both code and data.

2 Related Work


To the best of our knowledge, no manually curated datasets of landmark bounding boxes exist. Gordo \etal[9] use SIFT [21] matching to estimate boxes in landmark images. Such boxes are biased towards the feature extraction and matching technique, and may contain localization errors. Their dataset contains boxed images, from landmarks. In comparison, we use human raters to annotate the regions of interest, and produce boxed images from landmarks. The OpenImages dataset [19] contains images, annotated with generic object bounding boxes. Some of them may be considered landmarks, for example: buildings, towers, skyscrapers, billboards. However, these classes make for a small fraction of the entire dataset.

Regional search and aggregation.

Region selection has been explored in image retrieval systems. They have been used with two different purposes: (i) regional search: selected regions are encoded independently in the database, allowing for retrieval of subimages; (ii) regional aggregation: selected regions are used to improve image representations. In the following, we review these two types of approaches.

Regional search. Many papers propose to describe regions using VLAD [15] or Fisher Vectors [16]: Arandjelovic and Zisserman [2] use a multi-scale grid to extract regions per image; Tao \etal[36] use Selective Search [40] with thousands of regions per image; Kim \etal[17] use maximally stable extremal regions (MSER) [22]. Razavian \etal[31] use a multi-scale grid with regions per image, and compute the similarity of two images by taking into account the distances between all region pairs. Iscen \etal[13, 12] leverage multi-scale grids in conjunction with CNN features [29], to enable query expansion via diffusion. More recently, Radenovic \etal[28] performed a comprehensive evaluation of retrieval techniques and concluded that existing regional search methods may improve recognition accuracy, however at significantly larger memory and complexity costs. In contrast, our Detect-to-Retrieve framework aims at efficient regional search via the use of a custom trained detector.

Regional aggregation. Tolias \etal[39] leverage the grid structure from [31] to pool pretrained CNN features [18, 34] into compact representations; approximately regions are selected per image. Radenovic \etal[29] build upon [39] by re-training features on a dataset collected in an unsupervised manner. Gordo \etal[9] train a region proposal network [32] from semi-automatic bounding box annotations, to replace the grid from [39]. Hundreds of regions per image are considered in this case. Our work departs from these papers by using a small set of regions (fewer than per image), and by formulating regional aggregation as a new match kernel (instead of regional sum-pooling as in [39, 9]).

3 Landmark Boxes Dataset

Figure 2: Examples of annotated images from our Landmark Boxes dataset. A box is drawn around the most prominent landmark depicted in the image. The dataset contains a wide variety of objects, ranging from man-made to natural landmarks.

In this section, we introduce our newly collected Landmark Boxes dataset, describing the manual box annotation process. Our work builds upon the recent Google Landmarks dataset (GLD) [25], whose training set contains images of unique landmarks, with a wide variety of objects including buildings, monuments, bridges, statues as well as natural landmarks such as mountains, lakes and waterfalls.

Each image in this dataset is considered to only depict one landmark. In some cases, a landmark may consist of a set of buildings: for example, skylines, which are common in this dataset, are considered as a single landmark. Since GLD is collected in a semi-automatic manner considering popular touristic locations, it is sometimes ambiguous what the landmark of interest may be. When collecting bounding box annotations, our goal is to capture the most prominent landmark in the image, according to the fact that each image is only assigned one landmark label. Each box should reflect the main object (or set of objects) which is showcased in each dataset image. For this reason, we instructed human operators to draw at most one box per image.

One of the main challenges in such a fine-grained dataset is the inherent long tail of number of image samples per class. In GLD, some landmarks are associated to several thousands of images, while for about half of the classes only or fewer images are provided. Our goal is to represent landmarks in a balanced manner in our new dataset, such that trained detectors are able to localize a wide variety of objects. For this reason, we first separate part of the training set into a validation set. We randomly select four training and four validation images per landmark. In total, this yields and boxed images for training and validation, respectively. Note that this means that for about of landmarks, all available images are annotated.

Examples of annotated images are shown in \figreffig:dataset_examples. In some cases, it is not possible to identify a prominent landmark (see \figreffig:corner_cases): the landmark of interest may be occluded, or the image may actually show the surroundings of a landmark. We remove such corner cases from our dataset (this applied to about of images which were initially selected). We will make all annotations public to stimulate progress in the area of landmark recognition and image retrieval.

Figure 3: Examples of Google Landmarks dataset images which do not depict a prominent landmark. In such cases (about of images), no boxes were drawn, and the images were not included in the Landmark Boxes dataset.

4 Regional Search and Aggregation

We present techniques that enhance image retrieval performance by utilizing bounding boxes predicted by a trained landmark detector. In particular, our approach builds on top of deep local features (DELF) [25] and aggregated selective match kernels (ASMK) [37], which were recently shown to achieve state-of-the-art performance on a large-scale image retrieval benchmark [28].

4.1 Regional Search

In this section, we consider image retrieval systems where regional descriptors are stored independently in the database. We adopt aggregated match kernels, originally introduced by Tolias \etal[37], which we briefly review next.

An image is described by a set containing local descriptors, each of dimension . A codebook comprising visual words, learned using -means, is used to quantize the descriptors. Denote as the subset of descriptors from which are assigned to visual word by the nearest neighbor quantizer .

According to this framework, the similarity between two images and , represented by local descriptor sets and , can be computed as:


where is an aggregated vector representation, denotes a scalar selectivity function and is a normalization factor. This formulation encompasses popular local feature aggregation techniques, such as Bag-of-Words [35], VLAD [15] and ASMK [37].

In particular, for VLAD, and corresponds to an aggregated residual . For ASMK, corresponds to a thresholded polynomial selectivity function


where usually and ; and corresponds to a normalized aggregated residual .

Now, consider comparing a query image against a database of images , . We are mainly interested in the experimental configuration where a query contains a well-localized region-of-interest (\ie, the query in practice contains only one region), which is a common setting in image retrieval. For the -th database image, regions are predicted by a landmark detector, defining the subimages . We denote as the subimage corresponding to the original image, and always consider it as a valid region. To leverage uncluttered representations, we store aggregated descriptors independently for each subimage, which leads to a total of items in the database.

To compute the similarity between the query and a database image , we consider max-pooling or average-pooling individual regional similarities, respectively:


Max-pooling corresponds to assigning a database image’s score considering only its highest-scoring subimage. Average pooling, on the other hand, aggregates contributions from all subimages. These two variants are compared in \secrefsec:experiments.

4.2 Regional Aggregated Match Kernels

Storing descriptors of each region independently in the database incurs additional cost for both storage and search computation. In this section, we consider utilizing the detected bounding boxes to instead improve the aggregated representations of database images – producing discriminative descriptors at no additional cost. We extend the aggregated match kernel framework of Tolias \etal[37] to regional aggregated match kernels, as follows.

We start by noting that the average pooling similarity \equrefeq:average_pooling_similarity can be rewritten as:

Simple regional aggregation.

For VLAD, this can be further expanded as:


where we define . Using this definition, note that . This derivation indicates that average pooling of regional VLAD similarities can be performed using aggregated regional descriptors and does not require storage of each region’s representation separately111Another way to see that this applies to VLAD kernels is to note that VLAD similarity is computed via a simple inner product, and that the average inner product with a set of vectors equals the inner product with the set average; \ie, for vector and set , .. We refer to this simple regional aggregated kernel as R-VLAD.

A similar derivation can be obtained for ASMK in the case where is the identity function (\ie, no selectivity is applied), by replacing by in \equrefeq:regional_aggregated_avg_vlad. A straightforward matching kernel using this idea would apply the selectivity function when comparing the query ASMK representation against this aggregated representation. We refer to this aggregation variant as Naive-R-ASMK.

Both the R-VLAD and Naive-R-ASMK kernels present an important problem when using many detected regions per image and large codebooks. For a given image region, most visual words will not be associated to any local feature, leading to many all-zero residuals for the region. For visual words that correspond to visual patterns observed in only a small number of regions, this will lead to substantially downweighted residuals. We propose to fix this weakness by developing the R-ASMK kernel as follows, inspired by the changes introduced by the original ASMK with respect to VLAD.


We define the R-ASMK similarity between a query and a database image as:


where is the normalized regionally aggregated residual corresponding to visual word .


The kernels we presented in this section can be regarded as different instantiations of a general regional aggregated match kernel (R-AMK), defined as follows:


where denotes the sets of local descriptors quantized to visual word , from each region of . specializes to for R-VLAD, and to for R-ASMK. Note that this definition involves regional aggregation for both images, while in this work we focus on the asymmetric case where regional aggregation is applied to the database image only. As previously mentioned, the asymmetric case is more relevant when the query image is itself a well-localized region-of-interest, which is a common setup in image retrieval benchmarks.


For codebooks with a large number of visual words, the storage cost for such aggregated representations may be prohibitive. Binarization is an effective strategy to allow scalable retrieval in these cases. We adopt a similar binarization strategy as [37], where a binarized version of can be obtained by the elementwise function . We denote the binarized version by a superscript (\eg, R-ASMK is the binarized version of R-ASMK).

5 Experiments

We present two types of experiments: first, landmark detection, to assess the quality of object detector models trained on the new dataset. Second, we utilize the detected landmarks to enhance image retrieval systems.

5.1 Landmark Detection

We train two types of detection models on the bounding box data we have collected and described in \secrefsec:data: a single shot Mobilenet-V2 [33] based SSD detector [20] and a two stage Resnet-50 [10] based Faster-RCNN [32]. Standard object detection evaluation metric Average Precision (AP) measured at Intersection-over-Union ratio is used during evaluation. Both models reach about AP on the validation set within steps (, respectively). The models are trained with publicly available Tensorflow Object Detection API [11]. The results indicate that accurate landmark localization can be trained using our dataset. The Mobilenet-V2-SSD variant runs at ms per image, while the Resnet-50-Faster-RCNN runs at ms, both numbers on a TitanX GPU. Please refer to the appendix for details on learning curves and visualizations of detection results compared against ground-truth.

5.2 Image Retrieval

We perform regional search and regional aggregation experiments. The following describes the experimental setup.


We use the Oxford [26] and Paris [27] datasets, which have recently been revisited to correct annotation mistakes, add new query images and introduce new evaluation protocols [28]; the datasets are referred to as Oxf and Par, respectively. There are query images for each dataset, with () database images in the Oxf (Par) dataset. We report results on the Medium and Hard setups, as common practice; for ablations, we focus more specifically on the Hard setup. Performance is measured using standard metrics for these datasets: mean average precision (mAP) and mean precision at rank 10 (mP@).

(a) Regional search evaluation.
(b) Regional aggregation evaluation.
Figure 4: Regional search and aggregation evaluations of different image representations, on Oxf-Hard. (a) Regional search: each regional representation is stored independently in the database, leading to increased memory requirements. Our D2R-ASMK variants achieve significant improvements over the single-image baseline while requiring substantially fewer boxes compared to other region selection approaches. (b) Regional aggregation: each region contributes to the aggregated representation for the entire image. The memory requirements are identical for the single-image baseline that does not use regions. Our D2R-R-ASMK variants leverage the different landmark regions to compose a strong image representation, which is even more effective than storing each regional representation separately.
Image representation.

We use the following setup in our experiments, except where indicated otherwise. The released DELF model [25] (pre-trained on the dataset from [9]) is used, with the default configuration (maximum of features per region are extracted, with a required minimum attention score of ), except that the feature dimensionality is set to as in previous work [28]. A -sized codebook is used when computing aggregated kernels; as common practice, codebooks are trained on Oxf for retrieval experiments on Par, and vice-versa. We focus on improving the core image representations for retrieval, and do not consider query expansion (QE) [7] techniques such as Hamming QE [38], QE [30] or diffusion [13, 12]; these methods could be incorporated to our system to obtain even stronger retrieval performance.

Region selection techniques.

For our Detect-to-Retrieve (D2R) framework, we adopt the trained Faster R-CNN detector described in \secrefsubsec:landmark_detection. We compare against previously proposed region selection techniques for image retrieval: the uniform grid from [31, 39] (denoted RMACB, for “RMAC boxes”) and Selective Search (SS) [40, 36]. To vary the number of regions per image, we do as follows: (i) for D2R, we vary the landmark detector threshold; (ii) for RMACB, we sweep the number of levels from to ; (iii) for SS, we select the top boxes per image (as in this case there are no confidence scores associated to regions). For all region selection techniques, we add the original image as one of the selected regions.

Method Det. Oxf-Hard Par-Hard
Thresh. mAP Size mAP Size
ASMK 42.0
Table 1: Retrieval mAP and relative database size for the different region-based techniques introduced in this work, on the Oxf-Hard and Par-Hard datasets, as a function of the landmark detector threshold used for region selection. D2R-ASMK uses max-pooling similarity from \equrefeq:max_pooling_similarity. The performances of both D2R-ASMK and D2R-R-ASMK tend to improve as the detection threshold decreases (more regions are selected). D2R-R-ASMK outperforms D2R-ASMK consistently, with a smaller memory footprint.
Implementation details.

We implemented the aggregated kernel framework in Python. As a comparison against the reference MATLAB implementation [37], our ASMK with a -sized codebook and DELF features obtains mAP in the Oxf-Hard dataset, while the reference implementation obtains . Note that the reference binarized implementation uses a similar configuration as Hamming Embedding (HE) [14], with a projection matrix before binarizarion, residuals computed with respect to the median, and IDF. We did not find consistent improvements using these, so we use the simpler version as described in \secrefsec:method. Similarly, the reference implementation uses multiple visual word assignments, but our preliminary experiments show improved results using single assignment, making retrieval faster and simpler – therefore we adopt single assignment in our experiments. We extend the implementation to support our regional search and aggregation techniques, and plan to release code to foster reproducibility of results.

5.2.1 Regional Search

Method Medium Hard
Oxf Par Oxf Par
mAP mP@10 mAP mP@10 mAP mP@10 mAP mP@10
HesAff-rSIFT-ASMK [37]
HesAff-rSIFT-ASMK+SP [37]
HesAffNet–HardNet++–ASMK+SP [24]
DELF–ASMK+SP [25, 28] 93.4
AlexNet-GeM [30]
VGG16-GeM [30]
ResNet101-GeM [30]
ResNet101-R-MAC [9]
DELF–ASMK (reimpl.)
DELF–D2R-R-ASMK (ours)
— DELF-GLD (ours) 80.7 61.3 93.4
DELF–ASMK+SP (reimpl.)
DELF–D2R-R-ASMK+SP (ours) 99.4
— DELF-GLD (ours) 76.0 93.4 52.4 70.9
Table 2: Comparison of our proposed techniques against state-of-the-art methods. We report mAP and mP@ on the Oxford (Oxf) and Paris (Par) datasets, with Medium and Hard evaluation protocols. Previously published results are presented in the first block of rows. The second and third block of rows present our experimental results, considering systems without and with spatial verification (SP), respectively. In this experiment, we use codebooks with visual words, to make our results comparable to previous work [28]. DELF-GLD indicates a version of DELF which we re-trained on the Google Landmarks dataset. Our proposed methods achieve equal or improved performance for all evaluation protocols, datasets and metrics.

We compare aggregated match kernels, region selection techniques and similarity computation methods on the Oxf-Hard dataset. When performing regional search, multiple regions are selected per image and stored independently in the database, leading to increased memory cost. \figreffig:box_comparison presents results for ASMK variants, where all techniques use max-pooling similarity from \equrefeq:max_pooling_similarity, except for D2R-ASMK, which uses average-pooling similarity from \equrefeq:average_pooling_similarity. Combining our proposed D2R regions with ASMK enhances mAP by when using an average of regions per image.

We compare the different region selection approaches using ASMK. Our D2R-ASMK achieves mAP when using regions per image, improvement of over the single-image ASMK baseline. Other region selection approaches improve retrieval accuracy, but with significantly larger memory requirements. RMACB-ASMK requires regions/image to achieve mAP (this is mAP below the previously mentioned D2R-ASMK operating point, despite requiring more memory). SS-ASMK benefits from some regions, while performance decreases when a large number of regions are selected, since many of those regions are irrelevant.

Average pooling of individual regional similarities improves upon the single-image baseline significantly, at low overhead memory requirements: D2R-ASMK achieves mAP with only storage cost. Note that in this case performance drops significantly as more regions are added, since irrelevant regional similarities are added to the final image similarity. We also experimented with a D2R-VLAD representation: mAP improves from (single-image) to ( regions/image).


tab:search_aggregation_comparison further presents D2R-ASMK results on the Par-Hard dataset. Regional search enables mAP improvement at regions/image. Note that our D2R approach is effective even if the landmarks in the Landmark Boxes dataset present much larger variability than the landmarks encountered in the Oxf/Par datasets.

Figure 5: Qualitative results for ASMK (baseline single-image method), D2R-ASMK (regional search) and D2R-R-ASMK (regional aggregation) on Oxf-Hard. Four queries are presented, with their regions-of-interest highlighted. For each method, we show the first ranked image where the methods disagree. Red borders indicate incorrect results, and green borders indicate correct results. For D2R-ASMK, we box the region used for the result (or leave unboxed if the region corresponds to the entire image). For D2R-R-ASMK, we box all regions used for aggregation. We also present average precision (AP) for each method and query.

5.2.2 Regional Aggregated Match Kernels

In this section, we evaluate the proposed regional aggregated match kernels. In this experiment, region selection is used to produce an improved image representation, with no additional memory requirements to the retrieval system. \figreffig:box_comparison_2 compares different aggregation methods and region selection approaches, on the Oxf-Hard dataset. Both our proposed D2R-R-ASMK and D2R-R-ASMK variants achieve substantial improvements compared to their baselines which do not use boxes for aggregation: and absolute mAP improvements, respectively. We also compare our D2R approach against other region selection methods. RMACB and SS improve upon the baseline, however with limited gain of at most mAP.

More interestingly, our proposed kernels outperform even the regional search configuration where each region is indexed separately in the database. \tabreftab:search_aggregation_comparison compiles experimental results on Oxf-Hard and Par-Hard. Our D2R-R-ASMK method outperforms the best regional search variant on both datasets, respectively by and absolute mAP, with storage savings of and .

In another ablation experiment, we assess the performance of simpler regional aggregation methods: R-VLAD and Naive-R-ASMK. We use the trained detector to select regions. For R-VLAD, mAP on Oxf improves from (single-image) to when using regions per image, but degrades quickly as more regions are considered. In particular, when setting a very low detection threshold () to obtain regions per image, performance degenerates to mAP – this agrees with the intuition that a large number of regions is detrimental to R-VLAD. For Naive-R-ASMK, no improvement is obtained when detected regions are used: mAP drops from to when regions per image are used, and similarly degenerates to when using regions per image. In comparison, using the same detection threshold of , R-ASMK obtains mAP, \ie, performance is high even if using a large number of regions, due to the improved aggregation technique.

5.2.3 Comparison Against State-of-the-Art

We compare our D2R-R-ASMK technique against state-of-the-art image retrieval systems. To make our system comparable with previously published results [28], for this experiment we use a codebook with visual words. We also further experiment with re-training the DELF local feature on the Google Landmarks dataset (denoted as DELF-GLD). Spatial verification (SP) is used to re-rank the top database images (we use RANSAC with an Affine model), again following the protocol from previous work [28].

Table 2 presents experimental results on Oxf and Par, using the Medium and Hard protocols. Our proposed D2R-R-ASMK representation by itself, without spatial verification, already improves mAP when comparing against all previously published results. SP further boosts performance by about mAP on Oxf. Surprisingly, it actually degrades performance on the Par dataset, by about . Re-training DELF on GLD improves performance by around . Our best results improve upon the previous state-of-the-art by mAP on Oxf-Medium, mAP on Par-Medium, mAP on Oxf-Hard and in Par-Hard.

5.2.4 Discussion

Our experiments demonstrate that selecting relevant image regions can help boost image retrieval performance significantly. In our regional aggregation method, the detected regions allow for effective re-weighting of local feature contributions, emphasizing relevant visual patterns in the final image representation. Note, however, that it is crucial to perform both region selection and regional aggregation in a suitable manner. If the selected regions are not relevant to the objects of interest, regional aggregation cannot be very effective, as shown in \figreffig:box_comparison_2. Also, our experiments with naive versions of regional aggregation indicate that the aggregation needs to be performed in the right way: this is related to the poor R-VLAD and Naive-R-ASMK results.

It may initially seem unintuitive that the regional search method underperforms when compared to our regional aggregation technique. However, this can be understood by observing some retrieval result patterns, which are presented in \figreffig:qualitative. The addition of separate regional representations to the database may help retrieval of relevant small objects in cluttered scenes, as illustrated with the successful bottom-right D2R-ASMK retrieved image. However, it also increases the chances of finding localized regions which are similar but do not correspond to the same landmark, as illustrated with the top two cases.

Regional aggregation, on the other hand, can help retrieval by re-balancing the visual information presented in an image. The top-right D2R-R-ASMK result shows a database image where the detected boxes do not precisely cover the query object; instead, several selected regions cover it, and consequently its features are boosted. A similar case is illustrated in the bottom-left example, where the main detected region in the database image does not cover the object of interest entirely. The features inside the main box are boosted but those outside are also used, generating a more suitable representation for image retrieval.

6 Conclusions

In this paper, we present an efficient regional aggregation method for image retrieval. We first introduce a dataset of landmark bounding boxes, and show that landmark detectors can be trained and leveraged for extracting regional representations. Regional search using our detectors not only provides superior retrieval performance but also much better efficiency than existing regional methods. In addition, we propose a novel regional aggregated match kernel framework that further boosts the retrieval accuracy with no increase in memory. Our full system achieves state-of-the-art performance by a large margin on two image retrieval datasets.

Appendix A. Discussion: Detection Helps Finding Relevant Features

In this section, we analyze the detector’s ability to focus on relevant landmarks by empirically estimating the proportion of relevant local features located within or without predicted bounding boxes. We extract and match local features for image pairs that are known to depict the same landmark. A local feature is declared to be relevant if it is an inlier to a high-confidence estimated geometric transformation.

More specifically, we use DELF local features [25] and a Faster-RCNN [32] landmark detector trained on our new dataset. image pairs are collected from the Google Landmarks dataset [25]. Local feature matching is performed via nearest neighbor search followed by geometric verification (RANSAC [8] with an affine model). \figreffig:plot_matching plots the relevance probabilities as a function of the DELF local feature attention scores (these attention scores can be interpreted as a measure of a local feature’s “landmarkness”). The blue curve denotes features that are located within bounding boxes, while the red curve represents features located outside bounding boxes.

The curves show that local features located within bounding boxes are much more likely to be relevant: for two features with the same attention score, the relevance probability for a feature located within a predicted box is approximately to larger than that for a local feature located outside the box. Note how feature relevance increases with attention scores, as expected, but the predicted boxes can provide important extra information to effectively select the best features. This can be interpreted as the merging of two information streams: bottom-up (DELF attention scores estimate per-local feature relevance) and top-down (landmark detector estimates relevance of large regions).

Our proposed R-ASMK can be seen as a local feature re-weighting mechanism, which favors features located within detected regions. The experimental results obtained on the Oxford and Paris datasets (as presented in the main paper) confirm that re-weighting features within detected regions boosts image retrieval performance substantially.

Figure 6: Relevance probability of a DELF local feature, as a function of its attention score. The blue curve denotes features inside predicted bounding boxes, while the red curve denotes features outside them. The detected boxes provide valuable information that can be used to improve image representations for retrieval tasks.

Appendix B. Detection Experiments

We present learning curves for the trained detectors, and examples of detected regions compared to ground-truth.

Learning Curves.

We train both Faster-RCNN and SSD based object detection models on our dataset. \figreffig:map_curves shows the comparison of learning progression of the two models. Both models converge to around 85% mAP within 600k training steps. The MobilenetV2 SSD model trains much faster than the Resnet-50 Faster-RCNN, due to much smaller model size and larger batch size ( vs. , respectively). We also observe that SSD-based model slightly outperforms the Faster-RCNN base model despite having a smaller/weaker feature extractor. We conjecture that the advantage is due to the multi-scale feature map of SSD capturing the landmarks at different scales better than Faster-RCNN, which operates on a single feature map.

Figure 7: Mean average precision @ IOU= for the two trained landmark detectors, as a function of the number of training steps.
Detection Examples.

To illustrate the effectiveness of our trained detectors, we present examples of detection using the SSD model. \figreffig:gld_detection_examples shows examples for a variety of landmarks with different scales, occlusion and lighting conditions. In addition, we also show some failure cases in \figreffig:gld_detection_failures_examples where the object of interest has ambiguous semantic boundary (resulting in double-detection) or is very hard to distinguish from the scene (resulting in missed detection). For both figures, only detections with confidence probability more than are shown.

Figure 8: Detection (on the left) versus ground-truth (on the right) on the Google Landmarks dataset.
Figure 9: Two failure detection cases. On the right are the ground-truth images, and on the left are the outputs of the detector (if any).

Appendix C. Region Selection Comparison

In this section, we present landmark detection results on the Oxford and Paris datasets (\figreffig:roxford_detection and \figreffig:rparis_detection, respectively), comparing with the selected regions by competitive approaches (RMAC boxes and Selective Search). The three methods use a configuration that produces a roughly similar number of regions per image: D2R with detection threshold (about regions per image), RMAC boxes with levels ( regions per image), and Selective Search with selected regions per image. Note that our image retrieval experiments always use the whole image as a valid region, but in these visualizations we do not box the whole image, for a more concise presentation.

The figures show that our trained landmark detector tends to focus on the most prominent landmark regions in the image. RMAC boxes correspond to a fixed multi-scale grid, where the selected regions only depend on the input image size, not on its contents. This leads to regularly spaced boxes which do not usually overlap well with landmarks. Selective search produces boxes corresponding to prominent objects in the scene, which may or may not correspond to landmarks.

Figure 10: Examples of selected regions for the three methods compared in the paper, on the Oxford dataset. Left: our D2R approach, with detection threshold of ( regions per image). Center: RMAC boxes (fixed multi-scale grid), with 2 levels ( regions per image). Right: Selective search, with regions per image. Note that edges for some regions overlap in some cases, so not all regions may be clearly visible.
Figure 11: Examples of selected regions for the three methods compared in the paper, on the Paris dataset. Left: our D2R approach, with detection threshold of ( regions per image). Center: RMAC boxes (fixed multi-scale grid), with 2 levels ( regions per image). Right: Selective search, with regions per image. Note that edges for some regions overlap in some cases, so not all regions may be clearly visible.


  • [1] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proc. CVPR, 2016.
  • [2] R. Arandjelovic and A. Zisserman. All About VLAD. In Proc. CVPR, 2013.
  • [3] Y. Avrithis and G. Tolias. Hough Pyramid Matching: Speeded-up Geometry Re-ranking for Large Scale Image Retrieval. IJCV, 2014.
  • [4] A. Babenko and V. Lempitsky. Aggregating Local Deep Features for Image Retrieval. In Proc. ICCV, 2015.
  • [5] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural Codes for Image Retrieval. In Proc. ECCV, 2014.
  • [6] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-Up Robust Features (SURF). Computer Vision and Image Understanding, 2008.
  • [7] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval. In Proc. ICCV, 2007.
  • [8] M. Fischler and R. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the ACM, 1981.
  • [9] A. Gordo, J. Almazan, J. Revaud, and D. Larlus. Deep Image Retrieval: Learning Global Representations for Image Search. In Proc. ECCV, 2016.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Proc. CVPR, 2016.
  • [11] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/Accuracy Trade-offs for Modern Convolutional Object Detectors. In Proc. CVPR, 2017.
  • [12] A. Iscen, Y. Avrithis, G. Tolias, T. Furon, and O. Chum. Fast Spectral Ranking for Similarity Search. In Proc. CVPR, 2018.
  • [13] A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum. Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations. In Proc. CVPR, 2017.
  • [14] H. Jégou, M. Douze, and C. Schmid. Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search. In Proc. ECCV, 2008.
  • [15] H. Jégou, M. Douze, C. Schmidt, and P. Perez. Aggregating Local Descriptors into a Compact Image Representation. In Proc. CVPR, 2010.
  • [16] H. Jégou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid. Aggregating Local Image Descriptors into Compact Codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
  • [17] H. J. Kim, E. Dunn, and J.-M. Frahm. Predicting Good Features for Image Geo-Localization Using Per-Bundle VLAD. In Proc. ICCV, 2015.
  • [18] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet Classification with Deep Convolutional Neural Networks. In Proc. NIPS, 2012.
  • [19] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari. The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale. arXiv:1811.00982, 2018.
  • [20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single Shot Multibox Detector. In ECCV, 2016.
  • [21] D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 2004.
  • [22] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust Wide-Baseline Stereo from Maximally Stable Extremal Regions. Image and Vision Computing, 2004.
  • [23] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas. Working Hard to Know your Neighbor’s Margins: Local Descriptor Learning Loss. In Proc. NIPS, 2017.
  • [24] D. Mishkin, F. Radenovic, and J. Matas. Repeatability Is Not Enough: Learning Affine Regions via Discriminability. In Proc. ECCV, 2018.
  • [25] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han. Large-Scale Image Retrieval with Attentive Deep Local Features. In Proc. ICCV, 2017.
  • [26] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object Retrieval with Large Vocabularies and Fast Spatial Matching. In Proc. CVPR, 2007.
  • [27] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases. In Proc. CVPR, 2008.
  • [28] F. Radenović, A. Iscen, G. Tolias, Y. Avrithis, and O. Chum. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking. In Proc. CVPR, 2018.
  • [29] F. Radenović, G. Tolias, and O. Chum. CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples. In Proc. ECCV, 2016.
  • [30] F. Radenović, G. Tolias, and O. Chum. Fine-Tuning CNN Image Retrieval with no Human Annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [31] A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki. Visual Instance Retrieval with Deep Convolutional Networks. ITE Transactions on Media Technology and Applications, 2016.
  • [32] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proc. NIPS, 2015.
  • [33] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. In Proc. CVPR, 2018.
  • [34] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proc. ICLR, 2015.
  • [35] J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to Object Matching in Videos. In Proc. ICCV, 2003.
  • [36] R. Tao, E. Gavves, C. G. M. Snoek, and A. W. M. Smeulders. Locality in Generic Instance Search from One Example. In Proc. CVPR, 2014.
  • [37] G. Tolias, Y. Avrithis, and H. Jegou. Image Search with Selective Match Kernels: Aggregation Across Single and Multiple Images. International Journal of Computer Vision, 2015.
  • [38] G. Tolias and H. Jegou. Visual Query Expansion with or without Geometry: Refining Local Descriptors by Feature Aggregation. Pattern Recognition, 2014.
  • [39] G. Tolias, R. Sicre, and H. Jégou. Particular Object Retrieval with Integral Max-Pooling of CNN Activations. In Proc. ICLR, 2015.
  • [40] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective Search for Object Recognition. IJCV, 2013.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description