Coarse2Fine: Two-layer Fusion for Image Retrieval

Coarse2Fine: Two-layer Fusion for Image Retrieval

le Dong, Gaipeng Kong, Wenpu Dong,  Liang Zheng  and Qi Tian M. Shell was with the Department of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, 30332 USA e-mail: (see Doe and J. Doe are with Anonymous University.Manuscript received April 19, 2005; revised August 26, 2015.

This paper addresses the problem of large-scale image retrieval. We propose a two-layer fusion method which takes advantage of global and local cues and ranks database images from coarse to fine (C2F). Departing from the previous methods fusing multiple image descriptors simultaneously, C2F is featured by a layered procedure composed by filtering and refining. In particular, C2F consists of three components. 1) Distractor filtering. With holistic representations, noise images are filtered out from the database, so the number of candidate images to be used for comparison with the query can be greatly reduced. 2) Adaptive weighting. For a certain query, the similarity of candidate images can be estimated by holistic similarity scores in complementary to the local ones. 3) Candidate refining. Accurate retrieval is conducted via local features, combining the pre-computed adaptive weights. Experiments are presented on two benchmarks, i.e., Holidays and Ukbench datasets. We show that our method outperforms recent fusion methods in terms of storage consumption and computation complexity, and that the accuracy is competitive to the state-of-the-arts.

Image retrieval, Coarse-to-fine, Holistic representation, Local feature.

I Introduction

This paper considers the task of accurate image retrieval on a large scale. Given a query image, we aim at finding the similar images from the database. A number of retrieval models have been proposed in this scenario, among which local feature based model, i.e., bag-of-words (BOW) [1], has obtained widespread applications. Nevertheless, traditional BOW is lack of spatial layout and may encounter quantization errors. Besides, local feature based models usually lead to huge storage consumption, which is limited in real-world applications featured by the rapid expansion of the image volume. On the contrary, holistic representations such as HSV represent an image via global vectors and obtain good scalability, while the retrieval accuracy is less desirable.

Fig. 1: A sample query from the Holidays dataset and its retrieval results obtained by holistic representation (first row), BOW (second row) and the proposed two-layer fusion framework (third row).

Generally speaking, local descriptors and holistic representations demonstrate distinct strengths in finding similar images. Local features are capable of capturing local image patterns or textures, while holistic representations delineate overall feature distributions in images. Thus, a variety of methods [15, 16, 17, 11, 30] are proposed to integrate the strengths of local and holistic features to yield more satisfying retrieval results.

Zhang et al. [15] propose a graph based fusion approach to merge and rerank multiple retrieval sets. Although this method achieves desirable accuracy, there are some disadvantages, e.g., being not robust to the dynamic changes of the dataset or being unable to find reciprocal neighborhoods. Zheng et al. [17] propose a coupled Multi-Index framework to perform feature fusion at indexing level. Specifically, color and SIFT features are coupled into a multi-dimensional inverted index to filter out false positive SIFT matches. The c-MI seems to achieve higher accuracy accompanied with less query time, while the storage consumption is still remarkable when the image dataset rapidly expands. Another representative work is the query-adaptive late fusion method proposed in [11]. To a certain query, [11] extracts five different image descriptors and each of them is used to search the whole dataset independently. The initial score curves obtained via these representations are normalized by the references and later used to evaluate the effectiveness of the corresponding descriptors. This query-adaptive method does have competitive retrieval performance, while it heavily relies on the number of to-be-fused features and needs considerable amount of storage to save these features, as well as the reference curves.

Different from the above fusion methods utilizing multiple image descriptors to search the entire database simultaneously, C2F proposed in this paper is a two-layer fusion procedure working from coarse to fine. The first stage aims at filtering out distractors from the dataset, which helps enhance the retrieval efficiency and reduce the memory consumption to be encountered in the next step. The subsequent operation focuses on refining the rank list from the filtered candidates, which further improves the retrieval accuracy (see Figure 1 for an illustration).

To be more specific, we firstly exploit the holistic descriptors to filter out distractor images that are dissimilar to the query from the database, and thus the number of candidate images used to compared with the query can be greatly reduced. This process is efficient and the retrieved candidates usually appear globally similar. Then, according to the similarity scores obtained via holistic features, we acquire candidate images which are more similar to the query when compared with other images in the database. Moreover, for each candidate image, we design an adaptive weight by holistic similarity scores to measure its similarity degree to the query. Finally, accurate image search is conducted on candidate image pool via local features. In this step, adaptive weights of the candidate images are used in refining the initial retrieval list.

The main contribution of the proposed framework consists in the coarse-to-fine fusion of retrieval sets given by different methods, which has two merits: 1) by filtering out distractors from the database, the number of candidate images to be used for comparison with the query can be greatly reduced. 2) given a certain query, the relative weight of each candidate can be automatically evaluated via global similarity scores, which helps ensure the stability of the retrieval performance. We validate the performance of C2F on two public datasets, i.e., Ukbench and Holidays datasets. The evaluation shows that our method compares favorably with the recent state of the art.

The remainder of the paper is organized as follows: Section 2 highlights the related works; We describe the formulation details of C2F in Section 3; Section 4 provides comprehensive experimental results to validate the superiority of C2F; Finally, Section 5 concludes the paper and states directions for future work.

Ii Related Work

A variety of works have been exposed to improve the image retrieval performance. In this section, we briefly introduce several closely related methods.

To obtain a discriminative image representation with local features, BOW based methods are usually adopted. In BOW model, a codebook is generated off-line by unsupervised clustering algorithms, such as k-means, hierarchical k-means [3], and approximate k-means [18]. Each local feature is quantized to one or a few visual words by Approximate Nearest Neighbor algorithms. Then the image is represented by a high-dimensional histogram over the visual codebook. This process is usually accompanied by the quantization error [7, 8].

To correct quantization defects, hard assignment can be replaced with schemes such as soft assignment [7], multiple assignment [19], sparse coding [20], etc. Nevertheless, quantizing a single feature to several visual words [7, 19, 20] will introduce more storage burden and higher search complexity. Representatively, [19] proposes to generate a -bit binary signature to each local feature with the hashing approach via a smaller codebook, for example 20,000. The true matches are defined as those local features that are not only quantized to the same visual word but also have small Hamming distance between their binary signatures.

Fig. 2: C2F via coupling filtering and refining. 1) Given a query, holistic representation (HSV histograms) is used to filter out distractors from all images. 2) Top- candidates with adaptive weights are selected via global similarity scores. 3) Accurate retrieval is conducted on candidates via local features(BOW), combining the effectiveness of weights.

On the other hand, [49] proposes a compact feature based clustering (CFC) to represent images. [47] improves the retrieval performance by automatically learning robust visual features and hash functions. [50] proposes a discriminative light unsupervised learning network to learn a compact image representation. [10] employs a two-layer hierarchical scheme to extract the global information and statistics of the local feature set of image datasets. There is no doubt that methods based on compact holistic features are efficient in computation and memory usage, as well as more suitable to large-scale dataset. Since holistic features tend to be less invariant than local features and more sensitive to image transformations, their retrieval precision is often lower compared to local feature based methods.

As a consequence, it would be desirable if one method can effectively combine the complementary cues of local and holistic features. Rank aggregation [29] is a solution to fuse local and holistic cues at the rank level, but the effectiveness of rank aggregation is discounted when there is no intersection among the top candidates retrieved by the local and holistic feature based methods, which does occasionally occur. To overcome this weakness, [15] proposes an undirected-graph based fusion approach to further enhance the retrieval precision. Built on this undirected-graph, [30] proposes a directed graph which is robust to outlier distraction. [48] propose a graph-based optimization framework to leverage category independent object proposals for logo search in a large scale image database. [16] proposes a semantic-aware co-indexing algorithm to jointly embed local invariant features and semantic attributes into the inverted indexes. [31] introduces a simple color signature generation procedure and embeds local color features into the inverted index to provide color information by Bag of Colors. In [17], a couple Multi-Index framework is proposed to perform feature fusion at indexing level by couple complementary features into a multi-dimensional inverted index. [11] proposes a late fusion at score level. Particularly, the effectiveness of each to-be-fused feature is estimated in an unsupervised, query-adaptive manner. [46] proposes a bag-of-words based deep neural network for image retrieval task, which learns high-level image representation and maps images into bag-of-words space.

From the above analysis, it can be seen that considerable efforts have been made to improve the retrieval accuracy via fusion schemes, such as graph-based [15, 30], query-adaptive late fusion [11]. Usually, with more heterogeneous features, higher accuracy can be obtained, while more information needs to be stored and higher computational complexity incurred, correspondingly. Therefore, how to maintain a good balance between accuracy and complexity is an important issue when focusing on large-scale data. Different from the above fusion methods, in this work, we propose a two-layer fusion framework to rank database images from coarse to fine (C2F). This method improves retrieval efficiency and accuracy reliably.

Iii Two-layer fusion framework

In this section, we present the detailed structure of the proposed C2F aiming at accurate image retrieval on large-scale databases. C2F is a two-layer fusion framework which consists of three key components. 1) It filters distractors from the whole database with holistic representations, and thus the number of images to be used for comparison with the query can be greatly reduced. 2) For a certain query, we design an adaptive weight for each selected candidate image according to the similarity scores obtained via holistic features. 3) We further refine the candidate set with local features, considering the impact of adaptive weights. The mechanism of C2F is presented in Figure 2. In the next subsections, we will elaborate each component of the framework.

Iii-a Distractor filtering via holistic representations

We assume that there are a total of color images in the database . Using holistic features, such as HSV, each database image is represented as a -dimensional histogram vector , and the image set can be defined as . In order to reduce the impact of bins with large values, HSV histograms of both query and dataset images are l1-normalized and square scaled, similar to the rootSIFT [35]:


where is a coefficient and its value is set to 0.5.

To select candidate images from the dataset, we adopt the cosine distance between vectors of the query and the database images as similarity measurement:


where is the query, is a database image, and represents the similarity score obtained by global feature. The cosine similarity between the query and all the dataset images can be defined as:


Particularly, a larger cosine score corresponds to higher similarity with the query.

Note that there exists large variationa among the database image. Thus, we hold the opinion that most database images have significant differences from the query. Hence, the holistic image representation can be used to quickly filter out these noise images. For each query, candidates that share similar information with the query are adaptively selected from the original database. Different queries commonly have different candidate images.

Iii-B Adaptive weights for candidates

After the filtering stage, candidate images more similar to the query are selected from the original image database. The similarity of the query with different candidates varies extensively. It is important to design an adaptive weight for each candidate image to evaluate its similarity level with the query. Considering that holistic representations, i.e., HSV histograms, delineate overall color feature distributions in images, the cosine values computed in Sec. III-A are used as the basis of learning weights. In particular, the similarity scores of the query with candidates are represented as .

The motivation behind the weighting method is that the larger cosine values correspond to higher importance of the candidate images. To overcome the dictatorship incurred by the candidates with larger scores, the cosine similarity values of the corresponding candidate images are min-max normalized to :


Then, the normalized scores are used as the adaptive weights of the corresponding candidate images.


Experimental results show that adaptive weights significantly improves the accuracy of image retrieval on both Holidays and Ukbench datasets. The performance of C2F with and without wights are compared in Table 4.

Iii-C Further refinement with local features

Given the candidate images and their corresponding weights of a certain query, we focus on utilizing local features to find out ground truths.

Firstly, we extract SIFT descriptors from an image by Hessian-affine detector . The image contains local descriptors in dimensions.

Then, a codebook trained on Filckr60k dataset by Approximate K-Means (AKM) [18] clustering algorithm is used to quantize SIFT features. Specifically, each local feature of an image is assigned to its nearest visual word of the codebook, represented by the ID of the corresponding visual word.


Next, to enhance the discriminative validity of the visual words, a -bit binary signatures feature is generated for each SIFT feature , which encodes the location of the SIFT descriptor within the Voronoi cell. The distance between two descriptors and lying in the same cell is approximated by the Hamming distance of their corresponding binary signatures.


At this point, a SIFT descriptor is represented by and , and a standard inverted index is generated. Then the matching function is defined as:


where is the Hamming distance defined in E.q. 9, and is a fixed hamming threshold.

The similarity score between the query and the candidate image is:


where is the number of local feature in the query image , and is the local feature of the query and dataset image, respectively, L represents that the score is computed via local features.

Considering the similarity degrees measured by holistic representations, the sorted list of the candidate images obtained via local features are refined according to the adaptive weights learned in Sec. III-B. The final similarity scores of the query and the candidate images are defined as:


where is the final similarity score between the query and the dataset image d, is the weight obtained by the intuitively or steadily weighted method.

Overall, we propose a two-layer fusion framework which takes advantage of global and local cues and ranks database images from coarse to fine. Considering the efficiency and low storage consumption of holistic features, we use the HSV histograms to coarsely search the original database and filter out noise images. Then, accurate image search is conducted on candidate images, combining the adaptive weights learned via global similarity scores to further enhance the retrieval accuracy. Experimental results in Sec. IV-C4 demonstrate that the retrieval accuracy of C2F is competitive to state of the art.

Iii-D Complexity and Scalability

In order to illustrate the following analysis more clearly, we provide a brief description of these parameters: is the number of dataset images, is the number of query images, is the number of candidate images, and is the number of features to-be-fused.

The rising processing time of the image retrieval mainly caused by the increasing number of candidate images, which are used for comparison with the query. In this paper, We propose a two-layer fusion method which takes advantage of global and local cues and ranks database images from coarse to fine. The total comparison number in C2F is , which is a relative small value compared with other fusion methods [15, 11]. For example, the comparison number in [11] is , especially, is set to 5 in the experiment. Correspondingly, the shorten of the comparisons in C2F will certainly contribution to the efficiency and the storage consumption. The memory storage consumed by C2F is quite smaller than that of [11]. This is because [11] need to save multiple features of all images of the database, while C2F stores holistic features of all database images and local features of only candidate images.

Holidays () 0 100 200 300 400 500 600 700 800 900 1000 1491
HSV+HE 61.95 79.81 80.65 80.68 80.73 81.23 81.55 82.11 82.27 82.25 82.25 80.16
UKbench () 0 200 400 600 800 1000 1200 1400 1600 1800 2000 10200
HSV+HE 3.19 3.621 3.623 3.623 3.621 3.617 3.614 3.609 3.606 3.604 3.603 3.484
TABLE I: The filtering and refining effect of C2F. is the number of candidates selected via holistic representation. We use the HSV histograms and BOW to filter out distractors and refine candidates, respectively. In particular, no weights is embed in C2F.

The computational cost incurred by the proposed two-layer fusion framework is quite small. C2F computes the adaptive weighs via holistic similarity scores, this processing time is less than 1 second for over a million database images in the experiments. The weighted method used in C2F is straightforward when compared with the normalization method in [11]. And compared with the graph fusion method [15] constructing a weighted graph offline for each query from the retrieval results of one method, C2F does not need any offline processing scheme and can automatically adapt to the dynamic changes in the database.

Iv Experiments

In following sections, we describe the datasets and the evaluation protocols (Sec. IV-A), the baseline features (Sec. IV-B) used in C2F, the analysis on experimental results (Sec. IV-C), the efficiency and storage of C2F (Sec. IV-D), and the discussion on C2F (Sec. IV-E).

Iv-a Datasets and Evaluation Protocol

We evaluate our framework on two commonly used benchmarks, i.e., the Holidays and Ukbench datasets. With Holidays, we add 1M distractor images from ImageNet [33] for large-scale experiments.

Holidays [19] dataset is composed of 1,491 personal holidays images undergoing various transformations. There are 500 queries in total and the other 991 images are the similar ones corresponding to queries. Generally, most queries have 2 - 3 ground truth images. As for the evaluation mechanism, we adopt Mean Average Precision (mAP) [18] to calculate retrieval accuracy. It is the mean value of Average Precision (AP), which computes the area under the precision-recall curve for each query. Specifically, precision is defined as the ratio of retrieved positive images to the total number retrieved and recall is the ratio of the number of retrieved positive images to the total number of positive images in the dataset. The ideal precision-recall curve has precision of 1.

Ukbench [3] dataset contains 10,200 images which is made up of 2,550 objects for each has four images taken from different viewpoints. In the experiments, all the 10,200 images are used as query in turn. The expected results of using each query to retrieve the Ukbench database are the four images with the same object. In term of evaluation manner, N-S score is used to calculate the retrieval accuracy. The accuracy of each query is measured by counting the number of correct images in the top-4 ranked images of returned retrieval results. Then the retrieval performance on Ukbench is averaged over all test queries (N-S score is 4 maximum).

ImageNet 1M ImageNet [33] is an image database organized according to the WordNet hierarchy, in which each node of the hierarchy is depicted by hundreds or thousands of images. Since ImageNet dataset contains sufficient images with large variations and is readily accessible, it is well suitable to benchmark the accuracy, computation, and memory usage for the large-scale image retrieval. In the experiments, we use 1 million images of 1,000 categories in ImageNet as distractors to evaluate the scalability of C2F. Following the experimental setting principles of [18, 32], we use the original 500 queries and the corresponding ground truths of Holidays as the queries and ground truths of the large-scale dataset. We assume there is no ground truth in the newly added images form ImageNet. mAP is used to measure the retrieval accuracy.

Iv-B Baseline Features

Holistic representation and local features are used in the C2F to enhance the retrieval performance, especially, we consider the baseline of HSV histograms [11] and BOW [19], as well as some recent technologies to improve retrieval accuracy.

Datasets HSV BOW
Ukbench (N-S score) 3.19 3.484
Holidays (mAP(%)) 61.95 80.16
TABLE II: Retrieval accuracy with baseline features.

HSV We compute a 1,000-dimensional histogram for each image. The parameters of bins for H, S, V are 20, 10, 5, respectively.

BOW Following the implementation setup in [19], scale-invariant keypoints are detected with the Hessian-affine [18] detector. For each SIFT feature, we use the quantization scheme as in [18]. To improve recall, rootSIFT [35] and Multiple Assignment (MA) are applied. Moreover, complying with [11], we use 128-bit Hamming signature with the Hamming threshold and weighting parameter set to 52 and 26, respectively. Retrieval accuracy of the baseline features is presented in Table II.

Iv-C Experimental results

In this following , we make a detailed analysis on the performance of C2F with different implementation details.

Iv-C1 Retrieval performance of C2F

Filtering. In order to evaluate the performance of the filtering mechanism, we conduct experiments by filtering out different numbers of distractors. In particular, noise images are discarded via HSV, and accurate image retrieval is further conducted on the selected candidates via local features. Besides, to better illustrate the effectiveness of the filtering and refining, retrieval accuracies elaborated in Table 2 are implemented without adaptive weights.

Table I presents the performance of C2F on Holidays and Ukbench dataset with different elimination degrees. For example, on Holidays, = 200 represents that 1291 images are regarded as distractors and discarded, while = 1491 means that retrieval experiments are performed on all dataset images via BOW. When the number of candidates is set to 800, the retrieval result of the C2F is 82.27, which is 1.09% higher than that obtained by retrieving the whole dataset with local features (mAP = 80.16). The filter performance of the C2F is also well verified on Ukbench. The N-S score of the C2F with 1000 candidates is 0.133 higher than that of 10200 images (no image is filtered out from Ukbench). Experimental results demonstrate that filtering out distractor images is quite important to improve the final retrieval performance.

Refining. We also conduct complementary experiments to evaluate the performance of the refining strategy, namely, whether the candidates selected from image dataset are further refined via local features. On both Holidays and Ukbench, = 0 means that no candidates are used to further enhance the retrieval results computed via HSV histogram.

Experimental results in Table I indicate that further refining on candidates can significantly enhance the retrieval accuracy at the cost of a little more memory consumption. For example, on Holidays dataset, the accuracy of the C2F with 100 candidates is 18.6% higher than that of K = 0. And this growth will be more apparent when more candidates are selected, e.g., the accuracy of the C2F with 500 candidates is 85.11. The effectiveness of the refining on Ukbench is consistent with that on Holidays. When 200 candidates are selected to refine the retrieval results obtained via HSV (N-S = 3.19), search accuracy is increased to 3.693. When K is set to 1000, we get a desirable accuracy (N-S = 3.725). In general, the experimental results on both Holidays and Ukbench demonstrate that the filtering and refining strategies adopted in C2F can significantly enhance the retrieval accuracy in a simple but effective manner.

Fig. 3: Sensitivity to candidate on (a) Holidays and (b) Ukbench datasets. We use the HSV histograms to filter out distractors and BOW to further refine candidates.

Iv-C2 Impact of the candidate number

In this section, we conduct experiments to validate the performance of C2F with different number of candidates. In particular, since Holidays only contains 1491 images in total, we set the number of candidates vary from 100 to 1000. While for the Ukbench dataset, we vary from 200 to 2000.

From Figure 3, we observe that on both Holidays and Ukbench datset, the retrieval accuracy improves when the candidate number increases. More candidates will certainly help improve the retrieval performance, however, such increase is not always guaranteed. When all ground truths of one query are recalled, the retrieval performance shows subtle improvement. This is because more candidates mean more distractors. We attribute such phenomenon to the negative impact of noisy images, that is, some noisy images will be selected when the is set to a large value. The noisy images will contribute nothing and even drag down the final retrieval performance.

Methods Ours [11] [17] [8] [16] [36] [45] [38] [4] [15] [42] [31] [39] [40]
Ukbench 3.737 3.755 3.71 3.62 3.60 3.75 - - 3.52 3.77 3.56 3.50 3.67 3.42
Holidays 86.78 84.47 84.0 81.9 80.86 84.7 82.2 82.1 - 84.6 78.1 78.9 42.3 81.3
TABLE III: Performance comparison with the state-of-the-art.
Candidates 100 200 300 400 500 600 700 800 900 1000
Memory (MB) 17.24 33.82 50.21 66.45 82.67 98.91 115.26 131.92 149.00 166.30
Time (s) 1.14 1.99 2.78 3.49 4.83 5.66 6.88 7.99 9.14 10.43
TABLE IV: Memory consumption (MB) and the average query time on holidays against the number of candidates. Feature extraction and quantization time is excluded.
Candidates 200 400 600 800 1000 1200 1400 1600 1800 2000
Memory (MB) 17.69 34.80 52.13 69.65 87.34 105.14 123.16 141.26 159.49 177.93
Time (s) 0.27 0.51 0.65 1.03 1.28 1.56 1.98 2.13 2.34 2.56
TABLE V: Memory consumption and average query time on Ukbench against the number of candidates.

From the Figure 3(a), it can be observed that, on Holidays dataset, the number of candidates = 1000 obtains the best retrieval accuracy among all other values. With the adaptive weights, the mAP of candidates with 800 is 1.67% higher than that with 500 (mAP = 85.11). When the number of candidate continues to grow, the retrieval accuracy starts to drop slightly. For example, the mAPs of candidate-1200 and candidate-1400 are 0.07% and 0.09% lower than that achieved by candidate-1000, respectively. Another noteworthy phenomenon is that the mAP with the least candidates ( = 100) is lowest, e.g. mAP = 80.56 with adaptive weights, this accuracy loss is caused by the improper filtering. In other words, the ground truths corresponding to some certain queries are filtered out due to the small number of candidate images.

Fig. 4: The impact of adaptive weights. We compare the performance with the case where no weight is used. The red bar represents the C2F result without weights, while green bar shows results with adaptive weights.

The Figure 3(b) shows that candidate-1600 achieves desirable accuracy (N-S = 3.733). With the further increase of the candidates, the retrieval performance shows subtle improvement. It means that when = 1600, ground truths of most queries are recalled via holistic representation, and more candidate images can not significantly enhance the retrieval precision. What is more, C2F without any weights achieves the best accuracy when candidate number is set to 800 (N-S = 3.629), and the retrieval performance declines with the further increase of . For example, candidate images with 1400 and 1600 achieve the N-S scores of 3.601 and 3.603, which are 0.028 and 0.025 lower than that of 800. All of the above analysis demonstrates that the set of candidate number plays some dominated role in C2F.

Iv-C3 Evaluation on the adaptive weights

To demonstrate the effectiveness of the adaptive weights, we compare with the case in which no weights are used. In other words, holistic representation is only used to filter out noisy images and the selected candidates are considered to be equally important. The results are shown in Figure 4. It is clear that, the usage of adaptive weighs brings benefits for accurate image retrieval.

Compared with the C2F without weights, adaptive weights help to achieve desirable retrieval performance on holidays dataset, as shown in Figure 4(a). When the candidate number is set to 500, the mAP of C2F with weights is 3.88% higher than that of without weights. And this gap will enhance with the increase of candidate images. Take the retrieval result with candidate number = 1400 as an example, the accuracy of C2F with weights is 85.93, which is 4.70% higher than that of without weights.

From the Figure 4(b), we can obviously observe that, on Ukbench dataset, C2F with adaptive weights significantly outperforms that without weights. When the candidate number is set to 1000, the N-S score of C2F with weights is 3.725, which is 0.108 higher than that without weights. Obviously, with the increase of candidate images, the advantage of the adaptive weights is always exist. In general, the retrieval results show that the weighted strategy significantly influences the final retrieval performance when all the other settings keep the same.

Iv-C4 Comparison with state-of-the-arts

Table III shows that our method yields competitive results when compared with state-of-the-arts. On holidays dataset, C2F with adaptive weights achieves desirable performance among these methods. With the same holistic and local descriptors, e.g., HSV histogram with 1000-dimensional and BOW embedded with hamming signature, C2F achieves the accuracy of 86.78, which is 3.31% higher than the query-adaptive fusion method [11]. Meanwhile, we observe that our result on Ukbench is slightly lower than [11] and [15] by 0.013 and 0.018, respectively. This is because the BOW result used in [11] is 0.098 higher than ours, e.g., the mAP of BOW is 3.582 in [11], while in C2F is 3.484. Compared with the graph based method in [15], we use a small loss of accuracy for huge storage saving and efficiency enhancing.

= 1000 C2F C2F+wt
0 82.25 86.78
1000 80.23 83.95
10,000 77.36 79.30
100,000 71.64 72.73
500,000 68.39 69.41
1,000,000 67.26 69.15
TABLE VI: The retrieval performance of C2F with different number of distractor images. Especially, we set the candidate number to 1000, and use the adaptive weights to enhance the retrieval performance.

Iv-C5 Large-scale image retrieval analysis

To further evaluate the performance of C2F on the large-scale image datasets, we merge Holidays with ImageNet 1. In the experiments, We use different numbers of distractors to test the scalability of the proposed two-layer fusion framework. Specifically, we set the number of candidates to 1000, and adopt adaptive weights to enhance the retrieval accuracy.

Table VI shows the retrieval results with different numbers of distractor images. Undoubtedly, the performance of the performance of C2F gradually decreases with the augment of distractors in the database. When the number of distractor images is 0, C2F obtain best accuracy. While when all 1M distractors of ImageNet are added into the Holidays database, the mAP of C2F reduce significantly, which is 12.63% lower than that of without distractors. Moreover, we do not enhance the corresponing candidates when the distractors increase, so C2F can significantly enhance the efficiency and reduce the memory consumption at the cost of some accuracy loss.

Iv-D Efficiency and Storage

It is well known that BOW incorporated with Hamming Embedding can achieve desirable performance. Meanwhile, a major bottleneck in BOW based methods is the length of the inverted list for it grows almost linearly with the database size. A traverse on the inverted index generated by the whole database can be extravagantly expensive. This problem appears more severe if we consider a typical case where the benchmark contains millions of images, for the inverted list can be as long as GBs. It means that for each query feature, a naive full comparision with all images of the dataset takes considerable time. Table IV and Table V give the average query memory consumption and the corresponding time on holidays and Ukbench datasets with the increasing number of candidates, respectively.

With the same candidate number, memory expended on Ukbench is far less than that on Holidays due to the different size of images, while incremental tendency reflected on two benchmarks is consistent. Similarly, from Table IV and Table V, we observe that the average query time also significantly increases with the growing number of candidates. This is because that more candidates means larger inverted index, namely, more search time to one query. Thus, an effective filter mechanism is very necessary for large-scale image retrieval, and a moderate candidate set is important to guarantee the tradeoff between the retrieval accuracy and memory consumption.

Iv-E Discussion

Fig. 5: Visualization of false retrieval results from Holidays. To each query, top returned images are obtained via HSV histograms (first row), BOW (second row) and the C2F (third row), reapectively.

Figure 5 presents some false retrieval results through C2F on Holidays dataset. As shown in Figure 5, these inaccurate images have the following two characteristics: 1) The image itself contains ambiguous information. Taking the query in the upper left corner as an example, we want the returned top images contain the waterfall, while in fact, the query involves abundant greenery Information. This is why neither global representation nor local feature can acquire satisfactory results. 2) In the experiments, C2F utilizes HSV histograms delineating overall color feature distributions in images to filter out distractors, so the top ranked candidates share similar color composition. This similar error will affect the query accuracy in the next stage, see the queries on the upper right corner and the lower left corner of the Figure 5.

V Conclusion

In this paper, a two-layer fusion method is proposed, which takes advantage of global and local cues and ranks database images from coarse to fine. The main purpose of C2F is to reduce memory consumption and the computational complexity, without compromising the retrieval accuracy. To achieve this goal, C2F adopts holistic representation to filter out noisy images of the benchmark and choose the images with high similarity scores as candidates. Particularly, for each candidate, an adaptive weight is learned via the holistic similarity scores. Then retrieval is conducted on candidate set by taking the adaptive weights into account. Comprehensive experiments are conducted to evaluate the accuracy and scalability of the C2F. With the same holistic and local descriptors, the accuracy of C2F on Holidays is 3.31% higher than the query-adaptive fusion method [11]. In future work, we will further explore more effective holistic representations and design more adaptive weighted functions.


  • [1] Sivic, Zisserman. Video Google: a text retrieval approach to object matching in videos[C]. International Conference on Computer Vision, 2003.
  • [2] Lowe D. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision, 2004.
  • [3] Nister D, Stewenius H. Scalable Recognition with a Vocabulary Tree[C]. Computer Vision and Pattern Recognition, 2006.
  • [4] Shen X, Lin Z, Brandt J, et al. Object retrieval and localization with spatially-constrained similarity measure and k-NN re-ranking[C]. Computer Vision and Pattern Recognition, 2012.
  • [5] Zheng L, Wang S. Visual Phraselet: Refining Spatial Constraints for Large Scale Image Search[J]. IEEE Signal Processing Letters, 2013, 20(4): 391-394.
  • [6] Zhou W, Lu Y, Li H, et al. Spatial coding for large scale partial-duplicate web image search[C]. ACM Multimedia, 2010.
  • [7] Philbin J, Chum O, Isard M, et al. Lost in quantization: Improving particular object retrieval in large scale image databases[C]. Computer Vision and Pattern Recognition, 2008.
  • [8] Zheng L, Wang S, Zhou W, et al. Bayes Merging of Multiple Vocabularies for Scalable Image Retrieval[C]. Computer Vision and Pattern Recognition, 2014.
  • [9] Jegou H, Perronnin F, Douze M, et al. Aggregating Local Image Descriptors into Compact Codes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(9): 1704-1716.
  • [10] Dong L, Liang Y, Kong G, et al. Holons Visual Representation for Image Retrieval[J]. IEEE Transactions on Multimedia, 2016, 18(4): 714-725.
  • [11] Zheng L, Wang S, Tian L, et al. Query-adaptive late fusion for image search and person re-identification[C]. Computer Vision and Pattern Recognition, 2015.
  • [12] Jia Y, Shelhamer E, Donahue J, et al. Caffe: Convolutional Architecture for Fast Feature Embedding[C]. ACM Multimedia, 2014.
  • [13] Oliva A, Torralba A. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope[J]. International Journal of Computer Vision, 2001, 42(3): 145-175.
  • [14] Wright J, Yang A Y, Ganesh A, et al. Robust Face Recognition via Sparse Representation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(2): 210-227.
  • [15] Zhang S, Yang M, Cour T, et al. Query specific fusion for image retrieval[C]. European Conference on Computer Vision, 2012.
  • [16] Zhang S, Yang M, Wang X, et al. Semantic-Aware Co-indexing for Image Retrieval[C]. International Conference on Computer Vision, 2013.
  • [17] Zheng L, Wang S, Liu Z, et al. Packing and Padding: Coupled Multi-index for Accurate Image Retrieval[C]. Computer Vision and Pattern Recognition, 2014.
  • [18] Philbin J, Chum O, Isard M, et al. Object retrieval with large vocabularies and fast spatial matching[C]. Computer Vision and Pattern Recognition, 2007.
  • [19] Jegou H, Douze M, Schmid C, et al. Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search[C]. European Conference on Computer Vision, 2008.
  • [20] Yang J, Yu K, Gong Y, et al. Linear spatial pyramid matching using sparse coding for image classification[C]. Computer Vision and Pattern Recognition, 2009.
  • [21] Mikulik A, Perdoch M, Chum O, et al. Learning a fine vocabulary[C]. European Conference on Computer Vision, 2010.
  • [22] Niu Z, Zhang S, Gao X, et al. Personalized Visual Vocabulary Adaption for Social Image Retrieval[C]. ACM Multimedia, 2014.
  • [23] Liu Z, Li H, Zhou W, et al. Embedding spatial context information into inverted filefor large-scale image retrieval[C]. ACM Multimedia, 2012.
  • [24] Liu Z, Li H, Zhou W, et al. Uniting Keypoints: Local Visual Information Fusion for Large-Scale Image Search[J]. IEEE Transactions on Multimedia, 2015, 17(4): 538-548.
  • [25] Jegou H, Douze M, Schmid C, et al. Product Quantization for Nearest Neighbor Search[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(1).
  • [26] Torralba A, Fergus R, Weiss Y, et al. Small codes and large image databases for recognition[C]. Computer Vision and Pattern Recognition, 2008.
  • [27] Gong Y, Lazebnik S. Iterative quantization: A procrustean approach to learning binary codes[C]. Computer Vision and Pattern Recognition, 2011.
  • [28] Weiss Y, Torralba A, Fergus R, et al. Spectral Hashing[C]. Neural Information Processing Systems, 2009.
  • [29] Fagin R, Kumar R, Sivakumar D, et al. Efficient similarity search and classification via rank aggregation[C]. International Conference on Management of Data, 2003.
  • [30] Liu Z, Wang S, Zheng L, et al. Visual reranking with improved image graph[C]. International Conference on Acoustics, Speech, and Signal Processing, 2014.
  • [31] Wengert C, Douze M, Jegou H, et al. Bag-of-colors for improved image search[C]. ACM Multimedia, 2011.
  • [32] Yang L, Geng B, Cai Y, et al. Object Retrieval Using Visual Query Context[J]. IEEE Transactions on Multimedia, 2011, 13(6): 1295-1307.
  • [33] Deng J, Dong W, Socher R, et al. ImageNet: A large-scale hierarchical image database[C]. Computer Vision and Pattern Recognition, 2009.
  • [34] Muja. Fast approximate nearest neighbors with automatic algorithm configuration[C]. International Conference on Computer Vision, 2009.
  • [35] Arandjelovic R, Zisserman A. Three things everyone should know to improve object retrieval[C]. Computer Vision and Pattern Recognition, 2012.
  • [36] Deng C, Ji R, Liu W, et al. Visual Reranking through Weakly Supervised Multi-graph Learning[C]. International Conference on Computer Vision, 2013.
  • [37] Jegou H, Schmid C, Harzallah H, et al. Accurate Image Search Using the Contextual Dissimilarity Measure[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(1): 2-9999.
  • [38] Qin D, Wengert C, Van Gool L, et al. Query Adaptive Similarity for Large Scale Object Retrieval[C]. Computer Vision and Pattern Recognition, 2013.
  • [39] Qin D, Gammeter S, Bossard L, et al. Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors[C]. Computer Vision and Pattern Recognition, 2011.
  • [40] Jegou H, Douze M, Schmid C, et al. Improving Bag-of-Features for Large Scale Image Search[J]. International Journal of Computer Vision, 2010, 87(3).
  • [41] Zhang Y, Jia Z, Chen T, et al. Image retrieval with geometry-preserving visual phrases[C]. Computer Vision and Pattern Recognition, 2011.
  • [42] Wang X, Yang M, Cour T, et al. Contextual weighting for vocabulary tree based image retrieval[C]. International Conference on Computer Vision, 2011.
  • [43] Douze M, Ramisa A, Schmid C, et al. Combining attributes and Fisher vectors for efficient image retrieval[C]. Computer Vision and Pattern Recognition, 2011.
  • [44] Jegou H, Douze M, Schmid C, et al. On the burstiness of visual elements[C]. Computer Vision and Pattern Recognition, 2009.
  • [45] Tolias G, Avrithis Y, Jegou H, et al. To Aggregate or Not to aggregate: Selective Match Kernels for Image Search[C]. International Conference on Computer Vision, 2013.
  • [46] Bai Y, Yu W, Xiao T, et al. Bag-of-Words Based Deep Neural Network for Image Retrieval[C]. ACM Multimedia, 2014.
  • [47] Zhou K, Liu Y, Song J, et al. Deep Self-taught Hashing for Image Retrieval[C]. ACM Multimedia, 2015.
  • [48] Bhattacharjee S D, Yuan J, Tan Y, et al. Query-Adaptive Logo Search using Shape-Aware Descriptors[C]. ACM Multimedia, 2015.
  • [49] Liang Y, Dong L, Xie S, et al. Compact feature based clustering for large-scale image retrieval[C]. International Conference on Multimedia and Expo, 2014.
  • [50] Dong L, He L, Zhang Q, et al. Discriminative Light Unsupervised Learning Network for Image Representation and Classification[C]. ACM Multimedia, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description