Profile Based Sub-Image Search in Image Databases
Sub-image search with high accuracy in natural images still remains a challenging problem. This paper proposes a new feature vector called profile for a keypoint in a bag of visual words model of an image. The profile of a keypoint captures the spatial geometry of all the other keypoints in an image with respect to itself, and is very effective in discriminating true matches from false matches. Sub-image search using profiles is a single-phase process requiring no geometric validation, yields high precision on natural images, and works well on small visual codebook. The proposed search technique differs from traditional methods that first generate a set of candidates disregarding spatial information and then verify them geometrically. Conventional methods also use large codebooks. We achieve a precision of 81% on a combined data set of synthetic and real natural images using a codebook size of 500 for top-10 queries; that is 31% higher than the conventional candidate generation approach.
1Introduction and Motivation
Community contributed image sites (e.g. Flickr.com) and stock photo sites (e.g. Gettyimages.com) are witnessing an unprecedented growth in the recent decade. Search of images by example is one of the most common tasks performed on these data sets. A related task is sub-image retrieval . It extends traditional full-image search by allowing users to select a region in an image and then, search for similar regions in a repository. Sub-image search is a very important tool for harnessing the potential of ever growing image repositories. Sub-image search as a tool is also being incorporated in online shopping (e.g. like.com) to help customers search products using images. It is also an important tool for analysis and study of biological and medical image datasets.
Community contributed repositories contain images of scenes or objects taken under varying imaging conditions. These images have affine, viewpoint, and photometric differences and also varying degrees of occlusion. Local descriptors like SIFT  are used in literature  with fair success to measure similarity between natural images. Images are scanned to detect keypoints , covariant regions are extracted around each point, and finally a high dimensional local feature vector  representation of each region is obtained . We know the spatial location, geometry of the covariant region, and high dimensional feature vector representation of each keypoint detected in an image. Thus, each image is transformed into a bag of feature vectors.
Researchers have pursued two-phase techniques to retrieve similar images using local descriptors. The first phase consists of candidate generation disregarding spatial relationships between keypoints and the second phase consists of geometric verification. Candidates are generated using two common approaches. One approach  transforms each image into an orderless bag of visual words. Feature vectors of all the images are clustered and each feature vector is assigned the symbol of its cluster called visual word. Thus, all the feature vectors of an image are transformed into visual words and an image is finally represented as a histogram of these visual words. This enables leveraging and extending existing text-retrieval techniques to image search. Distance between image histograms is computed using norm. A given query image is also transformed into a histogram of visual words and its top-k candidates are retrieved. Another approach  finds top-k nearest neighbors of each keypoint feature vector of the query image in the dataset of local feature vectors of all the images. It ranks images based on the total number of nearest neighbors detected in them and retains top-k images as candidates. Both approaches are made efficient by using indexing techniques .
We present two pathological results obtained by sub-image search using the first-phase methods as described previously. Figure ? is a query sub-image and Figure ? is an image in the top-5 results retrieved from the dataset. We can see that the components of the query pattern are scattered randomly in Figure ? and it is not a meaningful match. A local descriptor like SIFT is computed just using neighborhood pixels around a keypoint. Similar keypoints will be computed in both the query and the given result image because result image contains all the components of the query. The only way to distinguish between these two images is to consider the spatial layout of the keypoints. These methods generate this false match because they do not take spatial relationships between keypoints into account. Figure ? is another query sub-image and Figure ? is a true match not returned in the top-5 results. An explanation of this anomaly is that the query region is a very small fraction of the whole image and therefore, random outliers score better than the true match when similarity is computed just using orderless bag of words representation of images. Both these examples motivate the necessity of using spatial relationships between matched points in a sub-image search to get high accuracy. Existing literature uses geometrical verification (Hough , Ransack ) on the candidate images to find the best match or re-rank the top- candidates. Generating a small set of high quality candidates is very important for all the conventional approaches to reduce the high cost of geometric verification. Geometric verification algorithms are iterative in nature and costly.
Sub-image search with high accuracy is very challenging and still remains an open problem. There is need of further development of better techniques to retrieve high quality images. Success of sub-image search tools in optimally utilizing image repositories for various applications is directly constrained by the accuracy of their search results. The focus of our work is to develop a feature vector and a sub-image similarity measure using this feature vector that yields high accuracy for sub-image search.
In this paper, we present a new feature vector called profile. We construct a concentric profile for each keypoint in a bag of visual words representation of an image (Section Section 3). A keypoint’s profile approximately captures the spatial relationships of other keypoints in the image with respect to itself. Sub-image retrieval for a given query is a single-phase search of the best matching profile in the database (Section Section 4). Our feature produces high quality results for sub-image search without geometric verification using a single-step search and a small visual codebook. We perform experiments to empirically validate our method in Section 5. We show that our profile achieves comparatively higher accuracy on the landmark dataset provided by Philbin et al. . We also prepared a dataset which is a collection of natural and synthetic images. We used a small codebook of size for bag of words representation of images compared to thousands or even a million  used in the literature. A conventional user mostly views top-k results of a query  returned by a search engine. Therefore, high accuracy in top-k results is really important. We specifically focus on empirically analyzing accuracy of top-k matching images retrieved using our feature vector. Our approach achieves 81% precision on our dataset for top- results using code book size of ; this is 31% higher than the precision of conventional methods.
The main contribution of our paper is the development of a robust feature vector for each keypoint that captures the spatial relationship of keypoints in an image, and a measure of similarity between the feature vectors for retrieving similar images.
Svetlana et al.  proposed to construct a single feature vector for the whole image by concatenating histograms of local features of sub-regions obtained by repeatedly splitting the image at various scales in a principled way (Spatial Pyramid). They used it for full-image categorization. This feature is a global feature vector and is not directly designed for sub-image search. Feature vector may fail to find those matching sub-regions of database images for a given query which get split into multiple regions during feature vector construction. The similarity score of a sub-region of an image with its full-image using this feature vector may be small, depending on the similarity measure used, making it difficult for the true match to be distinguished from false matches. Further, this feature is neither rotation nor translation invariant because of the use of orthogonal split lines. Our profile is computed for each keypoint of an image and matches are retrieved based on keypoint profile similarity rather than full-image similarity. Therefore, our profile is naturally suited to sub-image search. We discuss scale, rotation, and translation invariance of our profile in Section 3.
Weijun et al.  proposed to spatially cluster local descriptors per image with a bound on the number of points in the cluster and its radius. They represented each cluster as a bag of visual words and each image as a collection of these clusters. Images were ranked based on the count of the clusters that were top-k nearest neighbors of the clusters of a given query image. They achieved a precision of 65%. Philbin et al.  clustered the local descriptors to build a visual codebook of size million and represented each image as a bag of visual words. They used the measure to find candidates and then validated those using LO-RANSACK for restricted sets of transformations. They reported mean average precision of 66% on specialized dataset of landmark images using 1 million visual words and geometric verification. Yan et al.  proposed to generate candidates by nearest neighbor search using Locality Sensitive Hashing  and validated using RANSACK for sub-image retrieval for near duplicate image detection and copyright protection. They experimented only on a synthetic dataset. Lowe et al.  performed nearest neighbor search using BBF algorithm  and geometric validation by Hough transform to recognize objects in images.
In this section, we design our new feature vector called profile which is created for each keypoint in an image. A profile of a keypoint is a structural representation of the spatial layout of all other keypoints around it. We assume that an image has been preprocessed and transformed into an orderless bag of visual words. We also know the coordinate of each keypoint detected in an image. To form the profile, we draw concentric circles around each keypoint of an image. Figure ? shows concentric circles for keypoint and keypoint . Each ring is represented as a histogram of visual words of keypoints lying in it. Profile of a keypoint , where is the number of rings, is a concatenated list of ring histograms ordered from the center towards the outer rings. The dimension of a profile is directly proportional to the codebook size and the number of rings around a keypoint. The number of points in a ring (called size) increases as we move away from the center, and is defined by . The number of points in the first ring is which is user defined and is same for every keypoint’s profile across all the images. The th ring contains points except in the case of the last ring where the required number of points may not exist. It’s worth noting that we do not fix the radii of the rings but fix the number of points in a ring, which is a function of . As a result, number of rings in the profiles of all the keypoints in a given image will be the same but the radii of the rings may vary. Radii of the rings are based on the spatial density of other keypoints around a given keypoint. An image having more keypoints will have more rings than an image with less keypoints irrespective of the size of the images. Rings of an image with a high spatial density of keypoints will have narrower rings compared to an image having a low spatial density of keypoints. The profile of each keypoint of an image will differ depending on its position in the image because keypoints captured in the rings of different profiles will be different.
Similarity between profiles: The similarity between two profiles and , where the number of rings in is and in is respectively, is given by
Here, is a decaying parameter learned empirically and is always positive. The similarity between corresponding ring histograms () can be computed using Jaccard’s Coefficient or Cosine measure. The distance between two profile is computed as a complement of their similarity. For the purpose of explanation, assume Jaccard’s Coefficient as the measure of similarity. For this measure, the maximum value of similarity between two corresponding histograms is . Therefore, the best value of similarity between two profiles is
The distance between two profiles is given by
This can be represented as a recurrence
where is the distance computed for the first corresponding ring histograms. Other similarity measures and their complements can be used to find similarity and distance between two profiles respectively. We found Jaccard’s Coefficient as a measure of similarity between histograms to perform better than other methods empirically.
The similarity in the proximity of a keypoint should be weighed higher but should be weighed less as we move away to decimate the effect of noise and overfitting in matching. This is the reason for the exponentially decaying aggregation of similarity between corresponding rings of profiles and keeping increasingly more keypoints in the rings away from the center.
A keypoint is localized in an image and its local descriptor captures the property of its small neighborhood in the image. Locality of a keypoint makes it naturally fit for sub-image search. Our concentric profile captures spatial layout of keypoints of the whole image with respect to a given keypoint. The profile of a keypoint gives it the structured global view of the whole image making it semantically richer than the keypoint itself. Therefore, the profile of a keypoint can distinguish between a true match and a false match effectively. A sub-image search based on profiles would have higher accuracy than bag of visual words based search.
Profile Vector Robustness:
Visual words which form our profile’s histogram are obtained from local descriptors. It should be noted that although local descriptors like SIFT provide invariance to affine changes for similarity search, feature vectors created by spatial division of image around a keypoint may compromise invariance to affine changes. Our profile vector retains invariance to rotation, scaling, translation, and occlusion. We perform image division only in a radial direction. Therefore, our method yields same concentric profile for a given keypoint irrespective of the rotation of an image making it rotation invariant. We see in Figure ? that the profile of keypoint in query image remains same in rotated image . The scaling of an image alters the relative distance between points, as seen in image of Figure ?. Therefore, a fixed radius concentric division will fail to provide scale invariance even though the local descriptors are scale invariant. We keep equal number of keypoints in rings while constructing the profile of a given keypoint irrespective of the size and scale of an image. This technique generates the same profile for a given keypoint in two images irrespective of the scale. The profile of keypoint of image in Figure ? is the same as the profile of keypoint in the scaled image . Therefore, our profile feature remains invariant to scaling. Our search methodology automatically preserves translation invariance and provides robustness to occlusion as discussed in Section 4.
In this section, we discuss the method used to compute the similarity between a pair of profiles. This similarity is used for sub-image search. Similar keypoints will be extracted from a query image and a matching sub-region of a database image by local descriptor methods. Therefore, the keypoints captured in the more weighted histograms of the profile of a keypoint in a query image would be same to the profile of a keypoint in the matching sub-region of a database image . This will give high similarity score between the profiles. We exploit this property to retrieve best matching sub-regions from the database. All the images, represented as bags of words, are converted into bags of profiles. Query is also processed to generate a bag of profiles. We search the best matching profile in a database for each query profile and finally, choose the highest scoring pair (, ) among these. The sub-region around the keypoint of the highest scoring pair is the best matching image. If has profiles and is the total number of profiles in the dataset then the best matching sub-image is the region around th profile where is obtained by
Our search algorithm inherently provides translation invariance because it searches for the best matching profile. We create a profile for each keypoint of an image. Since, local descriptors are translation invariant, relatively similar keypoints will be detected in a matching sub-region of a translated image and the given query. Therefore, profiles of the query will have high similarity with the profiles of the keypoints detected in the matching sub-region compared to the profiles of random keypoints and our algorithm will successfully find these translated matches. Image in Figure ? has translation of few objects in image as well as additional new objects. We see that the concentric profile of keypoint in image is very similar to the corresponding profile of in image . Our algorithm will also find a partially occluded but true matching sub-region of a database. Profile of a keypoint detected in the preserved part of the matching sub-region will have relatively high similarity score with some keypoint profile of the given query. Profile of keypoint in image of Figure ? will have high similarity with the profile of keypoint in image . Therefore, the true matching sub-region will rank higher on similarity score compared to random profiles.
In this section, we present a comparative study with a state of art technique using the author’s dataset. Next, we describe our own dataset and perform experiments to show that our profile based approach yields high precision for sub-image search with a small visual codebook.
We applied our profile feature based search on the Oxford dataset provided by Philbin et al.
|Rank k||% precision for profile based search||% precision for bag of words using based search|
We downloaded natural images from Flickr. We also manually photographed a large number of scenes under varying conditions. We chose random queries from this real dataset of natural images for our experiments. We synthetically created some test images to put our method to even tougher challenges. We randomly chose sub-images from the natural images and embedded them into other large natural images. We also added varying noise levels to these images as discussed in . Some of the operations to add noise were rotation, scaling, shearing, gaussian blur, and averaging noise. We also split a sub-image and scattered its fragments into other images. Finally, we obtained a dataset of images (-natural and -synthetic). A snapshot of the dataset is shown in Figure 2. We used a total of queries (-natural and -synthetic).
Next, we created bag of words representation of each image. We extracted covariant regions  from each image and the corresponding dimensional SIFT vector. We clustered the feature vectors by picking random centers using k-means  clustering. We chose k-means clustering because it is not only simple but embarrassingly parallelizable. A naive computation of k-means has complexity of O() where is the number of cluster centers, is number of data points, and is the number of iteration for k-means convergence. This can be reduced by using metric property . We repeated this process times to choose the cluster set which has the minimum average sum of the squared error or scatter. We assigned the symbol to each cluster. We mapped each SIFT feature vector to the cluster symbol to which it belonged. This gave us a bag of words representation of an image. For creating profiles, we chose =50 points in the first ring and used =1/3 as decaying parameter for aggregation. We found the Jaccard’s Coefficient to yield better results than other distance measures and used it to compute similarities between ring histograms of our profiles.
We computed precision of top-k results obtained using our profile based search and compared it against the precision of conventional methods which compute similarity only on the bag of words without taking spatial relationships into account. We used Cosine, , and Jaccard’s Coefficient as the measure of similarity for conventional methods. We considered both the schemes of standard tf-idf weighting  and without tf-idf weighting of visual words for the candidate generation approach. In a tf-idf approach, commonly occurring visual words are weighed less as they are less discriminative. We did not consider the tf-idf weighting of visual words for our profile based approach. Figure 3 and Figure 4 show the comparative precision of various methods for the non-weighted and the weighted cases, respectively. We find that our approach yields 81% precision rate for top-10 results. Without tf-idf weighting, conventional method with cosine measure gives the best precision of 50% which is % less than profile based method. With tf-idf weighting, conventional method with Jaccard’s coefficient gives the best precision of 39% for top-10 results. Search with tf-idf weighting achieves less precision compared to the search without tf-idf weighting for conventional methods. We also experimented by varying for our profile similarity and achieved more than 20% higher precision for every compared to conventional methods. We also experimented by weighting the symbols with the area of covariant regions but achieved less precision. We see that our method achieves higher precision than conventional methods used for candidate generation on a small codebook size and without any geometric verification.
We present the top-5 visual results for real queries from our search in Figure ?. Our profile based approach retrieves high quality results irrespective of the kind of noise present in the dataset. We outline the matching sub-region in a result image with red box. We got all true matches in the top-5 results for the rd query which got pathological result using conventional methods shown in Figure ?. We also verified for the 2nd pathological case shown in Figure ? and found that the result was returned in top-10 results and all the better scoring images were true matches.
In this paper, we developed a simple but effective profile based feature vector that captures the spatial relationships between keypoints of an image. Inclusion of spatial layout in feature vector improves the search result dramatically without the need of geometric verification. We proposed a method to measure the similarity between profiles. Our technique produced higher accuracy on landmark images compared to the state of the art method. We evaluated our technique on a mixture of synthetic and real natural images dataset over 52 queries and obtained a 81% precision for top- matches using a small code book size of . This was 31% higher than the conventional methods.
- http://www.robots.ox.ac.uk/ vgg/data/oxbuildings/index.html
- An optimal algorithm for approximate nearest neighbor searching fixed dimensions.
S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. J. ACM, 45(6):891–923, 1998.
- Shape indexing using approximate nearest-neighbour search in high-dimensional spaces.
J. S. Beis and D. G. Lowe. In CVPR, pages 1000–1006, 1997.
- Using the triangle inequality to accelerate k-means.
C. Elkan. In ICML, 2003.
- Similarity search in high dimensions via hashing.
A. Gionis, P. Indyk, and R. Motwani. In VLDB, 1999.
- A combined corner and edge detection.
C. Harris and M. Stephens. In Proceedings of The Fourth Alvey Vision Conference, pages 147–151, 1988.
- An efficient parts-based near-duplicate and sub-image retrieval system.
Y. Ke, R. Sukthankar, and L. Huston. In MM ’04: Proceedings of the 12th annual ACM international conference on Multimedia, 2004.
- Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories.
S. Lazebnik, C. Schmid, and J. Ponce. In CVPR, 2006.
- Least squares quantization in pcm.
S. P. Lloyd. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
- Distinctive image features from scale-invariant keypoints.
D. G. Lowe. IJCV, 60:91–110, 2004.
- Content based sub-image retrieval via hierarchical tree matching.
J. Luo, Nascimento, and M. A. In MMDB ’03: Proceedings of the 1st ACM international workshop on Multimedia databases, 2003.
- Enhancing dpf for near-replica image recognition.
Y. Meng, E. Chang, and B. Li. CVPR, page 416, 2003.
- An affine invariant interest point detector.
K. Mikolajczyk and C. Schmid. In Proc. European Conf. Computer Vision, pages 128–142. Springer Verlag, 2002.
- A performance evaluation of local descriptors.
K. Mikolajczyk and C. Schmid. IEEE Trans. PAMI, 27(10), 2005.
- A comparison of affine region detectors.
K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. V. Gool. IJCV, 65.
- Object retrieval with large vocabularies and fast spatial matching.
J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. In CVPR, pages 1–8, 2007.
- World-scale mining of objects and events from community photo collections.
T. Quack, B. Leibe, and L. Van Gool. In CIVR, 2008.
- Term-weighting approaches in automatic text retrieval.
G. Salton and C. Buckley. In Information Processing and Management, pages 513–523, 1988.
- Multi-scale sub-image search.
N. Sebe, M. S. Lew, and D. P. Huijsmans. In MM ’99: Proceedings of the seventh ACM international conference on Multimedia, 1999.
- Video google: A text retrieval approach to object matching in videos.
J. Sivic and A. Zisserman. In ICCV, 2003.
- An eye tracking study of the effect of target rank on web search.
Z. uan and E. G. Cutrell. In Proceedings of the SIGCHI conference on Human factors in computing systems, 2007.
- Object retrieval using configurations of salient regions.
W. Wang, Y. Luo, and G. Tang. In Proceedings of the 2008 international conference on Content-based image and video retrieval, 2008.