GIFT: A Real-time and Scalable 3D Shape Search Engine
Abstract
Projective analysis is an important solution for 3D shape retrieval, since human visual perceptions of 3D shapes rely on various 2D observations from different view points. Although multiple informative and discriminative views are utilized, most projection-based retrieval systems suffer from heavy computational cost, thus cannot satisfy the basic requirement of scalability for search engines.
In this paper, we present a real-time 3D shape search engine based on the projective images of 3D shapes. The real-time property of our search engine results from the following aspects: (1) efficient projection and view feature extraction using GPU acceleration; (2) the first inverted file, referred as F-IF, is utilized to speed up the procedure of multi-view matching; (3) the second inverted file (S-IF), which captures a local distribution of 3D shapes in the feature manifold, is adopted for efficient context-based re-ranking. As a result, for each query the retrieval task can be finished within one second despite the necessary cost of IO overhead. We name the proposed 3D shape search engine, which combines GPU acceleration and Inverted File (Twice), as GIFT. Besides its high efficiency, GIFT also outperforms the state-of-the-art methods significantly in retrieval accuracy on various shape benchmarks and competitions.
1Introduction
3D shape retrieval is a fundamental issue in computer vision and pattern recognition. With the rapid development of large scale public 3D repositories, e.g., Google 3D Warehouse or TurboSquid, and large scale shape benchmarks, e.g., ModelNet [40], SHape REtrieval Contest (SHREC) [14], the scalability of 3D shape retrieval algorithms becomes increasingly important for practical applications. However, efficiency issue has been more or less ignored by previous works, though enormous efforts have been devoted to retrieval effectiveness, that is to say, to design informative and discriminative features [12] to boost the retrieval accuracy. As suggested in [14], plenty of these algorithms do not scale up to large 3D shape databases due to their high time complexity.
Meanwhile, owing to the fact that human visual perception of 3D shapes depends upon 2D observations, projective analysis [21] has became a basic and inherent tool in 3D shape domain for a long time, with applications to segmentation [39], matching [25], reconstruction, etc.. Specifically in 3D shape retrieval, projection-based methods demonstrate impressive performances. Especially in recent years, the success of planar image representation [7], makes it easier to describe 3D models using depth or silhouette projections.
Generally, a typical 3D shape search engine is comprised of the following four components (see also Figure 1):
Projection rendering.
With a 3D model as input, the output of this component is a collection of projections. Most methods set an array of virtual cameras at pre-defined view points to capture views. These view points can be the vertices of a dodecahedron [4], located on the unit sphere [36], or around the lateral surface of a cylinder [25]. In most cases, pose normalization [23] is needed for the sake of invariance to translation, rotation and scale changes.
View feature extraction.
The role of this component is to obtain multiple view representations, which affects the retrieval quality largely. A widely-used paradigm is Bag-of-Words (BoW) [7] model, since it has shown its superiority as natural image descriptors. However, in order to get better performances, many features [14] are of extremely high dimension. As a consequence, raw descriptor extraction (e.g., SIFT [20]), quantization and distance calculation are all time-consuming.
Multi-view matching.
This component establishes the correspondence between two sets of view features, and returns a matching cost between two 3D models. Since at least a set-to-set matching strategy [26] is required, this stage suffers from high time complexity even when using the simplest Hausdorff matching. Hence, the usage of algorithms incorporated with some more sophisticated matching strategies on large scale 3D datasets is limited due to their heavy computational cost.
Re-ranking.
It aims at refining the initial ranking list by using some extra information. For retrieval problems, since no prior or supervised information is available, contextual similarity measure is usually utilized. A classic context-based re-ranking methodology for shape retrieval is diffusion process [5], which exhibits outstanding performance on various datasets. However, as graph-based and iterative algorithms, many variants of diffusion process (e.g., locally constrained diffusion process [42]), generally require the computational complexity of , where is the total number of shapes in the database and is the number of iterations. In this sense, diffusion process does not seem to be applicable for real-time analysis.
In this paper, we present a real-time 3D shape search engine using projections that includes all the aforementioned components. It combines Graphics Processing Unit (GPU) acceleration and Inverted File (Twice), hence we name it GIFT. In on-line processing, once a user submits a query shape, GIFT can react and present the retrieved shapes within one second (the off-line preprocessing operations, such as CNN model training and inverted file establishment, are excluded). GIFT is evaluated on several popular 3D benchmarks datasets, especially on one track of SHape REtrieval Contest (SHREC) which focuses on scalable 3D retrieval. The experimental results on retrieval accuracy and query time demonstrate the capability of GIFT in handling large scale data.
In summary, our main contributions are as follows. Firstly, GPU is used to speed up the procedure of projection rendering and feature extraction. Secondly, in multi-view matching procedure, a robust version of Hausdorff distance for noise data is approximated with an inverted file, which allows for extremely efficient matching between two view sets without impairing the retrieval performances too much. Thirdly, in the re-ranking component, a new context-based algorithm based on fuzzy set theory is proposed. Different from diffusion processes of high time complexity, our re-ranking here is ultra time efficient on the account of using inverted file again.
2Proposed Search Engine
2.1Projection Rendering
Prior to projection rendering, pose normalization for each 3D shape is needed in order to attain invariance to some common geometrical transformations. However, apart from many pervious algorithms [24] that require rotation normalization using some Principal Component Analysis (PCA) techniques, we only normalize the scale and the translation in our system. Our concerns are two-fold: 1) PCA techniques are not always stable, especially when dealing with some specific geometrical characteristics such as symmetries, large planar or bumpy surfaces; 2) the view feature used in our system can tolerate the rotation issue to a certain extent, though cannot be completely invariant to such changes. In fact, we observe that if enough projections (more than in our experiments) are used, one can achieve reliable performances.
The projection procedure is as follows. Firstly, we place the centroid of each 3D shape at the origin of a spherical coordinate system, and resize the maximum polar distance of the points on the surface of the shape to unit length. Then virtual cameras are set on the unit sphere evenly, and they are located by the azimuth and the elevation angles. At last, we render one projected view in depth buffer at each combination of and . For the sake of speed, GPU is utilized here such that for each 3D shape, the average time cost of rendering projections is only .
2.2Feature Extraction via GPU Acceleration
Feature design has been a crucial problem in 3D shape retrieval for a long time owing to its great influence on the retrieval accuracy. Though extensively studied, almost all the existing algorithms ignore the efficiency of the feature extraction.
To this end, our search engine adopts GPU to accelerate the procedure of feature extraction. Impressed by the superior performance of deep learning approaches in various visual tasks, we propose to use the activation of a Convolutional Neural Network (CNN). The CNN used here takes depth images as input, and the loss function is exerted on the classification error for projections. The network architecture consists of five successive convolutional layers and three fully connected layers as in [3]. We normalize each activation in its Euclidean norm to avoid scale changes. It only takes on average to extract the view features for a 3D model.
Since no prior information is available to judge the discriminative power of activations of different layers, we propose a robust re-ranking algorithm described in Section 2.4. It can fuse those homogenous features efficiently based on fuzzy set theory.
2.3Inverted File for Multi-view Matching
Consider a query shape and a shape from the database . Let denote a mapping function from 3D shapes to their feature sets. We can obtain two sets and respectively, where is the number of views. (or ) denotes the view feature assigned to the -th view of shape (or ).
A 3D shape search engine requires a multi-view matching component to establish a correspondence between two sets of view features. These matching strategies are usually metrics defined on sets (e.g., Hausdorff distance) or graph matching algorithms (e.g., Hungarian method, Dynamic Programming, clock-matching). However, these pairwise strategies are time-consuming for a real-time search engine. Among them, Hausdorff distance may be the most efficient one, since it only requires some simple algebraic operations without sophisticated optimizations.
Recall that the standard Hausdorff distance measures the difference between two sets, and it is defined as
where function measures the distance between two input vectors. In order to eliminate the disturbance of isolated views in the query view set, a more robust version of Hausdorff distance is given by
For the convenience of analysis, we consider its dual form in the similarity space as
where measures the similarity between the two input vectors. In this paper, we adopt the cosine similarity.
As can be seen from Eq. and Eq. , Hausdorff matching requires the time complexity for retrieving a given query (assuming that there are shapes in the database). Though the complexity grows linearly with respect to the database size, it is still intolerable when gets larger. However, by analyzing Eq. , we can make several observations: (1) let , the similarity calculations of are unnecessary when , since these similarity values are unused due to the operation, i.e., only is kept; (2) when considering from the query side, we can find that counts little to the final matching cost if and is a small threshold. Those observations suggest that although the matching function in Eq. requires the calculation of all the pairwise similarities between two view sets, some similarity calculations, which generate small values, can be eliminated without impairing the retrieval performance too much.
In order to avoid these unnecessary operations and improve the efficiency of multi-view matching procedure, we adopt inverted file to approximate Eq. by adding the Kronecker delta response as
where if , and if . The quantizer maps the input feature into an integer index that corresponds to the nearest codeword of the given vocabulary . As a result, the contribution of , which satisfies , to the similarity measure can be directly set to zero, without estimating explicitly.
In conclusion, our inverted file for multi-view matching is built as illustrated in Figure 2. For each view feature, we store it and its corresponding shape ID in the nearest codeword. It should be mentioned that we can also use Multiple Assignment (MA), i.e., assign each view to multiple codewords, to improve the matching precision at the sacrifice of memory cost and on-line query time.
2.4Inverted File for Re-ranking
A typical search engine usually involves a re-ranking component [22], aiming at refining the initial candidate list by using some contextual information. In GIFT, we propose a new contextual similarity measure called Aggregated Contextual Activation (ACA), which follows the same principles as diffusion process [5], i.e., the similarity between two shapes should go beyond their pairwise formulation and is influenced by their contextual distributions along the underlying data manifold. However, apart from diffusion process which has high time complexity, ACA enables real-time re-ranking, which can be applied to large scale data.
Let denote the neighbor set of , which contains its top- neighbors. Similar to [43], our basic idea is that the similarity between two shapes can be more reliably measured by comparing their neighbors using Jaccard similarity as
One can find that the neighbors are treated equally in Eq. . However the top-ranked neighbors are more likely to be true positives. So a more proper behavior is increasing the weights of top-ranked neighbors.
To achieve this, we propose to define the neighbor set using fuzzy set theory. Different from classical (crisp) set theory where each element either belongs or does not belong to the set, fuzzy set theory allows a gradual assessment of the membership of elements in a set. We utilize to measure the membership grade of in the neighbor set of . Accordingly, Eq. is re-written as
Since considering equal-sized vector comparison is more convenient in real computational applications, we use to encode the membership values. The -th element in is given as
Based on this definition we replace Equation 6 with
Considering vector is sparse, we can view it as sparse activation of shape , where the activation at coordinate is the membership grade of in the neighbor set . Eq. utilizes the sparse activations and to define the new contextual shape similarity measure.
Note that all the above analysis is carried out for only one similarity measure. However, in our specific scenario, the outputs of different layers of CNN are usually at different abstraction resolutions.
For example, two different layers of CNN lead to two different similarities and by Eq. , which in turn yield two different sparse activations and by Eq. . Since no prior information is available to assess their discriminative power, our goal now is to fuse them in a unsupervised way. For this we utilize the aggregation operation in fuzzy set theory, by which several fuzzy sets are combined in a desirable way to produce a single fuzzy set. We consider two fuzzy sets represented by the sparse activations and (the extension to more than two activations is similar) . Their aggregation is then defined as
which computes the element-wise generalized means with exponent of and . Instead of using arithmetic mean, we use this generalized means ( is set to throughout our experiments). Our concern for this is to avoid the problem that some artificially large elements in dominate the similarity measure. This motivation is very similar to handling bursty visual elements in Bag-of-Words (BoW) model (see [10] for examples).
In summary, we call the feature in Eq. Aggregated Contextual Activation (ACA). Next, we will introduce some improvements of Eq. concerning its retrieval accuracy and computational efficiency.
Improving Accuracy
Similar to diffusion process, the proposed ACA requires an accurate estimation of the context in the data manifold. Here we provide two alternative ways to improve the retrieval performance of ACA without depriving its efficiency.
Neighbor Augmentation. The first one is to augment using the neighbors of second order, i.e., the neighbors of the neighbors of . Inspired by query expansion [25], the second order neighbors are added as
Neighbor Co-augmentation. Our second improvement is to use a so-called “neighbor co-augmentation”. Specifically, the neighbors generated by one similarity measure are used to augment contextual activations of the other similarity measure, formally defined as
This formula is inspired by “co-training” [45]. Essentially, one similarity measure tells the other one that “I think these neighbors to be true positives, and lend them to you such that you can improve your own discriminative power”.
Note that the size of neighbor set used here may be different from that used in Eq. . In order to distinguish them, we denote the size of neighbor set in Eq. as , while that used in Eq. and Eq. as .
Improving Efficiency
Considering that the length of is , one may doubt the efficiency of similarity computation in Eq. , especially when the database size is large. In fact, is a sparse vector, since only encodes the neighborhood structure of , and the number of non-zero values is only determined by the size of . This observation motivate us to utilize an inverted file again to leverage the sparsity of . Now we derive the feasibility of applying inverted file in Jaccard similarity theoretically.
The numerator in Eq. is computed as
Since all values of the aggregated contextual activation are non-negative, the last two items in Eq. are equal to zero. Consequently, Eq. can be simplified as
which only requires accessing non-zero entries of the query, and hence can be computed efficiently on-the-fly.
Although the calculation of the denominator in Eq. seems sophisticated, it can be expressed as
Besides the query-dependent operations (the first and the last items), Eq. only involves an operation of norm calculation of , which is simply equal to the cardinality of the fuzzy set and can be pre-computed off-line.
Our inverted file for re-ranking is built as illustrated in Figure 3. It has exactly entries, and each entry corresponds to one shape in the database. For each entry, we first store the cardinality of its fuzzy neighbor set. Then, we find those shapes which have non-negative membership values in this entry. Those shape IDs and the membership values are stored in this entry.
3Experiments
In this section, we evaluate the performance of GIFT on different kinds of 3D shape retrieval tasks. The evaluation metrics used in this paper include mean average precision (MAP), area under curve (AUC), Nearest Neighbor (NN), First Tier (FT) and Second Tier (ST). Refer to [40] for their detailed definitions.
If not specified, we adopt the following setup throughout our experiments. The projection rendered for each shape is . For multi-view matching procedure, the approximate Hausdorff matching defined in Eq. with an inverted file of entries is used. Multiple Assignment is set to . We use two pairwise similarity measures, which are calculated using features from convolutional layer and fully-connected layer respectively. In re-ranking component, each similarity measure generates one sparse activation to capture the contextual information for the 3D shape , and neighbor co-augmentation in Eq. is used to produce and . Finally, both and are integrated by with exponent .
3.1ModelNet
ModelNet is a large-scale 3D CAD model dataset introduced by Wu et al. [40] recently, which contains 3D CAD models divided into object categories. Two subsets are used for evaluation, i.e., ModelNet40 and ModelNet10. The former one contains models, and the latter one contains models. We evaluate the performance of GIFT on both subsets and adopt the same training and test split as in [40], namely randomly selecting unique models per category from the subset, in which models are used for training the CNN model and the rest for testing the retrieval performance.
For comparison, we collected all the retrieval results publicly available
Methods |
||||
(lr)2-3 (l)4-5 | AUC | MAP | AUC | MAP |
SPH | 34.47% | 33.26% | 45.97% | 44.05% |
LFD | 42.04% | 40.91% | 51.70% | 49.82% |
PANORAMA | 45.00% | 46.13% | 60.72% | 60.32% |
ShapeNets | 49.94% | 49.23% | 69.28% | 68.26% |
DeepPano | 77.63% | 76.81% | 85.45% | 84.18% |
MVCNN | - | 78.90% | - | - |
63.70% | 63.07% | 78.19% | 77.25% | |
77.28% | 76.63% | 89.03% | 88.05% | |
GIFT | 83.10% | 81.94% | 92.35% | 91.12% |
Fig. ? compares the precision-recall curves. It demonstrates again the discriminative power of the proposed search engine in 3D shape retrieval. Note that ModelNet also defines the 3D shape classification tasks. Considering GIFT is initially developed for real-time retrieval, its classification results are given in the supplementary material.
3.2Large Scale Competition
As the most authoritative 3D retrieval competition held each year, SHape REtrieval Contest (SHREC) pays much attention to the development of scalable algorithms gradually. Especially in recent years, several large scale tracks [32], such as SHREC14LSGTB [14], are organized to test the scalability of algorithms. However, most algorithms that the participants submit are of high time complexity, and cannot be applied when the dataset becomes larger (millions or more). Here we choose SHREC14LSGTB dataset for a comprehensive evaluation. This dataset contains 3D models classified into classes, and each 3D shape is taken in turn as the query. As for the feature extractor, we collected unrelated models from ModelNet [40] divided into categories to train a CNN model.
To keep the comparison fair, we choose two types of results from the survey paper [14] to present in Table 2. The first type consists of the top- best-performing methods on retrieval accuracy, including PANORAMA [25], DBSVC, MR-BF-DSIFT, MR-D1SIFT and LCDR-DBSVC. The second type is the most efficient one, i.e., ZFDR [13].
As can be seen from the table, excluding GIFT, the best performance is achieved by LCDR-DBSVC. However, it requires to return the retrieval results per query, which means that days are needed to finish the query task on the whole dataset. The reason behind such a high complexity lies in two aspects: 1) its visual feature is dimensional, which is time consuming to compute, store and compare; 2) it adopts locally constrained diffusion process (LCDP) [42] for re-ranking, while it is known that LCDP is an iterative graph-based algorithm of high time complexity. As for ZFDR, its average query time is shortened to by computing parallel on cores. Unfortunately, ZFDR achieves much less accurate retrieval performance, and its FT is smaller than LCDR-DBSVC. In summary, a conclusion can be drawn that no method can achieve a good enough performance at a low time complexity.
By contrast, GIFT outperforms all these methods, including a very recent algorithm called Two Layer Coding (TLC) [1] which reports in FT. What is more important that GIFT can provide the retrieval results within , which is orders of magnitude faster than LCDR-DBSVC. Meanwhile, the two baseline methods and incur heavy query cost due to the usage of exact Hausdorff matching, which testifies the advantage of the proposed F-IF.
Methods |
Query time | |||
(lr)2-4 | NN | FT | ST | |
ZFDR | 0.879 | 0.398 | 0.535 | 1.77 |
PANORAMA | 0.859 | 0.436 | 0.560 | 370.2 |
DBSVC | 0.868 | 0.438 | 0.563 | 62.66 |
MR-BF-DSIFT | 0.845 | 0.455 | 0.567 | 65.17 |
MR-D1SIFT | 0.856 | 0.465 | 0.578 | 131.04 |
LCDR-DBSVC | 0.864 | 0.528 | 0.661 | 668.6 |
0.879 | 0.460 | 0.592 | ||
0.884 | 0.507 | 0.642 | ||
GIFT | 0.889 | 0.567 | 0.689 | |
3.3Generic 3D Retrieval
Following [35], we select three popular datasets for a generic evaluation, including Princeton Shape Benchmark (PSB) [30], Watertight Models track of SHape REtrieval Contest 2007 (WM-SHREC07) [8] and McGill dataset [31]. Among them, PSB dataset is probably the first widely-used generic shape benchmark, and it consists of polygonal models divided into categories. WM-SHREC07 contains watertight models evenly distributed in classes, and is a representative competition held by SHREC community. McGill dataset focuses on non-rigid analysis, and contains articulated objects classified into classes. We train CNN on an independent TSB dataset [37], and then use the trained CNN to extract view features for the shapes in all the three testing datasets.
In Table 3, a comprehensive comparison between GIFT and various state-the-art methods is presented, including LFD [4], the curve-based method of Tabia et al. [34], DESIRE descriptor [38], total Bregman Divergences (tBD) [19], Covariance descriptor [35], the Hybrid of 2D and 3D descriptor [24], Two Layer Coding (TLC) [1] and PANORAMA [25]. As can be seen, GIFT exhibits encouraging discriminative ability in retrieval accuracy and achieves state-of-the-art performances consistently on all the three evaluation metrics.
Methods |
|||||||||
(lr)2-4 (lr)5-7 (l)8-10 | NN | FT | ST | NN | FT | ST | NN | FT | ST |
LFD | 0.657 | 0.380 | 0.487 | 0.923 | 0.526 | 0.662 | - | - | - |
Tabia et al. | - | - | - | 0.853 | 0.527 | 0.639 | - | - | - |
DESIRE | 0.665 | 0.403 | 0.512 | 0.917 | 0.535 | 0.673 | - | - | - |
Makadia et al. | 0.673 | 0.412 | 0.502 | - | - | - | - | - | - |
tBD | 0.723 | - | - | - | - | - | - | - | - |
Covariance | - | - | - | 0.930 | 0.623 | 0.737 | 0.977 | 0.732 | 0.818 |
2D/3D Hybrid | 0.742 | 0.473 | 0.606 | 0.955 | 0.642 | 0.773 | 0.925 | 0.557 | 0.698 |
PANORAMA | 0.753 | 0.479 | 0.603 | 0.957 | 0.673 | 0.784 | 0.929 | 0.589 | 0.732 |
PANORAMA + LRF | 0.752 | 0.531 | 0.659 | 0.957 | 0.743 | 0.839 | 0.910 | 0.693 | 0.812 |
TLC | 0.763 | 0.562 | 0.705 | 0.988 | 0.831 | 0.935 | 0.980 | 0.807 | 0.933 |
0.849 | 0.588 | 0.721 | 0.980 | 0.777 | 0.877 | 0.984 | 0.747 | 0.881 | |
0.837 | 0.653 | 0.784 | 0.980 | 0.805 | 0.898 | 0.980 | 0.763 | 0.897 | |
GIFT | 0.849 | 0.712 | 0.830 | 0.990 | 0.949 | 0.990 | 0.984 | 0.905 | 0.973 |
3.4Execution Time
In addition to state-of-the-art performances on several datasets and competitions, the most important property of GIFT is the “real-time” performance with the potential of handling large scale shape corpora. In Table 4, we give a deeper analysis of the time cost. The off-line operations mainly include projection rendering and feature extraction for database shapes, training CNN, and building two inverted files. As the table shows, the time cost of off-line operations varies significantly for different datasets. Among them, the most time-consuming operation is training CNN, followed by building the first inverted file with k-means. However, the average query time for different datasets can be controlled within one second, even for the biggest SHREC14LSGTB dataset.
Datasets | Off-line | On-line Indexing |
---|---|---|
ModelNet40 | ||
ModelNet10 | ||
SHREC14LSGTB | ||
PSB | ||
WMSHREC07 | ||
McGill | ||
3.5Parameter Discussion
Due to the space limitation, the discussion is conducted only on PSB dataset.
Improvements Over Baseline. In Table 5, a thorough discussion is given about the influence of various components of GIFT. We can observe a consistent performance boost by those improvements. The performance jumps a lot especially when re-ranking component is embedded. One should note a slight performance decrease when approximate Hausdorff matching with F-IF is used as compared with its exact version. However, as discussed below, the embedding with inverted file does not necessarily result in a poorer performance, but shortens the query time significantly.
Feature |
Hausdorff | First Tier | ||
NA | ||||
0.588 | ||||
0.653 | ||||
1 | 0.688 | |||
0.5 | 0.692 | |||
0.5 | 0.710 | |||
0.5 | 0.717 | |||
0.5 | 0.712 | |||
Discussion on F-IF. In Fig. ?, we plot the retrieval performance and the average query time using feature , as the number of entries used in the first inverted file changes. As Fig. ? shows, the retrieval performance generally decreases with more entries, and multiple assignment can boost the retrieval performance significantly. However, it should be addressed that a better approximation to Eq. using fewer entries (decreasing ) or larger multiple assignments (increasing MA) does not necessarily imply a better retrieval performance. For example, when and MA, the performance of approximate Hausdorff matching using inverted file surpasses the baseline using exact Hausdorff matching. The reason for this “abnormal” observation is that the principle of inverted file here is to reject those view matching operations that lead to smaller similarities, and sometimes they are noisy and false matching pairs which can be harmful to retrieval performance.
As can be seen from Fig. ?, the average query time is higher at smaller and larger MA, since the two cases both increase the number of candidate matchings in each entry. The baseline query time using exact Hausdorff matching is , which is at least one order of magnitude larger than the approximate one.
Discussion on S-IF. Two parameters, and , are involved in the second inverted file, which are determined empirically. We plot the influence of them in Figure 4. As can be drawn from the figure, when increases, the retrieval performance increases at first. Since noise contextual information can be included at a larger , we can observe the performance decreases after . Meanwhile, neighbor augmentation can boost the performance further. For example, the best performance is achieved when . However, when , the performance tends to decrease. One may find that the optimal value of is much smaller than that of . The reason for this is that defines the size of the second order neighbor, which is more likely to return noise context compared with the first order neighbor defined by .
4Conclusions
In the past years, 3D shape retrieval was evaluated with only small numbers of shapes. In this sense, the problem of 3D shape retrieval has stagnated for a long time. Only recently, shape community started to pay more attention to the scalable retrieval issue gradually. However, as suggested in [14], most classical methods encounter severe obstacles when dealing with larger databases.
In this paper, we focus on the scalability of 3D shape retrieval algorithms, and build a well-designed 3D shape search engine called GIFT. In our retrieval system, GPU is utilized to accelerate the speed of projection rendering and view feature extraction, and two inverted files are embedded to enable real-time multi-view matching and re-ranking. As a result, the average query time is controlled within one second, which clearly demonstrates the potential of GIFT for large scale 3D shape retrieval. What is more impressive is that while preserving the high time efficiency, GIFT outperforms state-of-the-art methods in retrieval accuracy by a large margin. Therefore, we view the proposed search engine as a promising step towards larger 3D shape corpora.
We submitted a version of GIFT to the latest SHREC2016 large scale track (the results are available in https://shapenet.cs.stanford.edu/shrec16/), and won the first place on perturbed dataset.
Footnotes
References
- 3d shape matching via two layer coding.
X. Bai, S. Bai, Z. Zhu, and L. J. Latecki. TPAMI, 37(12):2361–2373, 2015. - Scale-invariant heat kernel signatures for non-rigid shape recognition.
M. M. Bronstein and I. Kokkinos. In CVPR, pages 1704–1711, 2010. - Return of the devil in the details: Delving deep into convolutional nets.
K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. CoRR, abs/1405.3531, 2014. - On visual similarity based 3d model retrieval.
D. Y. Chen, X. P. Tian, Y. T. Shen, and M. Ouhyoung. Comput. Graph. Forum, 22(3):223–232, 2003. - Diffusion processes for retrieval revisited.
M. Donoser and H. Bischof. In CVPR, pages 1320–1327, 2013. - 3d deep shape descriptor.
Y. Fang, J. Xie, G. Dai, M. Wang, F. Zhu, T. Xu, and E. Wong. In CVPR, pages 2319–2328, 2015. - A bayesian hierarchical model for learning natural scene categories.
L. Fei-Fei and P. Perona. In CVPR, pages 524–531, 2005. - Shape retrieval contest 2007: Watertight models track.
D. Giorgi, S. Biasotti, and L. Paraboschi. SHREC competition, 8, 2007. - Vocmatch: Efficient multiview correspondence for structure from motion.
M. Havlena and K. Schindler. In ECCV, pages 46–60, 2014. - On the burstiness of visual elements.
H. Jégou, M. Douze, and C. Schmid. In CVPR, pages 1169–1176, 2009. - Rotation invariant spherical harmonic representation of 3d shape descriptors.
M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. In SGP, pages 156–164, 2003. - Intrinsic shape context descriptors for deformable shapes.
I. Kokkinos, M. M. Bronstein, R. Litman, and A. M. Bronstein. In CVPR, pages 159–166, 2012. - 3d model retrieval using hybrid features and class information.
B. Li and H. Johan. Multimedia tools and applications, 62(3):821–846, 2013. - A comparison of 3d shape retrieval methods based on a large-scale benchmark supporting multimodal queries.
B. Li, Y. Lu, C. Li, A. Godil, T. Schreck, M. Aono, et al. CVIU, 131:1–27, 2015. - Persistence-based structural recognition.
C. Li, M. Ovsjanikov, and F. Chazal. In CVPR, pages 2003–2010, 2014. - Cm-bof: visual similarity-based 3d shape retrieval using clock matching and bag-of-features.
Z. Lian, A. Godil, X. Sun, and J. Xiao. Mach. Vis. Appl., 24(8):1685–1704, 2013. - Shape classification using the inner-distance.
H. Ling and D. W. Jacobs. TPAMI, 29(2):286–299, 2007. - Supervised learning of bag-of-features shape descriptors using sparse coding.
R. Litman, A. Bronstein, M. Bronstein, and U. Castellani. Computer Graphics Forum, 33(5):127–136, 2014. - Shape retrieval using hierarchical total bregman soft clustering.
M. Liu, B. C. Vemuri, S. ichi Amari, and F. Nielsen. TPAMI, 34(12):2407–2419, 2012. - Distinctive image features from scale-invariant keypoints.
D. G. Lowe. IJCV, 60(2):91–110, 2004. - Spherical correlation of visual representations for 3d model retrieval.
A. Makadia and K. Daniilidis. IJCV, 89(2-3):193–210, 2010. - Multimedia search reranking: A literature survey.
T. Mei, Y. Rui, S. Li, and Q. Tian. ACM Comput. Surv., 46(3):38:1–38:38, 2014. - Efficient 3d shape matching and retrieval using a concrete radialized spherical projection representation.
P. Papadakis, I. Pratikakis, S. J. Perantonis, and T. Theoharis. Pattern Recognition, 40(9):2437–2452, 2007. - 3d object retrieval using an efficient and compact hybrid shape descriptor.
P. Papadakis, I. Pratikakis, T. Theoharis, G. Passalis, and S. J. Perantonis. In 3DOR, pages 9–16, 2008. - Panorama: A 3d shape descriptor based on panoramic views for unsupervised 3d object retrieval.
P. Papadakis, I. Pratikakis, T. Theoharis, and S. J. Perantonis. IJCV, 89(2-3):177–192, 2010. - Efficient shape matching using vector extrapolation.
E. Rodolà, T. Harada, Y. Kuniyoshi, and D. Cremers. In BMVC, volume 1, 2013. - Dense non-rigid shape correspondence using random forests.
E. Rodola, S. Rota Bulò, T. Windheuser, M. Vestner, and D. Cremers. In CVPR, pages 4177–4184, 2014. - Elastic net constraints for shape matching.
E. Rodola, A. Torsello, T. Harada, Y. Kuniyoshi, and D. Cremers. In ICCV, pages 1169–1176, 2013. - Deeppano: Deep panoramic representation for 3-d shape recognition.
B. Shi, S. Bai, Z. Zhou, and X. Bai. IEEE Signal Processing Letters, 22(12):2339–2343, 2015. - The princeton shape benchmark.
P. Shilane, P. Min, M. M. Kazhdan, and T. A. Funkhouser. In SMI, 2004. - Retrieving articulated 3-d models using medial surfaces.
K. Siddiqi, J. Zhang, D. Macrini, A. Shokoufandeh, S. Bouix, and S. J. Dickinson. Mach. Vis. Appl., 19(4):261–275, 2008. - Scalability of non-rigid 3d shape retrieval.
I. Sipiran, B. Bustos, T. Schreck, A. Bronstein, S. Choi, L. Lai, H. Li, R. Litman, and L. Sun. In 3DOR, pages 121–128, 2015. - Multi-view convolutional neural networks for 3d shape recognition.
H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller. In ICCV, 2015. - A new 3d-matching method of nonrigid and partially similar models using curve analysis.
H. Tabia, M. Daoudi, J.-P. Vandeborre, and O. Colot. TPAMI, 33(4):852–858, 2011. - Covariance descriptors for 3d shape matching and retrieval.
H. Tabia, H. Laga, D. Picard, and P.-H. Gosselin. In CVPR, pages 4185–4192, 2014. - Compact vectors of locally aggregated tensors for 3d shape retrieval.
H. Tabia, D. Picard, H. Laga, and P. H. Gosselin. In 3DOR, pages 17–24, 2013. - A large-scale shape benchmark for 3d object retrieval: Toyohashi shape benchmark.
A. Tatsuma, H. Koyanagi, and M. Aono. In APSIPA, pages 1–10, 2012. - Desire: a composite 3d-shape descriptor.
D. V. Vranic. In ICME, pages 962–965, 2005. - Projective analysis for 3d shape segmentation.
Y. Wang, M. Gong, T. Wang, D. Cohen-Or, H. Zhang, and B. Chen. ACM Trans. Graph., 32(6):192:1–192:12, 2013. - 3d shapenets: A deep representation for volumetric shape modeling.
Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. In CVPR, 2015. - Deepshape: Deep learned shape descriptor for 3d shape matching and retrieval.
J. Xie, Y. Fang, F. Zhu, and E. Wong. In CVPR, pages 1275–1283, 2015. - Locally constrained diffusion process on locally densified distance spaces with applications to shape retrieval.
X. Yang, S. Koknar-Tezel, and L. J. Latecki. In CVPR, pages 357–364, 2009. - Query specific rank fusion for image retrieval.
S. Zhang, M. Yang, T. Cour, K. Yu, and D. N. Metaxas. TPAMI, 37(4):803–815, 2015. - Query-adaptive late fusion for image search and person re-identification.
L. Zheng, S. Wang, L. Tian, F. He, Z. Liu, and Q. Tian. In CVPR, pages 1741–1750, 2015. - Semi-supervised regression with co-training.
Z.-H. Zhou and M. Li. In IJCAI, 2005.