# Deep Cross-modality Adaptation via Semantics Preserving Adversarial Learning for Sketch-based 3D Shape Retrieval

###### Abstract

Due to the large cross-modality discrepancy between 2D sketches and 3D shapes, retrieving 3D shapes by sketches is a significantly challenging task. To address this problem, we propose a novel framework to learn a discriminative deep cross-modality adaptation model in this paper. Specifically, we first separately adopt two metric networks, following two deep convolutional neural networks (CNNs), to learn modality-specific discriminative features based on an importance-aware metric learning method. Subsequently, we explicitly introduce a cross-modality transformation network to compensate for the divergence between two modalities, which can transfer features of 2D sketches to the feature space of 3D shapes. We develop an adversarial learning based method to train the transformation model, by simultaneously enhancing the holistic correlations between data distributions of two modalities, and mitigating the local semantic divergences through minimizing a cross-modality mean discrepancy term. Experimental results on the SHREC 2013 and SHREC 2014 datasets clearly show the superior retrieval performance of our proposed model, compared to the state-of-the-art approaches.

## I Introduction

In the last few years, there has been an explosive growth of 3D shape data, due to increasing demands from real industrial applications, such as virtual reality, LiDAR based autonomous vehicles. 3D shape related techniques have emerged as extremely hot research topics recently. Retrieving a certain category of 3D shapes from a given database is one of the fundamental problems for 3D shape based applications. A lot of efforts have been devoted to 3D shape retrieval by 3D models [23, 25], which are intuitively straightforward, but difficult to acquire. Alternatively, freehand sketch is a more convenient way for human to interact with data collection and processing systems, especially with the sharply increased use of touch-pad devices such as smart phones and tablet computers. As a consequence, sketch-based 3D shape retrieval, i.e., searching 3D shapes queried by sketches, has attracted more and more attentions [2, 22, 13, 24].

Despite of its succinctness and convenience to acquire, freehand sketches remain two disadvantages in the application of 3D shape retrieval, making the sketch-based 3D shape retrieval an extremely challenging task. Firstly, sketches are usually drawn subjectively in uncontrolled environments, resulting in severe intra-class variations as shown in Fig. 3. Secondly, sketches and 3D shapes have heterogenous data structures, which leads to large cross-modality divergences.

A variety of models have been proposed to address the aforementioned two issues, which can be roughly divided into two categories, i.e., representation based methods and matching based methods. The first category aims to extract robust features for both sketches and 3D shapes [10, 13, 2, 3, 22, 30, 29, 24]. However, due to the heterogeneity of sketches and 3D shapes, it is quite difficult to achieve modality-invariant discriminative representations. On the other hand, matching based methods focus on developing effective models for calculating similarities or distances between sketches and 3D shapes, among which deep metric learning based models [22, 1, 24] have achieved the state-of-the-art performance. Nevertheless, these methods fail to explore the varying importance of different training samples. Besides, they can merely enhance local cross-modality correlations, by selecting data pairs or triplets across modalities, while not taking into account the holistic data distributions. As a consequence, the learned deep metrics might be less discriminative, and lack of generalization for unseen test data.

To overcome the drawbacks of existing works, we propose a novel model, namely Deep Cross-modality Adaptation (DCA), for sketch-based 3D shape retrieval. Fig. 1 shows the framework of our proposed model. We first construct two separate deep convolutional neural networks (CNNs) and metric networks, one for sketches and the other for 3D shapes, to learn discriminative modality-specific features for each modality via importance-aware metric learning (IAML). Through mining the hardest samples in each mini-batch for training, IAML could explore the importance of training data, and therefore learn discriminative representations more efficiently. Furthermore, in order to reduce the large cross-modality divergence between learned features of sketches and 3D shapes, we explicitly introduce a cross-modality transformation network, to transfer features of sketches into the feature space of 3D shapes. An adversarial learning method with class-aware cross-modality mean discrepancy minimization (CMDM-AL) is developed to train the transformation network, which acts as a generator. Since CMDM-AL is able to enhance correlations between distributions of transferred data of sketches and data of 3D shapes, our model can compensate for the cross-modality discrepancy in a holistic way. IAML is also applied to the transformed data, in order to further preserve semantic structures of sketch data after adaptation. The main contributions of this paper are three-fold:

1) We propose a novel deep cross-modality adaptation model via semantics preserving adversarial learning. To our best knowledge, this work is the first one that incorporates adversarial learning into sketch-based 3D shape retrieval.

2) We develop a new adversarial learning based method for training the deep cross-modality adaptation network, which simultaneously reduces the holistic cross-modality discrepancy of data distributions, and enhances semantic correlations of local data batches across modalities.

3) We significantly boost the performance of existing state-of-the-art sketch-based 3D shape retrieval methods on two large benchmark datasets.

## Ii Related Work

In the literature, most of existing works on sketch-based 3D shape retrieval mainly concentrate on building modality-invariant representations for sketches and 3D shapes, and developing discriminative matching models. Various hand-craft features are employed, such as Zernike moments, coutour-based Fourier descriptor, eccentricity feature and circularity feature [14], the chordal axis transform based shape descriptor [26], HoG-SIFT features [27], the local improved Pyramid of Histograms of Orientation Gradients (iPHOG) [12], the sparse coding spatial pyramid matching feature (ScSPM), local depth scale-invariant feature transform (LD-SIFT) [30]. Besides, many learning-based features are developed, including bag-of-features (BoF) with Gabor local line based features (GALIF) [11], dense SIFT with BOF [3]. Meanwhile, tremendous matching approaches have also been developed, such as manifold ranking [3], dynamic time warping [26], sparse coding based matching [27] and adaptive view clustering [10, 12].

Recently, various deep models have been developed for both feature extraction and matching, which are closely related to our proposed method. In [22], two Siamese CNNs were employed to learn discriminative features of sketches and 3D shapes by minimizing within-modality and cross-modality losses. In [30], the pyramid cross-domain neural networks were utilized to compensate for cross-domain divergences. In [1] and [25], Siamese metric networks are employed to minimize both within-modality and cross-modality intra-class distances whilst maximizing inter-class distances. In [25], the Wasserstein barycenters were additionally employed to aggregate multi-view deep features of rendered images from 3D models. However, these methods only reduced the local cross-modality divergence, and haven’t considered removing data distribution shift across modalities. In contrast, our proposed model employs an adversarial learning based method to mitigate the discrepancy between distributions of two modalities in a holistic way, whilst addressing the local divergence issues by introducing a class-aware mean discrepancy term. Moreover, we apply IAML to mine importance of training samples, which has also been ignored by current works.

Another branch of works related to our work is the supervised discriminative adversarial learning for domain adaptation. In [4, 21, 15], a variety of adversarial discriminative models were developed for domain adaptation. The basic idea of these methods is to remove the domain shift between the source and target domains, by employing a domain discriminator and an adversarial loss. However, these works concentrate on scenarios where few labeled data are available in the target domain (despite abundant labeled data in the source domain), and are unable to jointly explore local discriminative semantic structures for both domains, making them unsuitable for our task. In [28], the authors also explicitly adopted a transformation network to transfer data from source domain to the target domain, where the cross-domain divergence is mitigated by an adversarial loss. However, they used hand crafted features, while our model employs deep CNNs to learn discriminative modality-specific features, and integrates them with the transformation network as a whole. Moreover, we introduce a class-aware cross-modality mean discrepancy term to the original adversarial loss. This term can enhance semantic correlations of data distributions across modalities as well as remove domain shift, which is largely neglected by existing works.

## Iii Deep Cross-modality Adaptation

As illustrated in Fig. 1, our proposed framework mainly consists of five components, including the CNN networks for 2D sketches (denoted by ) and for 3D shapes (denoted by ), fully connected metric networks for 2D sketches (denoted by ) and for 3D shapes (denoted by ), together with the cross-modality transformation network , of which the parameters are , , , and , respectively.

Similar to most existing deep learning methods, we train our model by mini-batches. In order to depict our own method more conveniently, we build image batches from the whole training data in a slightly different way from random sampling. Specifically, for 2D sketches, we first select classes randomly, and then collect images for each class. The selected images finally comprise a mini-batch of size , of which the corresponding class labels are denoted by . Following the same way, a batch of 3D shapes is constructed, together with labels . To characterize a 3D shape, we utilize the widely used multi-view representation as in [18, 1, 24], i.e., projecting a 3D shape to grayscale images from rendered views that are evenly divided around the 3D shape. Thereafter, we can represent as a batch of images , where consists of (=12 is used in our paper) 2D rendered images of the 3D shape .

As demonstrated in Fig. 1, we train the CNN and metric networks for sketches, i.e., and , jointly by adopting an importance-aware metric learning method, which could explore hardest training samples within a mini-batch. The CNN and metric networks for 3D shapes, i.e., and , are also trained in the same way. The cross-modality transformation network is learnt by preserving semantic structures of transformed features, and employing an adversarial learning based training strategy with class-aware cross-modality mean discrepancy minimization.

In the rest of this paper, we will elaborate the trailing details about the proposed method, including the importance-aware metric learning, the semantic adversarial learning, and the optimization algorithm. Without loss of generality, all loss functions are formulated based on image batches and throughout this paper, which can be easily extended to the whole training data.

### Iii-a Importance-Aware Feature Learning

Given a mini-batch , after successively passing through the CNN network and the metric network , we can obtain a set of feature vectors:

where , and for ,

Ideally, in order to learn discriminative features for each modality (i.e., the 2D sketches or the 3D shapes), the inter-class distances within the batch need to be larger than the intra-class distances. To achieve this, inspired by [7], we adopt the following loss function for importance-aware metric learning:

(1) | ||||

where

(2) |

(3) |

and is a constant.

As can be seen from Eq. (2), for a certain anchor point , is the sample that has the minimal Euclidean distance to among those samples from different classes. And from Eq. (3), we can see that is the sample that has the maximal Euclidean distance to , among samples belonging to the same class as . In other words, and indicate the largest inter-class Euclidean distance and the minimal intra-class Euclidean distance with respect to within the batch , respectively. Therefore, and are the batch-wise “hardest positive” and the “hardest negative” samples w.r.t. , and should be given higher importance during training. Existing deep metric learning based models [1, 25] equally treat all training samples. In contrast, our IAML firstly explores the hardest positive and negative training samples within a mini-batch, and enforces them to be consistent with semantics, making it more efficient to learn discriminative features.

By minimizing in Eq. (1), are forced to be greater than , i.e., . That is to say, by minimizing , the minimal inter-class distance is compelled to be larger than the maximal intra-class distance in the feature space, whilst keeping a certain margin . Consequently, we can learn CNN and metric networks to extract discriminative features for each modality (i.e., 2D sketches or 3D shapes).

### Iii-B Cross-modality Transformation based on Adversarial Learning

By applying the importance-aware metric learning via minimizing the losses and , we can learn discriminative features for sketches and shapes, i.e., and , respectively. However, due to the large discrepancy between data distributions of different modalities, directly using and for cross-modality retrieval will result in extremely poor performance.

To address this problem, we propose a cross-modality transformation network , in order to adapt the learnt features of 2D sketches to the feature space of 3D shapes with cross-modality discrepancies removal.

Suppose is the transformed features of sketches with class labels

where for , and . Ideally, the transformed features are expected to have the following properties, in order to guarantee good performance for the cross-modality retrieval task:

1) should be semantics preserving, i.e., maintaining small intra-class distances and large inter-class distances.

2) should have correlated data distribution with , i.e., the learnt features of 3D shapes.

The first property aims to compel the transformed features to preserve semantics, whilst the second attempts to remove the cross-modality discrepancy through strengthening correlations between data distributions of two modalities.

As shown in Fig. 2, we introduce a semantics preserving term by repeatedly utilizing the importance-aware metric learning to accomplish 1). And in order to achieve 2), we employ a cross-modality correlation enhancement term based on adversarial learning with class-aware cross-modality mean discrepancy minimization. We will provide details about the aforementioned two terms in the rest of this section.

Semantics Preserving Term In order to preserve semantic structures, i.e., keeping small (large) intra-class (inter-class) distances, we apply the loss of Importance-aware Metric Learning previously introduced to transformed data:

(4) |

where

(5) |

(6) |

and is a constant.

Cross-modality Correlation Enhancement Term Generative adversarial networks (GANs) have recently emerged as an effective method to generate synthetic data [5]. The basic idea is to train two competing networks, a generator and a discriminator , based on game theory. The generator is trained to sample from the data distribution from the vector of noise v. The discriminator is trained to distinguish synthetic data generated by and real data sampled from . The problem of training GANs is formulated as follows:

(7) |

where is a prior distribution over v. It has been pointed out in [5] that the global equilibrium of the two-player game in Eq. (7) achieves if and only if , where is the distribution of generated data.

In our model, we treat the transformation network as the generator . Suppose , and are distributions of learnt features of sketches, 3D shapes and transformed data (denoted by , and ), respectively. By solving the following problem

(8) |

we can expect that , i.e., the transformed data has the same data distribution as of 3D shapes, if problem (8) reaches the global equilibrium. Consequently, the cross-modality discrepancy can be reduced.

Conventionally, problem (8) is solved by alternatively optimizing and through minimizing the following two loss functions:

(9) |

(10) |

So far, we have trained a transformation network such that by minimizing and . Albeit the divergence between the distributions for transformed features of sketches and for features of 3D models can be diminished by adversarial learning, the cross-modality semantic structures are not taken into account. To address this problem, we further introduce the following term, namely the class-aware cross-modality mean discrepancy

(11) |

to adversarial learning, where is the class label. By minimizing , the mean feature vector of class from the sketch modality is compelled to be close to the mean feature vector of the same class from the 3D shape modality.

In practice, provided a mini-batch (), the term can be approximated by the batch-wise mean feature vector, i.e., .

Through minimizing the loss , we can obtain the adversarial learning method with cross-modality mean discrepancy minimization (CMDM-AL), which could enhance the semantic correlations across modalities.

By combing the semantics preserving loss and the cross-modality correlation enhancing loss , we finally get the loss function for training :

(12) |

### Iii-C Optimization

In Eq. (1), we defined the loss function for jointly training , , and the loss function for training , of 3D shapes. We also developed a loss function for training the cross-modality transformation network in Eq. (12).

To learn parameters of the proposed deep cross-modality adaptation model, we optimize different networks in an alternating iterative way. Algorithm 1 summarizes the outline of the training process. Specifically, we first pre-train the CNN and metric networks of sketches and 3D shapes based on the loss in Eq. (1), and pre-train the cross-modality transformation network by minimizing and . After initialization, we then alternatively update , , , and the adversarial discriminator , by minimizing , , , and , respectively. Throughout the whole training process, we use the Adam stochastic gradient method [8] as the optimizer.

## Iv Experimental Results and Analysis

To evaluate the performance of our method, we conduct experiments on two widely used benchmark datasets for sketch-based 3D shape retrieval: i.e., SHREC 2013 and SHREC 2014.

SHREC 2013 [10, 11] is a large-scale dataset for sketch-based 3D shape retrieval. This dataset consists of 7,200 sketches and 1,258 shapes from 90 classes, by collecting human-drawn sketches [2] and 3D shapes from the Princeton Shape Benchmark (PSB) [16] that share common categories. For each class, there are totally 80 sketches, where 50 images are used for training and 30 images for test. The numbers of 3D shapes are different for distinct classes, about 14 on average.

SHREC 2014 [14, 13] is a sketch track benchmark larger than SHREC 2013. It totally contains 13,680 sketches and 8,987 3D shapes, grouped into 171 classes. The 3D shapes are collected from various datasets, including SHREC 2012 [9] and the Toyohashi Shape Benchmark (TSB) [20]. Similar to SHREC 2013, there are 80 images for sketches, and about 53 3D shapes on average for each class. The sketches are further split into 8,550 training data and 5,130 test data, where for each class, 50 images are used for training and the rest 30 images for test.

Fig. 3 shows some samples from the two datasets. As illustrated, retrieving 3D shapes by sketches is quite challenging, due to large intra-class variations and cross-modality discrepancies between sketches and 3D shapes.

### Iv-a Implementation Details

In this subsection, we will provide implementation details about our proposed method, including network structures and parameter settings.

Network Structures. For CNN networks of both sketches and shapes, i.e., and , we utilize the ResNet-50 network [6]. Specifically, we use the layers of ResNet-50 before the “pooling5” layer (inclusive). As for metric networks of sketches and 3D shapes, i.e., and , both of them consist of four fully connected layers set as 2048-1024-512-256-128. We utilize the “relu” activation functions and batch normalization for all layers in the metric networks, except that the last layer uses the “tanh” activation function. As to the cross-modality transformation model , we adopt a network with four fully connected layers set as 128-64-32-64-128, where the first three layers uses the “relu” activation functions, and the last layer uses the “tanh” activation function. The discriminator is a fully connected network set as 128-64-1.

Parameter Settings. We set the number of the maximal iterative step as 30,000. The initial learning rate is set to , and decays exponentially after 10,000 steps. To generate data batches and , the number of classes per batch and the number of images per class are set as 16 and 4, respectively.

### Iv-B Evaluation Metrics

We adopt the most widely used metrics for sketch-based 3D shape retrieval as follows: nearest neighbor (NN), first tier (FT), second tier (ST), E-measure (E), discounted cumulated gain (DCG) and mean average precision (mAP) [11, 1, 24]. We also report the precision-recall curve, a common metric for visually evaluating the retrieval performance.

### Iv-C Evaluation of the Proposed Method

In this section, we will evaluate the effect of the proposed adversarial learning with class-aware cross-modality mean discrepancy minimization (CMDM-AL), together with the semantics preserving (SeP) term.

As a baseline, we apply the importance-aware metric learning (IAML) to separately train for 2D sketches, and for 3D shapes, where the learnt feature vectors are directly used for retrieval. This baseline method, denoted by sepIAML, merely learns discriminative features, without considering the cross-modality issues. Based on sepIAML, we employ the cross-modality transformation network , which is trained by minimizing the loss . We denote this method by DCA (CMDM-AL). By further adding the semantics preserving term , i.e., training by , we can obtain the complete model of our proposed method denoted by DCA (CMDM-AL+SeP). By comparing the performance of sepIAML, DCT (CMDM-AL) and DCT (CMDM-AL+SeP), we can evaluate the effects of the proposed adversarial learning method and semantics preserving term.

The results are summarized in Tables I and II. As can be seen, the baseline method sepSMML yields a rather poor performance, due to its weakness in dealing with cross-modality discrepancies. By introducing the adversarial learning method, DCA (CMDM-AL) significantly boosts the performance of the baseline, implying that the adversarial learning can largely enhance the correlation between data distributions of different modalities. Moreover, we can see a consistent improvements of DCA (CMDM-AL+SeP) on two benchmarks, compared to DCA (CMDM-AL). This indicates that the semantics preserving term can help learn more discriminative cross-modality transformation network.

### Iv-D Comparison with the State-of-the-art Methods

Retrieval Performance on SHREC 2013. Here we report experimental results of the proposed method on SHREC 2013, by comparing with the state-of-the-art methods, including the cross domain manifold ranking method (CDMR) [3], sketch-based retrieval method with view clustering (SBR-VC) [10], spatial proximity method (SP) [17], Fourier descriptors on 3D model silhouettes (FDC) [10], edge-based Fourier spectra descriptor (EFSD) [10], Siamese network (Siamese) [22], chordal axis transform with dynamic time warping (CAT-DTW), deep correlated metric learning (DCML) [1], and the learned Wasserstein barycentric representation method (LWBR) [24].

Methods | NN | FT | ST | E | DCG | mAP |

CDMR [3] | 0.279 | 0.203 | 0.296 | 0.166 | 0.458 | 0.250 |

SBR-VC [10] | 0.164 | 0.097 | 0.149 | 0.085 | 0.348 | 0.114 |

SP [17] | 0.017 | 0.016 | 0.031 | 0.018 | 0.240 | 0.026 |

FDC [10] | 0.110 | 0.069 | 0.107 | 0.061 | 0.307 | 0.086 |

Siamese [22] | 0.405 | 0.403 | 0.548 | 0.287 | 0.607 | 0.469 |

CAT-DTW [26] | 0.235 | 0.135 | 0.198 | 0.109 | 0.392 | 0.141 |

KECNN [19] | 0.320 | 0.319 | 0.397 | 0.236 | 0.489 | NA |

DCML [1] | 0.650 | 0.634 | 0.719 | 0.348 | 0.766 | 0.674 |

LWBR [24] | 0.712 | 0.725 | 0.785 | 0.369 | 0.814 | 0.752 |

Baseline (sepIAML) | 0.011 | 0.015 | 0.028 | 0.016 | 0.234 | 0.037 |

DCA (CMDM-AL) | 0.762 | 0.776 | 0.812 | 0.370 | 0.842 | 0.795 |

DCA (CMDM-AL+SeP) | 0.783 | 0.796 | 0.829 | 0.376 | 0.856 | 0.813 |

Fig. 4 demonstrates the precision-recall curves of the proposed method and compared approaches. As illustrated, the precision rate of our method is significantly higher than those of compared models, when the recall rate is smaller than 0.8. Considering that the top retrieved results are preferable, our method therefore performs significantly better than the state-of-the-art approaches.

We also report NN, FT, ST, E, DCG and mAP of various methods, including CDMR, SBR-VC, SP, FDC, Siamese, DCML, LWBR and the proposed method. As summarized in Table I, our approach yields the best retrieval performance w.r.t. all evaluation metrics. Among all compared approaches, Siamese, DCML and LWBR are deep metric learning based models. They directly map data from different modalities into a common embedding subspace, where both the single-modality and cross-modality intra-class Euclidean distances are decreased, and the inter-class distances are simultaneously enlarged. However, they equally treat each training data, and fail to explore varying importance of distinct samples. Besides, they only reduce the local cross-modality divergences between data pairs or triplets, without considering the correlation between data distributions in a holistic way. In contrast, our method learns features by mining the batch-wise hardest positive and hardest negative samples. Through automatically selecting the most important training samples, we can learn discriminative features more efficiently. Moreover, we explicitly introduce a cross-modality transformation network, in order to transfer the feature from the sketch modality to the feature space of 3D shapes. By leveraging the semantics preserving adversarial learning, we simultaneously reduce holistic divergences between data distributions from two modalities, and enhance the semantic correlations. As a consequence, our method achieves better retrieval performance. For instance, the mAP of our method reaches 0.813, which is , and higher than Siamese, DCML and LWBR, respectively.

Retrieval Performance on SHREC 2014. On this dataset, we compared our proposed model to the following state-of-the-art methods: the BoF with Gabor local line based feature (BF-fGALIF)[2], CDMR [3], SBR-VC [10], depth-buffered vector of locally aggregated tensors (DB-VLAT) [20], SCMR-OPHOG [14], BOF junction-based extended shape context (BOFJESC) [14], Siamese [22], DCML [1], and LWBR [24] .

Methods | NN | FT | ST | E | DCG | mAP |

CDMR [3] | 0.109 | 0.057 | 0.089 | 0.041 | 0.328 | 0.054 |

SBR-VC [10] | 0.095 | 0.050 | 0.081 | 0.037 | 0.319 | 0.050 |

DB-VLAT [20] | 0.160 | 0.115 | 0.170 | 0.079 | 0.376 | 0.131 |

CAT-DTW [26] | 0.137 | 0.068 | 0.102 | 0.050 | 0.338 | 0.060 |

Siamese [22] | 0.239 | 0.212 | 0.316 | 0.140 | 0.496 | 0.228 |

DCML [1] | 0.272 | 0.275 | 0.345 | 0.171 | 0.498 | 0.286 |

LWBR [24] | 0.403 | 0.378 | 0.455 | 0.236 | 0.581 | 0.401 |

Baseline (sepIAML) | 0.016 | 0.016 | 0.023 | 0.005 | 0.263 | 0.028 |

DCA (CMDM-AL) | 0.745 | 0.766 | 0.808 | 0.392 | 0.845 | 0.782 |

DCA (CMDM-AL+SeP) | 0.770 | 0.789 | 0.823 | 0.398 | 0.859 | 0.803 |

Fig. 5 provides precision-recall curves for BF-fGALIF, CDMR, SBR-VC, SCMR-OPHOG, OPHOG, DCML, LWBR and the proposed model. As shown, the precision rate of our proposed method is remarkably higher than compared approaches, when the recall rate is less than 0.8.

Besides the precision-recall curves, we additionally report NN, FT, ST, E, DCG and mAP of CDMR, SBR-VC, DB-VLAT, Siamese, DCML, LWBR in Table II. As can be seen, the performance of existing deep metric learning based methods including Siamese, DCML and LWBR drops sharply on SHREC 2014. For example, the mAP of LWBR on SHREC 2014 is 0.401, around lower than the mAP that it has achieved on SHREC 2013. The reason might lie in that SHREC 2014 has much more class categories (90 classes on SHREC 2013 versus 171 classes on SHREC 2014) and larger scale 3D shapes (1,258 3D shapes on SHREC 2013 versus 8,987 3D shapes on SHREC 2014) with more severe intra-class and cross-modality variations, making SHREC 2014 more challenging than SHREC 2013. As a comparison, the mAP of our proposed model merely drops about , and reaches 0.803 on SHREC 2014. This result is 40.2%, 51.7% and 57.5% higher than that of LWBR, DCML and Siamese, indicating that our method are much more scalable than existing deep models.

## V Conclusions

In this paper, we proposed a novel cross-modality adaptation model for sketch-based 3D shape retrieval. We firstly learnt modality-specific discriminative features for 2D sketches and 3D shapes, by employing the importance-aware metric learning through mining the batch-wise hardest samples. To remove the cross-modality discrepancy, we proposed a transformation network, aiming to transfer the features of sketches into the feature space of 3D shapes. We developed an adversarial learning based method for training the network, by enhancing correlations between holistic data distributions and preserving local semantic structures across modalities. As a consequence, we obtained discriminative transformed features of sketches that were also highly correlated with data of 3D shapes. Extensive experimental results on two benchmark datasets demonstrated the superiority of the propose method, compared to the state-of-the-art approaches.

## References

- [1] G. Dai, J. Xie, F. Zhu, and Y. Fang. Deep correlated metric learning for sketch-based 3d shape retrieval. In AAAI, pages 4002–4008, 2017.
- [2] M. Eitz, R. Richter, T. Boubekeur, K. Hildebrand, and M. Alexa. Sketch-based shape retrieval. ACM Trans. Graph., 31(4):31–1, 2012.
- [3] T. Furuya and R. Ohbuchi. Ranking on cross-domain manifold for sketch-based 3d model retrieval. In Cyberworlds (CW), 2013 International Conference on, pages 274–281. IEEE, 2013.
- [4] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
- [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
- [7] A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
- [8] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- [9] B. Li, A. Godil, M. Aono, X. Bai, T. Furuya, L. Li, R. J. López-Sastre, H. Johan, R. Ohbuchi, C. Redondo-Cabrera, et al. Shrec’12 track: Generic 3d shape retrieval. 3DOR, 6, 2012.
- [10] B. Li, Y. Lu, A. Godil, T. Schreck, M. Aono, H. Johan, J. M. Saavedra, and S. Tashiro. SHREC¡¯13 track: large scale sketch-based 3D shape retrieval. 2013.
- [11] B. Li, Y. Lu, A. Godil, T. Schreck, B. Bustos, A. Ferreira, T. Furuya, M. J. Fonseca, H. Johan, T. Matsuda, et al. A comparison of methods for sketch-based 3d shape retrieval. Computer Vision and Image Understanding, 119:57–80, 2014.
- [12] B. Li, Y. Lu, H. Johan, and R. Fares. Sketch-based 3d model retrieval utilizing adaptive view clustering and semantic information. Multimedia Tools and Applications, 76(24):26603–26631, 2017.
- [13] B. Li, Y. Lu, C. Li, A. Godil, T. Schreck, M. Aono, M. Burtscher, Q. Chen, N. K. Chowdhury, B. Fang, et al. A comparison of 3d shape retrieval methods based on a large-scale benchmark supporting multimodal queries. Computer Vision and Image Understanding, 131:1–27, 2015.
- [14] B. Li, Y. Lu, C. Li, A. Godil, T. Schreck, M. Aono, M. Burtscher, H. Fu, T. Furuya, H. Johan, et al. Shrec¡¯14 track: Extended large scale sketch-based 3d shape retrieval. In Eurographics workshop on 3D object retrieval, volume 2014, 2014.
- [15] S. Motiian, Q. Jones, S. Iranmanesh, and G. Doretto. Few-shot adversarial domain adaptation. In Advances in Neural Information Processing Systems, pages 6673–6683, 2017.
- [16] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser. The princeton shape benchmark. In Shape modeling applications, 2004. Proceedings, pages 167–178. IEEE, 2004.
- [17] P. Sousa and M. J. Fonseca. Sketch-based retrieval of drawings using spatial proximity. Journal of Visual Languages & Computing, 21(2):69–80, 2010.
- [18] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
- [19] H. Tabia and H. Laga. Learning shape retrieval from different modalities. Neurocomputing, 253:24–33, 2017.
- [20] A. Tatsuma, H. Koyanagi, and M. Aono. A large-scale shape benchmark for 3d object retrieval: Toyohashi shape benchmark. In Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific, pages 1–10. IEEE, 2012.
- [21] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
- [22] F. Wang, L. Kang, and Y. Li. Sketch-based 3d shape retrieval using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1875–1883. IEEE, 2015.
- [23] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG), 36(4):72, 2017.
- [24] J. Xie, G. Dai, F. Zhu, and Y. Fang. Learning barycentric representations of 3d shapes for sketch-based 3d shape retrieval. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 00, pages 3615–3623, July 2017.
- [25] J. Xie, G. Dai, F. Zhu, E. K. Wong, and Y. Fang. Deepshape: deep-learned shape descriptor for 3d shape retrieval. IEEE transactions on pattern analysis and machine intelligence, 39(7):1335–1345, 2017.
- [26] Z. Yasseen, A. Verroust-Blondet, and A. Nasri. View selection for sketch-based 3d model retrieval using visual part shape description. The Visual Computer, 33(5):565–583, 2017.
- [27] G.-J. Yoon and S. M. Yoon. Sketch-based 3d object recognition from locally optimized sparse features. Neurocomputing, 267:556–563, 2017.
- [28] Y. Zhang, R. Barzilay, and T. Jaakkola. Aspect-augmented adversarial networks for domain adaptation. arXiv preprint arXiv:1701.00188, 2017.
- [29] F. Zhu, J. Xie, and Y. Fang. Heat diffusion long-short term memory learning for 3d shape analysis. In European Conference on Computer Vision, pages 305–321. Springer, 2016.
- [30] F. Zhu, J. Xie, and Y. Fang. Learning cross-domain neural networks for sketch-based 3d shape retrieval. In AAAI, pages 3683–3689, 2016.