Bi-Directional Domain Translation for Zero-Shot Sketch-Based Image Retrieval

Bi-Directional Domain Translation for Zero-Shot Sketch-Based Image Retrieval

Jiangtong Li
Shanghai Jiao Tong University
   Zhixin Ling
Shanghai Jiao Tong University
   Li Niu
Shanghai Jiao Tong University
   Liqing Zhang
Shanghai Jiao Tong University

The goal of Sketch-Based Image Retrieval (SBIR) is using free-hand sketches to retrieve images of the same category from a natural image gallery. However, SBIR requires all categories to be seen during training, which cannot be guaranteed in real-world applications. So we investigate more challenging Zero-Shot SBIR (ZS-SBIR), in which test categories do not appear in the training stage. Traditional SBIR methods are prone to be category-based retrieval and cannot generalize well from seen categories to unseen ones. In contrast, we disentangle image features into structure features and appearance features to facilitate structure-based retrieval. To assist feature disentanglement and take full advantage of disentangled information, we propose a Bi-directional Domain Translation (BDT) framework for ZS-SBIR, in which the image domain and sketch domain can be translated to each other through disentangled structure and appearance features. Finally, we perform retrieval in both structure feature space and image feature space. Extensive experiments demonstrate that our proposed approach remarkably outperforms state-of-the-art approaches by about 8% on the Sketchy dataset and over 5% on the TU-Berlin dataset.

1 Introduction

In recent years, with the rapid growth of multimedia data on the internet, image retrieval is playing a more and more important role in many fields, such as remote sensing and e-commerce. Since sketch can be easily drawn and reveal the characteristics of the target images, sketch-based image retrieval (SBIR), which uses a sketch to retrieve the images of the same category, has become widely accepted among users. Therefore, SBIR has also attracted widespread attention in research community [9, 3, 14, 15, 2, 21, 68, 20, 1, 46, 24, 58, 50, 34, 63, 48, 51, 40]. In the conventional setting, it is assumed that the images and sketches in training and test sets share the same set of categories. However, in real-world applications, the categories of test sketches/images may be out of the scope of training categories.

Figure 1: For both seen categories and unseen categories, we visualize the feature of a query sketch (red star) and image features from different categories (points of different colors) obtained by SBIR method SaN [64], together with the query sketch and an image from the same category.

In this paper, we focus on a more challenging task called zero-shot sketch-based image retrieval (ZS-SBIR) [52], which assumes that test categories do not appear in the training stage. In the remainder of this paper, we refer to training (resp., test) categories as seen (resp., unseen) categories [11]. Traditional SBIR methods suffer from sharp performance drop in ZS-SBIR setting [62], probably because they are prone to learn category-based retrieval. Specifically, based on the analysis in [62], since the evaluation methodology is category-based, traditional SBIR methods may take a shortcut by correlating sketches/images with their category labels and retrieving the images from the same category as the query sketch, which is very effective when test data share the same categories as training data. However, SBIR methods often fail when the test categories are not present in the training stage. As illustrated in Figure 1, based on the pairwise distance in feature space, SBIR method SaN [64] succeeds in retrieving the images from a seen category “giraffe” when given a query “giraffe” sketch, but fails on an unseen category “church”. We conjecture that to generalize well from seen categories to unseen categories, a model should learn to align the structure information (e.g., outline, shape) of sketches with the corresponding structure information of images (e.g., the structure of church spire in Figure 1), which is referred to as structure-based retrieval and ignored by traditional SBIR methods like [64].

Existing ZS-SBIR methods can be categorized into three groups: (1) using a generative model based on aligned sketch-image pairs (a sketch is drawn based on a given image and thus has roughly the same outline as this image) to reduce the gap between seen and unseen categories [62]; (2) employing semantic information to reduce the intra-class variance in sketches to stabilize the training process [60, 59, 12, 52]; (3) fine-tuning the pre-trained model in ZS-SBIR task with semantic-aware knowledge preservation to prevent catastrophic forgetting [37]. However, the aligned sketch-image pairs and semantic information are not always available. Moreover, most of the above methods did not achieve the goal of structure-based retrieval. The method in [62] made an attempt at structure-based retrieval but did not explicitly extract structure information from images. In terms of the extraction of image structure information, some prior works relied on sketch tokens, which are obtained by extracting the outlines of images [36, 58, 63]. However, the sketch tokens obtained in this way are not very reliable due to the noisy and redundant information, which significantly limits the performance of these methods.

In this work, to facilitate structure-based retrieval, we disentangle image features into structure features and appearance features, where the former encode the structure information (i.e., outline, shape) and the latter encode the additional detailed information (i.e., color, texture). To assist feature disentanglement and take full advantage of disentangled information, we propose Bi-directional Domain Translation (BDT) framework, where sketches and images are deemed as two domains. As shown in Figure 2, we first use a pre-trained model to extract features from sketches (resp., images), which are dubbed as sketch (resp., image) features. Then, the image features are disentangled into structures features and appearance features, while the sketch features are also projected to the shared structure feature space. Furthermore, bi-directional domain translation is performed through the structure features and appearance features. Concretely, for image-to-sketch translation, we project image features to structure features and then generate sketch features. For sketch-to-image translation, we project sketch features to structure features, which are combined with variational appearance features to compensate the uncertainty when we generate image features from sketch features.

Finally, we perform retrieval in both structure feature space and image feature space, to combine the best of two worlds. The effectiveness of our proposed BDT framework is verified by comprehensive experimental results on two benchmark datasets. Our main contributions are summarized as follows:

  • To the best of our knowledge, we are the first to disentangle image features into structure features and appearance features to facilitate structure-based retrieval.

  • We propose a bi-directional domain translation framework for zero-shot sketch-based image retrieval task.

  • Comprehensive results on two popular large-scale datasets show that our framework significantly outperforms the state-of-the-art methods.

2 Related Work

2.1 SBIR and ZS-SBIR

The main goal of sketch-based image retrieval (SBIR) is to bridge the gap between image domain and sketch domain. Basically, previous SBIR methods can be categorized into hand-crafted feature based methods and deep learning based methods. Before deep learning was introduced to this task, hand-crafted based methods generally extracted the edge maps from natural images and then matched them with sketches using hand-craft feature [50, 20, 15, 21, 14]. In recent years, deep learning based methods have become popular in this area. To reduce the gap between image domain and sketch domain, variants of siamese network [48, 51, 56] and ranking loss [8, 51] were adopted to this task. Besides, semantic information and adversarial loss were also introduced to preserve the domain invariant information [4].

Zero-shot sketch-based image retrieval (ZS-SBIR) was proposed by [52] and then followed by [62, 60, 59, 37, 12]. To reduce the intra-class variance in sketches and stabilize the training process, semantic information was leveraged in [59, 52, 60, 12]. To reduce the gap between seen and unseen categories, a generative model along with aligned data pairs, was proposed in [62]. To adapt the pre-trained model to ZS-SBIR without forgetting the knowledge of ImageNet [10], semantic-aware knowledge preservation mechanism was used in [37]. However, none of the above methods attempted to disentangle images into structure information and appearance information, which is explored in this work.

2.2 Disentangled Representation

Disentangled representation learning aims to divide the latent representation into multiple units, with each unit corresponding to one latent factor (e.g., position, scale, identity). Each unit is only affected by its corresponding latent factor, but not influenced by other latent factors. Disentangled representations are more generalizable and semantically meaningful, and thus useful for a variety of tasks.

Disentangled representation learning methods can be categorized into unsupervised methods and supervised methods according to whether supervision for latent factors is available. For unsupervised disentanglement, abundant methods have been developed, including InfoGAN [6], MTAN [38], -VAE [19], JointVAE [11], FactorVAE [26], InfoVAE [66] and TCVAE [5]. Most of them encouraged statistical independence across different dimensions of the latent representation while maintaining the mutual information between input data and latent representations. For supervised disentanglement, Kingma et al. [30] used disentangled representation to enhance semi-supervised learning. Zheng et al. [67] proposed DG-Net to integrate discriminative and generative learning using disentangled representation. Besides, supervised disentanglement has been applied to different tasks, like person re-id [67], face recognition [35, 39, 53, 57], and image generation [41, 61, 43, 25]. Our work is the first to apply disentangled representation learning to sketch-based image retrieval task.

2.3 Domain Translation

Many domain translation approaches, like Pix2Pix [23], CycleGAN [69], BiCycleGAN [70], StarGAN [7], DiscoGAN [27] have been proposed, which can translate between two domains (e.g., sketch domain and image domain). In this subsection, we mainly discuss the domain translation methods [32, 33, 22, 17] based on disentangled representation. Overall speaking, they disentangle latent representation into domain-specific representation and domain-invariant representation. In our problem, structure (resp., appearance) features can be treated as domain-invariant (resp., specific) representation. The translation between two domains in previous works [32, 33, 22, 17] is generally symmetric. In contrast, the translation between sketch domain and image domain is asymmetric because image domain has additional domain-specific representation compared with sketch domain.

Figure 2: An overview of our framework. We first adopt VGG-16 [54] to extract features from image and sketch. Then we disentangle image feature into appearance feature and structure feature, through which bi-directional domain translation is performed between image feature space and sketch feature space.

3 Methodology

In this section, we introduce our proposed Bi-directional Domain Translation (BDT) framework for zero-shot sketch-based image retrieval. In Sec 3.1, we state the problem definition. In Sec 3.2, we elaborate disentangled representation and bi-directional domain translation in detail. In Sec 3.3, we discuss the strategy during training and retrieval.

3.1 Problem Definition

In this paper, we focus on sketch-based image retrieval under zero-shot setting, where only the sketches and images from seen categories are used in the training stage. In the test stage, our proposed framework is expected to use the sketches to retrieve the images, the categories of which are unseen during training.

Formally, we are given a sketch dataset and an image dataset , where is category label set, and and represent the sketches and images with their corresponding category labels respectively. Following the zero-shot setting in [62, 59], we split all categories into and , in which no overlap exists between two label sets, i.e., . Based on the partition of label set , we can split the sketch (resp., image) dataset into and (resp., and ). In the training stage, our model can only process the data in and . During testing, given a sketch from , our model needs to retrieve the images belonging to the same category as from test images gallery .

The overall framework of our method is illustrated in Figure 2. We input a triplet containing a pair of sketch and image from the same category and another image from a different category. First, a pre-trained model extracts features (resp., and ) from (resp., and ). Then, image features (resp., ) are disentangled into appearance features (resp., ) and structure features (resp., ). We employ a ranking loss on (, , ) as well as an orthogonal loss on (, ) to disentangle appearance features and structure features. Furthermore, we use image structure features to reconstruct sketch features by using a reconstruction loss and an adversarial loss, because and belong to the same category. Similarly, we can use sketch structure features along with to reconstruct . To support stochastic sampling in the test stage, we use to infer variational appearance features , which is combined with to reconstruct . In the test stage, given an image (resp., sketch), we can obtain its structure feature as well as reconstructed (resp., generated) image feature, so that an image and a sketch can be compared in both structure feature space and image feature space.

3.2 Our Framework

3.2.1 Feature Extractor

Since sketches are highly abstract and visually sparse compared with natural images, it is hard to obtain adequate information from sketches when using a pre-trained model as feature extractor. To tackle this problem without using more parameters, we adopt the fusion strategy in [59] to concatenate the features extracted from multiple layers of the pre-trained model for both images and sketches.

In detail, we first use a pre-trained backbone model, i.e., VGG-16 pre-trained on ImageNet [10], to process each sketch and image. Suppose is the output feature of the -th convolution layer and is the output feature of the last fully connected layer, the final feature can be obtained by concatenating and global average pooling (GAP) of :


3.2.2 Disentangled Representation

To achieve the goal of structure-based retrieval, we tend to disentangle structure information from image feature. Given an image feature , we adopt two image encoders and to disentangle image feature into image structure feature and image appearance feature . Besides, to project sketch feature to the same structure feature space as , a sketch encoder is adopted to obtain sketch structure feature . The above process is formulated as follows,


In each training step, apart from sampling a positive sketch-image pair (, ) of the same category, we also sample a negative image , which belongs to other categories. Therefore, a triple (, , ) is fed into the network. We expect that the structure features of images and sketches are in the same feature space. Moreover, in the structure feature space shared by sketch and image, we expect intra-class coherence and inter-class separability across different domains (i.e., sketch domain and image domain). Specifically, we expect to pull sketches close to the images of the same category and push sketches apart from the images of a different category. With the above purpose, we employ a ranking loss with distance:


in which the margin is empirically set as 10.0 in our experiments.

After enforcing the structure features of images to share the same structure feature space of sketches, we further expect that the appearance features of images only contain complementary information (e.g., color, texture) to the structure features. To ensure that the image feature are disentangled in the structure feature space and appearance feature space, we impose an orthogonal constraint on the structure features and appearance features of images based on cosine similarity:


where means the the dot product between two vectors. Note that the and are the output of ReLU activation, so is always non-negative and minimizing (4) will push towards zero.

3.2.3 Bi-directional Domain Translation

To further help learn disentangled representations and fully utilize the disentangled image features, we perform bi-directional domain translation between sketch domain and image domain.

For image-to-sketch translation, we employ a decoder to reconstruct sketch feature based on , considering that and belong to the same category. By denoting , we adopt a reconstruction loss . Furthermore, we employ an adversarial loss to guarantee that the distribution of generated sketch features is close to that of real sketch features. The adversarial loss is implemented based on a discriminator , which distinguishes generated sketch features from real ones. Thus, the total loss of image-to-sketch translation can be written as


For sketch-to-image translation, we tend to use the sketch structure features to reconstruct image features from the same category. However, images contain extra appearance information (e.g., color, texture) compared with sketches, so it is necessary to compensate for the appearance uncertainty when translating from structure features to image features. Therefore, image appearance features should be integrated with sketch structure features to reconstruct image features.

In the test stage, given a sketch, we also hope to generate its imaginary image feature to enable retrieval in the image feature space. Nevertheless, we do not have the corresponding image appearance features in this case. One commonly used solution is stochastic sampling during testing. We introduce a variational estimator to approximate the variational Gaussian distribution based on , that is, . Then, we use Kullback-Leibler divergence to force to be close to prior distribution :


After using reparameterization trick [29] to sample variational appearance feature , i.e., , where is sampled from and means element-wise product, we employ a decoder to reconstruct based on the concatenation of and . By denoting , we employ a reconstruction loss and an adversarial loss implemented based on the discriminator , which distinguishes generated image features from real ones, leading to the following loss function:


By performing image-to-sketch translation, we expect that the image structure features contain the necessary structure information to reconstruct the sketch features of the same category. By performing sketch-to-image translation, we expect that the image appearance features contain the necessary appearance information to compensate for the sketch structure features when reconstructing image features. Therefore, bi-directional domain translation could cooperate with ranking loss and orthogonal loss to assist feature disentanglement.

Finally, recall that the discriminator (resp., ) is trained to distinguish the generated image (resp., sketch) features from the real ones. So the loss functions for discriminators can be written as


3.3 Training and Retrieval

The full objective function can be divided into the generation loss and the discrimination loss, which can be expressed as


in which and are empirically set as 0.5 and 2.0 respectively. Our model consists of generators and discriminators, in order to stabilize the training process, we follow the training strategy in GAN [18] to update them alternatingly with and iterations respectively to minimize and .

In the test stage, we perform retrieval in both structure feature space and image feature space. Specifically, given a sketch and an image , we compare them in both feature spaces.

1) Structure feature space: We project image feature and sketch feature into the shared structure feature space by and respectively. Then, we calculate the cosine distance .

2) Image feature space: Based on the sketch structure feature and a variational appearance feature sampled from , we can employ the decoder to generate an image feature. We can generate image features vectors by sampling times ( in our experiments) and average them to represent the final image feature :


where is sampled from . Then, we calculate the cosine distance .

Finally, we calculate the weighted average of two distances for retrieval:


where is a hyper-parameter to balance two feature spaces and set as by default.

Method Sketchy Ext. (aligned) Sketchy Ext. (unaligned) TU-Berlin Ext.
P@200(%) mAP@200(%) P@200(%) mAP@200(%) P@200(%) mAP@200(%)
SBIR Cosine 9.0 5.1 9.0 5.1 4.6 2.0
3D shape [58] 6.1 1.0 7.0 1.8 3.6 0.5
SaN [64] 15.3 5.8 18.9 8.5 10.1 4.2
Siamese [48] 24.4 14.6 25.6 15.3 8.3 3.7
ZSL ESZSL [49] 16.0 8.3 17.2 9.5 4.8 1.7
SAE [31] 24.4 14.6 27.1 17.5 11.6 5.5
CMT [55] 26.9 17.6 27.5 17.7 10.0 4.3
SSE [65] 6.9 2.3 7.3 3.3 4.1 1.2
DeViSE [16] 14.3 4.7 15.4 5.4 8.0 2.2
ZS-SBIR CVAE [62] 33.4 22.6 31.2 19.9 10.2 4.9
SEM-PCYC [12] 28.0 17.7 30.0 19.4 12.4 5.7
Xu et al. [60] 20.4 12.0 20.8 12.6 7.4 2.9
BDT-St 36.1 25.5 36.9 25.8 15.2 7.9
BDT-Im 37.2 26.8 35.1 24.9 14.7 7.1
BDT 41.2 29.9 39.7 28.1 17.6 10.2
Table 1: Comparison of our BDT method and baselines on Sketchy and TU-Berlin. Best results are denoted in boldface.

4 Experiment

4.1 Experiment Setup

4.1.1 Dataset

We evaluate our BDT framework and baselines on two large-scale sketch-image datasets: TU-Berlin [13] and Sketchy [51] with extended images obtained from [36].

Sketchy (Extended) [51] is originally comprised of 75,479 sketches and 12,500 images from 125 categories, where the images and sketches are aligned pairs. Liu et al. [36] extended the image retrieval gallery by collecting extra 60,502 images, so that the total number of images in extended Sketchy reaches 73,002. Following the standard zero-shot setting in [62], we partition the total 125 categories into 104 seen categories and 21 unseen categories according to whether the category appears in the 1,000 categories of ImageNet [10], which avoids violating the zero-shot assumption when utilizing models pre-trained on ImageNet. In the training stage, there were previously two settings about how to utilize the training data: 1) use aligned pairs without extended training images [62], which is referred to “aligned” in Table 1; 2) do not use the information of aligned pairs but use all training data including extended images [52], which is referred to as “unaligned” in Table 1.

TU-Berlin (Extended) [13] contains 250 categories with a total of 20, 000 sketches extended by [36] with 204,489 natural images based on the sketch categories. Following the same split criterion as Sketchy, we first split the TU-Berlin into 165 seen categories and 85 unseen categories according to whether the category appears in the 1,000 categories of ImageNet [10]. As Shen et al. [52] suggest, we re-select unseen categories with more than 400 images out of the 85 categories. In the end, there are 186 seen and 64 unseen categories 111The detailed category split will be found in Appendix. Compared with the Sketchy dataset, TU-Berlin is much more challenging because of more unseen categories and fewer training sketches.

4.1.2 Implementation Details

We implement our method and all the other baselines using PyTorch [47], which are all trained on one GTX 1080Ti GPU. We use a VGG-16 (pre-trained on ImageNet dataset) to extract the image and sketch features. As Sec. 3.2 mentioned, we concatenate the output of multiple layers, leading to a 5568-dim vector for each image and sketch. For each encoder, we use two fully-connected (FC) layers with Batch Normalization and ReLU as activation. For the variational estimator, we use two individual FC layers to obtain the mean and variance of approximated separately. For each decoder, we use two FC layers with ReLU activation. For discriminators, we use two FC layers with Batch Normalization and LeakyReLU as activation. The dimensionality of , , , are all 1024.

We use Adam [28] optimizer with learning rate , , for bi-directional translation model, and use SGD optimizer with learning rate , momentum for the discriminators. The batch size for Sketchy (resp., TU-Berlin) is 128 (resp., 64) and the maximum number of training epochs is 30. The numbers of iterations for training generator () and discriminator () are 100 and 50 respectively. For Sketchy dataset, we conduct experiments in both “unaligned” and “aligned” settings (see Table 1), whereas there is only “unaligned” setting for TU-Berlin dataset because TU-Berlin does not have aligned pairs. Following [62], we use mean average precision and precision considering top 200 retrievals (mAP@200 and P@200) as the evaluation metric.

4.2 Comparison with Existing Methods

We compare our model with prior methods, which can be divided into three categories: sketch-based image retrieval (SBIR) baselines, zero-shot learning (ZSL) baselines, zero-shot sketch-based image retrieval (ZS-SBIR) baselines. The SBIR baselines include Siamese [48], SaN [64], and 3D shape [58]. A cosine baseline is also added, which conducts nearest neighbor search based on 4096-dim VGG-16 [54] feature vectors. The ZSL baselines include ESZSL [49], SAE [31], CMT [55], SSE [65], and DeViSE [16]. The ZS-SBIR baselies include CVAE [62], SEM-PCYC [12], and Xu et al. [60]. For a fair comparison, we replace the backbone of all previous models by VGG-16 except SaN, which designs a new backbone to extract sketch and image features. All the backbones are pre-trained on ImageNet. Note that our method does not rely on semantic information obtained from large textual corpus (e.g., word vector [44] and WordNet [45]). To make a fair comparison, for those baselines which require additional semantic information, we remove the semantic information [12] or replace the semantic information by the average of image features within each category [58, 49, 31, 55, 65, 16, 60] 222The results of ZSIH [52] become much worse after using this strategy, and thus we omit its results in Table 1.. In fact, we have tried both “remove” and “replace” strategies for all these baselines if applicable, and select the better one for each baseline. Besides, we do not compare with the methods that fine-tune the pre-trained backbone during training, like SAKE [37] and EMS [40], because they learn four times more model parameters than ours.

Sketchy Ext. (aligned) Sketchy Ext. (unaligned)
P@200(%) mAP@200(%) P@200(%) mAP@200(%)
w/o 35.1 23.2 31.7 20.3
w/o 40.3 29.1 38.4 26.9
w/o 31.7 19.8 32.1 20.5
w/o 39.9 28.3 38.3 27.8
w/o 40.0 28.3 35.5 24.9
w/o 40.7 29.6 39.3 27.9
alternative 39.1 27.4 37.9 26.6
w/o appearance 37.2 26.0 36.5 25.9
Table 2: Ablation Studies of our method on Sketchy dataset.

Based on Table 1, we can find that most of the SBIR and ZSL baselines under-perform the ZS-SBIR baselines. Compared with Cosine, 3D shape [58] and SSE [65] perform even worse, which indicates these methods heavily overfit on the seen categories. On Sketchy dataset, we observe that the results in “unaligned” setting are usually better than the corresponding results in “aligned” setting, mainly because that the amount of unaligned data is five times larger than that of aligned data. However, CVAE exhibits the opposite tendency because the aligned sketch-image pairs could help reconstruct images from their paired sketches. On the TU-Berlin dataset, the overall results are worse than those reported in previous works [37, 12] due to different seen/unseen splits. In particular, the number of unseen categories under our split is two times larger than that in [37, 12], and our split criterion also guarantees no information leak from ImageNet to unseen categories.

In terms of P@200, our proposed BDT excels the state-of-the-art methods by 7.8% on the Sketchy (aligned) dataset, 8.5% on Sketchy (unaligned) dataset, and 5.2% on TU-Berlin dataset. To better understand our method, we also list our results by performing retrieval only in the image feature space or structure feature space as BDT-Im and BDT-St, respectively. Referring to the comparison between BDT-Im and CVAE as well as the comparison between BDT-St and Siamese, the disentangled representations indeed help the model to generalize from seen to unseen categories. Besides, by comparing BDT with BDT-Im and BDT-St, we can see that the combination of image feature space and structure feature space can boost the performance by a large margin, which indicates the complementarity of two feature spaces.

Figure 3: (a) The performance variance of our method when setting in the range of , where Sk (a), Sk (u) and TU represent Sketchy (aligned), Sketch (unaligned) and TU-Berlin respectively. (b) The performance and orthogonality variance of our method along with the training epoch.
Figure 4: The top-5 images retrieved by BDT, BDT-St, BDT-Im, CVAE methods on Sketchy test set. The green (resp., red) border indicates the correct (resp., incorrect) retrieval results.

4.3 Ablation Study

By taking the Sketchy dataset as an example, we analyze the effect of different loss functions and alternative model designs as well as the effect of .

Study on loss terms: We ablate each loss term in (3), (4), (5) and (7), and report the results in Table 2. As expected, the ranking loss and the image reconstruction loss are the most important losses, because these two losses mainly control the image-sketch distance in their corresponding feature spaces. Besides, the image reconstruction loss has larger impact in “aligned” setting than “unaligned” setting, which implies that the reconstruction loss is sensitive to the pose variance in unaligned data. In contrast, the image adversarial loss has larger impact in “unaligned” setting, which shows that the adversarial loss can enhance the robustness of our model in “unaligned” setting.

Study on alternative model designs: In the last two rows in Table 2, we report the results of two alternative designs: (1) move the orthogonal loss from () to (); (2) directly translate from sketch structure feature to image feature without using the image appearance feature . We can observe the performance drop in both cases, which demonstrates that we have placed the orthogonal loss at the proper position, and the appearance compensation is crucial for generating image features.

Study on retrieval strategy: In Figure 3a, we plot the -P@200 curve. It can be seen that our method can generally achieve competitive results when setting in a proper range, e.g., .

4.4 Disentanglement Analysis

To demonstrate the ability of our model to disentangle the image features, we first plot orthogonal loss and P@200 along with the training epoch in Figure 3b. It can be seen that the orthogonal loss decreases as P@200 increases, which indicates that our method benefits from the disentanglement of image features.

Figure 5: The t-SNE visualization of six types of features on Sketchy test set. Best viewed in color.

Then, in Figure 5, we visualize six types of features from 10 randomly selected unseen categories using t-SNE [42]: image appearance features, image structure features, sketch features, sketch structure features, sketch translated image features. According to Figure 5, we have the following observations: 1) Different categories can be separated very well in “image structure” and “sketch structure”, which significantly facilitates the retrieval in structure feature space; 2) The results in “image structure” and “image appearance” are complementary, in accordance with the disentanglement between structure features and appearance features; 3) The results in “image features” and “translated image features” are similar, which shows the effectiveness of image feature reconstruction; 4) The results in “sketch features” show the relatively poor separability of sketch features, which makes sketch feature space ill-suited for image retrieval.

4.5 Case Study

In Figure 4, we show the retrieval results of BDT, BDT-St, BDT-Im, and CVAE [62]. One interesting observation is that BDT-ST could capture the correspondence of local structure information, while BDT-Im behaves like CVAE and focuses on global pose/structure information. For example, given a “door” sketch, the retrieved images of both CVAE and BDT-Im have the global grid structure similar to the given sketch, but BDT-St could capture the correspondence between the retrieved images and the given sketch w.r.t. certain local structure information like door-case. One possible explanation is that the structure features are trained by aligning different domains into a shared space; however, the reconstructed image features are trained by aligning the sketch features to image features, which makes the former more flexible and tolerant to the difference between image and sketch. Moreover, BDT combines the strengths of both BDT-St and BDT-Im, producing better retrieval results.

5 Conclusion

We have studied zero-shot sketch-based image retrieval (ZS-SBIR) from a new viewpoint, i.e., using disentangled representation to facilitate structure-based retrieval. We have proposed our Bi-directional Domain Translation (BDT) framework, which performs retrieval in two feature spaces. Comprehensive experiments on Sketchy (aligned/unaligned) and TU-Berlin datasets have demonstrated the generalization ability of our framework from seen categories to unseen categories.


  • [1] X. Cao, H. Zhang, S. Liu, X. Guo, and L. Lin (2013) Sym-fish: a symmetry-aware flip invariant sketch histogram shape descriptor. In ICCV, Cited by: §1.
  • [2] Y. Cao, C. Wang, L. Zhang, and L. Zhang (2011) Edgel index for large-scale sketch-based image search. In CVPR, Cited by: §1.
  • [3] Y. Cao, H. Wang, C. Wang, Z. Li, L. Zhang, and L. Zhang (2010) Mindfinder: interactive sketch-based image search on millions of images. In ACM MM, Cited by: §1.
  • [4] J. Chen and Y. Fang (2018) Deep cross-modality adaptation via semantics preserving adversarial learning for sketch-based 3d shape retrieval. In ECCV, Cited by: §2.1.
  • [5] T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In NeurIPS, Cited by: §2.2.
  • [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, Cited by: §2.2.
  • [7] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018) StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, Cited by: §2.3.
  • [8] S. Chopra, R. Hadsell, Y. LeCun, et al. (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR, Cited by: §2.1.
  • [9] A. Del Bimbo and P. Pala (1997) Visual image retrieval by elastic matching of user sketches. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (2), pp. 121–132. Cited by: §1.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §2.1, §3.2.1, §4.1.1, §4.1.1.
  • [11] E. Dupont (2018) Learning disentangled joint continuous and discrete representations. In NeurIPS, Cited by: §1, §2.2.
  • [12] A. Dutta and Z. Akata (2019) Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In CVPR, Cited by: §1, §2.1, Table 1, §4.2, §4.2.
  • [13] M. Eitz, J. Hays, and M. Alexa (2012) How do humans sketch objects?. In SIGGRAPH, Cited by: §4.1.1, §4.1.1.
  • [14] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa (2010) An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Computers & Graphics 34 (5), pp. 482–498. Cited by: §1, §2.1.
  • [15] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa (2010) Sketch-based image retrieval: benchmark and bag-of-features descriptors. IEEE transactions on visualization and computer graphics 17 (11), pp. 1624–1636. Cited by: §1, §2.1.
  • [16] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) Devise: A deep visual-semantic embedding model. In NeurIPS, Cited by: Table 1, §4.2.
  • [17] A. Gonzalez-Garcia, J. van de Weijer, and Y. Bengio (2018) Image-to-image translation for cross-domain disentanglement. In NeurIPS, Cited by: §2.3.
  • [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §3.3.
  • [19] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-VAE: Learning basic visual concepts with a constrained variational framework.. In ICLR, Cited by: §2.2.
  • [20] R. Hu and J. Collomosse (2013) A performance evaluation of gradient field hog descriptor for sketch based image retrieval. Computer Vision and Image Understanding 117 (7), pp. 790–806. Cited by: §1, §2.1.
  • [21] R. Hu, T. Wang, and J. Collomosse (2011) A bag-of-regions approach to sketch-based image retrieval. In ICIP, Cited by: §1, §2.1.
  • [22] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In ECCV, Cited by: §2.3.
  • [23] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. CVPR. Cited by: §2.3.
  • [24] S. James, M. J. Fonseca, and J. Collomosse Reenact: sketch based choreographic design from archival dance footage. In ICMR, Cited by: §1.
  • [25] A. H. Jha, S. Anand, M. Singh, and V. Veeravasarapu (2018) Disentangling factors of variation with cycle-consistent variational auto-encoders. In ECCV, Cited by: §2.2.
  • [26] H. Kim and A. Mnih (2018) Disentangling by factorising. In ICML, Cited by: §2.2.
  • [27] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim (2017) Learning to discover cross-domain relations with generative adversarial networks. In ICML, Cited by: §2.3.
  • [28] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. ICLR. Cited by: §4.1.2.
  • [29] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §3.2.3.
  • [30] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In NeurIPS, Cited by: §2.2.
  • [31] E. Kodirov, T. Xiang, and S. Gong (2017) Semantic autoencoder for zero-shot learning. In CVPR, Cited by: Table 1, §4.2.
  • [32] H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In ECCV, Cited by: §2.3.
  • [33] H. Lee, H. Tseng, Q. Mao, J. Huang, Y. Lu, M. Singh, and M. Yang (2019) DRIT++: Diverse image-to-image translation via disentangled representations. arXiv preprint arXiv:1905.01270. Cited by: §2.3.
  • [34] K. Li, K. Pang, Y. Song, T. Hospedales, H. Zhang, and Y. Hu (2016) Fine-grained sketch-based image retrieval: the role of part-aware attributes. In WACV, Cited by: §1.
  • [35] A. H. Liu, Y. Liu, Y. Yeh, and Y. F. Wang (2018) A unified feature disentangler for multi-domain image translation and manipulation. In NeurIPS, Cited by: §2.2.
  • [36] L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao (2017) Deep sketch hashing: fast free-hand sketch-based image retrieval. In CVPR, Cited by: §1, §4.1.1, §4.1.1, §4.1.1.
  • [37] Q. Liu, L. Xie, H. Wang, and A. Yuille (2019) Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In ICCV, Cited by: §1, §2.1, §4.2, §4.2.
  • [38] Y. Liu, Z. Wang, H. Jin, and I. Wassell (2018) Multi-task adversarial network for disentangled feature learning. In CVPR, Cited by: §2.2.
  • [39] Y. Liu, F. Wei, J. Shao, L. Sheng, J. Yan, and X. Wang (2018) Exploring disentangled feature representation beyond face identification. In CVPR, Cited by: §2.2.
  • [40] P. Lu, G. Huang, Y. Fu, G. Guo, and H. Lin (2018) Learning large euclidean margin for sketch-based image retrieval. arXiv preprint arXiv:1812.04275. Cited by: §1, §4.2.
  • [41] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz (2018) Disentangled person image generation. In CVPR, Cited by: §2.2.
  • [42] L. v. d. Maaten and G. Hinton Visualizing data using t-SNE. Journal of machine learning research 9. Cited by: §4.4.
  • [43] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun (2016) Disentangling factors of variation in deep representation using adversarial training. In NeurIPS, Cited by: §2.2.
  • [44] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NeurIPS, Cited by: §4.2.
  • [45] G. A. Miller (1998) WordNet: An electronic lexical database. Cited by: §4.2.
  • [46] S. Parui and A. Mittal (2014) Similarity-invariant sketch-based image retrieval in large databases. In ECCV, Cited by: §1.
  • [47] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop, Cited by: §4.1.2.
  • [48] Y. Qi, Y. Song, H. Zhang, and J. Liu (2016) Sketch-based image retrieval via siamese convolutional neural network. In ICIP, Cited by: §1, §2.1, Table 1, §4.2.
  • [49] B. Romera and P. Torr (2015) An embarrassingly simple approach to zero-shot learning. In ICML, Cited by: Table 1, §4.2.
  • [50] J. M. Saavedra, J. M. Barrios, and S. Orand (2015) Sketch based image retrieval using learned keyshapes (LKS).. In BMVC, Cited by: §1, §2.1.
  • [51] P. Sangkloy, N. Burnell, C. Ham, and J. Hays (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG) 35 (4), pp. 119. Cited by: §1, §2.1, §4.1.1, §4.1.1.
  • [52] Y. Shen, L. Liu, F. Shen, and L. Shao (2018) Zero-shot sketch-image hashing. In CVPR, Cited by: §1, §1, §2.1, §4.1.1, §4.1.1, footnote 2.
  • [53] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras (2017) Neural face editing with intrinsic image disentangling. In CVPR, Cited by: §2.2.
  • [54] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: Figure 2, §4.2.
  • [55] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng (2013) Zero-shot learning through cross-modal transfer. In NeurIPS, Cited by: Table 1, §4.2.
  • [56] J. Song, Q. Yu, Y. Song, T. Xiang, and T. M. Hospedales (2017) Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In CVPR, Cited by: §2.1.
  • [57] L. Tran, X. Yin, and X. Liu (2017) Disentangled representation learning gan for pose-invariant face recognition. In CVPR, Cited by: §2.2.
  • [58] F. Wang, L. Kang, and Y. Li (2015) Sketch-based 3d shape retrieval using convolutional neural networks. In CVPR, Cited by: §1, §1, Table 1, §4.2, §4.2.
  • [59] H. Wang, C. Deng, X. Xu, W. Liu, X. Gao, and D. Tao (2019) Stacked semantic-guided network for zero-shot sketch-based image retrieval. arXiv preprint arXiv:1904.01971. Cited by: §1, §2.1, §3.1, §3.2.1.
  • [60] X. Xu, H. Wang, L. Li, and C. Deng (2019) Semantic adversarial network for zero-shot sketch-based image retrieval. arXiv preprint arXiv:1905.02327. Cited by: §1, §2.1, Table 1, §4.2.
  • [61] X. Yan, J. Yang, K. Sohn, and H. Lee (2016) Attribute2Image: Conditional image generation from visual attributes. In ECCV, Cited by: §2.2.
  • [62] S. K. Yelamarthi, S. K. Reddy, A. Mishra, and A. Mittal (2018) A zero-shot framework for sketch based image retrieval. In ECCV, Cited by: §1, §1, §2.1, §3.1, Table 1, §4.1.1, §4.1.2, §4.2, §4.5.
  • [63] Q. Yu, F. Liu, Y. Song, T. Xiang, T. M. Hospedales, and C. Loy (2016) Sketch me that shoe. In CVPR, Cited by: §1, §1.
  • [64] Q. Yu, Y. Yang, F. Liu, Y. Song, T. Xiang, and T. M. Hospedales (2017) Sketch-a-net: a deep neural network that beats humans. International journal of computer vision 122 (3), pp. 411–425. Cited by: Figure 1, §1, Table 1, §4.2.
  • [65] Z. Zhang and V. Saligrama (2015) Zero-shot learning via semantic similarity embedding. In Proceedings of the IEEE international conference on computer vision, pp. 4166–4174. Cited by: Table 1, §4.2, §4.2.
  • [66] S. Zhao, J. Song, and S. Ermon (2017) InfoVAE: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262. Cited by: §2.2.
  • [67] Z. Zheng, X. Yang, Z. Yu, L. Zheng, Y. Yang, and J. Kautz (2019) Joint discriminative and generative learning for person re-identification. In CVPR, Cited by: §2.2.
  • [68] R. Zhou, L. Chen, and L. Zhang (2012) Sketch-based image retrieval on a large scale database. In ACM MM, Cited by: §1.
  • [69] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In CVPR, Cited by: §2.3.
  • [70] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017) Toward multimodal image-to-image translation. In NeurIPS, Cited by: §2.3.


In the following, we provide the training (seen) and testing (unseen) category split on two datasets used in our experiments.

1) Split for Sketchy dataset

  • Training Categories: squirrel, turtle, tiger, bicycle, crocodilian, frog, bread, hedgehog, hot-air_balloon, ape, elephant, geyser, chicken, ray, fan, hotdog, pizza, duck, piano, armor, axe, hammer, camel, horse, spider, kangaroo, mushroom, owl, seal, table, hermit_crab, zebra, car_(sedan), shark, flower, guitar, bench, wine_bottle, fish, snail, deer, knife, airplane, sea_turtle, hat, eyeglasses, parrot, bee, tank, lion, swan, penguin, violin, rabbit, motorcycle, lobster, sheep, snake, shoe, hamburger, teddy_bear, pretzel, alarm_clock, church, ant, trumpet, candle, chair, hourglass, cat, scorpion, bear, dog, beetle, cannon, pig, cup, crab, pickup_truck, pineapple, apple, lizard, sailboat, spoon, umbrella, rocket, teapot, couch, butterfly, blimp, jellyfish, rifle, starfish, banana, wading_bird, bell, pistol, saxophone, strawberry, jack-o-lantern, castle, racket, harp, volcano

  • Test Categories: bat, cabin, cow, dolphin, door, giraffe, helicopter, mouse, pear, raccoon, rhinoceros, saw, scissors, seagull, skyscraper, songbird, sword, tree, wheelchair, windmill, window

2) Split for TU-Berlin dataset

  • Training Categories: arm, ashtray, axe, baseball bat, blimp, brain, bulldozer, bush, cake, chandelier, cloud, cow, crown, dolphin, donut, dragon, duck, eyeglasses, giraffe, grapes, grenade, head, head-phones, helicopter, horse, lightbulb, megaphone, microscope, mosquito, octopus, paper clip, pear, person walking, pigeon, pipe (for smoking), pumpkin, rainbow, rooster, satellite, satellite dish, scissors, seagull, skateboard, skyscraper, snowboard, stapler, suitcase, sun, sword, tire, toilet, tomato, toothbrush, trousers, walkie talkie, windmill, wrist-watch, carrot, key, palm tree, parrot, rollerblades, suv, tree

  • Test Categories: airplane, alarm clock, angel, ant, apple, armchair, backpack, banana, barn, basket, bathtub, bear (animal), bed, bee, beer-mug, bell, bench, bicycle, binoculars, book, bookshelf, boomerang, bottle opener, bowl, bread, bridge, bus, butterfly, cabinet, cactus, calculator, camel, camera, candle, cannon, canoe, car (sedan), castle, cat, cell phone, chair, church, cigarette, comb, computer monitor, computer-mouse, couch, crab, crane (machine), crocodile, cup, diamond, dog, door, door handle, ear, elephant, envelope, eye, face, fan, feather, fire hydrant, fish, flashlight, floor lamp, flower with stem, flying bird, flying saucer, foot, fork, frog, frying-pan, guitar, hamburger, hammer, hand, harp, hat, hedgehog, helmet, hot air balloon, hot-dog, hourglass, house, human-skeleton, ice-cream-cone, ipod, kangaroo, keyboard, microphone, monkey, moon, motorbike, mouse (animal), mouth, mug, mushroom, nose, owl, panda, parachute, parking meter, pen, penguin, person sitting, piano, pickup truck, pig, pineapple, pizza, potted plant, power outlet, present, pretzel, purse, rabbit, race car, radio, revolver, rifle, sailboat, santa claus, saxophone, scorpion, screwdriver, sea turtle, shark, sheep, ship, shoe, shovel, skull, snail, snake, snowman, socks, space shuttle, speed-boat, spider, sponge bob, spoon, squirrel, standing bird, strawberry, streetlight, submarine, swan, syringe, t-shirt, table, tablelamp, teacup, teapot, teddy-bear, telephone, tennis-racket, tent, tiger, tooth, tractor, traffic light, train, trombone, truck, trumpet, tv, umbrella, van, vase, violin, wheel, wheelbarrow, wine-bottle, wineglass, zebra

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description