Semi-Heterogeneous Three-Way Joint Embedding Network for Sketch-Based Image Retrieval

Semi-Heterogeneous Three-Way Joint Embedding Network for Sketch-Based Image Retrieval

Jianjun Lei,  Yuxin Song, Bo Peng, Zhanyu Ma, 
Ling Shao,  and Yi-Zhe Song
This work was supported in part by the Natural Science Foundation of Tianjin ( No.18ZXZNGX00110, 18JCJQJC45800 ), and National Natural Science Foundation of China ( No.61931014, 61922015, 61722112 ). Copyright 20xx IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org. (Corresponding author: Bo Peng)J. Lei, Y. Song, and B. Peng are with the School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China (e-mail: jjlei@tju.edu.cn; songyuxin@tju.edu.cn; bpeng@tju.edu.cn).Z. Ma is with the Pattern Recognition and Intelligent System Laboratory, Beijing University of Posts and Telecommunications, Beijing 100876, China (e-mail: mazhanyu@bupt.edu.cn).L. Shao is with the Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates (e-mail: ling.shao@ieee.org).Y.-Z. Song is with the SketchX Lab, Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, Surrey GU2 7XH, U.K. (e-mail: y.song@surrey.ac.uk).Digital Object Identifier
Abstract

Sketch-based image retrieval (SBIR) is a challenging task due to the large cross-domain gap between sketches and natural images. How to align abstract sketches and natural images into a common high-level semantic space remains a key problem in SBIR. In this paper, we propose a novel semi-heterogeneous three-way joint embedding network (Semi3-Net), which integrates three branches (a sketch branch, a natural image branch, and an edgemap branch) to learn more discriminative cross-domain feature representations for the SBIR task. The key insight lies with how we cultivate the mutual and subtle relationships amongst the sketches, natural images, and edgemaps. A semi-heterogeneous feature mapping is designed to extract bottom features from each domain, where the sketch and edgemap branches are shared while the natural image branch is heterogeneous to the other branches. In addition, a joint semantic embedding is introduced to embed the features from different domains into a common high-level semantic space, where all of the three branches are shared. To further capture informative features common to both natural images and the corresponding edgemaps, a co-attention model is introduced to conduct common channel-wise feature recalibration between different domains. A hybrid-loss mechanism is designed to align the three branches, where an alignment loss and a sketch-edgemap contrastive loss are presented to encourage the network to learn invariant cross-domain representations. Experimental results on two widely used category-level datasets (Sketchy and TU-Berlin Extension) demonstrate that the proposed method outperforms state-of-the-art methods.

SBIR, cross-domain learning, co-attention model, hybrid-loss mechanism

I Introduction

SINCE the number of digital images on the Internet has increased dramatically in recent years, content-based image retrieval technology has become a hot topic in computer vision community [1]-[4]. With the popularization of touch screen devices, sketch-based image retrieval (SBIR) has attracted extensive attention and achieved remarkable performance [5]-[8]. Given a hand-drawn sketch, the SBIR task aims to retrieve the natural target images from the image database. However, compared with the natural target images, which are full of color and texture information, sketches only contain simple black and white pixels [9], [10]. Therefore, hand-drawn sketches and natural images belong to two heterogeneous data domains, and aligning these two domains into a common feature space remains the most challenging problem in SBIR.

Traditional SBIR methods describe sketches and natural images using hand-crafted features [11]-[19]. Edgemaps of natural images are usually first extracted as sketch approximations. Then, hand-designed features, such as HOG [20], SIFT [21], and Shape Context [22], are exploited to describe both the sketches and edgemaps. Finally, the K-Nearest Neighbor (KNN) ranking process is utilized to evaluate the similarity between the sketches and natural images to obtain the final retrieval results. However, as mentioned above, hand-drawn sketches and natural images belong to two heterogeneous data domains. It is difficult to design a common type of feature applicable to two different data domains. Besides, sketches are usually drawn by non-professionals, making them full of intra-class variations [23]-[25]. Most hand-crafted features have difficulties in dealing with these intra-class variations and ambiguities of hand-drawn sketches, which also negatively impacts the performance of SBIR.

In recent years, convolutional neural networks (CNNs) have been widely used across fields [26]-[29], such as person re-identification, object detection, and video recommendation. In contrast to traditional hand-crafted methods, CNNs can automatically aggregate shallow features learned from the bottom convolutional layers. Inspired by the learning ability of CNNs, several Siamese networks and Triplet networks have been proposed for the SBIR task [30]-[33]. Most of these methods encode a sketch-image or a sketch-edgemap pair, and learn the similarity between the input pair using a contrastive loss or triplet loss. However, there are still several difficulties and challenges to be solved in these methods. 1) The different characteristics of sketches and natural images make the SBIR task challenging. Generally, a sketch only contains the object to be retrieved, meaning the sketches tend to have relatively clean backgrounds. In addition, since sketches are usually drawn by non-professionals, the shapes of objects in sketches are usually deformed and relatively abstract. For natural images, although the objects are not usually significantly deformed, natural images taken by cameras usually have complex backgrounds. Therefore, creating a network that can learn more discriminative features for both sketches and natural images remains a challenge. 2) Since sketches and natural images are from two different data domains, there exists a significant domain gap between the features of these two. Most deep SBIR methods that adopt a contrastive loss or a triplet loss to learn the cross-domain similarity are not effetive enough to cope with the intrinsic domain gap. Therefore, finding a way to eliminate or reduce the cross-domain gap and embedding features from different domains into a common high-level semantic space is critical for SBIR. 3) More importantly, most existing methods achieve SBIR by exploring the matching relationship between either sketch-edgemap pairs or sketch-image pairs. However, the methods using sketch-edgemap pairs ignore the discriminative features contained in natural images, while the methods using sketch-image pairs ignore the auxiliary role of edgemaps. Enabling full use of the joint relationships among sketches, natural images, and edgemaps provides a novel way to solve the cross-domain learning problem.

To address the above issues, a novel semi-heterogeneous three-way joint embedding network (Semi3-Net) is proposed in this paper to mitigate the domain gap and align sketches, natural images, and edgemaps into a high-level semantic space. The key insight behind our design is how we enforce mutual cooperation amongst the three branches. We importantly recognize that when measured in terms of visual abstraction, sketches and edgemaps are more closely linked compared with sketches and natural images. The sketches are highly abstract and iconic representations of natural images, and edgemaps are reduced versions of natural images, where detailed appearance information such as texture and color are removed. However, compared with edgemaps, natural images contain more discriminative features for SBIR. Motivated by this insight, we purposefully design a semi-heterogeneous joint embedding network, where a semi-heterogeneous weight-sharing setting among the three branches is adopted in the feature mapping part, while a three-branch all-sharing setting is conducted in the joint semantic embedding part. This design essentially promotes edgemaps to act as a “bridge” to help narrow the domain gap between the natural images and sketches. Fig. 1 offers a visualization of the proposed Semi3-Net architecture. More specifically, the semi-heterogeneous feature mapping part is designed to extract the bottom features for each domain, where a co-attention model is introduced to learn informative features common between different domains. Meanwhile, the joint semantic embedding part is proposed to embed the features from different domains into a common high-level semantic space. In addition, a hybrid-loss mechanism is proposed to achieve a more discriminative embedding, where an alignment loss and a sketch-edgemap contrastive loss are introduced to encourage the network to learn invariant cross-domain representations. The main contributions of this paper are summarized as follows.

1) A novel semi-heterogeneous three-way joint embedding network is proposed, in which the semi-heterogeneous feature mapping and the joint semantic embedding are designed to learn joint feature representations for sketches, natural images, and edgemaps.

2) To capture informative features common between natural images and the corresponding edgemaps, a co-attention model is developed between the natural image and edgemap branches.

3) A hybrid-loss mechanism is designed to mitigate the domain gap, where an alignment loss and a sketch-edgemap contrastive loss are presented to encourage the network to learn invariant cross-domain representations.

4) Experiments on two widely-used datasets, Sketchy and TU-Berlin Extension, demonstrate that the proposed method outperforms state-of-the-art methods.

The rest of the paper is organized as follows. Section II reviews the related works. Section III introduces the proposed method in detail. The experimental results and analysis are presented in Section IV. Finally, the conclusion is drawn in Section V.

Ii Related Work

Ii-a Traditional SBIR Methods

Traditional SBIR methods usually utilize edge extraction methods to extract edgemaps from natural images first. Then, hand-crafted features are used as descriptors for both sketches and edgemaps. Finally, a KNN ranking process within a Bag-of-Words (BoW) framework is usually utilized to rank the candidate natural images for each sketch. For instance, Hu et al. [11] incorporated the Gradient Field HOG (GF-HOG) into a BoW scheme for SBIR, and obtained promising performance. Saavedra et al. [12] introduced Soft-Histogram of Edge Local Orientations (SHELO) as the descriptors for sketches and edgemaps extracted from natural images, which effectively improves the retrieval accuracy. In [13], a novel method for describing hand-drawn sketches was proposed by detecting learned keyshapes (LKS). Xu et al. [14] proposed an academic coupled dictionary learning method to address the cross-domain learning problem in SBIR. Qian et al. [15] introduced re-ranking and relevance feedback schemes to find more similar natural images based on initial retrieval results, thus improving the retrieval performance.

Ii-B Deep SBIR Methods

Fig. 1: Illustration of the proposed Semi3-Net. (a) Three-way inputs. (b) Semi-heterogeneous feature mapping. (c) Joint semantic embedding. (d) Hybrid-loss mechanism. Blocks with the same color represent their weights are shared.

Recently, many frameworks based on CNNs have been proposed to address the challenges in SBIR [33]-[39]. Aiming to learn the cross-domain similarity between the sketch and natural image domains, several Siamese networks have been proposed to improve the retrieval performance. Qi et al. [33] introduced a novel Siamese CNN architecture for SBIR, which learns the features of sketches and edgemaps by jointly tuning two CNNs. Liu et al. [34] proposed a Siamese-AlexNet based on two AlexNet [40] branches to learn the cross-domain similarity and mitigate the domain gap. Wang et al. [36] proposed a Siamese network to learn the similarity between the input sketches and edgemaps of 3D models, which is originally designed for sketch-based 3D shape retrieval. Meanwhile, several Triplet architectures have also been proposed, which include the sketch branch, the positive natural image branch, and the negative natural image branch. In these methods, a ranking loss function is utilized to constrain the feature distance between a sketch and a positive natural image to be smaller than the one between the sketch and a negative natural image. Sangkloy et al. [38] learned a cross-domain mapping through a pre-training strategy to embed natural images and sketches in the same semantic space, and achieved superior retrieval performance. Recently, deep hashing methods [41], [42] have been exploited for retrieval task and have achieved significant improvement on retrieval performance. Liu et. al. [34] integrated a deep architecture into the hashing framework to capture the cross-domain similarities and speed up the SBIR process. Zhang et al. [39] proposed a Generative Domain-migration Hashing (GDH) approach, which uses a generative model to migrate sketches to their indistinguishable natural image counterparts and achieves the best-performing results on two SBIR datasets.

Ii-C Attention Models

Attention models have recently been successfully applied to various deep learning tasks, such as natural language processing (NLP) [43], fine-grained image recognition [44], [45], video moment retrieval [46], and visual question answering (VQA) [47]. In the field of image and video processing, the two most commonly used attention models are soft-attention models [48] and hard-attention models [49]. Soft-attention models assign different weights to different regions or channels in an image or a video, by learning an attention mask. In contrast, hard attention models only indicate one region at a time, using reinforcement learning. For instance, Hu et al. [50] proposed a channel attention model to recalibrate the weights of different channels, which effectively enhances the discriminative power of features and achieves promising classification performance. Gao et al. [51] proposed a novel aLSTMs framework for video captioning, which integrates the attention mechanism and LSTM to capture salient structures for video. In [52], a hierarchical LSTM with adaptive attention approach was proposed for image and video captioning, and achieved the state-of-the-art performances on both tasks. Besides, Li et al. [49] proposed a harmonious attention network for person re-identification, where soft attention is used to learn important pixels for fine-gained information matching, and hard attention is applied to search latent discriminative regions. Song et al. [35] proposed a soft attention spatial method for fine-grained SBIR to capture more discriminative fine-grained features. This model reweights the different spatial regions of the feature map by learning a weight mask for each branch of the triplet network. However, although the attention mechanisms above have strong abilities for feature learning, they generally learn discriminative features using only the input itself. For the SBIR task, we are more concerned about learning discriminative cross-domain features for retrieval. In other words, the common features of different domains should be considered simultaneously. To address this, a co-attention model is exploited in this paper for the SBIR task, to focus on capturing informative features common between natural images and the corresponding edgemaps, and further mitigate the cross-domain gap.

Iii Semi-Heterogeneous Three-Way Joint Embedding Network

Iii-a Semi-Heterogeneous Feature Mapping

As shown in Fig. 1 (b), the semi-heterogeneous feature mapping part consists of the natural image branch , the edgemap branch , and the sketch branch . Each branch includes a series of convolutional and pooling layers, which aim to learn the bottom features for each domain. As mentioned above, sketches and edgemaps have similar characteristics, and both lack detailed appearance information such as texture and color, thus the weights of and are shared. Besides, since the amount of the sketch training data is much smaller than natural image training data, sharing weights between the sketch and edgemap branches can partly alleviate problems that arise from the lack of sketch training data.

Fig. 2: Structure of the co-attention model. Yellow blocks represent the layers that belong to the natural image branch, while green blocks represent the layers that belong to the edgemap branch.

Meanwhile, since there exist obvious features representation differences between natural images and sketches (edgemaps), the bottom convolutional layers of the natural image branch should be learned separately. Accordingly, the natural image branch does not share weights with the other two branches in the feature mapping part. Additionally, with the aim of learning the informative features common between different domains, a co-attention model is introduced to associate the and branches. By applying the proposed co-attention model, the network is able to focus on the discriminative features common to both natural images and the corresponding edgemaps, and discard information that is not important for the retrieval task. For the sketch branch , a channel-wise attention module is also introduced to learn more discriminative features. The co-attention model will be discussed in Section III-B in detail.

Particularly, the semi-heterogeneous in the proposed architecture refers to the three-branch semi-heterogeneous weight sharing strategy. The weights of the sketch and edgemap branches are shared while the natural image branch is independent of the others in the semi-heterogeneous feature mapping part. The three branches are integrated into the semi-heterogeneous feature mapping architecture and interact with each other, which ensures that not only are the bottom features of each domain preserved, but that the features from different domains are also prealigned through the co-attention model.

Iii-B Joint Semantic Embedding

In the joint semantic embedding part, as shown in Fig. 1 (c), the natural image branch , the edgemap branch , and the sketch branch are developed to embed the features from different domains into a common high-level semantic space. Each branch of the joint semantic embedding part includes several fully-connected (FC) layers. An extra embedding layer, followed by an L2 normalization layer, is also introduced in each branch. As previously stated, the bottom features of different branches are learned respectively, by the semi-heterogeneous feature mapping. In order to achieve feature alignment between the natural images, edgemaps, and sketches in a common high-level semantic space, the weights of , , and are completely shared in the joint semantic embedding part. Based on the features learned in the common high-level semantic space, a hybrid-loss mechanism is proposed to learn the invariant cross-domain representations, and achieve a more discriminative embedding.

Iii-C Co-attention Model

To capture the informative features common between natural images and the corresponding edgemaps, a co-attention model is proposed between the natural image and edgemap branches. As previously stated, edgemaps are reduced versions of natural images, where detailed appearance information are removed. Therefore, a natural image and its corresponding edgemap are spatially aligned, which can be made full use of to narrow the gap between the natural images and edgemaps. Additionally, since the weights of the sketch and edgemap branches are shared, the introduction of the co-attention model actually enforces mutual and subtle cooperation amongst three branches. Specifically, before the bottom features of the three branches are fed into the joint semantic embedding part, the co-attention model prealigns different domains by conducting common feature recalibration. As shown in Fig. 2, the proposed co-attention model includes two attention modules and a co-mask learning module. The co-attention model takes the output feature maps of the last pooling layers in the natural image and edgemap branches as the inputs, and learns a common mask which is used to re-weight each channel of the feature maps in both two branches.

In the attention module, a channel-wise soft attention mechanism is adopted to capture discriminative features of each input. Specifically, the attention module on the right of Fig. 2 consists of a global average pooling (GAP) layer, two FC layers, a ReLU layer, and a sigmoid layer. Let and denote the inputs of the attention module for the natural image branch and edgemap branch, respectively, where h, w, and c represent the height, width, and channel dimensions of the feature maps. Through the GAP layer, the feature descriptors for natural image and edgemap branches are obtained by aggregating the global spatial information of and , which can be formulated as:

(1)
(2)

Based on and , two FC layers and a ReLU layer are applied to model the channel-wise dependencies, and the attention maps in both two domains are obtained. By deploying a sigmoid layer, each channel of the attention map is normalized to . For different domains, the final learned image attention mask and edgemap attention mask are respectively formulated as:

(3)
(4)

where and denote the weights of the first FC layer, and and denote the weights of the second FC layer.

In SBIR, the key challenge is to capture the discriminative information common to different domains and then align the different domains into a common high-level semantic space. Therefore, unlike most existing works that use the obtained attention mask directly to reweight the channel responses, the proposed co-attention model tries to capture the channel-wise dependencies common between different domains by learning a co-mask. Based on the learned image attention mask and edgemap attention mask, the co-mask is defined as:

(5)

where denotes the element-wise product. Elements in represent the joint weights of the corresponding channels in and .

Afterwards, the output feature maps and for the image and edgemap branches are calculated respectively by rescaling the input feature maps and with the obtained channel-wise co-mask:

(6)
(7)

where denotes the channel-wise multiplication between the co-mask and the input feature maps.

The proposed co-attention model not only considers the channel-wise feature responses of each domain by introducing the attention mechanism, but also captures the common channel-wise relationship between different domains, simultaneously. Specifically, by introducing the co-attention model between the natural image and edgemap branches, the common informative features from different domains are highlighted, and the cross-domain gap is effectively reduced.

Iii-D Hybrid-Loss Mechanism

SBIR aims to obtain a retrieval ranking by learning the similarity between a query sketch and the natural images in a dataset. The distance between the positive sketch-image pairs should be smaller than the distance between the negative sketch-image pairs. In the hybrid-loss mechanism, the alignment loss and the sketch-edgemap contrastive loss are presented to learn the invariant cross-domain representations, and mitigate the domain gap. Besides, we also introduce the cross-entropy loss [38] and the sketch-image contrastive loss [30], which are two typical losses in SBIR. The four types of loss functions are complementary to each other, thus improving the separability between the different sample pairs.

To be clear, the feature maps produced by the L2 normalization layers for the three branches are denoted as , , and , where denotes the mapping function learned by the network branch, and , , and denote the weights of the natural image, edgemap, and sketch branches, respectively.

1) Alignment loss. To learn the invariant cross-domain representations and align different domains into a high-level semantic space, a novel alignment loss is proposed between the natural image branch and the edgemap branch. Although the image and the corresponding edgemap come from different data domains, they should have similar high-level semantics after the processing of joint semantic embedding. Motivated by this, aiming to minimize the feature distance between an image and its corresponding edgemap in the high-level semantic space, the proposed alignment loss function is defined as:

(8)

By introducing the alignment loss, the cross-domain invariant representations between a natural image and its corresponding edgemap are captured for the SBIR task. In other words, the proposed alignment loss provides a novel way of dealing with the domain gap by constructing a correlation between the natural image and its corresponding edgemap. It potentially encourages the network to learn the discriminative features common to both natural image and sketch domains, and successfully aligns different domains into a common high-level semantic space.

2) Sketch-edgemap contrastive loss. Considering the one-to-one correlation between an image and its corresponding edgemap, the sketch-edgemap contrastive loss between the sketch and edgemap branches is proposed to further constrain the matching relationship between the sketch-image pair as follows.

(9)

where denotes the similarity label, with 1 indicating a positive sketch-edgemap pair and 0 being a negative sketch-edgemap pair, and denote the edgemap corresponding to the positive and negative natural image, respectively, denotes Euclidean distance, and denotes the margin. The sketch-edgemap contrastive loss aims to measure the similarity between input pairs from the sketch and edgemap branches, thus further aligning different domains into the high-level semantic space.

3) Cross-entropy loss. In order to learn the discriminative features from each domain, the cross-entropy losses [38] for the three branches are introduced. For each branch of the proposed network, a softmax cross-entropy loss is exploited, which is formulated as:

(10)

where denotes the discrete probability distribution of one data sample over categories, denotes the typical one-hot label corresponding to each category, and denotes the feature vector produced by the last FC layer. In the proposed Semi3-Net, the cross-entropy loss forces the network to extract the discriminative features for each domain.

4) Sketch-image contrastive loss. Intuitively, in the SBIR task, the positive sketch-image pair should be close together, while the negative sketch-image pair should be far apart. Given a sketch and a natural image , the sketch-image contrastive loss [30] can be represented as:

(11)

where and denote the positive and negative natural images, respectively, and denotes the margin. By utilizing the sketch-image contrastive loss, the cross-domain similarity between sketches and natural images is effectively measured.

Finally, the alignment loss in Eq. (8), the sketch-edgemap contrastive loss in Eq. (9), the cross-entropy loss in Eq. (10), and the sketch-image contrastive loss in Eq. (11) are combined, thus the overall loss function is derived as:

(12)

where , , and denote the hyper-parameters that control the trade-off among different types of losses.

The proposed hybrid-loss mechanism constructs the correlation among sketches, edgemaps, and natural images, providing a novel way to deal with the domain gap by learning the invariant cross-domain representations. By adopting the hybrid-loss mechanism, the proposed network is able to learn more discriminative feature representations and effectively align the sketches, natural images, and edgemaps into a common feature space, thus improving the retrieval accuracy.

Iii-E Implementation Details and Training Procedure

In the proposed Semi3-Net, each branch is constructed based on VGG19 [53], which consists of sixteen convolutional layers, five pooling layers and three FC layers. In terms of network architecture, the convolutional layers and pooling layers in the semi-heterogeneous feature mapping part, and the first two FC layers in the joint semantic embedding part are the same as the ones in VGG19 in structure. Specifically, the extra embedding layer in each branch is a 256-dimension fully-connected layer, which aims to embed the different domains into the common semantic feature space. Meanwhile, the size of the last FC layer is modified to the number of categories in the SBIR datasets. The entire training process includes the pre-training stage on each individual branch and the training stage on the proposed Semi3-Net.

In the pre-training stage, each individual branch, including the convolutional layers and pooling layers in the semi-heterogeneous feature mapping part, and the FC layers in the joint semantic embedding part, is trained independently. Specifically, the cross-entropy loss is adopted to pre-train each branch using the corresponding source data in the training dataset. Without learning the common embedding, the pre-training stage aims to learn the weights appropriate for sketch, natural image and edgemap recognition, respectively.

In the training stage, the weights of the three branches are learned jointly, and the cross-domain representations are obtained by training the whole Semi3-Net. The overall loss function in Eq. (12) is utilized in the training stage. As for the sketch-edgemap and the sketch-image contrastive losses illustrated above, the sketch-image and sketch-edgemap pairs for training need to be generated. To this end, for each sketch in the training dataset, we randomly select a natural image (edgemap) from the same category to form the positive pair and a natural image (edgemap) from the other categories to form the negative pair. In the training process, the ratio of positive and negative sample pairs is set to 1:1, and the positive and negative pairs are randomly selected following this rule, for each training batch.

Iv Experiments

Iv-a Experimental Settings

In this paper, two category-level SBIR benchmarks, Sketchy [38] and TU-Berlin Extension [34], are adopted to evaluate the proposed Semi3-Net. Sketchy consists of 75,471 sketches and 12,500 natural images, from 125 categories with 100 objects per category. In our experiments, we utilize the extended Sketchy dataset [34] with 73,002 natural images in total, which adds an extra 60,502 natural images collected from ImageNet [54]. TU-Berlin Extension [34] consists of 204,489 natural images and 20k free-hand drawn sketches from 250 categories, with 80 sketches per category. The natural images in these two datasets are all realistic with complex backgrounds and large variations, thus bringing great challenges to the SBIR task. Importantly, for fair comparison against state-of-the-art methods, the same training-testing splits used for the existing methods are adopted in our experiments. For Sketchy and TU-berlin Extension, 50 and 10 sketches, respectively, of each category are utilized as the testing queries, while the rest are used for training.

All experiments are performed under the simulation environments of GeForce GTX 1080 Ti GPU and Intel i7-8700K processor @3.70 GHz. The training process of the proposed Semi3-Net is implemented using SGD on Caffe [55] with a batch size of 32. The initial learning rate is set to 2e-4, and the weight decay and the momentum are set to 5e-4 and 0.9, respectively. For both datasets, the balance parameters , , and are set to 10, 100, and 10, respectively, as they consistently yield promising result. The margins and for the two contrastive losses are both set to 0.3.

In the proposed method, the Gb method [56] is applied to extract the edgemap of each natural image. During the testing phase, sketches, natural images, and the corresponding edgemaps are fed into the trained model, and the feature vectors of the three branches are obtained. Then, the cosine distance between the feature vectors of the query sketch and each natural image in the dataset is calculated. Finally, KNN is utilized to sort all the natural images for the final retrieval result. Similar to the existing SBIR methods, mean average precision (MAP) is used to evaluate the retrieval performance.

Iv-B Comparison Results

Table I shows the performance of the proposed method compared with the state-of-the-art methods, on the Sketchy and TU-Berlin Extension datasets. The comparison methods include traditional methods (i.e., LKS [13], HOG [20], SHELO [12], and GF-HOG [11]) and deep learning methods (i.e., Siamese CNN [33], SaN [37], GN Triplet [38], 3D shape [36], Siamese-AlexNet [34], Triplet-AlexNet [34], DSH [34] and GDH [39]). It can be observed from the table that: 1) Compared with the methods that utilize edgemaps to replace natural images for retrieval [33], [36], the proposed method obtains better retrieval accuracy. This is mainly because the edgemaps extracted from natural images may lose certain information useful for retrieval, and the CNNs pre-trained on ImageNet are more effective for natural images than edgemaps. 2) Compared with the methods that only utilize sketch-image pairs for SBIR [38], the proposed Semi3-Net achieves better performance, by introducing edgemap information. It demonstrates that an edgemap can be utilized as a bridge to effectively narrow the distance between the natural image and sketch domains. 3) Compared with DSH [34], which fuses the natural image and edgemap features into one feature representation, the proposed Semi3-Net achieves 0.133 and 0.230 improvements in terms of MAP, on Sketchy and TU-Berlin Extension, respectively. This is mainly because the proposed Semi3-Net makes full use of the one-to-one matching relationship between a natural image and its corresponding edgemap to align different domains into a high-level semantic space. Meanwhile, the domain shift process is well achieved under the proposed semi-heterogeneous joint embedding network architecture. Besides, by integrating the co-attention model and hybrid-loss mechanism, the proposed Semi3-Net is encouraged to learn a more discriminative embedding and invariant cross-domain representations, simultaneously. 4) The proposed Semi3-Net not only achieves superior performance over all traditional methods, but also outperforms the current best state-of-the-art deep learning method GDH [39] by 0.106 and 0.110 in MAP on Sketchy and TU-Berlin Extension, respectively. This further validates the effectiveness of the proposed Semi3-Net for the SBIR task.

Methods Sketchy
TU-Berlin
Extension
MAP MAP
3D shape [36]
HOG [20]
GF-HOG [11]
SHELO [12]
LKS [13]
SaN [37]
Siamese CNN [33]
Siamese-AlexNet [34]
GN Triplet [38]
Triplet-AlexNet [34]
DSH [34]
GDH [39]
Our method
TABLE I: COMPARISON WITH THE STATE-OF-THE-ART SBIR METHODS ON SKETCHY AND TU-BERLIN EXTENSION DATASETS.
Fig. 3: Some retrieval examples for the proposed Semi3-Net on the Sketchy dataset. Incorrect retrieval results are marked with red bounding boxes.

Figure 3 shows some retrieval examples by the proposed Semi3-Net for the Sketchy dataset. Specifically, 10 relatively challenging query sketches with top-15 retrieval rank lists are presented, where the incorrect retrieval results are marked with red bounding boxes. As can be seen, the proposed method performs well for retrieving natural images with complex backgrounds, such as the queries “Umbrella” and “Racket”. This is mainly because the key information common to different domains can be effectively captured by the proposed semi-heterogeneous feature mapping part with the co-attention model. Additionally, for relatively abstract sketches, such as the sketch “Cat”, the proposed method still achieves a consistent superior performance. This indicates that the semantic features of the abstract sketches are extracted by embedding the features from different domains into a high-level semantic space. In other words, although the abstract sketches only consist of black and white pixels, the proposed Semi3-Net can gain a better understanding of the sketch domain by introducing the joint semantic embedding part. However, there are also some incorrect retrieval images in Fig. 3. For example, for the query “Fish”, a dolphin is retrieved with the proposed method. Meanwhile, for the query “Duck”, several swans are retrieved. However, the incorrect retrieved natural images are quite similar to the query. Specifically, the fish and duck sketches, which lack color and texture information, are very similar in shape to a dolphin and a swan, respectively, which makes the SBIR task more challenging.

Methods MAP
All sharing
Only FC layer sharing
Only sketch-edgemap sharing
Semi3-Net
TABLE II: EVALUATION OF THE SEMI-HETEROGENEOUS ARCHITECTURE ON SKETCHY DATASET.

To further verify the effectiveness of the proposed Semi3-Net, the t-SNE [57] visualization of natural images and sketches from ten categories on Sketchy are reported. As shown in Fig. 4, the ten categories are selected randomly in the dataset, and the labels of the selected categories are also illustrated in the upper right corner. We run t-SNE visualization on both images and sketches together, then separate the projected data points to Fig. 4 (a) and Fig. 4 (b), respectively. Specifically, the circles in Fig. 4 represent clusters of different categories, and the data in each circle belongs to the same category in the high-level semantic space. In other words, circles in the same position of Fig. 4 (a) and Fig. 4 (b) correspond to the natural images and sketches with the same label, respectively. If data samples with the same label but from two different domains are correctly aligned in the common feature space, it indicates that the cross-domain learning is successful. As can be seen, the natural images and sketches with the same label scatter into nearly the same clusters. This further demonstrates that the proposed Semi3-Net effectively aligns sketches and natural images into a common high-level semantic space, thus improving the retrieval accuracy.

Fig. 4: t-SNE visualization of the natural images and query sketches from ten categories in the Sketchy dataset, during the test phase. Symbols with the same color and shape in (a) and (b) represent the natural images and query sketches with the same label, respectively.

Iv-C Ablation Study

Iv-C1 Evaluation of the semi-heterogeneous architecture

To evaluate the superiority of the proposed semi-heterogeneous network architecture, networks with different weight sharing strategies are compared with the proposed Semi3-Net. For fair comparison, the proposed network architecture is fixed when different weight-sharing strategies are verified. In other words, both the proposed co-attention model and the hybrid-loss mechanism are applied in the networks with different weight-sharing strategies. The comparison results are shown in Table II, where “all sharing” indicates the weights of the three branches are completely shared, “only FC layer sharing” indicates only the weights of the three branches are shared in the semantic embedding part, and “only sketch-edgemap sharing” indicates only the weights of the sketch branch and edgemap branch are shared in the feature mapping part. As can be seen, the method with all sharing strategy does not perform as well as the methods with partially-shared strategies. This supports our views illustrated previously that the bottom discriminative features should be learned separately for different domains. Compared with the the proposed Semi3-Net, the method with only FC layer sharing strategy obtains 0.879 MAP. That is because the intrinsic relevance between the sketches and edgemaps is ignored. Meanwhile, experimental results show that the method with only sketch-edgemap sharing strategy also results in a decrease in MAP. That is because the three branches are not fully aligned in the common high-level semantic space. Importantly, the proposed Semi3-Net, with both the FC layer sharing and sketch-edgemap sharing strategies, achieves the best performance, which proves the effectiveness of the proposed network architecture.

Iv-C2 Evaluation of key components

In order to illustrate the contributions of the key components in the proposed method, i.e., the self-heterogeneous three-way framework, the co-attention model and the hybrid-loss mechanism, leaving-one-out evaluations are conducted on the Sketchy dataset. The experimental results are shown in Table III. Note that “w/o CAM and w/o HLM” refers to the proposed self-heterogeneous three-way framework, which has neither a co-attention model nor a hybrid-loss mechanism. For “w/o HLM”, only two types of typical SBIR losses, the cross-entropy loss and the sketch-image contrastive loss, are used in the training phase. As shown in Table III, the proposed self-heterogeneous three-way framework obtains 0.851 MAP, outperforming the other methods in Table I. Additionally, leaving out either CAM or HLM results in a lower MAP, which verifies that each component contributes to the overall performance. Specifically, the informative features common between natural images and the corresponding edgemaps are captured effectively by the proposed co-attention mechanism, and the three different inputs can be effectively aligned into a common feature space by introducing the hybrid-loss mechanism.

Iv-D Discussions

Methods MAP
w/o CAM and w/o HLM
w/o HLM
w/o CAM
Semi3-Net
TABLE III: EVALUATION OF KEY COMPONENTS ON SKETCHY DATASET.
Methods MAP
Edgemap feature
Natural image feature
TABLE IV: EVALUATION OF DIFFERENT FEATURE SELECTIONS IN THE TEST PHASE ON SKETCHY DATASET.

In this section, the impact of different feature selections in the test phase is discussed for the Sketchy dataset. As mentioned above, either features extracted from the natural image branch or the edgemap branch can be used as the final retrieval feature representation. To evaluate the impact of different feature selections in the test phase, we obtain the 256-d feature vectors from the embedding layers of the natural image branch and the edgemap branch, respectively. The experiments are conducted on the Sketchy dataset, and the experimental results are reported in Table IV. As can be seen from the table, there is a small difference between the retrieval performances when using the edgemap feature and natural image feature. This is mainly because the invariant cross-domain representations of different domains are indeed learned by the proposed method, and the cross-domain gap is narrowed by the joint embedding learning. Besides, the feature extracted from the natural image branch performs a little better than that from the edgemap branch, which is also consistent with the research in [30], [38].

In addition, to verify the performance without edgemap branch, we conducted experiments on a two-branch framework by only using the sketch and natural image branches. Considering that the sketches and natural images belong to two different domains, a non-shared setting is first exploited on the two-branch framework. Note that without the edgemap branch, neither the co-attention model nor the hybrid-loss mechanism is added into the two-branch framework. Compared with the proposed Semi3-Net with 0.916 MAP, the two-branch framework obtains 0.837 MAP on Sketchy dataset, which sufficiently proves the effectiveness of the edgemap branch in the proposed Semi3-Net. In addition, a partially-shared setting is also exploited on the two-branch framework, in which the weights of the convolutional layers are independent while the weights of the fully-connected layers are shared. Compared with the non-shared setting, the partially-shared setting obtains 0.861 MAP. This further verifies the importance of the joint semantic embedding for SBIR.

V Conclusion

In this paper, we propose a novel semi-heterogeneous three-way joint embedding network, where auxiliary edgemap information is introduced as a bridge to narrow the cross-domain gap between sketches and natural images. The semi-heterogeneous feature mapping and the joint semantic embedding are proposed to learn the specific bottom features from each domain and embed different domains into a common high-level semantic space, respectively. Besides, a co-attention model is proposed to capture the informative features common between natural images and the corresponding edgemaps, by recalibrating corresponding channel-wise feature responses. In addition, a hybrid-loss mechanism is designed to construct the correlation among sketches, edgemaps, and natural images, so that the invariant cross-domain representations of different domains can be effectively learned. Experimental results on two datasets demonstrate that Semi3-Net outperforms the state-of-the-art methods, which proves the effectiveness of the proposed method.

In the future, we will focus on extending the proposed cross-domain network to fine-grained image retrieval, and learning the correspondence of the fine-grained details for sketch-image pairs. Besides, further study may also include extending our method to other cross-domain learning problems.

References

  • [1] S. Zhang, M. Yang, X. Wang, Y. Lin, and Q. Tian, “Semantic-aware co-indexing for image retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 12, pp. 2573-2587, 2015.
  • [2] D. Wu, Z. Lin, B. Li, J. Liu, and W. Wang, “Deep uniqueness-aware hashing for fine-grained multi-label image retrieval,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018, pp. 1683-1687.
  • [3] B. Peng, J. Lei, H. Fu, C. Zhang, T.-S. Chua, and X. Li, “Unsupervised video action clustering via motion-scene interaction constraint,” IEEE Transactions on Circuits and Systems for Video Technology, DOI: 10.1109/TCSVT.2018.2889514, 2018.
  • [4] X. Shang, H. Zhang, and T.-S. Chua, “Deep learning generic features for cross-media retrieval,” in Proc. International Conference on MultiMedia Modeling (MMM), Jun. 2018, pp. 15-24.
  • [5] J. M. Saavedra, “Rst-shelo: sketch-based image retrieval using sketch tokens and square root normalization,” Multimedia Tools and Applications, vol. 76, no. 1, pp. 931-951, 2017.
  • [6] K. Li, K. Pang, Y.-Z. Song, T. Hospedales, T. Xiang, and H. Zhan, “Synergistic instance-level subspace alignment for fine-grained sketch-based image retrieval,” IEEE Transactions on Image Processing, vol. 26, no. 12, pp. 5908-5921, 2017.
  • [7] G. Tolias and O. Chum, “Asymmetric feature maps with application to sketch based retrieval,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 6185-6193.
  • [8] J. Lei, K. Zheng, H. Zhang, X. Cao, N. Ling, and Y. Hou, “Sketch based image retrieval via image-aided cross domain learning,” in Proc. IEEE International Conference on Image Processing (ICIP), Sept. 2017, pp. 3685-3689.
  • [9] J. Song, K. Pang, Y-Z Song, T. Xiang, and T M. Hospedales, “Learning to sketch with shortcut cycle consistency,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp. 801-810.
  • [10] P. Xu, Y. Huang, T. Yuan, K. Pang, Y.-Z Song, T. Xiang, T. M. Hospedales, Z. Ma, and J. Guo, “SketchMate: deep hashing for million-scale human sketch retrieval,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp. 8090-8098.
  • [11] R. Hu, M. Barnard, and J. Collomosse, “Gradient field descriptor for sketch based retrieval and localization,” in in Proc. IEEE International Conference on Image Processing (ICIP), Sept. 2010, pp. 1025-1028.
  • [12] J. M. Saavedra, “Sketch based image retrieval using a soft computation of the histogram of edge local orientations (S-HELO),” in Proc. IEEE International Conference on Image Processing (ICIP), Oct. 2014, pp. 2998-3002.
  • [13] J. M. Saavedra and J. M. Barrios, “Sketch based image retrieval using learned keyshapes (LKS),” in Proc. British Machine Vision Conference (BMVC), Sept. 2015, pp. 164.1-164.11.
  • [14] D. Xu, X. Alameda-Pineda, J. Song, E. Ricci, and N. Sebe, “Academic coupled dictionary learning for sketch-based image retrieval,” in Proc. ACM International Conference on Multimedia (ACM MM), Oct. 2016, pp. 1326-1335.
  • [15] X. Qian, X. Tan, Y. Zhang, R. Hong, and M. Wang, “Enhancing sketch-based image retrieval by re-ranking and relevance feedback,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 195-208, 2016.
  • [16] S. Wang, J. Zhang, T. Han, and Z. Miao, “Sketch-based image retrieval through hypothesis-driven object boundary selection with hlr descriptor,” IEEE Transactions on Multimedia, vol. 17, no. 7, pp. 1045-1057, 2015.
  • [17] J. M. Saavedra, B. Bustos, and S. Orand, “Sketch-based image retrieval using keyshapes,” in Proc. British Machine Vision Conference (BMVC), Sept. 2015, pp. 164.1-164.11.
  • [18] Y. Zhang, X. Qian, X. Tan, J. Han, and Y. Tang, “Sketch-based image retrieval by salient contour reinforcement,” IEEE Transactions on Multimedia, vol. 18, no. 8, pp. 1604-1615, 2016.
  • [19] Y. Qi, Y.-Z. Song, T. Xiang, H. Zhang, T. Hospedales, Y. Li, and J. Guo, “Making better use of edges via perceptual grouping,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 1856-1865.
  • [20] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2005, pp. 886-893.
  • [21] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. IEEE International Conference on Computer Vision (ICCV), Sept. 1999, pp. 1-8.
  • [22] G. Mori, S. Belongie, and J. Malik, “Efficient shape matching using shape contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1832-1837, 2005.
  • [23] M. Eitz, J. Hays, and M. Alexa, “How do humans sketch objects?” ACM Transactions on Graphics, vol. 31, no. 4, pp. 1-10, 2012.
  • [24] U. R. Muhammad, Y. Yang, Y.-Z. Song, T. Xiang, and T. M. Hospedales, “Learning deep sketch abstraction,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp. 8014-8023.
  • [25] J. A. Landay and B. A. Myers, “Sketching interfaces: toward more human interface design,” IEEE Computer, vol. 34, no. 3, pp. 56-64, 2001.
  • [26] J. Lei, L. Niu, H. Fu, Bo, Peng, Q. Huang, and C. Hou, “Person re-identification by semantic region representation and topology constraint,” IEEE Transactions on Circuits and Systems for Video Technology, DOI: 10.1109/TCSVT.2018.2866260, 2018.
  • [27] J. Cao, Y. Pang, and X. Li, “Learning multi-layer channel features for pedestrian detection,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3210-3220, 2017.
  • [28] Y. Pang, M. Sun, X. Jiang, and X. Li, “Convolution in convolution for network in network,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 5, pp. 1587-1597, 2018.
  • [29] T. Han, H. Yao, C. Xu, X. Sun, Y. Zhang, and J. J. Corso, “Dancelets mining for video recommendation based on dance styles,” IEEE Transactions on Multimedia, vol. 19, no. 4, pp. 712-724, 2017.
  • [30] T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse, “Sketching out the details: Sketch-based image retrieval using convolutional neural networks with multi-stage regression,” Computers Graphics, vol. 77, pp. 77-87, 2018.
  • [31] H. Zhang, C. Zhang, and M. Wu, “Sketch-based cross-domain image retrieval via heterogeneous network,” in Proc. IEEE International Conference on Visual Communications and Image Processing (VCIP), Dec. 2017, pp. 1-4,
  • [32] Q. Yu, F. Liu, Y.-Z. Song, and T. Xiang, “Sketch me that shoe,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 799-807.
  • [33] Y. Qi, Y.-Z. Song, H. Zhang, and J. Liu, “Sketch-based image retrieval via siamese convolutional neural network,” in Proc. IEEE International Conference on Image Processing (ICIP), Sept. 2016, pp. 2460-2464.
  • [34] L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao, “Deep sketch hashing: fast free-hand sketch-based image retrieval,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 2298-2307.
  • [35] J. Song, Q. Yu, Y.-Z. Song, T. Xiang, and T M. Hospedales, “Deep spatial-semantic attention for fine-grained sketch-based image retrieval,” in Proc. IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 5552-5561.
  • [36] F. Wang, L. Kang, and Y. Li, “Sketch-based 3d shape retrieval using convolutional neural networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 1875-1883.
  • [37] Q. Yu, Y. Yang, Y.-Z. Song, T. Xiang, and T. Hospedales, “Sketch-a-net that beats humans,” in Proc. British Machine Vision Conference (BMVC), Sept. 2015, pp. 1-12.
  • [38] P. Sangkloy, N. Burnell, C. Ham, and J. Hays, “The sketchy database: learning to retrieve badly drawn bunnies,” ACM Transactions on Graphics, vol. 35, no. 4, pp. 1-12, 2016.
  • [39] J. Zhang, F. Shen, L. Liu, F. Zhu, M. Yu, L. Shao, H. T. Shen, and L. V. Gool, “Generative domain-migration hashing for sketch-to-image retrieval,” in Proc. European Conference on Computer Vision (ECCV), Sept. 2018, pp. 297-314.
  • [40] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Conference and Workshop on Neural Information Processing Systems (NIPS), Dec. 2012, pp. 1097-1105.
  • [41] J. Song, H. Zhang, X. Li, L. Gao, M. Wang, and R. Hong, “Self-supervised video hashing with hierarchical binary auto-encoder,” IEEE Transactions on Image Processing, vol. 27, no. 7, pp. 3210-3221, 2018.
  • [42] J. Song, L. Gao, L. Liu, X. Zhu, and N. Sebe, “Quantization based hashing: A general framework for scalable image and video retrieval,” Pattern Recognition, vol. 75, pp. 175-187, 2018.
  • [43] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: adaptive attention via a visual sentinel for image captioning,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 375-383.
  • [44] M. Sun, Y. Yuan, F. Zhou, and E. Ding, “Multi-attention multi-class constraint for fine-grained image recognition,” in Proc. European Conference on Computer Vision (ECCV), Sept. 2018, pp. 805-821.
  • [45] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, “The application of two-level attention models in deep convolutional neural network for fine-grained image classification,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 842-850.
  • [46] M. Liu, X. Wang, L. Nie, X. He, B. Chen, and T.-S. Chua, “Attentive moment retrieval in videos,” in Proc. International Conference on Research on Development in Information Retrieval (SIGIR), Jun. 2018, pp. 15-24.
  • [47] H. Nam, J.-W. Ha, and J. Kim, “Dual attention networks for multimodal reasoning and matching,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 2156-2164.
  • [48] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Z, X. Wang, and X. Tang, “Residual attention network for image classification,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 9156-3164.
  • [49] W. Li, X. Zhu, and S. Gong, “Harmonious attention network for person re-identification,” in Proc. European Conference on Computer Vision (ECCV), Sept. 2018, pp. 2285-2294.
  • [50] J. Hu, L, Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 7132-7141.
  • [51] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Videos captioning with attention-based LSTM and semantic consistency,” IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2045-2055, 2017.
  • [52] L. Gao, X. Li, J. Song, and H. T. Shen, “Hierarchical LSTMs with adaptive attention for visual captioning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.1109/TPAMI.2019.2894139, 2019.
  • [53] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. International Conference on Learning Representations (ICLR), May. 2015, pp. 1-14.
  • [54] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: a large-scale hierarchical image database,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2009, pp. 248-255.
  • [55] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: convolutional architecture for fast feature embedding,” in Proc. ACM International Conference on Multimedia (ACM MM), Nov. 2014, pp. 675-678.
  • [56] M. Leordeanu, R. Sukthankar, and C. Sminchisescu, “Efficient closed-form solution to generalized boundary detection,” in Proc. European Conference on Computer Vision (ECCV), Oct. 2012, pp. 516-529.
  • [57] L. V. D. Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, pp. 2579-2605, 2008.

Jianjun Lei (M’11-SM’17) received the Ph.D. degree in signal and information processing from Beijing University of Posts and Telecommunications, Beijing, China, in 2007. He was a visiting researcher at the Department of Electrical Engineering, University of Washington, Seattle, WA, from August 2012 to August 2013. He is currently a Professor at Tianjin University, Tianjin, China. His research interests include 3D video processing, virtual reality, and artificial intelligence.

Yuxin Song received the B.S. degree in communication engineering from Hefei University of Technology, Hefei, Anhui, China, in 2017. She is currently pursuing the M.S. degree with the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. Her research interests include image retrieval and deep learning.

Bo Peng received the M.S. degree in communication and information systems from Xidian University, Xi’an, Shaanxi, China, in 2016. Currently, she is pursuing the Ph.D. degree at the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. Her research interests include computer vision, image processing and video action analysis.

Zhanyu Ma has been an Associate Professor at Beijing University of Posts and Telecommunications, Beijing, China, since 2014. He is also an adjunct Associate Professor at Aalborg University, Aalborg, Denmark, since 2015. He received his Ph.D. degree in Electrical Engineering from KTH (Royal Institute of Technology), Sweden, in 2011. From 2012 to 2013, he has been a Postdoctoral research fellow in the School of Electrical Engineering, KTH, Sweden. His research interests include pattern recognition and machine learning fundamentals with a focus on applications in computer vision, multimedia signal processing, and data mining.

Ling Shao is the CEO and Chief Scientist of the Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates. His research interests include computer vision, machine learning, and medical imaging. He is a fellow of IAPR, IET, and BCS. He is an Associate Editor of the IEEE Transactions on Neural Networks and Learning Systems, and several other journals.

Yi-Zhe Song is a Reader of Computer Vision and Machine Learning at the Centre for Vision Speech and Signal Processing (CVSSP), UK s largest academic research centre for Artificial Intelligence with approx. 200 researchers. Previously, he was a Senior Lecturer at the Queen Mary University of London, and a Research and Teaching Fellow at the University of Bath. He obtained his PhD in 2008 on Computer Vision and Machine Learning from the University of Bath, and received a Best Dissertation Award from his MSc degree at the University of Cambridge in 2004, after getting a First Class Honours degree from the University of Bath in 2003. He is a Senior Member of IEEE, and a Fellow of the Higher Education Academy. He is a full member of the review college of the Engineering and Physical Sciences Research Council (EPSRC), the UK’s main agency for funding research in engineering and the physical sciences, and serves as an expert reviewer for the Czech National Science Foundation.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398104
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description