Instance Similarity Deep Hashing for MultiLabel Image Retrieval
Abstract
Hash coding has been widely used in the approximate nearest neighbor search for largescale image retrieval. Recently, many deep hashing methods have been proposed and shown largely improved performance over traditional featurelearningbased methods. Most of these methods examine the pairwise similarity on the semanticlevel labels, where the pairwise similarity is generally defined in a hardassignment way. That is, the pairwise similarity is ‘1’ if they share no less than one class label and ‘0’ if they do not share any. However, such similarity definition cannot reflect the similarity ranking for pairwise images that hold multiple labels. In this paper, a new deep hashing method is proposed for multilabel image retrieval by redefining the pairwise similarity into an instance similarity, where the instance similarity is quantified into a percentage based on the normalized semantic labels. Based on the instance similarity, a weighted crossentropy loss and a minimum mean square error loss are tailored for lossfunction construction, and are efficiently used for simultaneous feature learning and hash coding. Experiments on three popular datasets demonstrate that, the proposed method outperforms the competing methods and achieves the stateoftheart performance in multilabel image retrieval.
I Introduction
With the popular use of smartphone cameras, the amount of image data has been rapidly increasing. As a result, efficient and accurate image retrieval has became more and more important to our daily life. Generally, the image retrieval is based on the approximate nearest neighbor (ANN) search, in which a practical imageretrieval system is often built on hashing [1]. In hashing methods, the high dimensional data is transformed into compact binary codes and similar binary codes are expected to generate for similar data items. Due to the encouraging efficiency in both speed and storage, a number of hashing methods have been proposed in the past decade [2, 3, 4, 5, 6, 7, 8, 9, 10, 11].
Generally, the existing hashing methods can be divided into two categories: unsupervised methods and supervised methods. The unsupervised methods use unlabeled data to generate hash functions. They focus on preserving the distance similarity in the Hamming space as in the feature space. The supervised methods incorporate humaninteractive annotations, e.g., pairwise similarities of semantic labels, into the learning process to improve the quality of hashing, and often outperform the unsupervised methods. In the past five years, inspired by the success of deep neural networks that show superior featurerepresentation power in image classification [12, 13, 14], object detection [15], face recognition [16], and many other vision tasks [17, 18], many supervised hashing methods propose to use deep neural network for image abstraction and hashcode learning [19, 20, 21, 22, 23, 24, 25]. These so called deep hashing methods have achieved the stateoftheart performance on several popular benchmark datasets.
While these supervised deep hashing methods have produced impressive improvement in image retrieval, to the best of our knowledge, all of them examine the similarity of pairwise images using the semanticlevel labels, and define the similarity in a hard way. That is, the similarity of pairwise images is ‘1’ if they share at least one object class and ‘0’ (or ‘1’) if they do not share any object class. However, such similarity definition cannot reflect the similarity ranking when the pairwise images both have multiple labels. An illustrative example is shown in Fig. 1. In Fig. 1, the images in (a), (b) and (c) share the same class label ‘sky’, each pair of them is taken as similar in the context of image retrieval. However, as the images in (a) and (b) share three class labels, i.e., ‘sky’, ‘bridge’, and ‘water’, the similarity between them should be ranked higher than that between (a) and (c) which have only one class label in common. It can be easily observed that, the traditional similarity definition does not take the multilabel information into account and cannot rank the similarity for images with multiple class labels.
To solve this problem, we present a soft definition for the pairwise similarity with regarding to the semantic labels each image holds. Specifically, the pairwise similarity is quantified into a percentage using the normalized semantic labels, which we call the instance similarity. Based on the instance similarity, we propose an instance similarity deep hashing method (ISDH) to learn highquality hash codes. According to the instance similarity matrix, we construct the loss function by jointly considering the crossentropy loss and the minimum mean square error, in the purpose of preserving the similarity rankings. As the similarity preserving is observed to be contributed mostly by the completely similar and dissimilar pairs, a weight coefficient is assigned to the crossentropy loss to reinforce the completely similar and dissimilar cases. We evaluate the proposed deep hashing method on three popular multilabel image datasets and obtain significantly improved performance over the stateoftheart hashing methods in image retrieval. The contributions of this work lie in threefold:

We propose a soft definition for the pairwise similarity which quantifies the pairwise similarity into a percentage using the normalized semantic labels. This soft definition can reflect the similarity ranking for pairwise images that hold multiple labels.

A joint lossfunction of weighted crossentropy loss and minimum mean square error loss are adapted for preserving the similarity rankings based on the instance similarity.

Experiments have shown that the proposed method outperforms current stateoftheart methods on three datasets in image retrieval, which demonstrates the effectiveness of the proposed method.
The rest of this paper is organized as follows: Section II briefly reviews the related work. Section III describes the proposed instance similarity deep hashing method which generates highquality hash codes in a supervised learning manner. Section IV demonstrates the effectiveness of the proposed model by extensive experiments on three popular benchmark datasets, and Section V concludes our work.
Ii Related Work
In the past two decades, many hashing methods have been proposed for ANN search in the largescale image retrieval. Hashingbased methods transform high dimensional data into compact binary codes with a fixed number of bits and generate similar binary codes for similar data items, which can greatly reduces the storage and calculation consumption. Generally, the existing hashing methods can be divided into two categories: unsupervised methods and supervised methods.
Unsupervised Methods. The unsupervised hashing methods learn hash functions to preserve the similarity distance in the Hamming space as in the feature space. LocalitySensitive Hashing (LSH) [26] is one of the most wellknown representative. LSH aims to maximize the probability that the similar items will be mapped to the same buckets. Spectral Hashing (SH) [2] and [27] consider hash encoding as a spectral graph partitioning problem, and learns a nonlinear mapping to preserve semantic similarity of the original data in the Hamming space. Iterative Quantization (ITQ) [7] searches for an orthogonal matrix by alternating optimization to learn the hash functions. Sparse Product Quantization (SPQ) [28] encodes the highdimensional feature vectors into sparse representation by decomposing the feature space into a Cartesian product of lowdimensional subspaces and quantizing each subspace via Kmeans clustering, and the sparse representations are optimized by minimizing their quantization errors. [29] propose to learn compact hash code by computing a sort of soft assignment within the kmeans framework, which is called ”multikmeans”, to void the expensive memory and computing requirements. Latent Semantic Minimal Hashing (LSMH) [30] refines latent semantic feature embedding in the image feature to refine original feature based on matrix decomposition, and a minimum encoding loss is combined with latent semantic feature learning process simultaneously to get discriminative obtained binary codes.
Supervised Methods. The supervised hashing methods use supervised information to learn compact hash codes, which usually achieve superior performance compared with the unsupervised methods. Binary Reconstruction Embedding (BRE) [3] constructs hash functions by minimizing the squared error loss between the original feature distances and the reconstructed Hamming distances. Semisupervised hashing (SSH) [4] combines the characteristics of the labeled and unlabeled data to learning hash functions, where the supervised term tries to minimize the empirical error on the labeled data and the unsupervised term pursuits effective regularization by maximizing the variance and independence of hash bits over the whole data. Minimal Loss Hashing (MLH) [5] learns hash functions based on structural prediction with latent variables using a hingelike loss function. Supervised Hashing with Kernels (KSH) [6] is a kernel based method which learns compact binary codes by maximizing the separability between similar and dissimilar pairs in the Hamming space. Online Hashing [31] is also a hot research area in image retrieval. [32] proposes an online multiple kernel learning method, which aims to find the optimal combination of multiple kernels for similarity learning, and [33] improves the online multikernel learning with semisupervised way, which utilizes supervision information to estimate the labels of the unlabeled images by introducing classification confidence that is also instructive to select the reliably labeled images for training.
In the last few years, approaches built on deep neural networks have achieved stateoftheart performance on image classification [12, 13, 14] and many other computer vision tasks. Inspired by the powerful representation ability of deep neural networks, some deep hash methods have been proposed, which show great progress compared with traditional handcrafted feature based methods. A simple way to deep hash learning is thresholding high level feature directly, the typical methods is DLBHC [34], which learns hashlike representations by inserting a latent hash layer before the last classification layer in AlexNet [12]. While the network is finetuned well on classification task, the feature of latent hash layer is considered to be discriminative, which indeed presents better performance than handcrafted feature. CNNH [19] was proposed as a twostage hashing method, which decomposes the hash learning process into a stage of learning approximate hash codes, and followed by a stage of deep network finetune to learn the image features and hash functions. DNNH [21] improves the twostage CNNH in both the image representations and hash coding by using a joint learning process. DNNH and DSRCH [22] use image triplets as the input of deep network, which generate hash codes by minimizing the triplet ranking loss. Since the pairwise similarity is more straightforward than the triplet similarity, most of the latest deep hashing networks used pairwise labels for supervised hashing and further improved the performance of image retrieval, e.g., DHN [23], DQN [24], and DSH [25], etc. DSRH [20] tries to learn deep hash function by utilizing the ranking information of multilevel similarity, and propose a surrogate losses to solve the optimization problem of ranking measures. DSDH [35] proposes to use both pairwise label information and classification information to learn the hash codes under one stream framework, and adapts an alternating minimization method to optimize the objective function and output the binary codes directly.
In this work, we study to improve the hashing quality by exploring the diversities of pairwise semantic similarity on the multilabel dataset. To the best of our knowledge, none of the previous hashing methods explore the diversities of pairwise semantic similarity on multilabel dataset. To utilize the multilabel information, we define the instance similarity based on the normalized semantic labels, and construct a joint pairwise loss function to perform simultaneous feature learning and hashcode generating.
Iii Instance Similarity Deep Hashing
Iiia Problem Definition
Given a training set of images and a pairwise similarity matrix , the goal of hash learning for images is to learn a mapping , so that an input image can be encoded into a bit binary code , with the similarities of images being preserved. The similarity label is usually defined as = 1 if and have semantic label, i.e., object class label, in common and = 0 if and do not share any semantic label. As discussed in the introduction, this definitions does not take the multilabel information into account and cannot rank the similarity for images with multiple class labels. In our design, the pairwise similarity is quantified into percentages and the similarity value is defined as
(1) 
where is the cosine distance of pairwise label vectors, which is formulated as Eq. (2),
(2) 
where and denote the semantic label vector of image and , respectively, and calculates the inner product. According to Eq. (1), the similarity of pairwise images can be passed into three states: completely similar, partially similar, and dissimilar. For approximate nearest neighbor search, we demand that the binary codes should preserve the similarity in . To be specific, if = 1, the binary codes and should have a low Hamming distance; if = 0, the binary codes and should have a high Hamming distance; otherwise, the binary codes and should have a suitable Hamming distance complying with the similarity .
Figure 2 shows the pipeline of the proposed deep hashing network for supervised hashcode learning. The proposed method accepts input images in a pairwise form (,,) and processes them through the deep representation learning and hash coding. It includes a subnetwork with multiple convolution/pooling layers to perform image abstraction, two fullyconnected layers to approximate optimal dimensionreduced representation, a fullyconnected layer to generate bits hash codes. In this framework, a pairwise similarity loss is introduced for similaritypreserving learning, and a quantization loss is used to control the quality of hashing. The pairwise similarity loss consists of two parts – the cross entropy loss and the square error loss. Details will be introduced in the following of this sections.
IiiB Deep Network Architecture
Without loss of generality, we apply the AlexNet as our deep architecture, and this deep convolutional neural network (CNN) comprises of five convolutional layers  and three fully connected layers  . After each hidden layer, a nonlinear mapping is learned by the activation function , where is the th layer feature representation for the original input, and are the weight and bias parameters of the th layer. We replace the layer of the softmax classifier in the original AlexNet with a new fullyconnected hashing layer with hidden nodes, which converts the learned deep features into a lowdimensional hash codes. In order to realize hash encoding, we introduce an activation function to map the output of to be within [1,1].
IiiC Hash Code Learning
For efficient nearest neighbor search, the semantic similarity of original images should be preserved in the Hamming space. Given a pair of binary codes and , if pairwise images and do not share any object class, the Hamming distance between and should be large, i.e., be close to in the bit hash coding case; if the pairwise images and have some class labels in common, we expect the Hamming distance to be a small value. Previous works have shown that the inner product is a good representation of the Hamming distance to quantify the pairwise similarity [23, 24]. In this work, we construct a scaled inner product , where is a positive hyperparameter to control the constraint bandwidth.
Given the pairwise similarity relationship , the Maximum a Posterior estimation of hash codes can be derived as:
(3) 
where is the likelihood function, and is the prior distribution. For each pair of the images, is the conditional probability of given their hash codes and , which is defined as follows:
(4) 
where is the sigmoid function, which we use to transform the Hamming distance into a kind of measure of similarity. is the quantized pairwise similarity calculated by Eq. (1) and Eq. (2), and is the Euclidean distance between the quantized pairwise similarity and scaled inner product. When the pairwise images and are completely similar or dissimilar, it is suitable to measure the pairwise similarity loss with cross entropy, as formulated by Eq. (5),
(5) 
Then, substituting the sigmoid function with , we get
(6) 
When the pairwise images and are partially similar, we apply mean square error, i.e. Euclidean distance, to quantify the similarity error between them. Thus, the pairwise similarity loss can be defined as:
(7) 
We use to mark the two conditions, where = 1 denotes that and are completely similar or dissimilar, and = 0 denotes that and are partly similar. Under the assumption that the completely similar and dissimilar situation contribute more to the loss formulation, we use a hyperparameter to increase the weight of crossentropy term. Hence, the pairwise similarity loss is rewritten as:
(8) 
It is challenging to directly optimize Eq. (8), because the binary constraint requires thresholding the network outputs, which may result in the vanishinggradient problem in back propagation during the training procedure. Following previous works [1, 25, 23], we apply the continuous relaxation to solve this problem, where is the output of deep hashing network and is a weighted inner product between and . since the network output is not the binary codes, we use a pairwise quantization loss to encourage the network output to be close to standard binary codes. The pairwise quantization loss is defined as
(9) 
where 1 is a vector of all ones, is the L1norm of the vector, is the elementwise absolute value operation. By integrating the pairwise similarity loss and pairwise quantization loss, the final optimization problem is defined as
(10) 
where is a weight coefficient for controlling the quantization loss.
IiiD Learning Algorithm
During the training process, the standard backpropagation algorithm with minibatch gradient descent method is used to optimize the pairwise loss function. By combining Eq. (8) and Eq. (9), we rewrite the optimization objective function as follows:
(11) 
In order to employ back propagation algorithm to optimize the network parameters, we need to compute the derivative of the objective function. The subgradients of Eq. (11) w.r.t. (th unit of the network output ) can be written as:
(12) 
and
(13) 
where and . The gradient of w.r.t. (raw representations of before activation) can be calculated by
(14) 
where is an elementwise sign function and is the output of the th layer before activation. The gradient of the network parameter is
(15) 
Since we have computed subgradients of the layer , the rest of the backpropagation procedure can be done in the standard manner. Note that, after the learning procedure, we have not obtained the corresponding binary codes of input images yet. The network only generate approximate hash codes that have values within [1,1]. To finally get the hash codes and evaluate the efficacy of the trained network, we need to treat the test query data as input and forward propagate the network to generate hash codes by using Eq. (16),
(16) 
In this way, we can train the deep neural network in an endtoend fashion, and any new input images can be encoded into binary codes by the trained deep hashing model. Ranking the distance of these binary hash codes in the Hamming space, we can obtain an efficient image retrieval.
Iv Experiments and Results
Iva Datasets
To verify the performance of the proposed method, we compare the proposed method with several baselines on three widely used benchmark datasets, i.e., NUSWIDE, Flickr and VOC2012.
NUSWIDE [36] is a dataset containing 269,648 public web images. It is a multilabel dataset in which each image is annotated with one or more class labels from a total of 81 classes. We follow the settings in [37, 21] to use the subset of images associated with the 21 most frequent labels, where each label associates with at least 5,000 images, resulting in a total of 195,834 images. We resize the images of this subset to 227227.
Flickr [38] is a dataset containing 25,000 images collected from Flickr. Each image belongs to at least one of the 38 semantic labels. We resize the images to 227227.
VOC2012 [39] is a widely used dataset for object detection and segmentation, which contains 17,125 images, and each image belongs to at least one of the 20 semantic labels. We resize images to 227 227.
IvB Implementation Details
We compare our method with several stateoftheart hashing methods, including three unsupervised methods LSH [26], SH [2] and ITQ [7], and six supervised methods BRE [3], MLH [5], KSH [6], DLBHC [34], DHN [23] and DQN [24].
For NUSWIDE, we randomly select 100 images per class to form a test query set of 2,100 images, and 500 images per class to form the training set. For Flickr and VOC2012, we randomly select 1,000 images as the test query set, and the remaining images as the train set.
For the deep learning based methods, including DLBHC, DQN, DHN and ISDH, we directly use the image pixels as input. For the other baseline methods, we use some effective and widely used feature vectors to represent the images. Following [6, 21], each image in NUSWIDE is represented as a 500dimensional bagofwords vector, and a 3,857dimensional vector in Flickr concatenated by local SIFT feature, global GIST feature, etc. In VOC2012, each image is represented as a 7,680dimensional feature vector [40] built on dense SIFT descriptors [41].
We implement the proposed method (ISDH) by the TensorFlow toolkit [42]. We use the AlexNet architecture [12] and finetune the convolutional layers  and fullyconnected layers  with network weight parameters copied from the pretrained model, and train the hashing layer , all via backpropagation. We use the minibatch SGD with a minibatch size of 128, and the learning rate decay after each 500 iterations with a decay rate of 0.9. For fair comparison, all deep hashing methods for comparison are implemented by using the TensorFlow^{1}^{1}1Get our codes at https://github.com/pectinid16/ISDHTensorflow/.
In our models, there are three hyperparameters () which will impact the performance of the model. controls the range of inner product value after normalization. We notice that the gradient of large absolute value is very small in the sigmoid function, which may cause gradient vanishing. In order to avoid this and accelerate the convergence, we employ a parameter and set its value according to the code number in hashing. Empirically, we set = to constrain the result of to be within [5,5], which is relatively a suitable range. We will discuss the effect of in the next subsection. We employ another parameter to make a compromise between the crossentropy loss and the squareerror loss. In this work, we test our model with , and the test results have been shown in Fig. 4. It can be seen that, with the increasing of , the performance of retrieval is improving until the value reaches to 10, and when its value is larger than 10, it does not get any improvement but even suffers a decline on retrieval performance. Thus, we will use = 10 in the experiments. is the weight of quantization loss. Considering that the quantization loss is less influential than the similarity loss, we assign it a small value = 0.1.
IvC Metrics
We evaluate the image retrieval quality using four widelyused metrics: Average Cumulative Gains (ACG) [43], Normalized Discounted Cumulative Gains (NDCG) [44], Mean Average Precision (MAP) [45] and Weighted Mean Average Precision (WAP) [20].
ACG represents the average number of shared labels between the query image and the top retrieved images. Given a query image , the ACG score of the top retrieved images is calculated by
(17) 
where denotes the number of top retrieval images and is the number of shared class labels between and .
NDCG is a popular evaluation metric in information retrieval. Given a query image , the DCG score of top retrieved images is defined as
(18) 
Then, the normalized DCG (NDCG) score at the position can be calculated by , where is the maximum value of , which constrains the value of NDCG in the range [0,1].
MAP is the mean of average precision for each query, which can be calculated by
(19) 
where
(20) 
and is an indicator function that if and share some class labels, = 1; otherwise = 0. is the numbers of query sets and indicates the number of relevant images w.r.t. the query image within the top images.
The definition of WAP is similar with MAP. The only difference is that WAP computes the average ACG scores at each top retrieved image rather than average precision. WAP can be calculated by
(21) 


Methods  12bit  24bit  36bit  48bit 
ISDH  0.6987  0.7208  0.7298  0.7346 
DQN  0.6881  0.7109  0.7231  0.7301 
DHN  0.6948  0.7022  0.7074  0.7080 
DLBHC  0.5765  0.5970  0.6075  0.6194 
KSH  0.4851  0.4996  0.5045  0.5068 
MLH  0.3895  0.4024  0.4059  0.4091 
BRE  0.3891  0.3963  0.3987  0.4015 
SH  0.3465  0.3525  0.3557  0.3674 
ITQ  0.4021  0.4131  0.4187  0.4219 
LSH  0.3481  0.3750  0.3762  0.3917 

IvD Results
The results of MAP metric on three datasets are shown from Table I to Table III. It can be seen that, the proposed ISDH method substantially outperforms all the comparison methods on these three datasets. Compared to the best baseline of traditional hashing methods, KSH, our method has achieved an improvement of about 22.2%, 12.5% and 18.3% in average MAP for different bits on NUSWIDE, Flickr and VOC2012, respectively. It can be seen from Table I to Table III, the deep learning methods have obtained largely improved performance over the three traditional methods. Compared to the stateoftheart deep hashing methods, DQN, the proposed ISDH achieves an improvement of about 0.8%, 0.3% and 1.2% in average MAP on the three datasets, respectively. These results show the advantage of the proposed method.


Methods  12bit  24bit  36bit  48bit 
ISDH  0.8130  0.8304  0.8330  0.8419 
DQN  0.8068  0.8302  0.8325  0.8388 
DHN  0.7985  0.8023  0.8023  0.8078 
DLBHC  0.6805  0.7120  0.7160  0.7102 
KSH  0.6955  0.7044  0.7093  0.7113 
MLH  0.6249  0.6321  0.6335  0.6336 
BRE  0.5847  0.5881  0.5901  0.5986 
SH  0.5823  0.5856  0.5861  0.5865 
ITQ  0.5816  0.5817  0.5826  0.5835 
LSH  0.5852  0.5899  0.5854  0.5894 



Methods  12bit  24bit  36bit  48bit 
ISDH  0.6258  0.6480  0.6575  0.6654 
DQN  0.6115  0.6396  0.6483  0.6501 
DHN  0.6145  0.6241  0.6248  0.6308 
DLBHC  0.4879  0.5163  0.5277  0.5424 
KSH  0.4535  0.4667  0.4704  0.4760 
MLH  0.3917  0.3990  0.4028  0.4029 
BRE  0.3870  0.3951  0.3967  0.4015 
SH  0.3953  0.4045  0.4003  0.3963 
ITQ  0.3932  0.3986  0.4026  0.4036 
LSH  0.3595  0.3619  0.3622  0.3638 

Figure 5 shows the precision, ACG and NDCG curves of compared hashing methods w.r.t. different numbers of top returned images with 12, 24, 36 and 48 bits on NUSWISDE, respectively. On precision metric, it can be seen that, the proposed ISDH has close performance with DHN on 12 bits and outperforms all the comparison methods on 24, 36 and 48 bits w.r.t. different numbers of top returned images. On ACG and NDCG metric, our method is slightly lower than DHN on 12 and 24 bits, it may be because that a shorter code is less effective in representing the semantic similarity of multilabel images in a largescale dataset. With the code length increasing, the performance of the proposed ISDH improves and shows obvious advantage than other compared methods, including DHN. The performance of DQN is relative poorer than ISDH and DHN on this dataset, and DLBHC shows the worse results among these deep hashing methods, since it directly uses the class label as supervised information rather than semantic similarity.
Figure 6 shows the precision, ACG and NDCG curves of compared hashing methods w.r.t. different numbers of top returned images with 12, 24, 36 and 48 bits on Flickr, respectively. It can be seen that, the proposed method achieves the stateoftheart performance compared to other methods. On this dataset, DHN shows distinct disadvantage than ISDH, which demonstrates our method is more robust and stable than compared methods.
Figure 7 shows the precision, ACG and NDCG curves of compared hashing methods w.r.t. different numbers of top returned images with 12, 24, 36 and 48 bits on VOC2012, respectively. The proposed method also achieves the best performance among the ten hashing methods.
Figures 8 and 9 show the results of MAP and WAP for different numbers of bits. In multilabel image retrieval, MAP can reflect if two images share a class label or not, but cannot reflect how many number of class labels that the pairwise images shared with each other. In our study, highquality retrieval results should have as more shared class labels as possible in the nearest retrieval image, so we also use WAP to measure the average number of shared class labels among these retrieved similar images. In Flickr, ISDH has a close performance with DQN, but is still better than other comparison methods. In NUSWIDE and VOC2012, the results of ISDH methods are obvious better than the all comparison methods.
Methods  NUSWIDE  Flickr  VOC2012 

ISDH  0.7348  0.8419  0.6654 
ISDHw/oMSE  0.7312  0.8388  0.6605 
ISDHw/o  0.7149  0.8141  0.6190 
To justify the necessary of using the MSE term and paramter, we conduct some comparison experiments. Table IV shows the results of Mean Average Precision metric of the ISDH and its variants. ISDHw/oMSE replaces the square error loss in previous formula with 0. In this model, only pairwise crossentropy loss is calculated when the pairwise instance similarity equals to 1 or 0. Compared to the standard ISDH, the results of ISDHw/oMSE decrease 0.36%, 0.31% and 0.49% on NUSWDIE, Flickr and VOC2012, respectively. ISDHw/o is the variant of ISDH that the value of is set to 1. We apply this model to conform the effectiveness of constraining the range the inner product value of pairwise network output. It can be seen that without , the MAP results suffer a significant decrease of 1.99%, 2.78% and 4.64% on these three datasets, respectively. Such results demonstrate the significance of using square error loss for partially similar situation and hyperparameter for controlling the inner product.
Figure 10 shows some retrieval samples of four deep learning methods according to the ascending Hamming ranking. We marked the retrieval image with green box that include all instance in query image, blue box that include partial instance, and red box which don’t include any instance in query image. The first query image contains two semantic labels: animal and grass. We can see that among these four deep hashing methods, ISDH shows the best suitability between the retrieval images and query images, because only ISDH’s top20 retrieval results involves all these labels. The second query images contains two semantic labels: building and window. On the top20 retrieval images of each methods, only ISDH doesn’t include the wrong instance. This result suggests that the proposed method is more suitable for multilabel image retrieval.
V Conclusion
In this paper, a novel deep hashing method  ISDH  was proposed for multilabel image retrieval, in which an instancesimilarity definition was introduced to quantify the pairwise similarity for images holding multiple class labels. ISDH avoided the limitations that the traditional pairwise similarity cannot encode the ranking information of multilabel images. Moreover, based on the instance similarity, a pairwise similarity loss was introduced for similaritypreserving learning, and a quantization loss was used to control the quality of hashing. The proposed deep hashing method performed an effective feature learning and hashcode learning. Experiments on three multilabel datasets demonstrated that, the proposed ISDH outperformed the competing methods and achieved the stateoftheart performance in multilabel image retrieval.
Acknowledgement
This research was supported by the National Natural Science Foundation of China under grant No. 61301277, No. 61572370 and No. 91546106, Key Research Base for Humanities and Social Sciences of Ministry of Education Major Project under grant No. 16JJD870002, and the Open Research Fund of State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing under Grant 16S01. Yuewei Lin gratefully acknowledges the support by BNL LDRD 18009. The authors would like to thank the researchers for sharing these datasets used in our experiments  NUSWIDE, Flickr and VOC2012.
References
 [1] J. Wang, H. T. Shen, J. Song, and J. Ji, “Hashing for similarity search: A survey,” arXiv preprint arXiv:1408.2927, 2014.
 [2] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Advances in neural information processing systems, 2009, pp. 1753–1760.
 [3] B. Kulis and T. Darrell, “Learning to hash with binary reconstructive embeddings,” in Advances in neural information processing systems, 2009, pp. 1042–1050.
 [4] J. Wang, S. Kumar, and S.F. Chang, “Semisupervised hashing for scalable image retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3424–3431.
 [5] M. Norouzi and D. M. Blei, “Minimal loss hashing for compact binary codes,” in International Conference on Machine Learning, 2011, pp. 353–360.
 [6] W. Liu, J. Wang, R. Ji, Y.G. Jiang, and S.F. Chang, “Supervised hashing with kernels,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2074–2081.
 [7] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2916–2929, 2013.
 [8] C. Li, Q. Liu, J. Liu, and H. Lu, “Ordinal distance metric learning for image ranking,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 7, p. 1551, 2015.
 [9] L. Liu and L. Shao, “Sequential compact code learning for unsupervised image hashing,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 12, pp. 2526–2536, 2016.
 [10] G. Jie, T. Liu, Z. Sun, D. Tao, and T. Tan, “Supervised discrete hashing with relaxation,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–10, 2017.
 [11] Q. Liu, G. Liu, L. Li, X. T. Yuan, M. Wang, and W. Liu, “Reversed spectral hashing,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–9, 2017.
 [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [13] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
 [15] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in Advances in neural information processing systems, 2013, pp. 2553–2561.
 [16] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identificationverification,” in Advances in neural information processing systems, 2014, pp. 1988–1996.
 [17] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
 [18] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam, “Largescale object classification using label relation graphs,” in European Conference on Computer Vision, 2014, pp. 48–64.
 [19] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for image retrieval via image representation learning.” in AAAI Conference on Artificial Intelligence, vol. 1, 2014, pp. 2156–2162.
 [20] F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic ranking based hashing for multilabel image retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1556–1564.
 [21] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3270–3278.
 [22] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, “Bitscalable deep hashing with regularized similarity learning for image retrieval and person reidentification,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 4766–4779, 2015.
 [23] H. Zhu, M. Long, J. Wang, and Y. Cao, “Deep hashing network for efficient similarity retrieval.” in AAAI Conference on Artificial Intelligence, 2016, pp. 2415–2421.
 [24] Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen, “Deep quantization network for efficient image retrieval.” in AAAI Conference on Artificial Intelligence, 2016, pp. 3457–3463.
 [25] H. Liu, R. Wang, S. Shan, and X. Chen, “Deep supervised hashing for fast image retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2064–2072.
 [26] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Localitysensitive hashing scheme based on pstable distributions,” in Annual Symposium on Computational Geometry, 2004, pp. 253–262.
 [27] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu, “Spectral hashing with semantically consistent graph for image indexing,” IEEE Transactions on Multimedia, vol. 15, no. 1, pp. 141–152, 2013.
 [28] Q. Ning, J. Zhu, Z. Zhong, S. C. H. Hoi, and C. Chen, “Scalable image retrieval by sparse product quantization,” IEEE Transactions on Multimedia, vol. 19, no. 3, pp. 586–597, 2017.
 [29] S. Ercoli, M. Bertini, and A. D. Bimbo, “Compact hash codes for efficient visual descriptors retrieval in large scale databases,” IEEE Transactions on Multimedia, vol. 19, no. 11, pp. 2521–2532, 2017.
 [30] X. Lu, X. Zheng, and X. Li, “Latent semantic minimal hashing for image retrieval,” IEEE Transactions on Image Processing, 2016.
 [31] L. K. Huang, Q. Yang, and W. S. Zheng, “Online hashing,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–14, 2017.
 [32] H. Xia, S. C. H. Hoi, R. Jin, and P. Zhao, “Online multiple kernel similarity learning for visual search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 536–549, 2013.
 [33] J. Liang, Q. Hu, W. Wang, and Y. Han, “Semisupervised online multikernel similarity learning for image retrieval,” IEEE Transactions on Multimedia, vol. 19, no. 5, pp. 1077–1089, 2017.
 [34] K. Lin, H. F. Yang, J. H. Hsiao, and C. S. Chen, “Deep learning of binary hash codes for fast image retrieval,” in Computer Vision and Pattern Recognition Workshops, 2015, pp. 27–35.
 [35] Q. Li, Z. Sun, R. He, and T. Tan, “Deep supervised discrete hashing,” arXiv preprint arXiv:1705.10999, 2017.
 [36] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nuswide: a realworld web image database from national university of singapore,” in ACM International Conference on Image and Video Retrieval, 2009, p. 48.
 [37] W. Liu, J. Wang, S. Kumar, and S.F. Chang, “Hashing with graphs,” in International Conference on Machine Learning, 2011, pp. 1–8.
 [38] M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in ACM International Conference on Multimedia Information Retrieval, 2008, pp. 39–43.
 [39] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
 [40] F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for largescale image classification,” in European Conference on Computer Vision, 2010, pp. 143–156.
 [41] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
 [42] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Largescale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
 [43] K. Jarvelin and J. Kekalainen, “IR evaluation methods for retrieving highly relevant documents,” in ACM SIGIR Conference on Research and Development in Information Retrieval, 2000, pp. 41–48.
 [44] ——, “Cumulated gainbased evaluation of IR techniques,” ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422–446, 2002.
 [45] R. BaezaYates, B. RibeiroNeto et al., Modern information retrieval. ACM press New York, 1999, vol. 463.