Instance Similarity Deep Hashing for Multi-Label Image Retrieval

Instance Similarity Deep Hashing for Multi-Label Image Retrieval

Zheng Zhang, Qin Zou, Qian Wang, Yuewei Lin, and Qingquan Li Z. Zhang, Q. Zou and Q. Wang are with the School of Computer Science, Wuhan University, Wuhan 430072, P.R. China (E-mails: {zhangzheng, qzou, qianwang}@whu.edu.cn).Y. Lin is with the Computational Science Initiative, Brookhaven National Laboratory, NY 11973, USA (E-mail: ywlin@bnl.gov).Q. Li is with the Shenzhen Key Laboratory of Spatial Smart Sensing and Service, Shenzhen University, Guangdong 518060, P.R. China (E-mail: liqq@szu.edu.cn).
Abstract

Hash coding has been widely used in the approximate nearest neighbor search for large-scale image retrieval. Recently, many deep hashing methods have been proposed and shown largely improved performance over traditional feature-learning-based methods. Most of these methods examine the pairwise similarity on the semantic-level labels, where the pairwise similarity is generally defined in a hard-assignment way. That is, the pairwise similarity is ‘1’ if they share no less than one class label and ‘0’ if they do not share any. However, such similarity definition cannot reflect the similarity ranking for pairwise images that hold multiple labels. In this paper, a new deep hashing method is proposed for multi-label image retrieval by re-defining the pairwise similarity into an instance similarity, where the instance similarity is quantified into a percentage based on the normalized semantic labels. Based on the instance similarity, a weighted cross-entropy loss and a minimum mean square error loss are tailored for loss-function construction, and are efficiently used for simultaneous feature learning and hash coding. Experiments on three popular datasets demonstrate that, the proposed method outperforms the competing methods and achieves the state-of-the-art performance in multi-label image retrieval.

image retrieval, convolutional neural network, semantic label, image ranking, deep hashing.

I Introduction

With the popular use of smartphone cameras, the amount of image data has been rapidly increasing. As a result, efficient and accurate image retrieval has became more and more important to our daily life. Generally, the image retrieval is based on the approximate nearest neighbor (ANN) search, in which a practical image-retrieval system is often built on hashing [1]. In hashing methods, the high dimensional data is transformed into compact binary codes and similar binary codes are expected to generate for similar data items. Due to the encouraging efficiency in both speed and storage, a number of hashing methods have been proposed in the past decade [2, 3, 4, 5, 6, 7, 8, 9, 10, 11].

Generally, the existing hashing methods can be divided into two categories: unsupervised methods and supervised methods. The unsupervised methods use unlabeled data to generate hash functions. They focus on preserving the distance similarity in the Hamming space as in the feature space. The supervised methods incorporate human-interactive annotations, e.g., pairwise similarities of semantic labels, into the learning process to improve the quality of hashing, and often outperform the unsupervised methods. In the past five years, inspired by the success of deep neural networks that show superior feature-representation power in image classification [12, 13, 14], object detection [15], face recognition [16], and many other vision tasks [17, 18], many supervised hashing methods propose to use deep neural network for image abstraction and hash-code learning [19, 20, 21, 22, 23, 24, 25]. These so called deep hashing methods have achieved the state-of-the-art performance on several popular benchmark datasets.

Fig. 1: An example of the traditional similarity quantization for multi-label images. As long as two images share no less than one class label, the similarity of them is defined as ‘1’. Thus, the similarity of (a) and (b), and (a) and (c) are both equal to ‘1’, without caring about that (a) and (c) have only one label in common while (a) and (b) have three ones.

While these supervised deep hashing methods have produced impressive improvement in image retrieval, to the best of our knowledge, all of them examine the similarity of pairwise images using the semantic-level labels, and define the similarity in a hard way. That is, the similarity of pairwise images is ‘1’ if they share at least one object class and ‘0’ (or ‘-1’) if they do not share any object class. However, such similarity definition cannot reflect the similarity ranking when the pairwise images both have multiple labels. An illustrative example is shown in Fig. 1. In Fig. 1, the images in (a), (b) and (c) share the same class label ‘sky’, each pair of them is taken as similar in the context of image retrieval. However, as the images in (a) and (b) share three class labels, i.e., ‘sky’, ‘bridge’, and ‘water’, the similarity between them should be ranked higher than that between (a) and (c) which have only one class label in common. It can be easily observed that, the traditional similarity definition does not take the multi-label information into account and cannot rank the similarity for images with multiple class labels.

To solve this problem, we present a soft definition for the pairwise similarity with regarding to the semantic labels each image holds. Specifically, the pairwise similarity is quantified into a percentage using the normalized semantic labels, which we call the instance similarity. Based on the instance similarity, we propose an instance similarity deep hashing method (ISDH) to learn high-quality hash codes. According to the instance similarity matrix, we construct the loss function by jointly considering the cross-entropy loss and the minimum mean square error, in the purpose of preserving the similarity rankings. As the similarity preserving is observed to be contributed mostly by the completely similar and dissimilar pairs, a weight coefficient is assigned to the cross-entropy loss to reinforce the completely similar and dissimilar cases. We evaluate the proposed deep hashing method on three popular multi-label image datasets and obtain significantly improved performance over the state-of-the-art hashing methods in image retrieval. The contributions of this work lie in three-fold:

  1. We propose a soft definition for the pairwise similarity which quantifies the pairwise similarity into a percentage using the normalized semantic labels. This soft definition can reflect the similarity ranking for pairwise images that hold multiple labels.

  2. A joint loss-function of weighted cross-entropy loss and minimum mean square error loss are adapted for preserving the similarity rankings based on the instance similarity.

  3. Experiments have shown that the proposed method outperforms current state-of-the-art methods on three datasets in image retrieval, which demonstrates the effectiveness of the proposed method.

The rest of this paper is organized as follows: Section II briefly reviews the related work. Section III describes the proposed instance similarity deep hashing method which generates high-quality hash codes in a supervised learning manner. Section IV demonstrates the effectiveness of the proposed model by extensive experiments on three popular benchmark datasets, and Section V concludes our work.

Ii Related Work

In the past two decades, many hashing methods have been proposed for ANN search in the large-scale image retrieval. Hashing-based methods transform high dimensional data into compact binary codes with a fixed number of bits and generate similar binary codes for similar data items, which can greatly reduces the storage and calculation consumption. Generally, the existing hashing methods can be divided into two categories: unsupervised methods and supervised methods.

Unsupervised Methods. The unsupervised hashing methods learn hash functions to preserve the similarity distance in the Hamming space as in the feature space. Locality-Sensitive Hashing (LSH) [26] is one of the most well-known representative. LSH aims to maximize the probability that the similar items will be mapped to the same buckets. Spectral Hashing (SH) [2] and [27] consider hash encoding as a spectral graph partitioning problem, and learns a nonlinear mapping to preserve semantic similarity of the original data in the Hamming space. Iterative Quantization (ITQ) [7] searches for an orthogonal matrix by alternating optimization to learn the hash functions. Sparse Product Quantization (SPQ) [28] encodes the high-dimensional feature vectors into sparse representation by decomposing the feature space into a Cartesian product of low-dimensional subspaces and quantizing each subspace via K-means clustering, and the sparse representations are optimized by minimizing their quantization errors.  [29] propose to learn compact hash code by computing a sort of soft assignment within the k-means framework, which is called ”multi-k-means”, to void the expensive memory and computing requirements. Latent Semantic Minimal Hashing (LSMH) [30] refines latent semantic feature embedding in the image feature to refine original feature based on matrix decomposition, and a minimum encoding loss is combined with latent semantic feature learning process simultaneously to get discriminative obtained binary codes.

Supervised Methods. The supervised hashing methods use supervised information to learn compact hash codes, which usually achieve superior performance compared with the unsupervised methods. Binary Reconstruction Embedding (BRE) [3] constructs hash functions by minimizing the squared error loss between the original feature distances and the reconstructed Hamming distances. Semi-supervised hashing (SSH) [4] combines the characteristics of the labeled and unlabeled data to learning hash functions, where the supervised term tries to minimize the empirical error on the labeled data and the unsupervised term pursuits effective regularization by maximizing the variance and independence of hash bits over the whole data. Minimal Loss Hashing (MLH) [5] learns hash functions based on structural prediction with latent variables using a hinge-like loss function. Supervised Hashing with Kernels (KSH) [6] is a kernel based method which learns compact binary codes by maximizing the separability between similar and dissimilar pairs in the Hamming space. Online Hashing [31] is also a hot research area in image retrieval.  [32] proposes an online multiple kernel learning method, which aims to find the optimal combination of multiple kernels for similarity learning, and  [33] improves the online multi-kernel learning with semi-supervised way, which utilizes supervision information to estimate the labels of the unlabeled images by introducing classification confidence that is also instructive to select the reliably labeled images for training.

Fig. 2: An overview of the proposed deep hashing learning method. The top frame shows the deep architecture of neural network that produces the hashing code. The bottom frame shows the instance-similarity-guided loss function construction. Except for the quantization loss, two instance similarity losses are embedded into the loss function to handle different degrees of the similarity. If the instance similarity equals to 1 or 0, the cross-entropy loss is used; otherwise, the square-error loss is used.

In the last few years, approaches built on deep neural networks have achieved state-of-the-art performance on image classification [12, 13, 14] and many other computer vision tasks. Inspired by the powerful representation ability of deep neural networks, some deep hash methods have been proposed, which show great progress compared with traditional hand-crafted feature based methods. A simple way to deep hash learning is thresholding high level feature directly, the typical methods is DLBHC [34], which learns hash-like representations by inserting a latent hash layer before the last classification layer in AlexNet [12]. While the network is fine-tuned well on classification task, the feature of latent hash layer is considered to be discriminative, which indeed presents better performance than hand-crafted feature. CNNH [19] was proposed as a two-stage hashing method, which decomposes the hash learning process into a stage of learning approximate hash codes, and followed by a stage of deep network fine-tune to learn the image features and hash functions. DNNH [21] improves the two-stage CNNH in both the image representations and hash coding by using a joint learning process. DNNH and DSRCH [22] use image triplets as the input of deep network, which generate hash codes by minimizing the triplet ranking loss. Since the pairwise similarity is more straightforward than the triplet similarity, most of the latest deep hashing networks used pairwise labels for supervised hashing and further improved the performance of image retrieval, e.g., DHN [23], DQN [24], and DSH [25], etc. DSRH [20] tries to learn deep hash function by utilizing the ranking information of multi-level similarity, and propose a surrogate losses to solve the optimization problem of ranking measures. DSDH [35] proposes to use both pairwise label information and classification information to learn the hash codes under one stream framework, and adapts an alternating minimization method to optimize the objective function and output the binary codes directly.

In this work, we study to improve the hashing quality by exploring the diversities of pairwise semantic similarity on the multi-label dataset. To the best of our knowledge, none of the previous hashing methods explore the diversities of pairwise semantic similarity on multi-label dataset. To utilize the multi-label information, we define the instance similarity based on the normalized semantic labels, and construct a joint pairwise loss function to perform simultaneous feature learning and hash-code generating.

Iii Instance Similarity Deep Hashing

Iii-a Problem Definition

Given a training set of images and a pairwise similarity matrix , the goal of hash learning for images is to learn a mapping , so that an input image can be encoded into a -bit binary code , with the similarities of images being preserved. The similarity label is usually defined as = 1 if and have semantic label, i.e., object class label, in common and = 0 if and do not share any semantic label. As discussed in the introduction, this definitions does not take the multi-label information into account and cannot rank the similarity for images with multiple class labels. In our design, the pairwise similarity is quantified into percentages and the similarity value is defined as

(1)

where is the cosine distance of pairwise label vectors, which is formulated as Eq. (2),

(2)

where and denote the semantic label vector of image and , respectively, and calculates the inner product. According to Eq. (1), the similarity of pairwise images can be passed into three states: completely similar, partially similar, and dissimilar. For approximate nearest neighbor search, we demand that the binary codes should preserve the similarity in . To be specific, if = 1, the binary codes and should have a low Hamming distance; if = 0, the binary codes and should have a high Hamming distance; otherwise, the binary codes and should have a suitable Hamming distance complying with the similarity .

Figure 2 shows the pipeline of the proposed deep hashing network for supervised hash-code learning. The proposed method accepts input images in a pairwise form (,,) and processes them through the deep representation learning and hash coding. It includes a sub-network with multiple convolution/pooling layers to perform image abstraction, two fully-connected layers to approximate optimal dimension-reduced representation, a fully-connected layer to generate -bits hash codes. In this framework, a pairwise similarity loss is introduced for similarity-preserving learning, and a quantization loss is used to control the quality of hashing. The pairwise similarity loss consists of two parts – the cross entropy loss and the square error loss. Details will be introduced in the following of this sections.

Iii-B Deep Network Architecture

Without loss of generality, we apply the AlexNet as our deep architecture, and this deep convolutional neural network (CNN) comprises of five convolutional layers - and three fully connected layers - . After each hidden layer, a nonlinear mapping is learned by the activation function , where is the -th layer feature representation for the original input, and are the weight and bias parameters of the -th layer. We replace the layer of the softmax classifier in the original AlexNet with a new fully-connected hashing layer with hidden nodes, which converts the learned deep features into a low-dimensional hash codes. In order to realize hash encoding, we introduce an activation function to map the output of to be within [-1,1].

Iii-C Hash Code Learning

For efficient nearest neighbor search, the semantic similarity of original images should be preserved in the Hamming space. Given a pair of binary codes and , if pairwise images and do not share any object class, the Hamming distance between and should be large, i.e., be close to in the -bit hash coding case; if the pairwise images and have some class labels in common, we expect the Hamming distance to be a small value. Previous works have shown that the inner product is a good representation of the Hamming distance to quantify the pairwise similarity [23, 24]. In this work, we construct a scaled inner product , where is a positive hyper-parameter to control the constraint bandwidth.

Given the pairwise similarity relationship , the Maximum a Posterior estimation of hash codes can be derived as:

(3)

where is the likelihood function, and is the prior distribution. For each pair of the images, is the conditional probability of given their hash codes and , which is defined as follows:

(4)

where is the sigmoid function, which we use to transform the Hamming distance into a kind of measure of similarity. is the quantized pairwise similarity calculated by Eq. (1) and Eq. (2), and is the Euclidean distance between the quantized pairwise similarity and scaled inner product. When the pairwise images and are completely similar or dissimilar, it is suitable to measure the pairwise similarity loss with cross entropy, as formulated by Eq. (5),

(5)

Then, substituting the sigmoid function with , we get

(6)

When the pairwise images and are partially similar, we apply mean square error, i.e. Euclidean distance, to quantify the similarity error between them. Thus, the pairwise similarity loss can be defined as:

(7)

We use to mark the two conditions, where = 1 denotes that and are completely similar or dissimilar, and = 0 denotes that and are partly similar. Under the assumption that the completely similar and dissimilar situation contribute more to the loss formulation, we use a hyper-parameter to increase the weight of cross-entropy term. Hence, the pairwise similarity loss is rewritten as:

(8)

It is challenging to directly optimize Eq. (8), because the binary constraint requires thresholding the network outputs, which may result in the vanishing-gradient problem in back propagation during the training procedure. Following previous works [1, 25, 23], we apply the continuous relaxation to solve this problem, where is the output of deep hashing network and is a weighted inner product between and . since the network output is not the binary codes, we use a pairwise quantization loss to encourage the network output to be close to standard binary codes. The pairwise quantization loss is defined as

(9)

where 1 is a vector of all ones, is the L1-norm of the vector, is the element-wise absolute value operation. By integrating the pairwise similarity loss and pairwise quantization loss, the final optimization problem is defined as

(10)

where is a weight coefficient for controlling the quantization loss.

Iii-D Learning Algorithm

During the training process, the standard back-propagation algorithm with mini-batch gradient descent method is used to optimize the pairwise loss function. By combining Eq. (8) and Eq. (9), we rewrite the optimization objective function as follows:

(11)

In order to employ back propagation algorithm to optimize the network parameters, we need to compute the derivative of the objective function. The sub-gradients of Eq. (11) w.r.t. (-th unit of the network output ) can be written as:

(12)

and

(13)

where and . The gradient of w.r.t. (raw representations of before activation) can be calculated by

(14)

where is an element-wise sign function and is the output of the -th layer before activation. The gradient of the network parameter is

(15)

Since we have computed sub-gradients of the layer , the rest of the back-propagation procedure can be done in the standard manner. Note that, after the learning procedure, we have not obtained the corresponding binary codes of input images yet. The network only generate approximate hash codes that have values within [-1,1]. To finally get the hash codes and evaluate the efficacy of the trained network, we need to treat the test query data as input and forward propagate the network to generate hash codes by using Eq. (16),

(16)

In this way, we can train the deep neural network in an end-to-end fashion, and any new input images can be encoded into binary codes by the trained deep hashing model. Ranking the distance of these binary hash codes in the Hamming space, we can obtain an efficient image retrieval.

Fig. 3: Sample images from the three datasets. From top row to the bottom row are the samples from NUS-WIDE, Flickr, and VOC2012, respectively. The labels have been given for each image as provided by the datasets.

Iv Experiments and Results

Iv-a Datasets

To verify the performance of the proposed method, we compare the proposed method with several baselines on three widely used benchmark datasets, i.e., NUS-WIDE, Flickr and VOC2012.

NUS-WIDE [36] is a dataset containing 269,648 public web images. It is a multi-label dataset in which each image is annotated with one or more class labels from a total of 81 classes. We follow the settings in [37, 21] to use the subset of images associated with the 21 most frequent labels, where each label associates with at least 5,000 images, resulting in a total of 195,834 images. We resize the images of this subset to 227227.

Flickr [38] is a dataset containing 25,000 images collected from Flickr. Each image belongs to at least one of the 38 semantic labels. We resize the images to 227227.

VOC2012 [39] is a widely used dataset for object detection and segmentation, which contains 17,125 images, and each image belongs to at least one of the 20 semantic labels. We resize images to 227 227.

Iv-B Implementation Details

We compare our method with several state-of-the-art hashing methods, including three unsupervised methods LSH [26], SH [2] and ITQ [7], and six supervised methods BRE [3], MLH [5], KSH [6], DLBHC [34], DHN [23] and DQN [24].

For NUS-WIDE, we randomly select 100 images per class to form a test query set of 2,100 images, and 500 images per class to form the training set. For Flickr and VOC2012, we randomly select 1,000 images as the test query set, and the remaining images as the train set.

For the deep learning based methods, including DLBHC, DQN, DHN and ISDH, we directly use the image pixels as input. For the other baseline methods, we use some effective and widely used feature vectors to represent the images. Following [6, 21], each image in NUS-WIDE is represented as a 500-dimensional bag-of-words vector, and a 3,857-dimensional vector in Flickr concatenated by local SIFT feature, global GIST feature, etc. In VOC2012, each image is represented as a 7,680-dimensional feature vector [40] built on dense SIFT descriptors [41].

We implement the proposed method (ISDH) by the TensorFlow toolkit [42]. We use the AlexNet architecture [12] and fine-tune the convolutional layers - and fully-connected layers - with network weight parameters copied from the pre-trained model, and train the hashing layer , all via back-propagation. We use the mini-batch SGD with a mini-batch size of 128, and the learning rate decay after each 500 iterations with a decay rate of 0.9. For fair comparison, all deep hashing methods for comparison are implemented by using the TensorFlow111Get our codes at https://github.com/pectinid16/ISDH-Tensorflow/.

In our models, there are three hyper-parameters () which will impact the performance of the model. controls the range of inner product value after normalization. We notice that the gradient of large absolute value is very small in the sigmoid function, which may cause gradient vanishing. In order to avoid this and accelerate the convergence, we employ a parameter and set its value according to the code number in hashing. Empirically, we set = to constrain the result of to be within [-5,5], which is relatively a suitable range. We will discuss the effect of in the next subsection. We employ another parameter to make a compromise between the cross-entropy loss and the square-error loss. In this work, we test our model with , and the test results have been shown in Fig. 4. It can be seen that, with the increasing of , the performance of retrieval is improving until the value reaches to 10, and when its value is larger than 10, it does not get any improvement but even suffers a decline on retrieval performance. Thus, we will use = 10 in the experiments. is the weight of quantization loss. Considering that the quantization loss is less influential than the similarity loss, we assign it a small value = 0.1.

Fig. 4: MAP results of different with 24-bit and 48-bit hash codes.

Iv-C Metrics

We evaluate the image retrieval quality using four widely-used metrics: Average Cumulative Gains (ACG)  [43], Normalized Discounted Cumulative Gains (NDCG)  [44], Mean Average Precision (MAP)  [45] and Weighted Mean Average Precision (WAP)  [20].

Fig. 5: Performance of different methods on the NUS-WIDE dataset. From top to bottom, there are precision, ACG and NDCG curves w.r.t. different top returned samples with hash codes of 12, 24, 36 and 48 bits, respectively.

ACG represents the average number of shared labels between the query image and the top retrieved images. Given a query image , the ACG score of the top retrieved images is calculated by

(17)

where denotes the number of top retrieval images and is the number of shared class labels between and .

NDCG is a popular evaluation metric in information retrieval. Given a query image , the DCG score of top retrieved images is defined as

(18)

Then, the normalized DCG (NDCG) score at the position can be calculated by , where is the maximum value of , which constrains the value of NDCG in the range [0,1].

MAP is the mean of average precision for each query, which can be calculated by

(19)

where

(20)

and is an indicator function that if and share some class labels, = 1; otherwise = 0. is the numbers of query sets and indicates the number of relevant images w.r.t. the query image within the top images.

The definition of WAP is similar with MAP. The only difference is that WAP computes the average ACG scores at each top retrieved image rather than average precision. WAP can be calculated by

(21)

 

Methods 12-bit 24-bit 36-bit 48-bit
ISDH 0.6987 0.7208 0.7298 0.7346
DQN 0.6881 0.7109 0.7231 0.7301
DHN 0.6948 0.7022 0.7074 0.7080
DLBHC 0.5765 0.5970 0.6075 0.6194
KSH 0.4851 0.4996 0.5045 0.5068
MLH 0.3895 0.4024 0.4059 0.4091
BRE 0.3891 0.3963 0.3987 0.4015
SH 0.3465 0.3525 0.3557 0.3674
ITQ 0.4021 0.4131 0.4187 0.4219
LSH 0.3481 0.3750 0.3762 0.3917

 

TABLE I: Results of Mean Accuracy Precision (MAP) for different numbers of bits on NUS-WIDE dataset.
Fig. 6: Performance of different methods on the Flickr dataset. From top to bottom, there are precision, ACG and NDCG curves w.r.t. different top returned samples with hash codes of 12, 24, 36 and 48 bits, respectively.
Fig. 7: Performance of difference methods on the VCO2012 dataset. From top to bottom, there are precision, ACG and NDCG curves w.r.t. different number of top returned samples with hash codes of 12, 24, 36 and 48 bits, respectively.

Fig. 8: Comparison of the ISDH method and three deep hashing methods on results.

Fig. 9: Comparison of the ISDH method and three deep hashing methods on results.

Iv-D Results

The results of MAP metric on three datasets are shown from Table I to Table III. It can be seen that, the proposed ISDH method substantially outperforms all the comparison methods on these three datasets. Compared to the best baseline of traditional hashing methods, KSH, our method has achieved an improvement of about 22.2%, 12.5% and 18.3% in average MAP for different bits on NUS-WIDE, Flickr and VOC2012, respectively. It can be seen from Table I to Table III, the deep learning methods have obtained largely improved performance over the three traditional methods. Compared to the state-of-the-art deep hashing methods, DQN, the proposed ISDH achieves an improvement of about 0.8%, 0.3% and 1.2% in average MAP on the three datasets, respectively. These results show the advantage of the proposed method.

 

Methods 12-bit 24-bit 36-bit 48-bit
ISDH 0.8130 0.8304 0.8330 0.8419
DQN 0.8068 0.8302 0.8325 0.8388
DHN 0.7985 0.8023 0.8023 0.8078
DLBHC 0.6805 0.7120 0.7160 0.7102
KSH 0.6955 0.7044 0.7093 0.7113
MLH 0.6249 0.6321 0.6335 0.6336
BRE 0.5847 0.5881 0.5901 0.5986
SH 0.5823 0.5856 0.5861 0.5865
ITQ 0.5816 0.5817 0.5826 0.5835
LSH 0.5852 0.5899 0.5854 0.5894

 

TABLE II: Results of Mean Accuracy Precision (MAP) for different numbers of bits on Flickr dataset.

 

Methods 12-bit 24-bit 36-bit 48-bit
ISDH 0.6258 0.6480 0.6575 0.6654
DQN 0.6115 0.6396 0.6483 0.6501
DHN 0.6145 0.6241 0.6248 0.6308
DLBHC 0.4879 0.5163 0.5277 0.5424
KSH 0.4535 0.4667 0.4704 0.4760
MLH 0.3917 0.3990 0.4028 0.4029
BRE 0.3870 0.3951 0.3967 0.4015
SH 0.3953 0.4045 0.4003 0.3963
ITQ 0.3932 0.3986 0.4026 0.4036
LSH 0.3595 0.3619 0.3622 0.3638

 

TABLE III: Results of Mean Accuracy Precision (MAP) for different numbers of bits on VOC2012 dataset.

Figure 5 shows the precision, ACG and NDCG curves of compared hashing methods w.r.t. different numbers of top returned images with 12, 24, 36 and 48 bits on NUS-WISDE, respectively. On precision metric, it can be seen that, the proposed ISDH has close performance with DHN on 12 bits and outperforms all the comparison methods on 24, 36 and 48 bits w.r.t. different numbers of top returned images. On ACG and NDCG metric, our method is slightly lower than DHN on 12 and 24 bits, it may be because that a shorter code is less effective in representing the semantic similarity of multi-label images in a large-scale dataset. With the code length increasing, the performance of the proposed ISDH improves and shows obvious advantage than other compared methods, including DHN. The performance of DQN is relative poorer than ISDH and DHN on this dataset, and DLBHC shows the worse results among these deep hashing methods, since it directly uses the class label as supervised information rather than semantic similarity.

Figure 6 shows the precision, ACG and NDCG curves of compared hashing methods w.r.t. different numbers of top returned images with 12, 24, 36 and 48 bits on Flickr, respectively. It can be seen that, the proposed method achieves the state-of-the-art performance compared to other methods. On this dataset, DHN shows distinct disadvantage than ISDH, which demonstrates our method is more robust and stable than compared methods.

Figure 7 shows the precision, ACG and NDCG curves of compared hashing methods w.r.t. different numbers of top returned images with 12, 24, 36 and 48 bits on VOC2012, respectively. The proposed method also achieves the best performance among the ten hashing methods.

Figures 8 and 9 show the results of MAP and WAP for different numbers of bits. In multi-label image retrieval, MAP can reflect if two images share a class label or not, but cannot reflect how many number of class labels that the pairwise images shared with each other. In our study, high-quality retrieval results should have as more shared class labels as possible in the nearest retrieval image, so we also use WAP to measure the average number of shared class labels among these retrieved similar images. In Flickr, ISDH has a close performance with DQN, but is still better than other comparison methods. In NUS-WIDE and VOC2012, the results of ISDH methods are obvious better than the all comparison methods.

Methods NUS-WIDE Flickr VOC2012
ISDH 0.7348 0.8419 0.6654
ISDH-w/o-MSE 0.7312 0.8388 0.6605
ISDH-w/o- 0.7149 0.8141 0.6190
TABLE IV: The MAP results of ISDH, ISDH-w/o-MSE, and ISDH-w/o- with 48-bit hash codes.

To justify the necessary of using the MSE term and paramter, we conduct some comparison experiments. Table IV shows the results of Mean Average Precision metric of the ISDH and its variants. ISDH-w/o-MSE replaces the square error loss in previous formula with 0. In this model, only pairwise cross-entropy loss is calculated when the pairwise instance similarity equals to 1 or 0. Compared to the standard ISDH, the results of ISDH-w/o-MSE decrease 0.36%, 0.31% and 0.49% on NUS-WDIE, Flickr and VOC2012, respectively. ISDH-w/o- is the variant of ISDH that the value of is set to 1. We apply this model to conform the effectiveness of constraining the range the inner product value of pairwise network output. It can be seen that without , the MAP results suffer a significant decrease of 1.99%, 2.78% and 4.64% on these three datasets, respectively. Such results demonstrate the significance of using square error loss for partially similar situation and hyper-parameter for controlling the inner product.

Figure 10 shows some retrieval samples of four deep learning methods according to the ascending Hamming ranking. We marked the retrieval image with green box that include all instance in query image, blue box that include partial instance, and red box which don’t include any instance in query image. The first query image contains two semantic labels: animal and grass. We can see that among these four deep hashing methods, ISDH shows the best suitability between the retrieval images and query images, because only ISDH’s top-20 retrieval results involves all these labels. The second query images contains two semantic labels: building and window. On the top-20 retrieval images of each methods, only ISDH doesn’t include the wrong instance. This result suggests that the proposed method is more suitable for multi-label image retrieval.

Fig. 10: Top 20 retrieved images of the proposed ISDH method and three competing deep hashing methods DHN, DQN and DLBHC using the Hamming ranking on 48-bit hash codes. The green box indicates that the retrieved image includes all instances in the query image, the blue box indicates the retrieved image include partial instances, and the red box indicates the retrieved image do not include any instance in the query image.

V Conclusion

In this paper, a novel deep hashing method - ISDH - was proposed for multi-label image retrieval, in which an instance-similarity definition was introduced to quantify the pairwise similarity for images holding multiple class labels. ISDH avoided the limitations that the traditional pairwise similarity cannot encode the ranking information of multi-label images. Moreover, based on the instance similarity, a pairwise similarity loss was introduced for similarity-preserving learning, and a quantization loss was used to control the quality of hashing. The proposed deep hashing method performed an effective feature learning and hash-code learning. Experiments on three multi-label datasets demonstrated that, the proposed ISDH outperformed the competing methods and achieved the state-of-the-art performance in multi-label image retrieval.

Acknowledgement

This research was supported by the National Natural Science Foundation of China under grant No. 61301277, No. 61572370 and No. 91546106, Key Research Base for Humanities and Social Sciences of Ministry of Education Major Project under grant No. 16JJD870002, and the Open Research Fund of State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing under Grant 16S01. Yuewei Lin gratefully acknowledges the support by BNL LDRD 18-009. The authors would like to thank the researchers for sharing these datasets used in our experiments - NUS-WIDE, Flickr and VOC2012.

References

  • [1] J. Wang, H. T. Shen, J. Song, and J. Ji, “Hashing for similarity search: A survey,” arXiv preprint arXiv:1408.2927, 2014.
  • [2] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Advances in neural information processing systems, 2009, pp. 1753–1760.
  • [3] B. Kulis and T. Darrell, “Learning to hash with binary reconstructive embeddings,” in Advances in neural information processing systems, 2009, pp. 1042–1050.
  • [4] J. Wang, S. Kumar, and S.-F. Chang, “Semi-supervised hashing for scalable image retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3424–3431.
  • [5] M. Norouzi and D. M. Blei, “Minimal loss hashing for compact binary codes,” in International Conference on Machine Learning, 2011, pp. 353–360.
  • [6] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Supervised hashing with kernels,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2074–2081.
  • [7] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2916–2929, 2013.
  • [8] C. Li, Q. Liu, J. Liu, and H. Lu, “Ordinal distance metric learning for image ranking,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 7, p. 1551, 2015.
  • [9] L. Liu and L. Shao, “Sequential compact code learning for unsupervised image hashing,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 12, pp. 2526–2536, 2016.
  • [10] G. Jie, T. Liu, Z. Sun, D. Tao, and T. Tan, “Supervised discrete hashing with relaxation,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–10, 2017.
  • [11] Q. Liu, G. Liu, L. Li, X. T. Yuan, M. Wang, and W. Liu, “Reversed spectral hashing,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–9, 2017.
  • [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [13] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
  • [15] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in Advances in neural information processing systems, 2013, pp. 2553–2561.
  • [16] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” in Advances in neural information processing systems, 2014, pp. 1988–1996.
  • [17] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
  • [18] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam, “Large-scale object classification using label relation graphs,” in European Conference on Computer Vision, 2014, pp. 48–64.
  • [19] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for image retrieval via image representation learning.” in AAAI Conference on Artificial Intelligence, vol. 1, 2014, pp. 2156–2162.
  • [20] F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic ranking based hashing for multi-label image retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1556–1564.
  • [21] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3270–3278.
  • [22] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, “Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 4766–4779, 2015.
  • [23] H. Zhu, M. Long, J. Wang, and Y. Cao, “Deep hashing network for efficient similarity retrieval.” in AAAI Conference on Artificial Intelligence, 2016, pp. 2415–2421.
  • [24] Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen, “Deep quantization network for efficient image retrieval.” in AAAI Conference on Artificial Intelligence, 2016, pp. 3457–3463.
  • [25] H. Liu, R. Wang, S. Shan, and X. Chen, “Deep supervised hashing for fast image retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2064–2072.
  • [26] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in Annual Symposium on Computational Geometry, 2004, pp. 253–262.
  • [27] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu, “Spectral hashing with semantically consistent graph for image indexing,” IEEE Transactions on Multimedia, vol. 15, no. 1, pp. 141–152, 2013.
  • [28] Q. Ning, J. Zhu, Z. Zhong, S. C. H. Hoi, and C. Chen, “Scalable image retrieval by sparse product quantization,” IEEE Transactions on Multimedia, vol. 19, no. 3, pp. 586–597, 2017.
  • [29] S. Ercoli, M. Bertini, and A. D. Bimbo, “Compact hash codes for efficient visual descriptors retrieval in large scale databases,” IEEE Transactions on Multimedia, vol. 19, no. 11, pp. 2521–2532, 2017.
  • [30] X. Lu, X. Zheng, and X. Li, “Latent semantic minimal hashing for image retrieval,” IEEE Transactions on Image Processing, 2016.
  • [31] L. K. Huang, Q. Yang, and W. S. Zheng, “Online hashing,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–14, 2017.
  • [32] H. Xia, S. C. H. Hoi, R. Jin, and P. Zhao, “Online multiple kernel similarity learning for visual search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 536–549, 2013.
  • [33] J. Liang, Q. Hu, W. Wang, and Y. Han, “Semisupervised online multikernel similarity learning for image retrieval,” IEEE Transactions on Multimedia, vol. 19, no. 5, pp. 1077–1089, 2017.
  • [34] K. Lin, H. F. Yang, J. H. Hsiao, and C. S. Chen, “Deep learning of binary hash codes for fast image retrieval,” in Computer Vision and Pattern Recognition Workshops, 2015, pp. 27–35.
  • [35] Q. Li, Z. Sun, R. He, and T. Tan, “Deep supervised discrete hashing,” arXiv preprint arXiv:1705.10999, 2017.
  • [36] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-wide: a real-world web image database from national university of singapore,” in ACM International Conference on Image and Video Retrieval, 2009, p. 48.
  • [37] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, “Hashing with graphs,” in International Conference on Machine Learning, 2011, pp. 1–8.
  • [38] M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in ACM International Conference on Multimedia Information Retrieval, 2008, pp. 39–43.
  • [39] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
  • [40] F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in European Conference on Computer Vision, 2010, pp. 143–156.
  • [41] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
  • [42] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
  • [43] K. Jarvelin and J. Kekalainen, “IR evaluation methods for retrieving highly relevant documents,” in ACM SIGIR Conference on Research and Development in Information Retrieval, 2000, pp. 41–48.
  • [44] ——, “Cumulated gain-based evaluation of IR techniques,” ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422–446, 2002.
  • [45] R. Baeza-Yates, B. Ribeiro-Neto et al., Modern information retrieval.   ACM press New York, 1999, vol. 463.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
119904
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description