Queryadaptive Image Retrieval by
Deep Weighted Hashing
Abstract
Hashing methods have attracted much attention for large scale image retrieval. Some deep hashing methods have achieved promising results by taking advantage of the strong representation power of deep networks recently. However, existing deep hashing methods treat all hash bits equally. On one hand, a large number of images share the same distance to a query image due to the discrete Hamming distance, which raises a critical issue of image retrieval where finegrained rankings are very important. On the other hand, different hash bits actually contribute to the image retrieval differently, and treating them equally greatly affects the retrieval accuracy of image. To address the above two problems, we propose the queryadaptive deep weighted hashing (QaDWH) approach, which can perform finegrained ranking for different queries by weighted Hamming distance. First, a novel deep hashing network is proposed to learn the hash codes and corresponding classwise weights jointly, so that the learned weights can reflect the importance of different hash bits for different image classes. Second, a queryadaptive image retrieval method is proposed, which rapidly generates hash bit weights for different query images by fusing its semantic probability and the learned classwise weights. Finegrained image retrieval is then performed by the weighted Hamming distance, which can provide more accurate ranking than the traditional Hamming distance. Experiments on four widely used datasets show that the proposed approach outperforms eight stateoftheart hashing methods.
I Introduction
WITH rapid growth of multimedia data on web, retrieving the relevant multimedia content from a massive database has been an urgent need, yet still remains a big challenge. Hashing methods map multimedia data into short binary codes to utilize the storage and computing efficiency of Hamming codes, thus they have been receiving increasing attentions in many multimedia application scenarios, such as image retrieval [1, 2, 3, 4, 5, 6, 7, 8], video retrieval [9, 10] and cross media retrieval [11, 12]. Generally speaking, hashing methods aim to learn mapping functions for multimedia data, so that similar data are mapped to similar binary codes. These hashing methods can be divided into two categories, namely unsupervised methods and supervised methods.
Unsupervised methods are proposed to design the hash functions without using image labels, and they can be divided into data independent methods and data dependent methods. The representative data independent method is Locality Sensitive Hashing (LSH) [1], which maps data into the binary codes by random linear projections. There are several extensions of LSH, such as SIKH [13] and Multiprobe LSH [14]. Data dependent methods try to learn hash functions by analyzing the data properties, such as manifold structures and data distributions. For example, Spectral Hashing (SH) [4] is proposed to design the hash codes to be balanced and uncorrelated. Anchor Graph Hashing (AGH) [15] is proposed to use anchor graphs to discover neighborhood structures. Gong et al. propose Iterative Quantization (ITQ) [16] to learn hash functions by minimizing the quantization error of mapping data to the vertices of a binary hypercube. Topology Preserving Hashing (TPH) [17] is proposed to preserve the consistent neighborhood rankings of data points in Hamming space. Irie et al. propose Locally Linear Hashing (LLH) [18] to utilize localitysensitive sparse coding to capture the local linear structures and then recover these structures in Hamming space.
Unsupervised methods can search nearest neighbors under a certain kind of distance metric (e.g. distance). However, the neighbors in feature space may not be semantically similar. Therefore, supervised hashing methods are proposed, which leverage the semantic information to generate effective hash codes. Binary Reconstruction Embedding (BRE) [19] is proposed to learn the hash functions by minimizing the reconstruction error between the original distances and the reconstructed distances in the Hamming space. Wang et al. propose Semisupervised Hashing (SSH) [3] to learn hash functions by minimizing the empirical error over the labeled data while maximizing the information entropy of generated hash codes over both labeled and unlabeled data. Liu et al. propose Supervised Hashing with Kernels (KSH) [20] to learn the hash codes by preserving the pairwise relationship between data samples provided by labels. Order Preserving Hashing (OPH) [21] and Ranking Preserving Hashing (RPH) [22] are proposed to learn hash functions by preserving the ranking information, which is obtained based on the number of shared semantic labels between images. Supervised Discrete Hashing (SDH) [23] is proposed to leverage label information to obtain hash codes by integrating hash code generation and classifier training.
Although aforementioned unsupervised and supervised methods have achieved considerable progress, the image representations of these methods are handcrafted features (e.g. GIST [24], BagofVisualWords [25]), which can not well represent the images’ semantic information. Inspired by the successful applications of deep networks on image classification and object detection [26], some deep hashing methods have been proposed recently to take advantage of the superior feature representation power of deep networks. Convolutional Neural Network Hashing (CNNH) [27] is a twostage framework, which is designed to learn the fixed hash codes in the first stage by preserving the pairwise semantic similarity, and learn the hash functions based on the learned hash codes in the second stage. Although the learned hash codes can guide the feature learning, the learned features cannot give feedback for learning better hash codes. To overcome the shortcomings of the twostage learning scheme, some approaches have been proposed to perform simultaneously image feature and hash code learning. Lai et al. propose Network In Network Hashing (NINH) [2] to use a triplet ranking loss [28] to capture the relative similarities of images. NINH is a onestage supervised method, thus the image representation learning and hash code learning can benefit each other in the deep architecture. Some similar rankingbased deep hashing methods [29, 30, 31] have been proposed recently, which are also designed to preserve the ranking information obtained by labels. Besides the triplet ranking based methods, some pairwise based deep hashing methods [32, 33] are also proposed, which try to preserve the semantic similarities provided by pairwise labels.
Although deep hashing methods have achieved promising results on image retrieval, existing deep hashing methods [27, 2, 30] treat all hash bits equally. However, Hamming distances are discrete integers, so there are often a large number of images sharing the equal Hamming distances to a query, which raises a critical issue of image retrieval where finegrained rankings are very important. An example is illustrated in Figure 1, given a query image with hash code , there can be 6 images with different hash codes that within Hamming radius 1 of query image, while they differ in different hash bits. Existing deep hashing methods cannot perform finegrained ranking among them. However, if we know that which bit of hash codes is more important for query image, then we can return a better rankling list.
There exist several traditional hashing methods [34, 35, 36, 37, 38, 39] that learn weights for hash codes. However, these methods adopt twostage frameworks, which first generate hash codes by other methods (e.g. LSH, ITQ), then learn hash weights by analyzing the fixed hash codes and image features. The twostage scheme causes that the learned weights can’t give feedback for learning better hash codes, which limits the retrieval accuracy. Thus we propose the queryadaptive deep weighted hashing (QaDWH) method, which can not only learn hash codes and corresponding classwise weights jointly, but also perform effective yet efficient queryadaptive finegrained retrieval. To the best of our knowledge, this is the first deep hashing method that can perform queryadaptive finegrained ranking. The main contributions of this paper can be concluded as follows:

A novel deep hashing network is designed to learn hash functions and corresponding weights jointly. In the proposed deep network, a hash layer and a classwise weight layer are designed, of which the hash layer generates hash codes, while the classwise weight layer learns the classwise weights for different hash bits. On the top of the hash stream, a weighted triplet ranking loss is proposed to preserve the similarity constraints. With the trained deep network, we can not only generate the binary hash codes, but also weigh the importance of each bit for different image class.

A queryadaptive retrieval method is proposed to perform finegrained retrieval. Query image’s bitwise hash weights are first rapidly generated by fusing the learned classwise weights and the predicted query class probability, so that generated weights can reflect different query’s semantic property. Based on the weights and generated hash codes, weighted Hamming distance measurement is employed to perform finegrained rankings for different query images.
Extensive experiments on four datasets show that the proposed approach achieves the best retrieval accuracy comparing to eight stateoftheart hashing methods. The rest of this paper is organized as follows. Section II briefly reviews the related work, section III presents the proposed deep weighted hashing method, section IV shows the experiments on four widely used image datasets, and section V concludes this paper.
Ii Related work
In this section, we briefly review related work, including some deep hashing methods proposed recently, and traditional weighted hashing methods.
Iia Deep Hashing Methods
Convolutional Neural Network Hashing (CNNH) [27] is the first deep hashing method based on convolutional neural networks (cnns). CNNH is composed of two stages: a hash code learning stage and a hash function learning stage. Given a training image set , in the hash code learning stage (Stage 1), CNNH learns approximate hash codes for training images by optimizing the following loss function:
(1) 
where denotes the Frobenius norm; denotes the semantic similarity of image pairs in , in which when image and are semantically similar, otherwise ; denotes the approximate hash codes. encodes the approximate hash codes for training images which preserve the pairwise similarities in . Equation (1) is difficult to directly optimize, thus CNNH firstly relaxes the integer constraints on and randomly initializes , then optimizes equation (1) using a coordinate descent algorithm, which sequentially or randomly chooses one entry in to update while keeping other entries fixed. Thus it is equivalent to optimizing the following equation:
(2) 
where and denote the th and the th column of respectively. In the hash function learning stage (Stage 2), CNNH uses deep networks to learn image features and hash functions. Specifically, CNNH adopts the deep framework in [40] as its basic network, and designs an output layer with sigmoid activation to generate bit hash codes. CNNH trains the designed deep network in a supervised way, in which the approximate hash codes learned in Stage 1 are used as the groundtruth. However, CNNH is a twostage framework, where the learned deep features in Stage 2 cannot help to optimize the approximate hash code learning in Stage 1, which limits the performance of hash learning.
Different from the twostage framework in CNNH [27], Network in Network Hashing (NINH) [2] performs image representation learning and hash code learning jointly. NINH constructs deep framework based on the Network in Network architecture [41], with a shared subnetwork composed of several stacked convolutional layers to extract image features, as well as a divideandencode module encouraged by sigmoid activation function and a piecewise threshold function to output binary hash codes. During the learning process, instead of generating approximate hash codes in advance, NINH utilizes a triplet ranking loss function to exploit the relative similarity of training images to directly guide hash learning:
(3) 
where and specify the triplet constraint that image is more similar to image than to image based on image labels; denotes binary hash code, and denotes the Hamming distance. For easy optimization of equation (3), NINH applies two relaxation tricks: relaxing the integer constraint on binary hash code and replacing Hamming distance with Euclidean distance.
There are several extensions based on NINH, such as Bitscalable Deep Hashing method [30] further manipulates hash code length by weighing each bit of hash codes. Deep Hashing Network (DHN) [33] additionally minimizes the quantization errors besides triplet ranking loss to improve retrieval precision. Deep semantic preserving and rankingbased hashing (DSRH) [29] introduces orthogonal constraints into triplet ranking loss to make hash bits independent. The above deep hashing methods treat all hash bits equally, which leads to a coarse ranking among images with the same hamming distance and achieves limited retrieval accuracy.
IiB Traditional Weighted Hashing Methods
Hamming distances are discrete integers that can not perform finegrained ranking for those images sharing same distances with a query image. Some hash bits weighting methods [34, 35, 36, 37, 38, 39] are proposed to address this issue. QaRank [35, 36] first learns classspecific weights by minimizing the intraclass similarity and maintaining the interclass relations, then generates queryadaptive weights by using top k similar images’ labels. QsRank [39] designs a ranking algorithm for PCAbased hashing method, which uses the probability that neighbors of query map to hash code to measure the ranking score of hash codes . WhRank [38] proposes a weighted Hamming distance ranking algorithm by dataadaptive weight and querysensitive bitwise weight. QRank [34, 37] learns queryadaptive weights by exploiting both the discriminative power of each hash function and their complement for nearest neighbor search. The aforementioned traditional weighted hashing methods are all twostage schemes, which take hash codes generated by other methods (such as LSH, SH and ITQ) as input, then learn the weights by analyzing the fixed hash codes and image features. The twostage scheme causes that the learned weights can’t give feedback for learning better hash codes, which limits the retrieval accuracy.
Iii Queryadaptive deep weighted hashing
Given a set of images . The goal of hashing methods is to learn a mapping function , which encodes an image into a qdimensional binary code in the Hamming space, while preserving the semantic similarity of images. In this section, we will introduce the proposed queryadaptive deep weighted hashing (QaDWH) approach. The overall framework is shown in Figure 2, the proposed deep hashing network consists of two streams, namely the hash stream and the classification stream. In the training stage, the hash stream learns the hash functions and the associated weights, while the classification stream preserves the semantic information. In the query stage, the trained network generates compact hash codes and the bitwise weights for the newly input query images, and then the finegrained ranking can be performed by the weighted Hamming distance measurement efficiently. In the following of this section, we’ll first introduce the proposed deep hashing network and training algorithm, then we’ll demonstrate the queryadaptive image retrieval method.
Iiia Deep Weighted Hashing Network
As shown in Figure 2, the proposed deep network is composed of the representation learning layers, the hash stream and the classification stream. The representation learning layers serve as a feature extractor, which is a deep network composed of several convolutional layers and fully connected layers. We adopt the VGG19 network [42] as the representation learning layers, in which the first 18 layers follow exactly the same settings in the VGG19 network. The hash stream and the classification stream are both connected with the representation learning layers.
IiiA1 The Hash Stream
The hash stream is composed of two layers, the hash code learning layer and the classwise weight layer. The hash code learning layer is a fully connected layer with q neural nodes, its outputs are hash codes defined as:
(4) 
where is the deep features extracted from the representation learning layers, and are the parameters in the hash code learning layer. Through the hash code learning layer, the image features are mapped into . Since the hash codes are continuous real values, we apply a thresholding function to obtain binary codes:
(5) 
In order to learn the classwise hash weights, we design a classwise weight layer connected with the hash code learning layer. The classwise weight layer is an elementwise layer, which is also associated with image classes. Suppose the number of image classes is c and the hash code length is q, then the classwise weight layer is defined as an elementwise layer with parameters . And the output of classwise weight layer is defined as:
(6) 
where is the output hash codes of , is the image class index of , and denotes the element wise product. Here we constraint the weights to be nonnegative. For training images with multiple class, we use average fusion of corresponding weights to perform element wise product. Through the classwise weight layer, the hash codes of each image are multiplied by its corresponding weights associated with image class.
On the top of the classwise weight layer, we propose a weighted triplet ranking loss to train the hash stream. For the training images , where are the corresponding image labels. We sample a set of triplet tuples depending on the labels, , in which and are two similar images with the same labels, while and are two dissimilar images with different labels, is the number of sampled triplet tuples. For the triplet tuple , the weighted triplet ranking loss is defined as:
(7) 
where the constant parameter defines the margin difference metric between the relative similarity of the two pairs and . That is to say, we expect the distance of the dissimilar pair to be larger than the distance of the similar pair by at least . denotes the weighted Hamming distance defined as:
(8) 
where is the class index of image . Note that in the weighted triplet ranking loss, the weights of anchor point are used to calculate the weighted Hamming distance. Because anchor point acts like query in the retrieval process, we treat anchor point’s weights more importantly. Minimizing can reach our goal to preserve the semantic ranking constraints provided by labels.
In equation (7), the binary hash code and Hamming distance make it hard to directly optimize. Similar to NINH [2], binary hash code is relaxed with continuous real value hash code . Hamming distance is replaced by weighted Euclidean distance defined as:
(9) 
Then equation (7) can be rewritten as:
(10) 
IiiA2 The Classification Stream
Besides the hash stream, we also design a classification stream connected with the representation layers. On one hand, jointly training the hash stream and the classification stream can improve the retrieval accuracy, which has been shown in previous work [27]. On the other hand, the trained classification stream can be used to generate the queryadaptive hash code weights, which will be introduced in next part. In the classification stream, a fully connected layer with neural nodes is connected with the representation learning layers, which predicts the probability of each class. Then softmax loss is used to train the classification stream:
(11) 
where are parameters of the network, is the number of images in one batch, and denotes whether image belongs to class . Note that this is not a standard softmax loss, but a multilabel softmax loss, which can handle images with multiple labels. When only one element of is , the above equation is equal to standard softmax loss. Incorporating the hash stream and the classification stream, the network can not only preserve the ranking information and semantic information, but also learn the bitwise hash weights for different image classes.
IiiA3 Network Training
Forward and backward propagation schemes are used in the training phase. For the two streams in the network, we use a cotraining method to tune the network parameters jointly. More specifically, in the forward propagation stage, the ranking error in the hash stream is measured by equation (10), and the classification error in the classification stream is measured by equation (11). Then in the backward propagation stage, the network parameters are tuned by the gradient of each loss function. For the weighted triplet based ranking loss, the gradient with respect to , and are computed as:
(12) 
Where is an indicator function, if is true, otherwise . Then the gradient of each image is fed into the network to update parameters of each layer, including the hash layer and the weight layer.
For the softmax loss, the gradient with respect to is calculated as:
(13) 
By equations (12) and (13), these derivative values can be fed into the network via the backpropagation algorithm to update the parameters of each layer in the deep network. The training procedure is ended until the loss converges or a predefined maximal iteration number is reached. We briefly summarize the training process in Algorithm 1. Note that after the network is trained, we can not only get the hash mapping functions, but also the hash weights associated with each bit.
IiiB Queryadaptive Image Retrieval
In the query stage, existing deep hashing methods [27, 2, 30] treat each hash bit equally, and they usually first map query image to binary hash codes and retrieve the images in the database by Hamming distance. However, Hamming distances are discrete values, which can not perform finegrained ranking since a large amount of images may share the same distance to a query image. To address this issue, we propose the queryadaptive image retrieval approach. For a given query image , we first generate real valued hash codes by the output of hash layer, then the binary codes are generated by equation (5).
In order to perform queryadaptive finegrained ranking, besides the hash codes, we also generate queryadaptive hash weights efficiently. Based on the trained network, we already obtain classwise hash bit weights for different image classes. The queryadaptive weights are generated rapidly as:
(14) 
where is the predicted probability generated by the classification stream, in which indicates the probability that belongs to image class . Equation 14 means that we fuse the classwise weighs by the probability of query belongs to each class, thus the generated hash bit weights can reflect the semantic property of the query image. Finegrained image ranking can be performed by the weighted Hamming distance between the query and any image in the database:
(15) 
where is the length of hash codes. We summarize the queryadaptive retrieval method in algorithm 2. Note that the proposed queryadaptive image retrieval method is also very fast compared to original Hamming distance measurement. Equation 14 is a simple matrix multiplication which is efficient. And in practice, the weighted Hamming distance only needs to be computed in a subset of hash codes, since we can firstly sort with the Hamming distance fast by operation, and then compute the weighted distance in a subset within small Hamming distance (e.g. Hamming distance ). Thus the additional computation is very small compared to original Hamming distance ranking, and effective yet efficient finegrained ranking can be performed.
Iv Experiments
In this section, we will introduce our experiments conducted on four widely used datasets, which are CIFAR10, NUSWIDE, MIRFLICKR and ImageNet datasets. We compare with eight stateoftheart methods in terms of retrieval accuracy and efficiency to verify the effectiveness of our QaDWH approach. In addition, we also conduct baseline experiments to verify the separate contribution of proposed deep weighted hashing and queryadaptive retrieval.
Iva Datasets and Experimental Settings
We conduct experiments on four widely used image retrieval datasets. Each dataset is split into query, database and training set, we summarize the split of each dataset in Table I, and detailed settings are as follows:

CIFAR10 dataset consists of 60,000 color images from 10 classes, each of which has 6,000 images. Following [2, 27], 1,000 images are randomly selected as the query set (100 images per class). For the unsupervised methods, all the rest images are used as the training set. For the supervised methods, 5,000 images (500 images per class) are further randomly selected from the rest of images to form the training set.

NUSWIDE [43] dataset contains nearly 270,000 images, each image is associated with one or multiple labels from 81 semantic concepts. Following [2, 27], only the 21 most frequent concepts are used, where each concept has at least 5,000 images, resulting in a total of 166,047 images. 2,100 images are randomly selected as the query set (100 images per concept). For the unsupervised methods, all the rest images are used as the training set. For the supervised methods, 500 images from each of the 21 concepts are randomly selected to form the training set of total 10,500 images.

MIRFLICKR [44] dataset consists of 25,000 images collected from Flickr, and each image is associated with one or multiple labels of 38 semantic concepts. 1,000 images are randomly selected as the query set. For the unsupervised methods, all the rest images are used as the training set. For the supervised methods, 5,000 images are randomly selected from the rest of images to form the training set.

ImageNet [45] dataset contains 1000 categories with 1.2 million images. ImageNet is a large dataset that can comprehensively evaluate the proposed approach and compared methods. Since the testing set of ImageNet is not publicly available, following [2], we use the provided training set as the retrieval database, and the validation set as query set. For the training set of each hashing methods, we further randomly sampling 50,000 images from retrieval database as the training set (50 image per class).
CIFAR10  NUSWIDE  MIRFLICKR  ImageNet  
Query  1,000  2,100  1,000  50,000 
Database  54,000  153,447  19,000  1,231,167 
Training  5,000  10,500  5,000  50,000 
IvB Evaluation Metrics and Compared Methods
To objectively and comprehensively evaluate the retrieval accuracy of the proposed approach and the compared methods, we use 5 evaluation metrics: Mean Average Precision (MAP), PrecisionRecall curves, precision curves of top k retrieved samples, precision within top 500 retrieved samples and precision within Hamming radius 2. The definitions of these evaluation metrics are defined as follows:

Mean Average Precision (MAP): MAP presents an overall measurement of the retrieval performance. MAP for a set of queries is the mean of average precision (AP) for each query, where AP is defined as:
(16) where n is the size of database set, R is the number of relevant images with query in database set, is the number of relevant images in the top k returns, and if the image ranked at kth position is relevant and 0 otherwise.

PrecisionRecall curves: The precisions at certain level of recall, we calculate PrecisionRecall curves of all returned results.

Precision curves of top k retrieved samples: The average precision of top k returned images for each query.

Precision within top 500 retrieved samples: The average precision of the top 500 returned image for each query.

Precision within Hamming radius 2: Precision curve of returned images with the Hamming distance smaller than 2 using hash lookup.
We compare the proposed QaDWH approach with eight stateoftheart methods, including unsupervised methods LSH, SH and ITQ, supervised methods SDH, CNNH, NINH and DRSCH, and traditional queryadaptive hashing method QRank. The brief introductions of these 8 methods are listed below:

LSH [1] is a data independent unsupervised method, which uses randomly generated hash functions to map image features into binary codes.

SH [4] is a data dependent unsupervised method, which learns hash functions by making hash codes balanced and uncorrelated.

ITQ [16] is also a data dependent unsupervised method, which learns hash functions by minimizing the quantization error of mapping data to the vertices of a binary hypercube.

SDH [23] is a supervised method, which leverages label information to obtain hash codes by integrating hash code generation and classifier training.

CNNH [27] is a twostage deep hashing method, which learns hash codes for training images in first stage, and trains a deep hashing network in second stage.

NINH [2] is a onestage deep hashing method, which learns deep hashing network by a triplet loss function to measure the ranking information provided by labels.

DRSCH [30] is also a triplet loss based deep hashing method, which can further leverage hash code length by weighing each bit of hash codes.

QRank [34] is a traditional queryadaptive hashing method, which learns queryadaptive hash weights by exploiting both the discriminative power of each hash function and their complement for nearest neighbor search. QRank is stateoftheart queryadaptive hashing method, which outperforms other weighted hashing methods (e.g. QsRank [39], WhRank [38]).
Methods  CIFAR10 (MAP)  

12bit  24bit  32bit  48bit  
QaDWH (ours)  0.868  0.883  0.884  0.884 
NINHQRank [34]  0.800  0.822  0.835  0.832 
NINH [2]  0.792  0.818  0.832  0.830 
DRSCH [30]  0.820  0.852  0.850  0.851 
CNNH [27]  0.683  0.692  0.667  0.623 
SDHVGG19  0.430  0.652  0.653  0.665 
ITQVGG19  0.339  0.361  0.368  0.375 
SHVGG19  0.244  0.213  0.213  0.209 
LSHVGG19  0.133  0.171  0.178  0.198 
SDH [23]  0.255  0.330  0.344  0.360 
ITQ [16]  0.158  0.163  0.168  0.169 
SH [4]  0.124  0.125  0.125  0.126 
LSH [1]  0.116  0.121  0.124  0.131 
IvC Implementation Details
We implement the proposed approach based on the opensource framework Caffe [46]. The parameters of the first 18 layers in our network are initialized with the VGG19 network [42], which is pretrained on the ImageNet dataset [45]. Similar initialization strategy has been used in other deep hashing methods [29, 33]. For the weight layer, we initialize the weights with all , because we treat each bit equally in the beginning of training. In all experiments, our network is trained with the initial learning rate of 0.001, we decrease the learning rate by 10 every 20,000 steps. And the minibatch size is 64, the weight decay parameter is 0.0005. For the only parameter in our proposed loss function, we set in all the experiments.
For the proposed QaDWH, and compared methods CNNH, NINH and DRSCH, we use raw image pixels as input. The implementations of CNNH and DRSCH are provided by their authors, while NINH is of our own implementation. Since the representation learning layers of CNNH, NINH and DRSCH are different from each other, for a fair comparison, we use the same VGG19 network as the base structure for deep hashing methods. And the network parameters of all the deep hashing methods are all initialized with the same pretrained VGG19 model, thus we can perform fair comparison between them. The results of CNNH, NINH and DRSCH are referred as CNNH, NINH and DRSCH respectively.
For the queryadaptive method QRank, which uses image features and hash codes generated by other hashing methods as input. In order to compare QRank with proposed QaDWH approach fairly, we use the hash codes and features generated by deep hashing method NINH as the input of QRank, thus we denote the result of QRank as NINHQRank. The implementation of QRank is provided by the author.
For other compared traditional methods without deep networks, we represent each image by handcrafted features and deep features respectively. For handcrafted features, we represent images in the CIFAR10 and MIRFLICKR by 512dimensional GIST features, and images in the NUSWIDE by 500dimensional bagofwords features. For a fair comparison between traditional methods and deep hashing methods, we also conduct experiments on the traditional methods with the features extracted from deep networks, where we extract 4096dimensional deep feature for each image from the pretrained VGG19 network. We denote the results of traditional methods using deep features by LSHVGG19, SHVGG19, ITQVGG19 and SDHVGG19. The results of SDH, SH, and ITQ are obtained from the implementations provided by their authors, while the results of LSH are from our own implementation.
Methods  NUSWIDE (MAP)  

12bit  24bit  32bit  48bit  
QaDWH (ours)  0.867  0.879  0.884  0.882 
NINHQRank [34]  0.813  0.836  0.835  0.833 
NINH [2]  0.808  0.827  0.827  0.827 
DRSCH [30]  0.814  0.829  0.832  0.824 
CNNH [27]  0.768  0.784  0.790  0.740 
SDHVGG19  0.730  0.797  0.819  0.830 
ITQVGG19  0.777  0.800  0.806  0.817 
SHVGG19  0.712  0.697  0.689  0.682 
LSHVGG19  0.518  0.567  0.618  0.651 
SDH [23]  0.460  0.510  0.519  0.525 
ITQ [16]  0.472  0.478  0.483  0.476 
SH [4]  0.452  0.445  0.443  0.437 
LSH [1]  0.436  0.414  0.432  0.442 
IvD Experiment Results and Analysis
IvD1 Experiment results on CIFAR10 dataset
Table II shows the MAP scores with different length of hash codes on CIFAR10 dataset. Overall, the proposed QaDWH achieves the highest average MAP of 0.880, and consistently outperforms stateoftheart methods on all hash code lengths. More specifically, compared with the highest deep hashing methods DRSCH, which achieves average MAP of 0.843, the proposed QaDWH has an absolute improvement of 0.037. Compared with the highest traditional methods using deep features SDHVGG19, which achieves an average MAP of 0.600, the proposed method has an absolute improvement of 0.280. While the highest traditional methods using handcrafted features SDH achieves average MAP of 0.322, the proposed approach has an improvement of 0.558. And compared with the traditional weighted hashing method QRank, which achieves an average MAP of 0.822, the proposed QaDWH has an absolute improvement of 0.058. It’s because QaDWH benefits from the joint training of hash codes and corresponding classwise weights, while QRank can only learn the weights but cannot give feedback for learning better hash codes.
Figure 3(a) shows the precisions within Hamming radius 2 using hash lookup. The precision of proposed QaDWH consistently outperforms stateoftheart methods on all hash code length, because QaDWH benefits from the joint training scheme and can generate better hash codes. The precision within top 500 retrieved samples is shown in Figure 3(b), the proposed QaDWH still achieves the highest precision, which demonstrates the effectiveness of finegrained ranking. Figure 3(c) shows the precision curves of different number of retrieved samples on 48bit hash code, and the proposed QaDWH achieves the highest accuracy. Figure 3(d) demonstrates the precisionrecall curves using Hamming ranking with 48bit codes. QaDWH still achieves the best accuracy on all recall levels, which further shows the effectiveness of proposed approach.
Methods  MIRFLICKR (MAP)  

12bit  24bit  32bit  48bit  
QaDWH (ours)  0.791  0.804  0.805  0.802 
NINHQRank [34]  0.777  0.761  0.765  0.781 
NINH [2]  0.772  0.756  0.760  0.778 
DRSCH [30]  0.780  0.789  0.774  0.788 
CNNH [27]  0.763  0.757  0.758  0.744 
SDHVGG19  0.732  0.739  0.737  0.747 
ITQVGG19  0.686  0.685  0.687  0.689 
SHVGG19  0.618  0.604  0.598  0.595 
LSHVGG19  0.575  0.584  0.604  0.614 
SDH [23]  0.595  0.601  0.608  0.605 
ITQ [16]  0.576  0.579  0.579  0.580 
SH [4]  0.561  0.562  0.563  0.562 
LSH [1]  0.557  0.564  0.562  0.569 
Methods  ImageNet (MAP)  

12bit  24bit  32bit  48bit  
QaDWH (Ours)  0.090  0.212  0.245  0.298 
NINHQRank [34]  0.078  0.170  0.198  0.208 
NINH [2]  0.076  0.162  0.197  0.236 
DRSCH [30]  0.064  0.175  0.188  0.176 
CNNH [27]  0.076  0.151  0.204  0.230 
SDHVGG19  0.075  0.182  0.216  0.261 
ITQVGG19  0.054  0.151  0.201  0.268 
SHVGG19  0.052  0.147  0.201  0.263 
LSHVGG19  0.027  0.079  0.110  0.182 
IvD2 Experiment results on NUSWIDE dataset
Table III shows the MAP scores with different length of hash codes on NUSWIDE dataset. Following [2, 27], we calculate the MAP scores based on top 5000 returned images. Similar results on NUSWIDE can be observed, the proposed QaDWH still achieves the best MAP scores (average 0.878). QaDWH achieves an absolute improvement of 0.053 on average MAP compared to the highest deep hashing methods DRSCH (average 0.825). Compared with the highest traditional method using deep features SDHVGG19, which achieves an average MAP of 0.794, QaDWH has an absolute improvement of 0.084. It is also interesting to observe that with the deep features extracted from VGG19 network, the traditional method SDH achieves comparable results with deep hashing methods. And compared with QRank (average 0.829), the proposed QaDWH still achieves an absolute improvement of 0.049, which shows that the proposed QaDWH method has the advantage of joint training hash code and corresponding weights.
Figure 4 (a), (b), (c) and (d) demonstrate the retrieval accuracy on NUSWIDE. Similarly, the proposed QaDWH achieves the best accuracy on the 4 evaluation metrics, due to the joint training scheme and the finegrained ranking for different queries.
Methods  CIFAR10  NUSWIDE  MIRFLICKR  ImageNet 

QaDWH (ours)  9.29  10.48  9.22  24.77 
NINHQRank  65.03  66.24  64.88  115.27 
DRSCH  9.11  10.04  9.14  23.98 
NINH  8.83  9.40  9.04  15.78 
CNNH  8.93  9.53  9.04  15.94 
SDHVGG19  9.07  9.45  8.92  15.17 
ITQVGG19  8.91  9.39  8.76  15.33 
SHVGG19  8.95  9.41  8.78  15.10 
LSHVGG19  8.90  9.39  8;.76  15.15 
MAP  CIFAR10  NUSWIDE  MIRFLICKR  ImageNet  

12bit  24bit  32bit  48bit  12bit  24bit  32bit  48bit  12bit  24bit  32bit  48bit  12bit  24bit  32bit  48bit  
QaDWH (ours)  0.868  0.883  0.884  0.884  0.867  0.879  0.884  0.882  0.791  0.804  0.805  0.802  0.090  0.212  0.245  0.298 
DWH  0.856  0.873  0.879  0.856  0.849  0.867  0.866  0.858  0.774  0.773  0.788  0.784  0.080  0.196  0.233  0.276 
NINH  0.792  0.818  0.832  0.830  0.808  0.827  0.827  0.827  0.772  0.756  0.760  0.778  0.076  0.162  0.197  0.236 
IvD3 Experiment results on MIRFLICKR dataset
The MAP scores with different length of hash codes on MIRFLICKR dataset are shown in Table IV. The proposed QaDWH method achieves average MAP score of 0.800, which outperforms other deep hashing methods DRSCH (0.783), NINH (0.766) and CNNH (0.755). Compared with the highest traditional method using deep features SDHVGG19, which achieves the average MAP of 0.739, QaDWH has an absolute improvement of 0.061. On MIRFLICKR, the proposed QaDWH method still outperforms traditional weighted hashing method QRank by 0.029, which shows the effectiveness of jointly training of hash codes and corresponding weights. Figure 5(a) shows the precision within Hamming radius 2 using hash lookup, from which we can observe that the proposed QaDWH approach achieves the best result. Figure 5(b) shows the precision curves within top 500 retrieved samples, and QaDWH achieves the highest precision due to the finegrained retrieval. Figure 5(c) and (d) demonstrate the top 1k results and PrecisionRecall curve on 48bit hash code, the proposed QaDWH method still achieves the best results, which further shows the effectiveness of queryadaptive finegrained ranking.
IvD4 Experiment results on ImageNet dataset
The MAP scores with different length of hash codes on ImageNet dataset are shown in Table V, note that for this large scale dataset, we only report results of traditional methods using deep features. And for this large dataset, we calculate the MAP scores based on top 500 returned images due to the high computation cost of MAP evaluation. From Table V we can observe that the proposed QaDWH approach achieves best average MAP score of 0.211 on this challenging dataset. And compared with traditional weighted hashing method QRank, our proposed QaDWH achieves an absolute improvement of 0.048, and QRank cannot achieve stable improvements over NINH on this large dataset. Compared with the best deep hashing methods NINH on ImageNet dataset, the proposed QaDWH achieves an absolute improvement of 0.043. And on this large dataset, we can observe that traditional methods like SDH and ITQ achieve comparable results with deep hashing methods. Figure 6 (a), (b), (c) and (d) demonstrate the retrieval accuracy on ImageNet. Similarly, the proposed QaDWH achieves the best accuracy on these four evaluation metrics, due to the joint training scheme of hash functions and corresponding weights and the finegrained ranking for different query images.
IvD5 Comparison of Testing Time
Besides the comparison of retrieval accuracy between different methods, we also compare the testing time of proposed approach and stateoftheart methods. All the experiments are conducted on the same PC with NVIDIA Titan Black GPU, Intel Core i75930k 3.50GHz CPU and 64 GB memory. Typical retrieval process of hashing methods generally consists of three stages: Feature extraction, hash code generation and image retrieval among databases. We record time costs of each stage for different methods, the final testing time cost is the sum of three stages. Note that proposed QaDWH approach and other deep hashing methods are endtoend frameworks, whose input are raw images and output are hash codes, while compared traditional hashing methods use image features as input. Thus for fair comparison, we use deep features for traditional methods. And for compared queryadaptive hashing method QRank, which uses image features and hash codes generated by other methods as input, its additional computation is queryadaptive weights calculation. The average testing time of different methods is shown in Table VI, we perform each hashing methods 5 times to calculate the average testing time. Comparing proposed QaDWH approach with other deep hashing methods, we can observe that QaDWH is a little slower but still comparable (less than 1 millisecond for small scale dataset, less than 10 milliseconds for large scale ImageNet dataset), which is expected since QaDWH uses relatively slower weighted Hamming distance. However, proposed QaDWH can achieve much better retrieval accuracy by a little time costs. Comparing proposed QaDWH with traditional queryadaptive method QRank, we can observe that proposed QaDWH is much faster than QRank, it’s because QRank consumes much more time to calculate queryadaptive hash weights, while proposed QaDWH approach costs only a simple matrix multiplication to calculate queryadaptive weights. From the result table we can also observe that, the deep hashing methods and traditional hashing methods are comparable with each other in terms of testing time, since the time cost of hash code generation is only a matrix multiplication which is very fast, and all of them use Hamming distance that can be efficiently calculated by bitwise XOR operation.
IvE Baseline Experiments and Analysis
We also conduct two baseline experiments to further demonstrate the separate contributions of proposed deep weighted hashing and queryadaptive retrieval approach: (1) To verify the effectiveness of queryadaptive retrieval approach, we further perform experiments of using fixed weights by averaging learned classwise weights, thus each query has the same hash code weights, we denote results of this baseline method as DWH. (2) To verify the effectiveness of deep weighted hashing, we further conduct experiments without using hash weights at all, which is equivalent to NINH method, we denote the results of NINH as NINH. The MAP scores of baseline methods are shown in Table VII. From the result table, we can observe that on all the four datasets, the DWH method outperforms NINH, which shows that the learned classwise weights can reflect the semantic property of different image classes, thus improve the retrieval accuracy. And QaDWH further outperforms DWH on all four datasets, which demonstrates that the queryadaptive image retrieval approach can further improve the retrieval accuracy. Figure 7 to 10 show other four evaluation metrics on CIFAR10, NUSWIDE, MIRFLICKR and ImageNet datasets. From those figures we can clearly observe that DWH outperforms NINH and QaDWH outperforms DWH on those four evaluation metrics, which further demonstrate the effectiveness of proposed deep weighted hashing and queryadaptive retrieval approach.
V Conclusion
In this paper, we have proposed a novel queryadaptive deep weighted hashing (QaDWH) approach. First, we design a new deep hashing network, which consists of two streams: the hash stream learns the compact hash codes and corresponding classwise hash bit weights simultaneously, while the classification stream preserves the semantic information and improves hash performance. Second, we propose an effective yet efficient queryadaptive image retrieval approach, which first rapidly generates the queryadaptive hash weights based on the query’s predicted semantic probability and classwise weights, and then performs effective image retrieval by weighted Hamming distance. Experiment results show the effectiveness of QaDWH compared with eight stateoftheart methods on four widely used datasets. In the future work, we intend to extend the deep weighted hashing scheme to a multitable deep hashing framework, in which different weights are learned for different hash mapping functions.
References
 [1] A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high dimensions via hashing,” in International Conference on Very Large Data Bases (VLDB), vol. 99, no. 6, 1999, pp. 518–529.
 [2] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3270–3278.
 [3] J. Wang, S. Kumar, and S.F. Chang, “Sequential projection learning for hashing with compact codes,” in International Conference on Machine Learning (ICML), 2010, pp. 1127–1134.
 [4] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Annual Conference on Neural Information Processing Systems (NIPS), 2009, pp. 1753–1760.
 [5] Z. Chen, J. Lu, J. Feng, and J. Zhou, “Nonlinear discrete hashing,” IEEE Transactions on Multimedia (TMM), vol. 19, no. 1, pp. 123–135, 2017.
 [6] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu, “Spectral hashing with semantically consistent graph for image indexing,” IEEE Transactions on Multimedia (TMM), vol. 15, no. 1, pp. 141–152, 2013.
 [7] M. Kafai, K. Eshghi, and B. Bhanu, “Discrete cosine transform localitysensitive hashes for face retrieval,” IEEE Transactions on Multimedia (TMM), vol. 16, no. 4, pp. 1090–1103, 2014.
 [8] Y. Zhang, L. Zhang, and Q. Tian, “A priorfree weighting scheme for binary code ranking,” IEEE Transactions on Multimedia (TMM), vol. 16, no. 4, pp. 1127–1139, 2014.
 [9] V. E. Liong, J. Lu, Y.P. Tan, and J. Zhou, “Deep video hashing,” IEEE Transactions on Multimedia (TMM), 2016.
 [10] Y. Hao, T. Mu, R. Hong, M. Wang, N. An, and J. Y. Goulermas, “Stochastic multiview hashing for largescale nearduplicate video retrieval,” IEEE Transactions on Multimedia (TMM), vol. 19, no. 1, pp. 1–14, 2017.
 [11] K. Ding, B. Fan, C. Huo, S. Xiang, and C. Pan, “Crossmodal hashing via rankorder preserving,” IEEE Transactions on Multimedia (TMM), vol. 19, no. 3, pp. 571–585, 2017.
 [12] D. Wang, P. Cui, M. Ou, and W. Zhu, “Learning compact hash codes for multimodal representations using orthogonal deep structure,” IEEE Transactions on Multimedia (TMM), vol. 17, no. 9, pp. 1404–1416, 2015.
 [13] M. Raginsky and S. Lazebnik, “Localitysensitive binary codes from shiftinvariant kernels,” in Annual Conference on Neural Information Processing Systems (NIPS), 2009, pp. 1509–1517.
 [14] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multiprobe lsh: efficient indexing for highdimensional similarity search,” in International conference on Very large data bases (VLDB), 2007, pp. 950–961.
 [15] W. Liu, J. Wang, S. Kumar, and S.F. Chang, “Hashing with graphs,” in International Conference on Machine Learning (ICML), 2011, pp. 1–8.
 [16] Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean approach to learning binary codes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 817–824.
 [17] L. Zhang, Y. Zhang, J. Tang, X. Gu, J. Li, and Q. Tian, “Topology preserving hashing for similarity search,” in ACM International Conference on Multimedia (ACMMM), 2013, pp. 123–132.
 [18] G. Irie, Z. Li, X.M. Wu, and S.F. Chang, “Locally linear hashing for extracting nonlinear manifolds,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 2115–2122.
 [19] B. Kulis and T. Darrell, “Learning to hash with binary reconstructive embeddings,” in Annual Conference on Neural Information Processing Systems (NIPS), 2009, pp. 1042–1050.
 [20] W. Liu, J. Wang, R. Ji, Y.G. Jiang, and S.F. Chang, “Supervised hashing with kernels,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2074–2081.
 [21] J. Wang, J. Wang, N. Yu, and S. Li, “Order preserving hashing for approximate nearest neighbor search,” in ACM International Conference on Multimedia (ACMMM), 2013, pp. 133–142.
 [22] Q. Wang, Z. Zhang, and L. Si, “Ranking preserving hashing for fast similarity search,” in International Joint Conference on Artificial Intelligence (IJCAI), 2015, pp. 3911–3917.
 [23] F. Shen, C. Shen, W. Liu, and H. Tao Shen, “Supervised discrete hashing,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 37–45.
 [24] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International Journal of Computer Vision (IJCV), vol. 42, no. 3, pp. 145–175, 2001.
 [25] L. FeiFei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 524–531.
 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Annual Conference on Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.
 [27] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for image retrieval via image representation learning,” in AAAI Conference on Artificial Intelligence (AAAI), 2014, pp. 2156–2162.
 [28] M. Schultz and T. Joachims, “Learning a distance metric from relative comparisons,” in Advances in Neural Information Processing Systems (NIPS), 2003, pp. 41–48.
 [29] T. Yao, F. Long, T. Mei, and Y. Rui, “Deep semanticpreserving and rankingbased hashing for image retrieval,” in International Joint Conference on Artificial Intelligence (IJCAI), 2016, pp. 3931–3937.
 [30] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, “Bitscalable deep hashing with regularized similarity learning for image retrieval and person reidentification,” IEEE Transactions on Image Processing (TIP), vol. 24, no. 12, pp. 4766–4779, 2015.
 [31] F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic ranking based hashing for multilabel image retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1556–1564.
 [32] W.J. Li, S. Wang, and W.C. Kang, “Feature learning based deep supervised hashing with pairwise labels,” in International Joint Conference on Artificial Intelligence (IJCAI), 2016, pp. 1711–1717.
 [33] H. Zhu, M. Long, J. Wang, and Y. Cao, “Deep hashing network for efficient similarity retrieval,” in AAAI Conference on Artificial Intelligence (AAAI), 2016, pp. 2415–2421.
 [34] T. Ji, X. Liu, C. Deng, L. Huang, and B. Lang, “Queryadaptive hash code ranking for fast nearest neighbor search,” in ACM International Conference on Multimedia (ACMMM), 2014, pp. 1005–1008.
 [35] Y.G. Jiang, J. Wang, and S.F. Chang, “Lost in binarization: queryadaptive ranking for similar image search with compact codes,” in ACM International Conference on Multimedia Retrieval (ICMR), 2011, pp. 16–22.
 [36] Y.G. Jiang, J. Wang, X. Xue, and S.F. Chang, “Queryadaptive image search with hash codes,” IEEE Transactions on Multimedia (TMM), vol. 15, no. 2, pp. 442–453, 2013.
 [37] X. Liu, C. Deng, B. Lang, D. Tao, and X. Li, “Queryadaptive reciprocal hash tables for nearest neighbor search,” IEEE Transactions on Image Processing (TIP), vol. 25, no. 2, pp. 907–919, 2016.
 [38] L. Zhang, Y. Zhang, J. Tang, K. Lu, and Q. Tian, “Binary code ranking with weighted hamming distance,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 1586–1593.
 [39] X. Zhang, L. Zhang, and H.Y. Shum, “Qsrank: Querysensitive hash code ranking for efficientneighbor search,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2058–2065.
 [40] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing coadaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
 [41] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
 [42] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in International Conference on Learning Representations (ICLR), 2014.
 [43] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nuswide: a realworld web image database from national university of singapore,” in ACM international conference on image and video retrieval (CIVR), 2014, p. 48.
 [44] M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in ACM International Conference on Multimedia Information Retrieval (MIR), 2008, pp. 39–43.
 [45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
 [46] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM International Conference on Multimedia (ACMMM), 2014, pp. 675–678.