LaTeX Hierarchy Neighborhood Discriminative Hashing for An Unified View of SingleLabel and MultiLabel Image retrieval
Abstract
Recently, deep supervised hashing methods have become popular for largescale image retrieval task. To preserve the semantic similarity notion between examples, they typically utilize the pairwise supervision or the triplet supervised information for hash learning. However, these methods usually ignore the semantic class information which can help the improvement of the semantic discriminative ability of hash codes. In this paper, we propose a novel hierarchy neighborhood discriminative hashing method. Specifically, we construct a bipartite graph to build coarse semantic neighbourhood relationship between the subclass feature centers and the embeddings features. Moreover, we utilize the pairwise supervised information to construct the fined semantic neighbourhood relationship between embeddings features. Finally, we propose a hierarchy neighborhood discriminative hashing loss to unify the singlelabel and multilabel image retrieval problem with a onestream deep neural network architecture. Experimental results on two largescale datasets demonstrate that the proposed method can outperform the stateoftheart hashing methods.
1 Introduction
Hashing [14, 22, 15, 16, 17, 5, 11, 10] has been paid attention by lots of researchers for largescale image retrieval in recent years. The goal of hashing is to transform the multimedia data from the original highdimensional space into a compact binary space while preserving data similarities. The const or sublinear search speed can be achieved via Hamming distance measurement, which is performed by using XOR and POPCNT operations on modern CPUs or GPUs. The efficient storage and search make hashing technology popular for largescale multimedia retrieval.
Generally, we can divide existing hashing approaches into two categories: dataindependent and datadependent hashing methods. Dataindependent hashing methods [3] map the data points from the original feature space into a binarycode space by using random projections as hash functions. These methods provide theoretical guarantees for mapping the nearby data points into the same hash codes with high probabilities. However, they need long binary codes to achieve high precision. Datadependent hashing methods (i.e., learning to hash methods) [14, 22, 15, 5, 16, 12, 18, 11, 24, 21, 10] learn hash functions and compact binary codes from training data. They can be further divided into unsupervised hashing methods [22, 5, 7, 16, 14] and supervised hashing methods [12, 20, 18], based on whether or not the semantic (label) information is used. In many real applications, supervised hashing methods demonstrate superior performance over unsupervised hashing methods. Recently, deep learning based hashing methods [6, 11, 24, 21, 10, 15] demonstrate superior performance over these traditional hashing methods [22, 5, 12, 16, 14]. The main reason is that deep hashing methods can perform simultaneous feature learning and hashcode learning in an endtoend framework. Existing deep supervised hashing methods [1, 9, 23, 11, 24, 21, 15, 6] mainly utilize the pairwise supervision or the triplet supervised information for hash learning, while ignoring the semantic class information which can help the improvement of the semantic discriminative ability of hash codes. Recently, some deep supervised hashing methods [17, 24, 10] improve the hashing retrieval performance by introducing the essential semantic structure of the data in form of class labels. [24] constructs a twostream network architecture, one for classification stream and the other for hashing stream. However, the semantic labels do not directly guide the hash learning. [10] assumes that the learned binary codes should be ideal for linear classification which restricts its scalability for some complex scene. [17] weakens this assumpution to support nonlinear classification. However, [17] utilizes the geometrical center of semantic relevant subcenters as supervision information for multilabel hashing, which destroys the intrinsic manifold structure of the subcenter space.
In this paper, we propose a hierarchy neighborhood discriminative hashing method (HNDH). Specifically, we construct a bipartite graph to build coarse semantic neighbourhood relationship between the subclass feature centers and the embeddings features. Moreover, we utilize the pairwise supervised information to construct the fined semantic neighbourhood relationship between embeddings features. Finally, we propose a hierarchy neighborhood discriminative hashing loss to unify the singlelabel and multilabel image retrieval problem with a onestream deep neural network architecture. Compared with the multilabel classification loss in [17] using the geometrical center of relevant subcenters as supervision information, the proposed method employs the intrinsic manifold structure of these subcenters for learning the discriminative hash codes. Some preliminary results on multitask learning based semantic hashing framework were presented in [17], while the extension on the hierarchy neighborhood discriminative hashing loss and the unification of singlelabel and multilabel hash learning with onestream network are novel.
2 Hierarchy Neighborhood Discriminative Hashing
2.1 Problem Definition
Assume we have a training set , where denotes the number of training samples. The label information is denoted as , where denotes the number of categories. In addition, we can define a pairwise supervision matrix as if and are semantically similar, and otherwise. Under the supervised information and , supervised hashing aims to learn a hash function to transform the training data into a collection of bit compact binary codes . The Hamming distance between and is calculated by using . Therefore, we can utilize the inner product to measure the similarity of hash codes.
2.2 Network Architecture
As illustrated in Fig. 1, the proposed onestream deep architecture mainly contains two components: the feature extraction subnetwork consists of the conv1 to fc7 layers of a pretrained VGG19 network [8]; the output layer for generating discriminative embedding features and hash codes for retrieval and classification.
2.3 Proposed Method
2.3.1 Coarse neighborhood discriminative hashing loss
The output layer includes a dimensional fullyconnected layer (4096r) and a tangent layer to approximate the sign function. We utilize denotes the realvalued output of the network, i.e., the embedding features. For each subclass, we compute its feature center as follows: where represents the centroid of the th subclass, is the number of training samples that belong to the th subclass. Therefore, we can construct a complete bipartite graph to build the relationship between the subclass feature centers and the embeddings features . The edge weight between the vertex and the vertex is defined as . Inspired by [4], we can use a softmax normalization over the edge weights that connect the vertex to define the neighbour probability :
(1) 
where is the probability of selecting as its neighbor. The neighborhood relationship is coarse, since we only consider the relations between the embedding features and the subclass centers. If the th subclass is contained in the assigned labels of the embedding feature , then the subclass center is the relevant semantic neighborhood of . Therefore, can participate in the class labels voting for . Under this definition, the probability that image will be correctly classified can be computed as
(2) 
The negative logarithm likelihood function can be defined as:
(3) 
where refers to the batch size. This function can minimize the intraclass variation and maximize the interclass variation simultaneously to generate powerful embedding representations. It is worth noting that if is an onehot vector, this function will become the singlelabel classification function in [17, 13]. However, this function is not restricted to singlelabel classification problem. It can extend to the multilabel classification problem naturally. Compared with the multilabel hashing loss in [17] using the geometrical center of relevant semantic neighborhood subcenters as supervision information, the proposed method employs the intrinsic manifold structure of the subcenter space for learning discriminative embedding features.
2.3.2 Fined neighborhood discriminative hashing loss
In the classification task above, we only consider the discrimination and polymerization of embedding features. The semantic neighborhood relationship between the embedding features is ignored which may make the distance between dissimilar embedding features smaller than the distance between similar embedding features. To overcome this limitation, we introduce the following pairwise constraint [11] which is commonly used to preserve the semantic similarity in retrieval task.
(4) 
where . The semantic similarity matrix displays a fined neighborhood relationship between the embedding features, i.e., if denotes that image and image are neighborhood in the semantic space and otherwise. The neighborhood relationship is fined, since we consider the relations between all the embedding features.
2.4 Objective Function and Learning Algorithm
We formulate the proposed Hierarchy Neighborhood Discriminative Hashing method as the following multitask learning framework:
(5) 
where balance the impact of the different number of factors between the first term and the second term. We use an alternating optimization over the class subcenters and the CNN parameters as follows:

Fix and optimize . We can update the feature center of the th subclass directly as follows:
(6) 
Fix and optimize .
(7) where , , and . Then we can compute with by using the chain rule. In each iteration, we update the parameter based on the backpropagation (BP) algorithm.
Dataset  Total  Query / Retrieval / Training  Labels 

CIFAR10  60,000  1,000 / 5,9000 / 5,000  10 
NUSWIDE  195,834  2,100 / 193,734 / 10,500  21 
3 Experimental Results
3.1 Datasets and Experimental Settings
We evaluate the performance of several deep hashing methods on two public datasets: CIFAR10 and NUSWIDE. We split each dataset into a query set and a retrieval set. The training set is randomly selected from the retrieval set. For the CIFAR10 dataset, 100 images per class are randomly selected as the query set and the remaining images are used as the retrieval set following [11, 21, 10]. Moreover, 500 images per class are randomly sampled from the retrieval set as the training set. For the NUSWIDE dataset, we only use the images that belong to the 21 most frequent labels. Then it contains at least 5,000 images for each class. We randomly sample 2100 images (100 images per class) as the query set and the remaining images form the retrieval set. Moreover, 500 images per class from the retrieval set are used as training set. The statistics of the two dataset splits are summarized in Table 1. For CIFAR10, we use Mean Average Precision (MAP) as the evaluation metric following [11, 21, 10]. The MAP@5K for NUSWIDE is evaluated on top 5,000 retrieved images as similar in [11, 21, 10].
We use two NVIDIA TITAN XP GPUs and MatConvnet as the platform to implement the proposed model. The pretrained VGG19 model is utilized to initialize the base network in HMDH and the other parameters of the network are randomly initialized. The iteration number of the proposed HMDH is set to be 100 and the batch size is fixed to 128 for all datasets. The learning rate of the base network is gradually reduced from to for both the CIFAR10 and NUSWIDE datasets. The learning rate for the newly added layers is set to be 10 times more than the layers of the base network. For both datasets, we set via cross validation on training sets.
Method  CIFAR10  

12 bits  24 bits  32 bits  48 bits  
HNDH  0.805  0.825  0.829  0.838 
MLDH [17]  0.805  0.825  0.829  0.838 
DDSH [6]  0.769  0.829  0.835  0.819 
DSDH [10]  0.740  0.786  0.801  0.820 
DTSH [21]  0.710  0.750  0.765  0.774 
DPSH [11]  0.713  0.727  0.744  0.757 
Method  NUSWIDE  

12 bits  24 bits  32 bits  48 bits  
HNDH  0.806  0.832  0.841  0.848 
MLDH [17]  0.800  0.828  0.832  0.835 
DDSH [6]  0.791  0.815  0.821  0.827 
DSDH [10]  0.776  0.808  0.820  0.829 
DTSH [21]  0.773  0.808  0.812  0.824 
DPSH [11]  0.752  0.790  0.794  0.812 
3.2 Results and Discussions
The MAP results of CIFAR10 and NUSWIDE are presented in Table 2 and Table 3 respectively. The results of deep supervised baselines including [6, 11, 21] and [6] on CIFAR10 and NUSWIDE are cited from [10] and [6] respectively. It can be seen that the proposed method outperforms the other baselines for most cases. The average MAP of the proposed method is 0.824, which is 1.1 percents higher than the average of DDSH’s 0.813 on the CIFAR10 dataset. On the NUSWIDE dataset, the proposed method performs consistently better than the other baselines across all bits. The average MAP@5K of the proposed method is 0.831, which is 0.7 percents higher than the average of MDLH’s 0.824 on the NUSWIDE dataset. The reason can be that the relations between the learned embedding feature from MDLH tends to locate at geometrical center of its semantic relevant subclass centers which destroys the manifold structure in the subcenter space. When the hash code is short (e.g., 16 bits), the proposed method and MDLH perform much better than the stateoftheart. The reason is that the semantic label information is employed to learn the discrimination and polymerization of hash codes. Although [10] also utilizes the label information, they ignore the polymerization of hash codes.
Compared to the singletask learning based hashing methods including DTSH [21], DPSH [11], HashNet [2], DHN [1], DNNH [9] and CNNH [23], the multitask learning based hashing methods including the proposed method, MLDH [17] and DSDH [10] jointly consider the the retrieval task and the classification task for learning the discrete discriminative hash codes. From Table 2 and Table 3, it can be found that the multitask learning based hashing methods generally perform better than the singletask learning based hashing methods. In addition, different from DSDH, the polymerization of hash codes is also considered in the proposed method.
Method  CIFAR10  NUSWIDE  

12 bits  24 bits  32 bits  48 bits  12 bits  24 bits  32 bits  48 bits  
0.8045  0.8250  0.8293  0.8377  0.8062  0.8317  0.8410  0.8484  
0.7524  0.7865  0.7883  0.7888  0.7860  0.8204  0.8270  0.8339  
0.7640  0.8085  0.8125  0.7977  0.7594  0.8007  0.8102  0.8186 
3.3 Ablation Experiments
We report the effect of different components of our HNDH on two benchmark datasets with different numbers of bits in Table 4. From the MAP results, it verifies the effectiveness of combining the coarse neighborhood discriminative hashing loss and the fined neighborhood discriminative hashing loss. In addition, the proposed the individual coarse neighborhood discriminative hashing loss performs better than the individual multilabel hashing loss in [17]. To facilitate the outstanding, we focus on two HNDH variants: (a) HNDHC is the first variant which removes fined neighborhood discriminative hashing loss; (b) HNDHF is the second variant which removes coarse neighborhood discriminative hashing loss. Fig. 6 show tSNE visualization [19] of the deep representations of HNDHF, HNDHC, and HNDH with 48 bits on the query set of CIFAR10 dataset. As shown in Fig. 6 , the image embeddings generated by HNDH show most compact and discriminative structures with clearest boundaries.
3.4 Convergence Analysis
The training convergence curves of the proposed model at 48 bits over CIFAR10 and NUSWIDE datasets are shown in Fig. 6 (d). It can be observed that the proposed model can converge within 100 iterations, which validates the effectiveness of the proposed approach.
4 Conclusion
In this paper, we propose a hierarchy neighborhood discriminative hashing method. Firstly, we construct a bipartite graph to build coarse semantic neighbourhood relationship between the subclass feature centers and the embeddings features. Moreover, we utilize the pairwise supervised information to construct the fined semantic neighbourhood relationship between embeddings features. Finally, we propose a hierarchy neighborhood discriminative hashing loss to unify the singlelabel and multilabel image retrieval problem with a onestream deep neural network architecture. In the future work, we plan to extend the proposed singlemodal hashing method to the crossmodal hashing.
References
 [1] Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen. Deep quantization network for efficient image retrieval. In AAAI, pages 3457–3463, 2016.
 [2] Z. Cao, M. Long, J. Wang, and P. S. Yu. Hashnet: Deep learning to hash by continuation. In ICCV, pages 5609–5618, 2017.
 [3] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of International Conference on Very Large Data Bases, pages 518–529, 1999.
 [4] J. Goldberger, S. T. Roweis, G. E. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS, pages 513–520, 2004.
 [5] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 35(12):2916–2929, 2013.
 [6] Q. Jiang, X. Cui, and W. Li. Deep discrete supervised hashing. IEEE Transactions on Image Processing, 27(12):5996–6009, 2018.
 [7] W. Kang, W. Li, and Z. Zhou. Column sampling based discrete supervised hashing. In AAAI, pages 1230–1236, 2016.
 [8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
 [9] H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural networks. In CVPR, pages 3270–3278, 2015.
 [10] Q. Li, Z. Sun, R. He, and T. Tan. Deep supervised discrete hashing. In NIPS, pages 2479–2488, 2017.
 [11] W. Li, S. Wang, and W. Kang. Feature learning based deep supervised hashing with pairwise labels. In IJCAI, pages 1711–1717, 2016.
 [12] W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang. Supervised hashing with kernels. In CVPR, pages 2074–2081, 2012.
 [13] Y. Liu, H. Li, and X. Wang. Rethinking feature discrimination and polymerization for largescale recognition. In NIPS, 2017.
 [14] L. Ma, H. Li, F. Meng, Q. Wu, and K. N. Ngan. Learning efficient binary codes from highlevel feature representations for multilabel image retrieval. IEEE Trans. Multimedia, 19(11):2545–2560, 2017.
 [15] L. Ma, H. Li, F. Meng, Q. Wu, and K. N. Ngan. Global and local semanticspreserving based deep hashing for crossmodal retrieval. Neurocomputing, 312:49–62, 2018.
 [16] L. Ma, H. Li, F. Meng, Q. Wu, and L. Xu. Manifoldranking embedded order preserving hashing for image semantic retrieval. J. Visual Communication and Image Representation, 44:29–39, 2017.
 [17] L. Ma, H. Li, Q. Wu, C. Shang, and K. Ngan. Multitask learning for deep semantic hashing. In VCIP, pages 1–4, 2018.
 [18] F. Shen, C. Shen, W. Liu, and H. T. Shen. Supervised discrete hashing. In CVPR, pages 37–45, 2015.
 [19] L. van der Maaten and G. Hinton. Visualizing highdimensional data using tsne. JMLR, 9:2579–2605, 2008.
 [20] J. Wang, S. Kumar, and S. Chang. Semisupervised hashing for largescale search. IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2393–2406, 2012.
 [21] X. Wang, Y. Shi, and K. M. Kitani. Deep supervised hashing with triplet labels. In ACCV, pages 70–84, 2016.
 [22] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, pages 1753–1760, 2008.
 [23] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval via image representation learning. In AAAI, pages 2156–2162, 2014.
 [24] T. Yao, F. Long, T. Mei, and Y. Rui. Deep semanticpreserving and rankingbased hashing for image retrieval. In IJCAI, pages 3931–3937, 2016.