# Unsupervised Deep Hashing for Large-scale Visual Search

###### Abstract

Learning based hashing plays a pivotal role in large-scale visual search. However, most existing hashing algorithms tend to learn shallow models that do not seek representative binary codes. In this paper, we propose a novel hashing approach based on unsupervised deep learning to hierarchically transform features into hash codes. Within the heterogeneous deep hashing framework, the autoencoder layers with specific constraints are considered to model the nonlinear mapping between features and binary codes. Then, a Restricted Boltzmann Machine (RBM) layer with constraints is utilized to reduce the dimension in the hamming space. Extensive experiments on the problem of visual search demonstrate the competitiveness of our proposed approach compared to state-of-the-art.

Unsupervised Deep Hashing for Large-scale Visual Search

Zhaoqiang Xia, Xiaoyi Feng, Jinye Peng, Abdenour Hadid |

School of Electronics and Information, Northwestern Polytechnical University |

710129, Xi’an, Shaanxi, China |

Center for Machine Vision Research (CMV), University of Oulu, Finland |

Index Terms— Learning based hashing, Unsupervised learning, Deep learning, Autoencoder, RBM

## 1 Introduction

In the era of big data, large-scale visual search is vital for accessing and processing a huge amount of images and this is of great importance in many fields of computer vision. Compared to tree based approaches, hashing based methods utilize several hash functions to project image features into binary codes and are more suitable for large-scale visual search due to compact representations [1, 2]. Therefore, the hashing approaches are becoming appealing for dealing with high-dimensional data.

Broadly speaking, existing hashing approaches can be classified into two categories: data-independent methods [3, 4, 5] and data-dependent methods [2, 6, 7, 8, 9]. Data-independent methods basically project the data into a hamming space by random hash functions whereas data-dependent methods (also referred to as learning based hashing) usually learn hash functions from training datasets by optimizing an objective function. In the case of data-independent methods, the random projections require many binary bits to achieve good performance. With the growing size of data, these methods tend to suffer from memory constraints. The learning based methods, on the other hand, can learn more discriminative hash functions with different objective functions while the hash bits are not linearly growing with the data size. Hence, learning based hashing is clearly more suitable for large-scale data and is the focus of recent research.

It appears that most hashing approaches use linear models to map the data into binary codes. Hence, most existing methods using learning techniques do not capture well the nonlinear relationships within images. Although several improvements have been proposed, e.g. by adding kernelization [10], it is still challenging to select an appropriate kernel function for specific data. As deep learning techniques are shown to capture well the nonlinear relationships within data, a deep architecture can effectively boost the learning of hash functions.

In this paper, we propose a novel deep hashing method to learn hierarchy and nonlinear hash functions for obtaining compact binary codes. Our proposed deep architecture includes two heterogeneous layers: autoencoder layers and an RBM (a Restricted Boltzmann Machine) layer. The autoencoder layers are used to generate the initial binary codes whereas the RBM layer is utilized to reduce the dimensionality of the binary codes. For learning the deep autoencoders and RBM, we introduce new objective functions minimizing the reconstruction error and energy function under the constraints of balanced and uncorrelated bits. Extensive experimental analysis on the problem of large-scale visual search demonstrates the validity and competitiveness of our proposed approach compared to state-of-the-art methods.

## 2 Related Work

Learning-Based Hashing. According to whether the semantic information is used or not, learning based hashing can be divided into three categories: unsupervised hashing, semi-supervised hashing and supervised hashing. Unsupervised hashing approaches do not use semantic information (such as tags) whereas supervised hashing approaches learn hash functions with semantic information. Semi-supervised approaches model the data with labeled as well as unlabeled data. For the first category, the Spectral Hashing (SH) [6], ITerative Quantization (ITQ) [7] and K-Means Hashing (KMH) [8] used different objective functions with constraints of binarization loss or/and the variance of binary bits. For the second category, the Semi-Supervised Hashing (SSH) by Wang et al. [2] constructed an objective function minimizing binarization loss of labeled data and maximizing the variance of unlabeled data. The approach was later extended by employing nonlinear hash functions [9]. For the third category, Linear Discriminant Analysis (LDA) [11] and multiple linear-SVMs [12] were used as hash functions and trained with large margin criterion. While most methods seek a single linear mapping, we propose a new solution based on a deep learning framework to explore the hierarchy and nonlinear hash mapping.

Deep Learning. Recently, several deep learning algorithms have been proposed in machine learning and applied to visual object detection and recognition, image classification, face verification and many other research problems [13]. Since several foundational deep learning frameworks, such as Convolutional Neural Networks (CNN) [14], Stacked AutoEncoders (SAE) [15] and Deep Belief Network (DBN) [16], have been presented, numerous deep learning approaches are developed based on these frameworks. Some deep learning approaches have been applied for learning binary codes. Liong et al. [17] presented a framework minimizing a global quantization loss function with two constraints to learn binary codes. In [18, 19, 20], the convolutional neural networks were utilized to extract visual features and a hashing layer was combined to learn binary codes through supervised learning. In this context, we propose a novel deep learning approach with a heterogeneous architecture and specific constraints for image hashing. In the architecture, we use the layer-wise unsupervised learning to learn the model parameters.

## 3 Deep Hashing

Fig. 1 illustrates the framework of our proposed deep hashing method. The framework contains two heterogeneous layers: (1) several deep autoencoder layers; (2) an RBM layer. Given a feature vector , the deep hashing framework can transform the input vector into a binary vector , where .

### 3.1 SAE Layers

Let us assume that there are layers in our deep autoencoder layers, and the hash function in th layer is . represents the input vector in th layer and is the initial input . The output vector in th layer is denoted as . To learn multiple-layers autoencoder, layer-by-layer training has been proposed [15] to minimize the reconstruction error. As shown in Fig. 2, the deep autoencoder (i.e. SAE) can be divided into several three-layers autoencoders for each hidden layer of SAE. represents the reconstructed vector of . The optimization problem of a conventional autoencoder is to minimize the reconstruction error for each hidden layer:

(1) |

where is the number of training samples.

Besides preserving the similarity in the projected space by minimizing the reconstruction error, the representative hash codes should be balanced and uncorrelated [6]. For a balanced ith bit, we should have . In order to be more informative for each bit, the code bits should also be uncorrelated. This is satisfied by setting . The solution to the problem (Eq. 1) with above constraints is non-trivial as the problem is NP-hard. Since our goal is to obtain the most balanced and uncorrelated bits of hash codes, we propose to add above constraints as regularization terms to seek the suboptimal solution. The regularized optimization problem is defined by

(2) |

where represents the Frobenius norm. is the reconstructed vector and computed as . is the output vector of hidden layers and computed as .

To learn the model parameters for the -layers autoencoders, we employ the BackPropagation (BP) algorithm to solve the optimization problem (Eq. 2). and are updated as

(3) |

where is a learning rate.

The gradients of parameters are derived as

(4) |

where ”” denotes element-wise multiplication and the sample index ”” is marked as subscripts for clarify. In Eq. 4, the local gradient and the derivative of the activation function are computed as

(5) |

In Eq. 5, the parameters and need to be learned when the reconstruction errors are back-propagated. These parameters can be learned similarly to the parameters and .

### 3.2 RBM Layer

We further employ an RBM layer (Fig. 3) to reduce the dimension of the binary codes. Since the variables are binary in the RBM layer, the sign function is used to transform the output vector of SAE layers so that each input unit of the RBM layer can be valued as .

Let us assume that the visible layer and the hidden layer are denoted as and respectively whereas and are the bias and weights of visible layer and hidden layer. The energy of the RBM model is defined as and the joint probability of is .

The optimization problem of a conventional RBM is to maximize the likelihood of training samples as follows

(6) |

In order to keep the binary codes balanced and uncorrelated, we integrate above constraints into the optimization problem (Eq. 6). Thus, the regularized problem is defined as

(7) |

where . Since the derivative of sign function is an impulse function, the problem (Eq. 7) is intractable to compute. To seek the approximate solution, we replace the sign function with a derivable function . Through computing the gradients and , the parameters are updated as

(8) |

We utilize the Contrastive Divergence (CD) algorithm [21] to seek the numerical solution of the problem (Eq. 7).

The gradients are estimated with Gibbs sampling as follows

(9) |

where represents the r-step Gibbs sampling and . The derivate of is .

The detailed deep hashing algorithm is summarized in Algorithm 1.

## 4 Experimental Analysis

To evaluate our proposed method, we performed extensive experiments on two datasets: CIFAR-10^{1}^{1}1http://www.cs.toronto.edu/ kriz/cifar.html and MIRFLICKR-25K^{2}^{2}2http://press.liacs.nl/mirflickr/. The CIFAR-10 dataset consists of 60,000 color images in 10 classes with 50,000 training images and 10,000 test images. The MIRFLICKR-25K dataset contains 25,000 color images in 26 classes in which 20,000 training images and 5,000 test images are randomly selected. Moreover, the cascaded 512-D GIST [22] and 512-D Bag-of-Features (BoF) [23] are used for image representation.

For comparative analysis, the KLSH [5], SH [6], ITQ [7] and KMH [8] algorithms^{3}^{3}3The authors have shared their codes on Internet. are considered and used as baseline methods. Our approach (denoted as HetDH) uses 3 hidden layers ( architecture) for SAE and 1 hidden layer ( neurons) for RBM due to the dimensions of the images. To gain insight into the impact of constraints in our proposed deep learning framework, we performed experiments with constraints (denoted as HetDH) and without constraints (denoted as HetDH-WC). We report the results of all the approaches in terms of precision and recall (precision-recall curves).

Fig. 4 shows the precision-recall curves on the CIFAR-10 and MIRFLICKR-25K datasets at 32 and 64 bits for all considered methods. It can be observed that our proposed method (HetDH) outperforms all other methods in all configurations. The results also point out that all the learning based methods work better than the data-independent method (KLSH). Compared to shallow learning models (SH, ITQ, KMH), both HetDH and HetDH-WC achieve good performance due to the hierarchy representation of deep learning in our proposed approach. Comparing HetDH with HetDH-WC, the obtained results show that the constraints for each layer effectively boost the conventional deep learning and improve the searching performance. This assesses the effectiveness of our proposed algorithm.

It is worth noting that the dimension of the hash codes (the number of binary bits) affects the performance of image search (see the experimental results at 32 bits vs. 64 bits). The results indicate that larger dimensions improve the precision and recall but at the cost of more memory storage. Finally, the experiments also show that all the hashing methods seem to work better on the CIFAR-10 dataset compared to MIRFLICKR-25K dataset. This is perhaps due to the diverse nature of the images in MIRFLICKR-25K dataset compared to the images in the CIFAR-10 dataset.

## 5 Conclusion

We proposed a heterogeneous deep learning architecture for learning hash functions. With two constraints for balanced and uncorrelated binary codes, we learned the parameters of SAE and RBM layers. Experimental results and extensive comparative analysis on the problem of large-scale image search assessed the effectiveness of our proposed approach which outperformed state-of-the-art unsupervised methods.

## References

- [1] L. Paulevé, H. Jégou, and L. Amsaleg, “Locality sensitive hashing: A comparison of hash function types and querying mechanisms,” Pattern Recognition Letters, vol. 31, no. 11, pp. 1348–1358, 2010.
- [2] J. Wang, S. Kumar, and S.F. Chang, “Semi-supervised hashing for large-scale search.,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 34, no. 12, pp. 2393–2406, 2012.
- [3] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in Proceedings of the 25th International Conference on Very Large Data Bases, 1999, pp. 518–529.
- [4] M. Slaney and M. Casey, “Locality-sensitive hashing for finding nearest neighbors,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 128–131, 2008.
- [5] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scalable image search,” in IEEE International Conference on Computer Vision (ICCV), 2009, pp. 2130 – 2137.
- [6] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing.,” Advances in Neural Information Processing Systems, vol. 282, no. 3, pp. 1753–1760, 2008.
- [7] Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean approach to learning binary codes,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 817 – 824.
- [8] K. He, F. Wen, and J. Sun, “K-means hashing: An affinity-preserving quantization method for learning binary compact codes,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2938–2945.
- [9] C. Wu, J. Zhu, D. Cai, C. Chen, and J. Bu, “Semi-supervised nonlinear hashing using bootstrap sequential projection learning,” IEEE Transactions on Knowledge & Data Engineering, vol. 25, no. 6, pp. 1380–1393, 2013.
- [10] S.F. Chang, “Supervised hashing with kernels,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2074–2081.
- [11] P. Fua, M. Bronstein, and C. Bronstein, A.and Strecha, “Ldahash: Improved matching with smaller descriptors,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 34, no. 1, pp. 66–78, 2011.
- [12] M. Rastegari, A. Farhadi, and D. Forsyth, “Attribute discovery via predictable discriminative binary codes,” Lecture Notes in Computer Science on ECCV, vol. 7577, no. 1, pp. 876–889, 2012.
- [13] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–44, 2015.
- [14] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1090–1098.
- [15] P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103.
- [16] R. Salakhutdinov and G. Hinton, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
- [17] V.E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, “Deep hashing for compact binary codes learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2475–2483.
- [18] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for image retrieval via image representation learning,” in Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014, pp. 2156–2162.
- [19] K. Lin, H.-F. Yang, J.-H. Hsiao, and C.-S. Chen, “Deep learning of binary hash codes for fast image retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2015, pp. 27–35.
- [20] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3270–3278.
- [21] A. Fischer and C. Igel, “Training restricted boltzmann machines: An introduction,” Pattern Recognition, vol. 47, no. 1, pp. 25–39, 2014.
- [22] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International Journal of Computer Vision, vol. 42, no. 3, pp. 145–175, 2001.
- [23] J. Sivic and A. Zisserman, “Efficient visual search of videos cast as text retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 591–606, 2009.