Hashing with Mutual Information

Hashing with Mutual Information


Binary vector embeddings enable fast nearest neighbor retrieval in large databases of high-dimensional objects, and play an important role in many practical applications, such as image and video retrieval. We study the problem of learning binary vector embeddings under a supervised setting, also known as hashing. We propose a novel supervised hashing method based on optimizing an information-theoretic quantity, mutual information. We show that optimizing mutual information can reduce ambiguity in the induced neighborhood structure in learned Hamming space, which is essential in obtaining high retrieval performance. To this end, we optimize mutual information in deep neural networks with minibatch stochastic gradient descent, with a formulation that maximally and efficiently utilizes available supervision. Experiments on four image retrieval benchmarks, including ImageNet, confirm the effectiveness of our method in learning high-quality binary embeddings for nearest neighbor retrieval.




Hashing, Deep learning, Nearest neighbor retrieval, Mutual information.


1 Introduction


In computer vision and many other related application areas, there is typically an abundance of data with high-dimensional raw representations, such as mega-pixel images and high-definition videos. Besides obvious storage challenges, high-dimensional data pose additional challenges for semantic-level processing and understanding. One prominent such example is nearest neighbor retrieval. In applications such as image and video search, person and object recognition in photo collections, and action detection and classification in surveillance video, it is often necessary to map high-dimensional data objects to low-dimensional vector representations to allow for efficient retrieval of similar instances in large databases. In addition, the desired semantic similarity can vary from task to task, often prescribed by available supervision, e.g. class labels. Therefore, the mapping process is also responsible for leveraging supervised learning to encode task-specific similarity metrics, such that objects that are semantically similar are mapped to close neighbors in the resulting vector space.

In this paper, we consider the problem of learning low-dimensional binary vector embeddings of high-dimensional data, also known as hashing. Binary embeddings enjoy low memory footprint and permit fast search mechanisms, as Hamming distance computation between binary vectors can be implemented very efficiently in modern CPUs. As a result, across a variety of domains, hashing approaches have been widely utilized in applications requiring fast (approximate) nearest neighbor retrieval. Examples include: image annotation [52], visual tracking [29], 3D reconstruction [6], video segmentation [37], object detection [10], audio search [49], multimedia retrieval [44, 43, 14], and large-scale clustering [20, 18, 19]. Our goal is to learn hashing functions that can result in optimal nearest neighbor retrieval performance. In particular, as motivated above, we approach hashing as a supervised learning problem, such that the learned binary embeddings encode task-specific semantic similarity.

Supervised hashing is a well-studied problem. Although many different formulations exist, all supervised hashing formulations essentially constrain the learned Hamming distance to agree with the given supervision. Such supervision can be specified as pairwise affinity labels: pairs of objects are annotated with binary labels indicating their pairwise relationships as either “similar” or “dissimilar.” In this case, a common learning strategy is affinity matching: the learned binary embedding should evaluate to low Hamming distance between similar pairs, and high Hamming distance between dissimilar pairs. Alternatively, supervision can also be given in terms of local relative distance comparisons, most notably three-tuples of examples, or “triplets”, where one example is constrained to have lower distance to the second example than the third. This can be termed local ranking. Typically, these formulations attempt to improve the overall nearest neighbor retrieval performance through refining and optimizing objective functions that closely match the form of supervision, i.e., defined on pairs or triplets. To this end, it is often necessary to introduce parameters such as margins, thresholds, scaling factors and other regularization parameters.

We argue that approaches such as affinity matching and local ranking are insufficient to achieve optimal nearest neighbor retrieval performance. Instead, we view supervised hashing through an information-theoretic lens, and propose a novel solution tailored for the task of nearest neighbor retrieval. Our key observation is that a good binary embedding should well separate neighbors and non-neighbors in the Hamming space, or, achieve low neighborhood ambiguity. An alternative viewpoint is that the learned Hamming embedding should carry a high amount of information regarding the desired neighborhood structure. To quantify neighborhood ambiguity, we use a well-known quantity from information theory, mutual information, and show that is has direct and strong correlations with standard ranking-based retrieval performance metrics. An appealing property of the mutual information objective is that it is free of tuning parameters, unlike others that may require thresholds, margins, and so on. Finally, to optimize mutual information, we relax the original NP-hard discrete optimization problem, and develop a gradient-based optimization framework that can be efficiently applied with minibatch stochastic gradient descent in deep neural networks.

To briefly summarize our contributions, we propose a novel supervised hashing method based on quantifying and minimizing neighborhood ambiguity in the learned Hamming space, using mutual information as the learning objective. An end-to-end gradient-based optimization framework is developed, with an efficient minibatch-based implementation. Our proposed hashing method is named .1 We conduct image retrieval experiments on four standard benchmarks: CIFAR-10, NUSWIDE, LabelMe, and ImageNet.  achieves state-of-the-art retrieval performance across all datasets.

The rest of this paper is organized as follows. First, the relevant literature is reviewed in Section 2. We propose and analyze mutual information as a learning objective for hashing in Section 3, and then discuss its optimization using stochastic gradient descent and deep neural networks in Section 4. Section 5 presents experimental results and empirical analysis of the proposed algorithm’s behavior. Finally, Section 6 presents the conclusions.

2 Related Work

Many hashing methods have been introduced over the years. While a precise taxonomy is difficult, a rough grouping can be made as data-independent and dependent techniques. Data-independent techniques do not exploit data distribution during hashing. Instead, similarity, as induced by a particular metric, is often preserved. This is achieved by maximizing the probability of “collision” when hashing similar items. Notable earlier examples include Locality Sensitive Hashing methods [9, 16, 26] where distance functions such as the Euclidean, Jaccard, and Cosine distances are approximated. These methods usually have theoretical guarantees on the approximation quality and conform with sub-linear retrieval mechanisms. However, they are confined to certain metrics as they ignore the data distribution and accompanying supervision.

Contrary to data-independent techniques, recent approaches are data-dependent such that hash mappings are learned from the training set. While empirical evidence for the superiority of these methods over their data-independent counterparts are aplenty in the literature, a recent study has also theoretically validated their performance advantage [1]. These methods can be considered as binary embeddings that map the data into Hamming space while preserving a specific neighborhood structure. Such a neighborhood is derived from the meta-data (e.g., labels) or is completely determined by the user (e.g., via similarity-dissimilarity indicators of data pairs). With the binary embeddings, distances can be efficiently computed allowing even a linear search to be done very efficiently for large-scale corpora. These data-dependent methods can be grouped as follows: similarity preserving techniques [51, 25, 36, 31, 55, 45, 58, 15, 42], quantization methods [22, 17, 21] and recently, deep-learning based methods [56, 33, 13, 4, 27, 61, 62, 35, 60]. We now review a few of the prominent techniques in each category.

Quantization methods are the first group of data-dependent hashing methods, which are unsupervised in the sense that no supervision is assumed to be provided with the data. Such techniques generally minimize an objective involving a reconstruction error. Notable examples are quantization/PCA based techniques. Among these, Semi-Supervised Hashing [51] learns the hash mapping by maximizing the empirical accuracy on labeled data and also the entropy of the generated hash functions on any unlabeled data. This is shown to be very similar to doing a PCA analysis where the hash functions are the eigenvectors of a covariance matrix biased, due to the supervised information. Other noteworthy work includes PCA-inspired methods where the principal components are taken as the hash functions. If “groups” that are suitable for clustering exist within the data, then further refining the principal components for better binarization has shown to be beneficial, as in Iterative Quantization [17] and K-means Hashing [21].

Unsupervised quantization can also be approached as a special case of generative modeling. Semantic Hashing [40] is one early example of such algorithms based on the autoencoding principle, which learns a generative model to encode data. The generative model is in the form of stacked Restricted Boltzmann Machines (RBM), which is defined on binary variables by nature. Carreira-Perpinan and Raziperchikolaei [4] propose Binary Autoencoders, and construct autoencoders with a binary latent layer. They argue that finding the hash mapping without relaxing the binary constraints will yield better solutions, while in relaxation approaches that are more common in the literature, significant quantization errors can degrade the quality of learned hash functions. More recently, Stochastic Generative Hashing [8] learns a generative hashing model based on the minimum description length principle, and use stochastic distributional gradient descent to optimize the associated variational objective and handle the difficulty in having binary stochastic neurons.

Similarity preserving methods, on the other hand, aim to construct binary embeddings that optimize loss functions induced from the supervision provided. Both the affinity matching and local ranking methods mentioned in Section 1 belong to this group. Among such techniques, Minimal Loss Hashing [38] considers minimizing a hinge-type loss function motivated from structural SVMs. In Binary Reconstructive Embeddings [25], a kernel-based solution is proposed where the goal is to construct hash functions by minimizing an empirical loss between the input and Hamming space distances via a coordinate descent algorithm. Supervised Hashing with Kernels [36] also proposes a kernel-based solution; but, instead of preserving the equivalence of the input and Hamming space distances, the kernel function weights are learned by minimizing an objective function based on binary code inner products. Spectral Hashing [55] and Self-Taught Hashing [58] are other notable lines of work where the similarity of the instances is preserved by solving a graph Laplacian problem. Rank alignment methods [41, 12] that learn a hash mapping to preserve rankings in the data can also be considered in this group.

Lately, several “two-stage” similarity preserving techniques have also been proposed, where the learning is decomposed into two steps: binary inference and hash function learning. The binary inference step yields hash codes that best preserve the similarity. These hash codes are subsequently used as target vectors in the subsequent hash function learning step, for example, by learning binary classifiers to produce each bit. Notable two-stage methods include Fast Hashing with Decision Trees [32, 31], Structured Learning of Binary Codes [30] and Supervised Discrete Hashing [42]. All of these similarity preserving methods assume some type of supervision, such as labels or similarity indicators. Thus, in the literature, such techniques are also regarded as supervised hashing solutions.

Deep hashing methods have recently gained significant prominence following the success of deep neural networks in related tasks such as image classification. Although hashing methods using deep learning can be based on either unsupervised quantization or supervised learning, most existing deep hashing methods are supervised. A deep hashing study typically involves a novel architecture, a loss function or a binary inference formulation. Among such methods, Lai et al. [27] propose jointly learning the hash mapping and image features with a triplet loss formulation. This triplet loss ensures that an image is more similar to the second image than to a third one with respect to their binary codes. In [34], a network-in-network (NIN) type deep net is used as the architecture with a divide-conquer module. The divide-conquer module is shown to reduce redundancy in the hash bits. In [60], the authors follow the work of [42] and [4]. Similar to [42] they propose learning the hash mapping by optimizing a classification objective. Differently, they consider using a deep net consisting only of fully connected layers and propose using auxiliary variables, as in [4], to efficiently train and circumvent the vanishing gradient problem. Deep learning based hashing studies have also proposed sampling pairs or triplets of data instances to learn the hash mapping. Notable examples include [28] and [54], which optimize a likelihood function defined as to ensure similar (non-similar) pairs or triplets or mapped to nearby (distant) binary embeddings.

As the ultimate goal of hashing is to preserve a neighborhood structure in the Hamming space, we propose an information-theoretic solution and directly quantify the neighborhood ambiguity of the generated binary embeddings using a mutual information based criterion. The proposed mutual information based measure is efficient to compute, amenable to batch-learning, parameter-free and achieves state-of-the-art performances in standard retrieval benchmarks. We utilize a recent study, [47], when optimizing our mutual information based objective, and use their differentiable histogram binning technique as a foundation in deriving gradient-based optimization rules. Note that both our problem setup and objective function are quite different from [47].

Figure 1: Overview of the proposed hashing method. We use a deep neural network to compute -bit binary codes for: a (1) query image , (2) its neighbors in , and (3) its non-neighbors in . The binary codes are obtained via thresholding the activations in the last layer of the neural network. Computing hamming distances between the binary code of the query and the binary codes of neighbors and non-neighbors yields two distributions of Hamming distances. The information-theoretic quantity, Mutual Information, can be used to capture the separability between these two distributions, which gives a good quality indicator and learning objective.

3 Hashing with Mutual Information

3.1 Preliminaries

Let be the feature space and be the -dimensional Hamming space, i.e., . The goal of hashing is to learn an embedding function , which induces a Hamming distance , that is equal to the number of bit differences between embedded vectors.

We consider a supervised learning setup. For some example , we assume that we have access to a set containing examples that are labeled as similar to (neighbors), and a set of dissimilar examples (non-neighbors). We call an anchor example, and refer to as its neighborhood structure. Then, we can cast the problem of learning the Hamming embedding as one of preserving the neighborhood structure: the neighbors of should be mapped to the close vicinity of in the Hamming space, while the non-neighbors should be mapped farther away. In other words, we would like to ideally satisfy the following constraint:


If the learned successfully satisfies this constraint, then the neighborhood structure of can be exactly recovered by thresholding the Hamming distance . Generally, the neighborhood structure can be constructed by repeatedly querying a pairwise similarity oracle. In practice, it can often be derived from agreement of class labels, or from thresholding a distance metric (e.g. Euclidean distance) in the original feature space .

In this work, we parameterize the functional embeddings by deep neural networks, as they have recently shown to have superior learning capabilities when coupled with appropriate hardware acceleration. Also, in order to take advantage of end-to-end training by backpropagation, we use gradient-based optimization, and adopt an equivalent formulation of the Hamming distance that is amenable to differentiation:


where are the activations produced by a feed-forward neural network, with learnable parameters .

3.2 Minimizing Neighborhood Ambiguity

We now discuss a formulation for learning the Hamming embedding . As mentioned above, given and its neighborhood structure, we would like to satisfy Equation 2 as much as possible, or, minimize the amount of violation. Indeed, many existing supervised hashing formulations are based on the idea of minimizing violations. For instance, affinity matching methods, mentioned in Section 1, typically enforce the following constraints through their loss functions:


where are threshold parameters. This indirectly enforces Equation 2 by constraining the absolute values of individual Hamming distances. Alternatively, local ranking methods based on triplet supervision encourage the following:


where is a margin parameter. We note that both formulations are inflexible as the same threshold or margin parameters are applied for all anchors , and they can be nontrivial to tune.

Instead, we propose a novel formulation based on the idea of minimizing neighborhood ambiguity. We say that introduces neighborhood ambiguity if the mapped image of some is closer to that of than some in the Hamming space; when this happens, it is no longer possible to recover the neighborhood structure by thresholding . Consequently, when is used to perform retrieval, the retrieved nearest neighbors of would be contaminated by non-neighbors. Therefore, a high-quality embedding should minimize neighborhood ambiguity.

To concretely formulate the idea, we define random variable , and let be the membership indicator for . Then, we naturally have two conditional distributions of Hamming distance: and . Note that the constraint in Equation 2 can be re-expressed as having no overlap between these two conditional distributions, and that minimizing neighborhood ambiguity amounts to minimizing the overlap. Please see Figure 1 for an illustration.

We use the mutual information between random variables and to capture the amount of overlap between conditional Hamming distance distributions. The mutual information is defined as


where denotes (conditional) entropy. In the following, for brevity we will drop subscripts and , and denote the two conditional distributions and the marginal as , , and , respectively.

By definition, measures the decrease in uncertainty in the neighborhood information when observing the Hamming distances . If attains a high value, which means can be determined with low uncertainty by observing , then must have achieved good separation (i.e. low overlap) between and . is maximized when there is no overlap, and minimized when and are exactly identical. Recall, however, that is defined with respect to a single anchor ; therefore, for a general quality measure, we integrate over the feature space:


captures the expected amount of separation between and achieved by , over all instances in .

The expected separation cannot be determined directly as the ground truth distribution is unknown. However, it is possible to approximate by simply using a finite set of training elements independently sampled from :


We use subscript to indicate that when computing the mutual information , the and for the instance are estimated from . This can be done in time for each , as the discrete distributions can be estimated via histogram binning.

An appealing property of the mutual information objective is that it is parameter-free: the objective encourages distributions and to be separated, but does not include parameters dictating the distance threshold at which separation occurs, or the absolute amount of separation. The absence of such fixed parameters also increases flexibility, since the separation could occur at different distance thresholds depending on the anchor .

4 Optimizing Mutual Information

Having shown that mutual information is a suitable measure of hashing quality, we consider its use as a learning objective for hashing. For brevity, we omit the superscripts in Equation 11 and redefine the learning objective as


Inspired by recent advances in stochastic optimization, we derive gradient descent rules for . We first derive the gradient of with respect to the output of the hash mapping, . With -bit Hamming distances, we model the discrete distributions using -bin normalized histograms over . The mutual information is continuously differentiable, and using the chain rule we can write


where is the -th element of . We next focus on terms involving , and omit derivations for due to symmetry. For , we have


where we used the fact that , with and being shorthands for the priors and .

4.1 Continuous Relaxation

To complete the chain rule, we need to further derive the term . However, note that the hash mapping is discrete by nature, precluding the use of continuous optimization. While it is possible to maintain such constraints and resort to discrete optimization, the resulting problems are usually NP-hard. Instead, we take the relaxation approach, in order to apply gradient-based continuous optimization. Correspondingly, we need to perform a continuously differentiable relaxation to . Recall from Equation 5 that each element in is obtained from thresholding neural network activations with the sign function. We relax into a real-valued vector by adopting a standard technique in hashing [2, 28, 36], where the discontinuous sign function is approximated with the sigmoid function :


We include a tuning parameter in the approximation. We choose so as to increase the “sharpness” of the approximation, and reduce the error introduced by the continuous relaxation. Alternatively, it is possible to formulate the quantization error as a penalty term in the objective function with proper weighting, e.g.[28, 54], or use continuation methods [3].

With the continuous relaxation in place, we move on to the partial differentiation of and with respect to . As mentioned before, these discrete distributions can be estimated via histogram binning; however, histogram binning is a non-differentiable operation. In the following, we describe a differentiable approximation to the discrete histogram binning process, thereby enabling end-to-end backpropagation.

4.2 End-to-End Optimization

We first elaborate on the histogram binning process. Without the continuous relaxation, is estimated by performing hard assignments on Hamming distances into histogram bins:


With the continuous relaxation developed above, we first note that in Equation 3 is not integer-valued any more, but is also continuously relaxed into


When is relaxed into , we employ a technique similar to [47], and replace hard assignment with soft assignment. The key is to approximate the binary indicator with a differentiable function . Specifically, we use a triangular kernel centered on the histogram bin, so that linearly interpolates the real-valued into the -th bin with slope :


This soft assignment admits simple subgradients:


It is easy to see that the triangular kernel approximation approaches the original binary indicator as .

We are now ready to tackle the term . From the definition of in Equation 19, we have, for :


For the last step, we used the definition of in Equation 20. Next, for :


Lastly, to back-propagate gradients to ’s inputs and ultimately to the parameters of the underlying neural network, we only need to further differentiate the sigmoid approximation employed in (Equation 18), which has a closed form expression.

4.3 Efficient Minibatch Backpropagation

So far, our derivations of mutual information and its gradients have assumed a single anchor example , omitting the fact that the optimization objective in Equation 12 is the average of mutual information values over all anchors in a finite training set . In information retrieval terminology, the current derivations are for a single query and a fixed database. However, in many computer vision tasks (e.g. image retrieval), there is usually no clear split of the given training set into a set of queries and a database. Even if we create such a split, it can be arbitrary and does not fully utilize available supervision. Also, another challenge is adapting the optimization to the stochastic/minibatch setting, since deep neural networks are typically trained by minibatch stochastic gradient descent (SGD), where it is infeasible to access the entire database all at once.

Here, we describe a way to efficiently utilize all the available supervision during minibatch SGD training, simultaneously addressing both challenges. The idea is that, for a minibatch with examples, each example is used as the query against a database comprising of the remaining examples. For each query , elements of and are found from the remaining examples according to provided labels. Then, the overall objective value for the minibatch is the average over the queries. This way, the available supervision in the minibatch is utilized maximally: each example is used as the query once, and as a database item times. As we shall see next, the backpropagation in this case can be efficiently implemented using matrix multiplications.

Now consider a minibatch of size , . For , let be a shorthand for . We group the hash mapping output for the entire minibatch into the following matrix,


Similar to Equation 13, we can write the Jacobian matrix of the minibatch objective with respect to as


where () is the -th element of () when the anchor is . Again, the partial derivative is straightforward to compute, and the main issue is evaluating the Jacobian . We do so by examining each column of the Jacobian. First, for ,


with the substitutions


Next, for ,


We can further unify the two cases by defining :


Having derived all the columns, we now write down the matrix form of . Let , and let be the -th standard basis vector in ,


We will next complete Equation 30. First, we define a shorthand, which can be easily evaluated using the result in Equation 16:


Using symmetry, we only consider the first term involving in Equation 30, and we omit the scaling factor for now:




then we can simplify Equation 45 as


The last step is true since is symmetric: it can be seen from the definition of in Equation 35 that , since both the neighbor relationship and the Hamming distance are symmetric.

Now, if we define and for the non-neighbor distance distribution , analogously as in Equations 46 and 47 (details are very similar and thus omitted), then the full Jacobian matrix in Equation 30 can be evaluated as


Since only matrix multiplications and additions are involved, this operation can be implemented efficiently.

We note that a similar minibatch-based formulation is recently proposed by Triantafillou et al.[46], which is also inspired by information retrieval, and attempts to maximize the utilization of supervision by treating each example in the minibatch as a query. However, [46] specifically tackles the problem of few-shot learning, and its formulation approximately optimizes mean Average Precision using a structured prediction framework, which is very different from ours. Nevertheless, it would be interesting to explore the use of hashing and the mutual information objective for that problem in future work.

5 Experiments

5.1 Datasets and Evaluation Setup

We conduct experiments on widely used image retrieval benchmarks: CIFAR-10 [23], NUSWIDE [7], 22K LabelMe [39] and ImageNet100 [11]. Each dataset is split into a test set and retrieval set, and instances from the retrieval set are used in training. At test time, queries from the test set are used to rank instances from the retrieval set using Hamming distances, and the performance metric is averaged over the queries.

  • CIFAR-10 is a dataset for image classification and retrieval, containing 60K images from 10 different categories. We follow the setup of [27, 62, 28, 54]. This setup corresponds to two distinct partitions of the dataset. In the first case (cifar-1), we sample 500 images per category, resulting in 5,000 training examples to learn the hash mapping. The test set contains 100 images per category (1000 in total). The remaining images are then used to populate the hash table. In the second case (cifar-2), we sample 1000 images per category to construct the test set (10,000 in total). The remaining items are both used to learn the hash mapping and populate the hash table. Two images are considered neighbors if they belong to the same class.

  • NUSWIDE is a dataset containing 269K images from Flickr. Each image can be associated with multiple labels, corresponding with 81 ground truth concepts. For NUSWIDE experiments, following the setup in [27, 62, 28, 54], we only consider images annotated with the 21 most frequent labels. In total, this corresponds to 195,834 images. The experimental setup also has two distinct partitionings: nus-1 and nus-2. For both cases, a test set is constructed by randomly sampling 100 images per label (2,100 images in total). To learn the hash mapping, 500 images per label are randomly sampled in nus-1 (10,500 in total). The remaining images are then used to populate the hash table. In the second case, nus-2, all the images excluding the test set are used in learning and populating the hash table. Following standard practice, two images are considered as neighbors if they share at least one label.

  • 22K LabelMe consists of 22K images, each represented with a 512-dimensionality GIST descriptor. Following [25, 2], we randomly partition the dataset into a retrieval and a test set, consisting of 20K and 2K instances, respectively. A 5K subset of the retrieval set is used in learning the hash mapping. As this dataset is unsupervised, we use the Euclidean distance between GIST features in determining the neighborhood structure. Two examples that have a Euclidean distance below the distance percentile are considered neighbors.

  • ImageNet100 is a subset of ImageNet [11] containing 130K images from 100 classes. We follow the setup in [3], and randomly sample 100 images per class for training. All images in the selected classes from the ILSVRC 2012 validation set are used as the test set. Two images are considered neighbors if they belong to the same class.

As for performance metric, we use the standard mean Average Precision (), or its variants. We compare  against both classical and recent state-of-the-art hashing methods. These methods include: Spectral Hashing (SH) [55], Iterative Quantization (ITQ) [17], Sequential Projection Learning for Hashing (SPLH) [50], Supervised Hashing with Kernels (SHK) [36], Fast Supervised Hashing with Decision Trees (FastHash) [31], Structured Hashing (StructHash) [30], Supervised Discrete Hashing (SDH) [42], Efficient Training of Very Deep Neural Networks (VDSH) [60], Deep Supervised Hashing with Pairwise Labels (DPSH) [28], Deep Supervised Hashing with Triplet Labels (DTSH) [54], and Hashing by Continuation (HashNet) [3]. These competing methods have been shown to outperform earlier and other works such as [51, 25, 38, 56, 27, 61].

We finetune deep Convolutional Neural Network models pretrained on the ImageNet dataset, by replacing the final softmax classification layer with a new fully-connected layer that produces the binary bits. The new fully-connected layer is randomly initialized. For CIFAR-10 and NUSWIDE experiments, we finetune a VGG-F [5] architecture, as in [28, 54]. For ImageNet100 experiments, following the protocol of HashNet [3], we finetune the AlexNet [24] architecture, and scale down the learning rate for pretrained layers by a factor of 0.01, since the model is finetuned on the same dataset for a different task. For non-deep methods, we use the output of the penultimate layer () of both architectures as input features, which are 4096-dimensional. For the 22K LabelMe benchmark, all methods learn shallow models on top of precomputed 512-dimensional GIST descriptors; for gradient-based hashing methods, this corresponds to using a single fully connected layer. We use SGD with momentum and weight decay of , and reduce the learning rate periodically by a predetermined factor ( in most cases), which is standard practice. During SGD training, the minibatches are randomly sampled from the training set.

5.2 Results

Table LABEL:table:set-1 gives results for cifar-1 and nus-1 experimental settings in which and ( evaluated on the top 5,000 retrievals) are reported for the CIFAR-10 and NUSWIDE datasets, respectively. Deep learning based hashing methods such as DPSH and DTSH outperform most non-deep hashing solutions. This is not surprising as hash mapping is done simultaneously with feature learning. Non-deep solutions such as FastHash and SDH also perform competitively, especially in NUSWIDE experiments. Our proposed method, , surpasses all competing methods in majority of the experiments. For example, with 24 and 32-bit binary embeddings  surpasses the nearest competitor, DTSH, by - in CIFAR-10. For NUSWIDE,  demonstrates comparable results, improving over the state-of-the-art with 48-bit codes.

The performance improvement of  is much more significant in the cifar-2 and nus-2 settings. In these settings, a VGG-F architecture pretrained on ImageNet is again finetuned but with more data. Following standard practice, and metrics are used to evaluate the retrieval performance. Table LABEL:table:set-2 gives the results. As observed, in both CIFAR-10 and NUSWIDE,  achieves state-of-the-art results in nearly all code lengths. For instance,  consistently outperforms DTSH, the closest competitor, in all but one binary embedding size.

Figure 2: (Left) We plot the training objective value (Equation 12) and the mAP score from the 32-bit cifar-1 experiment. Notice that both the mutual information objective and the mAP value show similar behavior, i.e., exhibit strong correlation. (Middle) We apply min-max normalization in order to scale both measures to the same range. (Right) We conduct an additional set of experiments in which 100 instances are selected as the query set, and the rest is used to populate the hash table. The hash mapping parameters are randomly sampled from a Gaussian, similar to LSH [16]. Each experiment is conducted times. There exists strong correlation between MI and mAP as validated by the Pearson Correlation Coefficient score of 0.98.

Retrieval results for ImageNet100 are given in Table LABEL:table:imagenet. In these experiments, we compare against DTSH, the overall best competing method in past experiments and another recently introduced deep learning based hashing method, HashNet [3]. The evaluation metric is taken to be for consistency with the setup in [3]. In this benchmark,  significantly outperforms both DTSH and HashNet for all embedding sizes. Notably,  achieves improvement over HashNet with 16-bit codes, indicating its superiority in learning high-quality compact binary codes.

To further emphasize the merits of , we consider shallow model experiments on the 22K LabelMe dataset. In this benchmark, we only consider the overall best non-deep and deep learning methods in past experiments. Also, to solely put emphasize on comparing the hash mapping learning objectives, all deep learning methods use a one-layer embedding on top of the GIST descriptor. This nullifies the feature learning aspect of these deep learning methods. Still, some non-deep methods can learn non-linear hash mappings, such as FastHash and StructHash that use boosted decision trees. Table LABEL:table:labelme gives the results, and we can see that non-deep methods FastHash and StructHash outperform deep learning methods DPSH and DTSH on this benchmark. This indicates that the prowess of DPSH and DTSH might come primarily through feature learning. On the other hand,  remains to be the best performing method across all code lengths, despite using a simpler one-layer embedding function compared to FastHash and StructHash. This further validates the effectiveness of our mutual information based objective in capturing the neighborhood structure.

5.3 Empirical Analysis

Mutual Information and Ranking Metrics

Figure 3: We plot the distributions and , averaged on the CIFAR-10 test set, before and after learning  with a single-layer model and 20K training examples. Optimizing the mutual information objective substantially reduces the overlap between them, resulting in high mAP.

To evaluate the performance of hashing algorithms for retrieval tasks, it is common to use ranking-based metrics, and the most notable example is mean Average Precision (mAP). We note that there exists strong correlations between our mutual information criterion and mAP. Figure 2 provides an empirical analysis on the CIFAR-10 benchmark. The left plot displays the training objective value as computed from Equation 12 and the mAP score with respect to the epoch. These results are obtained from the cifar-1 experiment with 32-bit codes, as specified in the previous section. Notice that both the mutual information objective and the mAP value show similar behavior, i.e., exhibit strong correlation. While the mAP score increases from 0.40 to 0.80, the mutual information objective increases from 0.15 to above 0.40. In the middle figure, we apply min-max normalization in order to scale both measures to the same range.

To further emphasize the correlation between mutual information and mAP, we also conducted an additional experiment in which 100 instances are selected as the query set, and the rest are used to populate the hash table. The hash mapping parameters are randomly sampled from a Gaussian distribution, similar to LSH [16], and each experiment is conducted 50 times. The right figure provides the scatter plot of mAP and the mutual information objective value. We can see that the relationship is almost linear, which is also validated by the Pearson Correlation Coefficient score of 0.98.

We give an intuitive explanation to the strong correlation. Given a query, mAP is optimized when all of its neighbors are ranked above all of its non-neighbors in the database. On the other hand, mutual information is optimized when the distribution of neighbor distances has no overlap with the distribution of non-neighbor distances. Therefore, we can see that mAP and mutual information are simultaneously optimized by the same optimal solution. Conversely, mAP is suboptimal when neighbors and non-neighbors are interleaved in the ranking, so is mutual information when the distance distributions have nonzero overlap. Although a theoretical analysis of the correlation is beyond the scope of this work, empirically we find that mutual information serves as a general-purpose surrogate metric for ranking.

An advantage of mutual information, as we have demonstrated, is that it is suitable for direct, gradient-based optimization. In contrast, optimizing mAP is much more challenging as it is non-differentiable, and previous works usually resort to approximation and bounding techniques [30, 53, 57].

Distribution Separating Effect

To demonstrate that  indeed separates neighbor and non-neighbor distance distributions, we consider a simple experiment. Specifically, we learn a single-layer model on top of precomputed -layer features. The learning is done in an online fashion, which means that each training example is processed only once. We train such an  model on the CIFAR-10 dataset with 20K training examples.

In Figure 3, we plot the distributions and , averaged on the CIFAR-10 test set, before and after learning  with 20K training examples. The hash mapping parameters are initialized using LSH, and lead to high overlap between the distributions, although they do not totally overlap due to the use of strong pretrained features. After learning, the overlap is significantly reduced, with pushed towards zero hamming distances. Consequently, the mAP value increases to from .


[b]0.24 {subfigure}[b]0.24

Figure 4: t-SNE [48] visualization of the -bit binary codes produced by  and HashNet on ImageNet100, for a random subset of 10 different color-coded classes in the test set.  yields well-separated codes with distinct structures, opposed to HashNet, in which the binary codes have a higher overlap.
Figure 5: We show sample retrieval results from the ImageNet100 dataset. Left: query images, right: top 10 retrieved images from  (top row) and from HashNet (bottom row). Retrieved images marked with a green border belong to the same class as the query image, while ones marked with a red border do not belong to the same class as the query image.

t-SNE Visualization of the Binary Embeddings

We also visualize the learned embeddings using t-SNE [48]. In Figure 4, we plot the visualization for 48-bit binary embeddings produced by  and the top competing method, HashNet, on ImageNet100. For ease of visualization, we randomly sample 10 classes from the test set.

 produces binary embeddings that separate different classes well into separate clusters. This is in fact predictable from the formulation of , in which the class overlap is quantified via mutual information and minimized. On the other hand, binary codes generated by HashNet have higher overlap between classes. This is also consistent with the fact that HashNet does not specifically optimize for a criterion related to class overlap, but belongs to the simpler “affinity matching” family of approaches.

Example Retrieval Results

In Figure 5, we present example retrieval results for  and HashNet for several image queries from the ImageNet100 dataset. The top 10 retrievals of eight query images from eight distinct categories are presented. Correct retrievals (i.e., having the same class label as the query) are marked in green, and incorrect retrievals are in red. In these examples, many of the retrieved images appear visually similar to the query, even if not sharing the same class label. Nevertheless,  retrieves fewer incorrect images compared to HashNet. For example, HashNet returns bag images for the first query (image of cups), and digital-clock images for the second-to-last query (image of doormat).

6 Conclusion

We take an information-theoretic approach to hashing and propose a novel hashing method, called , in this work. It is based on minimizing neighborhood ambiguity in the learned Hamming space, which is crucial in maintaining high performance in nearest neighbor retrieval. We adopt the well-studied mutual information from information theory to quantify neighborhood ambiguity, and show that it has strong correlations with standard ranking-based retrieval performance metrics. Then, to optimize mutual information, we take advantage of recent advances in deep learning and stochastic optimization, and parameterize our embedding functions with deep neural networks. We perform a continuous relaxation on the NP-hard optimization problem, and use stochastic gradient descent to optimize the networks. In particular, our formulation maximally utilizes available supervision within each minibatch, and can be efficiently implemented. Our implementation is publicly available.

When evaluated on four standard image retrieval benchmarks,  is shown to learn high-quality compact binary codes, and it achieves superior nearest neighbor retrieval performance compared to existing supervised hashing techniques. We believe that the mutual information based formulation is also potentially relevant for learning real-valued embeddings, and for other applications besides image retrieval, such as few-shot learning.


This research was supported in part by a BU IGNITION award, US NSF grant 1029430, and gifts from NVIDIA.


[]Fatih Cakir is a Data Scientist at FirstFuel Software. He was previously a member at the Image and Video Computing Group at Boston University working with Professor Stan Sclaroff as his Ph.D. advisor. His research interests are in the fields of Computer Vision and Machine Learning.


Kun He is a Ph.D. candidate in Computer Science and a member of the Image and Video Computing group at Boston University, advised by Professor Stan Sclaroff. He obtained his M.Sc. degree in Computer Science from Boston University in 2013, and his B.Sc. degree (with honors) in Computer Science and Technology from Zhejiang University, Hangzhou, China, in 2010. He is a student member of the IEEE.


Sarah Adel Bargal is a Ph.D. candidate in the Image and Video Computing group in the Boston University Department of Computer Science. She received her M.Sc. from the American University in Cairo. Her research interests are in the areas of computer vision and deep learning. She is an IBM PhD Fellow and a Hariri Graduate Fellow.


[]Stan Sclaroff is a Professor in the Boston University Department of Computer Science. He received his Ph.D. from MIT in 1995. His research interests are in computer vision, pattern recognition, and machine learning. He is a Fellow of the IAPR and Fellow of the IEEE.


  1. Our MATLAB implementation of  is publicly available at https://github.com/fcakir/mihash


  1. A. Andoni and I. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proc. ACM Symposium on Theory of Computing (STOC), 2015.
  2. F. Cakir and S. Sclaroff. Adaptive hashing for fast similarity search. In Proc. IEEE International Conf. on Computer Vision (ICCV), 2015.
  3. Z. Cao, M. Long, J. Wang, and P. S. Yu. HashNet: Deep learning to hash by continuation. In Proc. IEEE International Conf. on Computer Vision (ICCV), 2017.
  4. M. A. Carreira-Perpinan and R. Raziperchikolaei. Hashing with binary autoencoders. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2015.
  5. K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference (BMVC), 2014.
  6. J. Cheng, C. Leng, J. Wu, H. Cui, and H. Lu. Fast and accurate image matching with cascade hashing for 3d reconstruction. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  7. T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. T. Zheng. NUS-WIDE: A real-world web image database from National University of Singapore. In Proc. ACM CIVR, 2009.
  8. B. Dai, R. Guo, S. Kumar, N. He, and L. Song. Stochastic generative hashing. In Proc. International Conf. on Machine Learning (ICML), 2017.
  9. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. on Computational geometry (SCG), 2004.
  10. T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2013.
  11. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  12. K. Ding, C. Huo, B. Fan, and C. Pan. Knn hashing with factorized neighborhood representation. In Proc. IEEE International Conf. on Computer Vision (ICCV), December 2015.
  13. V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact binary codes learning. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2475–2483, 2015.
  14. L. Gao, J. Song, F. Zou, D. Zhang, and J. Shao. Scalable multimedia retrieval by deep learning hashing with relative similarity learning. In Proceedings of the 23rd ACM International Conference on Multimedia, 2015.
  15. K. Ge, Tiezhengand He and J. Sun. Graph cuts for supervised binary coding. In Proc. European Conf. on Computer Vision (ECCV), 2014.
  16. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. International Conf. on Very Large Data Bases (VLDB), 1999.
  17. Y. Gong and S. Lazebnik. Iterative quantization: A Procrustean approach to learning binary codes. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011.
  18. Y. Gong, M. Pawlowski, F. Yang, L. Brandy, L. Boundev, and R. Fergus. Web scale photo hash clustering on a single machine. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
  19. J. Han, Xin Jin. Locality sensitive hashing based clustering. In Encyclopedia of Machine Learning. Springer US, 2010.
  20. T. H. Haveliwala. Scalable techniques for clustering the web. In Proc. of the WebDB Workshop, 2000.
  21. K. He, F. Wen, and J. Sun. K-means hashing: An affinity-preserving quantization method for learning binary compact codes. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2013.
  22. H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. In IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 2011.
  23. A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. In University of Toronto Technical Report, 2009.
  24. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems (NIPS), 2012.
  25. B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In Proc. Advances in Neural Information Processing Systems (NIPS), 2009.
  26. B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In Proc. IEEE International Conf. on Computer Vision (ICCV), 2009.
  27. H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural networks. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
  28. W. J. Li, S. Wang, and W. C. Kang. Feature learning based deep supervised hashing with pairwise labels. In Proc. International Joint Conf. on Artificial Intelligence (IJCAI), 2016.
  29. X. Li, C. Shen, A. Dick, and A. van den Hengel. Learning compact binary codes for visual tracking. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2013.
  30. G. Lin, F. Liu, C. Shen, J. Wu, and H. T. Shen. Structured learning of binary codes with column generation for optimizing ranking measures. International Journal of Computer Vision (IJCV), 2016.
  31. G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter. Fast supervised hashing with decision trees for high-dimensional data. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  32. G. Lin, C. Shen, D. Suter, and A. van den Hengel. A general two-step approach to learning-based hashing. In Proc. IEEE International Conf. on Computer Vision (ICCV), 2013.
  33. K. Lin, H.-F. Yang, J.-H. Hsiao, and C.-S. Chen. Deep learning of binary hash codes for fast image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2015.
  34. M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013.
  35. H. Liu, R. Wang, S. Shan, and X. Chen. Deep supervised hashing for fast image retrieval. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2016.
  36. J. W. Liu, Wei and, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2012.
  37. X. Liu, D. Tao, M. Song, Y. Ruan, C. Chen, and J. Bu. Weakly supervised multiclass video segmentation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  38. M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In Proc. International Conf. on Machine Learning (ICML), 2011.
  39. B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a database and web-based tool for image annotation. In International Journal of Computer Vision (IJCV), 2008.
  40. R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, 2009.
  41. G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter sensitive hashing. In Proc. IEEE International Conf. on Computer Vision (ICCV), 2003.
  42. F. Shen, C. S. Wei, L. Heng, and T. Shen. Supervised discrete hashing. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
  43. J. Song, L. Gao, Y. Yan, D. Zhang, and N. Sebe. Supervised hashing with pseudo labels for scalable multimedia retrieval. In Proceedings of the 23rd ACM International Conference on Multimedia, 2015.
  44. J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo. Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia, 2013.
  45. C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua. LDAHash: Improved matching with smaller descriptors. In IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 2012.
  46. E. Triantafillou, R. Zemel, and R. Urtasun. Few-shot learning through an information retrieval lens. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 2252–2262, 2017.
  47. E. Ustinova and V. Lempitsky. Learning deep embeddings with histogram loss. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 4170–4178, 2016.
  48. L. van der Maaten and G. Hinton. Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 2008.
  49. A. L. C. Wang. An industrial-strength audio search algorithm. In Proceedings of the 4th International Conference on Music Information Retrieval, 2003.
  50. J. Wang, S. Kumar, and S. F. Chang. Sequential projection learning for hashing with compact codes. In Proc. International Conf. on Machine Learning (ICML), 2010.
  51. J. Wang, S. Kumar, and S. F. Chang. Semi-supervised hashing for large-scale search. In IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 2012.
  52. Q. Wang, B. Shen, S. Wang, L. Li, and L. Si. Binary codes embedding for fast image tagging with incomplete labels. In Proc. European Conf. on Computer Vision (ECCV), 2014.
  53. Q. Wang, Z. Zhang, and L. Si. Ranking preserving hashing for fast similarity search. In Proc. International Joint Conf. on Artificial Intelligence (IJCAI), 2015.
  54. Y. Wang, Xiaofang Shi and K. M. Kitani. Deep supervised hashing with triplet labels. In Proc. Asian Conf. on Computer Vision (ACCV), 2016.
  55. Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Proc. Advances in Neural Information Processing Systems (NIPS), 2008.
  56. R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval via image representation learning. In Proc. AAAI Conf. on Artificial Intelligence (AAAI), volume 1, page 2, 2014.
  57. Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average precision. In Proc. ACM Conf. on Research & Development in Information Retrieval (SIGIR), 2007.
  58. D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. In Proc. ACM Conf. on Research & Development in Information Retrieval (SIGIR), 2010.
  59. R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang. Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Trans. on Image Processing, 2015.
  60. Z. Zhang, Y. Chen, and V. Saligrama. Efficient training of very deep neural networks for supervised hashing. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.
  61. F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic ranking based hashing for multi-label image retrieval. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
  62. B. Zhuang, G. Lin, C. Shen, and I. Reid. Fast training of triplet-based deep binary embedding networks. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description