Hashing with Mutual Information
Abstract
Binary vector embeddings enable fast nearest neighbor retrieval in large databases of highdimensional objects, and play an important role in many practical applications, such as image and video retrieval. We study the problem of learning binary vector embeddings under a supervised setting, also known as hashing. We propose a novel supervised hashing method based on optimizing an informationtheoretic quantity, mutual information. We show that optimizing mutual information can reduce ambiguity in the induced neighborhood structure in learned Hamming space, which is essential in obtaining high retrieval performance. To this end, we optimize mutual information in deep neural networks with minibatch stochastic gradient descent, with a formulation that maximally and efficiently utilizes available supervision. Experiments on four image retrieval benchmarks, including ImageNet, confirm the effectiveness of our method in learning highquality binary embeddings for nearest neighbor retrieval.
lequation

(1) 
Hashing, Deep learning, Nearest neighbor retrieval, Mutual information.
1 Introduction
\IEEEPARstartIn computer vision and many other related application areas, there is typically an abundance of data with highdimensional raw representations, such as megapixel images and highdefinition videos. Besides obvious storage challenges, highdimensional data pose additional challenges for semanticlevel processing and understanding. One prominent such example is nearest neighbor retrieval. In applications such as image and video search, person and object recognition in photo collections, and action detection and classification in surveillance video, it is often necessary to map highdimensional data objects to lowdimensional vector representations to allow for efficient retrieval of similar instances in large databases. In addition, the desired semantic similarity can vary from task to task, often prescribed by available supervision, e.g. class labels. Therefore, the mapping process is also responsible for leveraging supervised learning to encode taskspecific similarity metrics, such that objects that are semantically similar are mapped to close neighbors in the resulting vector space.
In this paper, we consider the problem of learning lowdimensional binary vector embeddings of highdimensional data, also known as hashing. Binary embeddings enjoy low memory footprint and permit fast search mechanisms, as Hamming distance computation between binary vectors can be implemented very efficiently in modern CPUs. As a result, across a variety of domains, hashing approaches have been widely utilized in applications requiring fast (approximate) nearest neighbor retrieval. Examples include: image annotation [52], visual tracking [29], 3D reconstruction [6], video segmentation [37], object detection [10], audio search [49], multimedia retrieval [44, 43, 14], and largescale clustering [20, 18, 19]. Our goal is to learn hashing functions that can result in optimal nearest neighbor retrieval performance. In particular, as motivated above, we approach hashing as a supervised learning problem, such that the learned binary embeddings encode taskspecific semantic similarity.
Supervised hashing is a wellstudied problem. Although many different formulations exist, all supervised hashing formulations essentially constrain the learned Hamming distance to agree with the given supervision. Such supervision can be specified as pairwise affinity labels: pairs of objects are annotated with binary labels indicating their pairwise relationships as either “similar” or “dissimilar.” In this case, a common learning strategy is affinity matching: the learned binary embedding should evaluate to low Hamming distance between similar pairs, and high Hamming distance between dissimilar pairs. Alternatively, supervision can also be given in terms of local relative distance comparisons, most notably threetuples of examples, or “triplets”, where one example is constrained to have lower distance to the second example than the third. This can be termed local ranking. Typically, these formulations attempt to improve the overall nearest neighbor retrieval performance through refining and optimizing objective functions that closely match the form of supervision, i.e., defined on pairs or triplets. To this end, it is often necessary to introduce parameters such as margins, thresholds, scaling factors and other regularization parameters.
We argue that approaches such as affinity matching and local ranking are insufficient to achieve optimal nearest neighbor retrieval performance. Instead, we view supervised hashing through an informationtheoretic lens, and propose a novel solution tailored for the task of nearest neighbor retrieval. Our key observation is that a good binary embedding should well separate neighbors and nonneighbors in the Hamming space, or, achieve low neighborhood ambiguity. An alternative viewpoint is that the learned Hamming embedding should carry a high amount of information regarding the desired neighborhood structure. To quantify neighborhood ambiguity, we use a wellknown quantity from information theory, mutual information, and show that is has direct and strong correlations with standard rankingbased retrieval performance metrics. An appealing property of the mutual information objective is that it is free of tuning parameters, unlike others that may require thresholds, margins, and so on. Finally, to optimize mutual information, we relax the original NPhard discrete optimization problem, and develop a gradientbased optimization framework that can be efficiently applied with minibatch stochastic gradient descent in deep neural networks.
To briefly summarize our contributions, we propose a novel supervised hashing method based on quantifying and minimizing neighborhood ambiguity in the learned Hamming space, using mutual information as the learning objective. An endtoend gradientbased optimization framework is developed, with an efficient minibatchbased implementation.
Our proposed hashing method is named .
The rest of this paper is organized as follows. First, the relevant literature is reviewed in Section 2. We propose and analyze mutual information as a learning objective for hashing in Section 3, and then discuss its optimization using stochastic gradient descent and deep neural networks in Section 4. Section 5 presents experimental results and empirical analysis of the proposed algorithm’s behavior. Finally, Section 6 presents the conclusions.
2 Related Work
Many hashing methods have been introduced over the years. While a precise taxonomy is difficult, a rough grouping can be made as dataindependent and dependent techniques. Dataindependent techniques do not exploit data distribution during hashing. Instead, similarity, as induced by a particular metric, is often preserved. This is achieved by maximizing the probability of “collision” when hashing similar items. Notable earlier examples include Locality Sensitive Hashing methods [9, 16, 26] where distance functions such as the Euclidean, Jaccard, and Cosine distances are approximated. These methods usually have theoretical guarantees on the approximation quality and conform with sublinear retrieval mechanisms. However, they are confined to certain metrics as they ignore the data distribution and accompanying supervision.
Contrary to dataindependent techniques, recent approaches are datadependent such that hash mappings are learned from the training set. While empirical evidence for the superiority of these methods over their dataindependent counterparts are aplenty in the literature, a recent study has also theoretically validated their performance advantage [1]. These methods can be considered as binary embeddings that map the data into Hamming space while preserving a specific neighborhood structure. Such a neighborhood is derived from the metadata (e.g., labels) or is completely determined by the user (e.g., via similaritydissimilarity indicators of data pairs). With the binary embeddings, distances can be efficiently computed allowing even a linear search to be done very efficiently for largescale corpora. These datadependent methods can be grouped as follows: similarity preserving techniques [51, 25, 36, 31, 55, 45, 58, 15, 42], quantization methods [22, 17, 21] and recently, deeplearning based methods [56, 33, 13, 4, 27, 61, 62, 35, 60]. We now review a few of the prominent techniques in each category.
Quantization methods are the first group of datadependent hashing methods, which are unsupervised in the sense that no supervision is assumed to be provided with the data. Such techniques generally minimize an objective involving a reconstruction error. Notable examples are quantization/PCA based techniques. Among these, SemiSupervised Hashing [51] learns the hash mapping by maximizing the empirical accuracy on labeled data and also the entropy of the generated hash functions on any unlabeled data. This is shown to be very similar to doing a PCA analysis where the hash functions are the eigenvectors of a covariance matrix biased, due to the supervised information. Other noteworthy work includes PCAinspired methods where the principal components are taken as the hash functions. If “groups” that are suitable for clustering exist within the data, then further refining the principal components for better binarization has shown to be beneficial, as in Iterative Quantization [17] and Kmeans Hashing [21].
Unsupervised quantization can also be approached as a special case of generative modeling. Semantic Hashing [40] is one early example of such algorithms based on the autoencoding principle, which learns a generative model to encode data. The generative model is in the form of stacked Restricted Boltzmann Machines (RBM), which is defined on binary variables by nature. CarreiraPerpinan and Raziperchikolaei [4] propose Binary Autoencoders, and construct autoencoders with a binary latent layer. They argue that finding the hash mapping without relaxing the binary constraints will yield better solutions, while in relaxation approaches that are more common in the literature, significant quantization errors can degrade the quality of learned hash functions. More recently, Stochastic Generative Hashing [8] learns a generative hashing model based on the minimum description length principle, and use stochastic distributional gradient descent to optimize the associated variational objective and handle the difficulty in having binary stochastic neurons.
Similarity preserving methods, on the other hand, aim to construct binary embeddings that optimize loss functions induced from the supervision provided. Both the affinity matching and local ranking methods mentioned in Section 1 belong to this group. Among such techniques, Minimal Loss Hashing [38] considers minimizing a hingetype loss function motivated from structural SVMs. In Binary Reconstructive Embeddings [25], a kernelbased solution is proposed where the goal is to construct hash functions by minimizing an empirical loss between the input and Hamming space distances via a coordinate descent algorithm. Supervised Hashing with Kernels [36] also proposes a kernelbased solution; but, instead of preserving the equivalence of the input and Hamming space distances, the kernel function weights are learned by minimizing an objective function based on binary code inner products. Spectral Hashing [55] and SelfTaught Hashing [58] are other notable lines of work where the similarity of the instances is preserved by solving a graph Laplacian problem. Rank alignment methods [41, 12] that learn a hash mapping to preserve rankings in the data can also be considered in this group.
Lately, several “twostage” similarity preserving techniques have also been proposed, where the learning is decomposed into two steps: binary inference and hash function learning. The binary inference step yields hash codes that best preserve the similarity. These hash codes are subsequently used as target vectors in the subsequent hash function learning step, for example, by learning binary classifiers to produce each bit. Notable twostage methods include Fast Hashing with Decision Trees [32, 31], Structured Learning of Binary Codes [30] and Supervised Discrete Hashing [42]. All of these similarity preserving methods assume some type of supervision, such as labels or similarity indicators. Thus, in the literature, such techniques are also regarded as supervised hashing solutions.
Deep hashing methods have recently gained significant prominence following the success of deep neural networks in related tasks such as image classification. Although hashing methods using deep learning can be based on either unsupervised quantization or supervised learning, most existing deep hashing methods are supervised. A deep hashing study typically involves a novel architecture, a loss function or a binary inference formulation. Among such methods, Lai et al. [27] propose jointly learning the hash mapping and image features with a triplet loss formulation. This triplet loss ensures that an image is more similar to the second image than to a third one with respect to their binary codes. In [34], a networkinnetwork (NIN) type deep net is used as the architecture with a divideconquer module. The divideconquer module is shown to reduce redundancy in the hash bits. In [60], the authors follow the work of [42] and [4]. Similar to [42] they propose learning the hash mapping by optimizing a classification objective. Differently, they consider using a deep net consisting only of fully connected layers and propose using auxiliary variables, as in [4], to efficiently train and circumvent the vanishing gradient problem. Deep learning based hashing studies have also proposed sampling pairs or triplets of data instances to learn the hash mapping. Notable examples include [28] and [54], which optimize a likelihood function defined as to ensure similar (nonsimilar) pairs or triplets or mapped to nearby (distant) binary embeddings.
As the ultimate goal of hashing is to preserve a neighborhood structure in the Hamming space, we propose an informationtheoretic solution and directly quantify the neighborhood ambiguity of the generated binary embeddings using a mutual information based criterion. The proposed mutual information based measure is efficient to compute, amenable to batchlearning, parameterfree and achieves stateoftheart performances in standard retrieval benchmarks. We utilize a recent study, [47], when optimizing our mutual information based objective, and use their differentiable histogram binning technique as a foundation in deriving gradientbased optimization rules. Note that both our problem setup and objective function are quite different from [47].
3 Hashing with Mutual Information
3.1 Preliminaries
Let be the feature space and be the dimensional Hamming space, i.e., . The goal of hashing is to learn an embedding function , which induces a Hamming distance , that is equal to the number of bit differences between embedded vectors.
We consider a supervised learning setup. For some example , we assume that we have access to a set containing examples that are labeled as similar to (neighbors), and a set of dissimilar examples (nonneighbors). We call an anchor example, and refer to as its neighborhood structure. Then, we can cast the problem of learning the Hamming embedding as one of preserving the neighborhood structure: the neighbors of should be mapped to the close vicinity of in the Hamming space, while the nonneighbors should be mapped farther away. In other words, we would like to ideally satisfy the following constraint:
(2) 
If the learned successfully satisfies this constraint, then the neighborhood structure of can be exactly recovered by thresholding the Hamming distance . Generally, the neighborhood structure can be constructed by repeatedly querying a pairwise similarity oracle. In practice, it can often be derived from agreement of class labels, or from thresholding a distance metric (e.g. Euclidean distance) in the original feature space .
In this work, we parameterize the functional embeddings by deep neural networks, as they have recently shown to have superior learning capabilities when coupled with appropriate hardware acceleration. Also, in order to take advantage of endtoend training by backpropagation, we use gradientbased optimization, and adopt an equivalent formulation of the Hamming distance that is amenable to differentiation:
(3)  
(4)  
(5) 
where are the activations produced by a feedforward neural network, with learnable parameters .
3.2 Minimizing Neighborhood Ambiguity
We now discuss a formulation for learning the Hamming embedding . As mentioned above, given and its neighborhood structure, we would like to satisfy Equation 2 as much as possible, or, minimize the amount of violation. Indeed, many existing supervised hashing formulations are based on the idea of minimizing violations. For instance, affinity matching methods, mentioned in Section 1, typically enforce the following constraints through their loss functions:
(6) 
where are threshold parameters. This indirectly enforces Equation 2 by constraining the absolute values of individual Hamming distances. Alternatively, local ranking methods based on triplet supervision encourage the following:
(7) 
where is a margin parameter. We note that both formulations are inflexible as the same threshold or margin parameters are applied for all anchors , and they can be nontrivial to tune.
Instead, we propose a novel formulation based on the idea of minimizing neighborhood ambiguity. We say that introduces neighborhood ambiguity if the mapped image of some is closer to that of than some in the Hamming space; when this happens, it is no longer possible to recover the neighborhood structure by thresholding . Consequently, when is used to perform retrieval, the retrieved nearest neighbors of would be contaminated by nonneighbors. Therefore, a highquality embedding should minimize neighborhood ambiguity.
To concretely formulate the idea, we define random variable , and let be the membership indicator for . Then, we naturally have two conditional distributions of Hamming distance: and . Note that the constraint in Equation 2 can be reexpressed as having no overlap between these two conditional distributions, and that minimizing neighborhood ambiguity amounts to minimizing the overlap. Please see Figure 1 for an illustration.
We use the mutual information between random variables and to capture the amount of overlap between conditional Hamming distance distributions. The mutual information is defined as
(8)  
(9) 
where denotes (conditional) entropy. In the following, for brevity we will drop subscripts and , and denote the two conditional distributions and the marginal as , , and , respectively.
By definition, measures the decrease in uncertainty in the neighborhood information when observing the Hamming distances . If attains a high value, which means can be determined with low uncertainty by observing , then must have achieved good separation (i.e. low overlap) between and . is maximized when there is no overlap, and minimized when and are exactly identical. Recall, however, that is defined with respect to a single anchor ; therefore, for a general quality measure, we integrate over the feature space:
(10) 
captures the expected amount of separation between and achieved by , over all instances in .
The expected separation cannot be determined directly as the ground truth distribution is unknown. However, it is possible to approximate by simply using a finite set of training elements independently sampled from :
(11) 
We use subscript to indicate that when computing the mutual information , the and for the instance are estimated from . This can be done in time for each , as the discrete distributions can be estimated via histogram binning.
An appealing property of the mutual information objective is that it is parameterfree: the objective encourages distributions and to be separated, but does not include parameters dictating the distance threshold at which separation occurs, or the absolute amount of separation. The absence of such fixed parameters also increases flexibility, since the separation could occur at different distance thresholds depending on the anchor .
4 Optimizing Mutual Information
Having shown that mutual information is a suitable measure of hashing quality, we consider its use as a learning objective for hashing. For brevity, we omit the superscripts in Equation 11 and redefine the learning objective as
(12) 
Inspired by recent advances in stochastic optimization, we derive gradient descent rules for . We first derive the gradient of with respect to the output of the hash mapping, . With bit Hamming distances, we model the discrete distributions using bin normalized histograms over . The mutual information is continuously differentiable, and using the chain rule we can write
(13) 
where is the th element of . We next focus on terms involving , and omit derivations for due to symmetry. For , we have
(14)  
(15)  
(16) 
where we used the fact that , with and being shorthands for the priors and .
4.1 Continuous Relaxation
To complete the chain rule, we need to further derive the term . However, note that the hash mapping is discrete by nature, precluding the use of continuous optimization. While it is possible to maintain such constraints and resort to discrete optimization, the resulting problems are usually NPhard. Instead, we take the relaxation approach, in order to apply gradientbased continuous optimization. Correspondingly, we need to perform a continuously differentiable relaxation to . Recall from Equation 5 that each element in is obtained from thresholding neural network activations with the sign function. We relax into a realvalued vector by adopting a standard technique in hashing [2, 28, 36], where the discontinuous sign function is approximated with the sigmoid function :
(17)  
(18) 
We include a tuning parameter in the approximation. We choose so as to increase the “sharpness” of the approximation, and reduce the error introduced by the continuous relaxation. Alternatively, it is possible to formulate the quantization error as a penalty term in the objective function with proper weighting, e.g.[28, 54], or use continuation methods [3].
With the continuous relaxation in place, we move on to the partial differentiation of and with respect to . As mentioned before, these discrete distributions can be estimated via histogram binning; however, histogram binning is a nondifferentiable operation. In the following, we describe a differentiable approximation to the discrete histogram binning process, thereby enabling endtoend backpropagation.
4.2 EndtoEnd Optimization
We first elaborate on the histogram binning process. Without the continuous relaxation, is estimated by performing hard assignments on Hamming distances into histogram bins:
(19) 
With the continuous relaxation developed above, we first note that in Equation 3 is not integervalued any more, but is also continuously relaxed into
(20) 
When is relaxed into , we employ a technique similar to [47], and replace hard assignment with soft assignment. The key is to approximate the binary indicator with a differentiable function . Specifically, we use a triangular kernel centered on the histogram bin, so that linearly interpolates the realvalued into the th bin with slope :
(21) 
This soft assignment admits simple subgradients:
(22) 
It is easy to see that the triangular kernel approximation approaches the original binary indicator as .
We are now ready to tackle the term . From the definition of in Equation 19, we have, for :
(23)  
(24)  
(25) 
For the last step, we used the definition of in Equation 20. Next, for :
(26)  
(27) 
Lastly, to backpropagate gradients to ’s inputs and ultimately to the parameters of the underlying neural network, we only need to further differentiate the sigmoid approximation employed in (Equation 18), which has a closed form expression.
4.3 Efficient Minibatch Backpropagation
So far, our derivations of mutual information and its gradients have assumed a single anchor example , omitting the fact that the optimization objective in Equation 12 is the average of mutual information values over all anchors in a finite training set . In information retrieval terminology, the current derivations are for a single query and a fixed database. However, in many computer vision tasks (e.g. image retrieval), there is usually no clear split of the given training set into a set of queries and a database. Even if we create such a split, it can be arbitrary and does not fully utilize available supervision. Also, another challenge is adapting the optimization to the stochastic/minibatch setting, since deep neural networks are typically trained by minibatch stochastic gradient descent (SGD), where it is infeasible to access the entire database all at once.
Here, we describe a way to efficiently utilize all the available supervision during minibatch SGD training, simultaneously addressing both challenges. The idea is that, for a minibatch with examples, each example is used as the query against a database comprising of the remaining examples. For each query , elements of and are found from the remaining examples according to provided labels. Then, the overall objective value for the minibatch is the average over the queries. This way, the available supervision in the minibatch is utilized maximally: each example is used as the query once, and as a database item times. As we shall see next, the backpropagation in this case can be efficiently implemented using matrix multiplications.
Now consider a minibatch of size , . For , let be a shorthand for . We group the hash mapping output for the entire minibatch into the following matrix,
(28) 
Similar to Equation 13, we can write the Jacobian matrix of the minibatch objective with respect to as
(29)  
(30) 
where () is the th element of () when the anchor is . Again, the partial derivative is straightforward to compute, and the main issue is evaluating the Jacobian . We do so by examining each column of the Jacobian. First, for ,
(31)  
(32)  
(33) 
with the substitutions
(34)  
(35) 
Next, for ,
(36)  
(37) 
We can further unify the two cases by defining :
(38) 
Having derived all the columns, we now write down the matrix form of . Let , and let be the th standard basis vector in ,
(39)  
(40) 
We will next complete Equation 30. First, we define a shorthand, which can be easily evaluated using the result in Equation 16:
(41) 
Using symmetry, we only consider the first term involving in Equation 30, and we omit the scaling factor for now:
(42)  
(43)  
(44)  
(45) 
Define
(46)  
(47) 
then we can simplify Equation 45 as
(48)  
(49) 
The last step is true since is symmetric: it can be seen from the definition of in Equation 35 that , since both the neighbor relationship and the Hamming distance are symmetric.
Now, if we define and for the nonneighbor distance distribution , analogously as in Equations 46 and 47 (details are very similar and thus omitted), then the full Jacobian matrix in Equation 30 can be evaluated as
(50) 
Since only matrix multiplications and additions are involved, this operation can be implemented efficiently.
We note that a similar minibatchbased formulation is recently proposed by Triantafillou et al.[46], which is also inspired by information retrieval, and attempts to maximize the utilization of supervision by treating each example in the minibatch as a query. However, [46] specifically tackles the problem of fewshot learning, and its formulation approximately optimizes mean Average Precision using a structured prediction framework, which is very different from ours. Nevertheless, it would be interesting to explore the use of hashing and the mutual information objective for that problem in future work.
5 Experiments
5.1 Datasets and Evaluation Setup
We conduct experiments on widely used image retrieval benchmarks: CIFAR10 [23], NUSWIDE [7], 22K LabelMe [39] and ImageNet100 [11]. Each dataset is split into a test set and retrieval set, and instances from the retrieval set are used in training. At test time, queries from the test set are used to rank instances from the retrieval set using Hamming distances, and the performance metric is averaged over the queries.

CIFAR10 is a dataset for image classification and retrieval, containing 60K images from 10 different categories. We follow the setup of [27, 62, 28, 54]. This setup corresponds to two distinct partitions of the dataset. In the first case (cifar1), we sample 500 images per category, resulting in 5,000 training examples to learn the hash mapping. The test set contains 100 images per category (1000 in total). The remaining images are then used to populate the hash table. In the second case (cifar2), we sample 1000 images per category to construct the test set (10,000 in total). The remaining items are both used to learn the hash mapping and populate the hash table. Two images are considered neighbors if they belong to the same class.

NUSWIDE is a dataset containing 269K images from Flickr. Each image can be associated with multiple labels, corresponding with 81 ground truth concepts. For NUSWIDE experiments, following the setup in [27, 62, 28, 54], we only consider images annotated with the 21 most frequent labels. In total, this corresponds to 195,834 images. The experimental setup also has two distinct partitionings: nus1 and nus2. For both cases, a test set is constructed by randomly sampling 100 images per label (2,100 images in total). To learn the hash mapping, 500 images per label are randomly sampled in nus1 (10,500 in total). The remaining images are then used to populate the hash table. In the second case, nus2, all the images excluding the test set are used in learning and populating the hash table. Following standard practice, two images are considered as neighbors if they share at least one label.

22K LabelMe consists of 22K images, each represented with a 512dimensionality GIST descriptor. Following [25, 2], we randomly partition the dataset into a retrieval and a test set, consisting of 20K and 2K instances, respectively. A 5K subset of the retrieval set is used in learning the hash mapping. As this dataset is unsupervised, we use the Euclidean distance between GIST features in determining the neighborhood structure. Two examples that have a Euclidean distance below the distance percentile are considered neighbors.

ImageNet100 is a subset of ImageNet [11] containing 130K images from 100 classes. We follow the setup in [3], and randomly sample 100 images per class for training. All images in the selected classes from the ILSVRC 2012 validation set are used as the test set. Two images are considered neighbors if they belong to the same class.
As for performance metric, we use the standard mean Average Precision (), or its variants. We compare against both classical and recent stateoftheart hashing methods. These methods include: Spectral Hashing (SH) [55], Iterative Quantization (ITQ) [17], Sequential Projection Learning for Hashing (SPLH) [50], Supervised Hashing with Kernels (SHK) [36], Fast Supervised Hashing with Decision Trees (FastHash) [31], Structured Hashing (StructHash) [30], Supervised Discrete Hashing (SDH) [42], Efficient Training of Very Deep Neural Networks (VDSH) [60], Deep Supervised Hashing with Pairwise Labels (DPSH) [28], Deep Supervised Hashing with Triplet Labels (DTSH) [54], and Hashing by Continuation (HashNet) [3]. These competing methods have been shown to outperform earlier and other works such as [51, 25, 38, 56, 27, 61].
We finetune deep Convolutional Neural Network models pretrained on the ImageNet dataset, by replacing the final softmax classification layer with a new fullyconnected layer that produces the binary bits. The new fullyconnected layer is randomly initialized. For CIFAR10 and NUSWIDE experiments, we finetune a VGGF [5] architecture, as in [28, 54]. For ImageNet100 experiments, following the protocol of HashNet [3], we finetune the AlexNet [24] architecture, and scale down the learning rate for pretrained layers by a factor of 0.01, since the model is finetuned on the same dataset for a different task. For nondeep methods, we use the output of the penultimate layer () of both architectures as input features, which are 4096dimensional. For the 22K LabelMe benchmark, all methods learn shallow models on top of precomputed 512dimensional GIST descriptors; for gradientbased hashing methods, this corresponds to using a single fully connected layer. We use SGD with momentum and weight decay of , and reduce the learning rate periodically by a predetermined factor ( in most cases), which is standard practice. During SGD training, the minibatches are randomly sampled from the training set.
5.2 Results
Table LABEL:table:set1 gives results for cifar1 and nus1 experimental settings in which and ( evaluated on the top 5,000 retrievals) are reported for the CIFAR10 and NUSWIDE datasets, respectively. Deep learning based hashing methods such as DPSH and DTSH outperform most nondeep hashing solutions. This is not surprising as hash mapping is done simultaneously with feature learning. Nondeep solutions such as FastHash and SDH also perform competitively, especially in NUSWIDE experiments. Our proposed method, , surpasses all competing methods in majority of the experiments. For example, with 24 and 32bit binary embeddings surpasses the nearest competitor, DTSH, by  in CIFAR10. For NUSWIDE, demonstrates comparable results, improving over the stateoftheart with 48bit codes.
The performance improvement of is much more significant in the cifar2 and nus2 settings. In these settings, a VGGF architecture pretrained on ImageNet is again finetuned but with more data. Following standard practice, and metrics are used to evaluate the retrieval performance. Table LABEL:table:set2 gives the results. As observed, in both CIFAR10 and NUSWIDE, achieves stateoftheart results in nearly all code lengths. For instance, consistently outperforms DTSH, the closest competitor, in all but one binary embedding size.
Retrieval results for ImageNet100 are given in Table LABEL:table:imagenet. In these experiments, we compare against DTSH, the overall best competing method in past experiments and another recently introduced deep learning based hashing method, HashNet [3]. The evaluation metric is taken to be for consistency with the setup in [3]. In this benchmark, significantly outperforms both DTSH and HashNet for all embedding sizes. Notably, achieves improvement over HashNet with 16bit codes, indicating its superiority in learning highquality compact binary codes.
To further emphasize the merits of , we consider shallow model experiments on the 22K LabelMe dataset. In this benchmark, we only consider the overall best nondeep and deep learning methods in past experiments. Also, to solely put emphasize on comparing the hash mapping learning objectives, all deep learning methods use a onelayer embedding on top of the GIST descriptor. This nullifies the feature learning aspect of these deep learning methods. Still, some nondeep methods can learn nonlinear hash mappings, such as FastHash and StructHash that use boosted decision trees. Table LABEL:table:labelme gives the results, and we can see that nondeep methods FastHash and StructHash outperform deep learning methods DPSH and DTSH on this benchmark. This indicates that the prowess of DPSH and DTSH might come primarily through feature learning. On the other hand, remains to be the best performing method across all code lengths, despite using a simpler onelayer embedding function compared to FastHash and StructHash. This further validates the effectiveness of our mutual information based objective in capturing the neighborhood structure.
5.3 Empirical Analysis
Mutual Information and Ranking Metrics
To evaluate the performance of hashing algorithms for retrieval tasks, it is common to use rankingbased metrics, and the most notable example is mean Average Precision (mAP). We note that there exists strong correlations between our mutual information criterion and mAP. Figure 2 provides an empirical analysis on the CIFAR10 benchmark. The left plot displays the training objective value as computed from Equation 12 and the mAP score with respect to the epoch. These results are obtained from the cifar1 experiment with 32bit codes, as specified in the previous section. Notice that both the mutual information objective and the mAP value show similar behavior, i.e., exhibit strong correlation. While the mAP score increases from 0.40 to 0.80, the mutual information objective increases from 0.15 to above 0.40. In the middle figure, we apply minmax normalization in order to scale both measures to the same range.
To further emphasize the correlation between mutual information and mAP, we also conducted an additional experiment in which 100 instances are selected as the query set, and the rest are used to populate the hash table. The hash mapping parameters are randomly sampled from a Gaussian distribution, similar to LSH [16], and each experiment is conducted 50 times. The right figure provides the scatter plot of mAP and the mutual information objective value. We can see that the relationship is almost linear, which is also validated by the Pearson Correlation Coefficient score of 0.98.
We give an intuitive explanation to the strong correlation. Given a query, mAP is optimized when all of its neighbors are ranked above all of its nonneighbors in the database. On the other hand, mutual information is optimized when the distribution of neighbor distances has no overlap with the distribution of nonneighbor distances. Therefore, we can see that mAP and mutual information are simultaneously optimized by the same optimal solution. Conversely, mAP is suboptimal when neighbors and nonneighbors are interleaved in the ranking, so is mutual information when the distance distributions have nonzero overlap. Although a theoretical analysis of the correlation is beyond the scope of this work, empirically we find that mutual information serves as a generalpurpose surrogate metric for ranking.
Distribution Separating Effect
To demonstrate that indeed separates neighbor and nonneighbor distance distributions, we consider a simple experiment. Specifically, we learn a singlelayer model on top of precomputed layer features. The learning is done in an online fashion, which means that each training example is processed only once. We train such an model on the CIFAR10 dataset with 20K training examples.
In Figure 3, we plot the distributions and , averaged on the CIFAR10 test set, before and after learning with 20K training examples. The hash mapping parameters are initialized using LSH, and lead to high overlap between the distributions, although they do not totally overlap due to the use of strong pretrained features. After learning, the overlap is significantly reduced, with pushed towards zero hamming distances. Consequently, the mAP value increases to from .
tSNE Visualization of the Binary Embeddings
We also visualize the learned embeddings using tSNE [48]. In Figure 4, we plot the visualization for 48bit binary embeddings produced by and the top competing method, HashNet, on ImageNet100. For ease of visualization, we randomly sample 10 classes from the test set.
produces binary embeddings that separate different classes well into separate clusters. This is in fact predictable from the formulation of , in which the class overlap is quantified via mutual information and minimized. On the other hand, binary codes generated by HashNet have higher overlap between classes. This is also consistent with the fact that HashNet does not specifically optimize for a criterion related to class overlap, but belongs to the simpler “affinity matching” family of approaches.
Example Retrieval Results
In Figure 5, we present example retrieval results for and HashNet for several image queries from the ImageNet100 dataset. The top 10 retrievals of eight query images from eight distinct categories are presented. Correct retrievals (i.e., having the same class label as the query) are marked in green, and incorrect retrievals are in red. In these examples, many of the retrieved images appear visually similar to the query, even if not sharing the same class label. Nevertheless, retrieves fewer incorrect images compared to HashNet. For example, HashNet returns bag images for the first query (image of cups), and digitalclock images for the secondtolast query (image of doormat).
6 Conclusion
We take an informationtheoretic approach to hashing and propose a novel hashing method, called , in this work. It is based on minimizing neighborhood ambiguity in the learned Hamming space, which is crucial in maintaining high performance in nearest neighbor retrieval. We adopt the wellstudied mutual information from information theory to quantify neighborhood ambiguity, and show that it has strong correlations with standard rankingbased retrieval performance metrics. Then, to optimize mutual information, we take advantage of recent advances in deep learning and stochastic optimization, and parameterize our embedding functions with deep neural networks. We perform a continuous relaxation on the NPhard optimization problem, and use stochastic gradient descent to optimize the networks. In particular, our formulation maximally utilizes available supervision within each minibatch, and can be efficiently implemented. Our implementation is publicly available.
When evaluated on four standard image retrieval benchmarks, is shown to learn highquality compact binary codes, and it achieves superior nearest neighbor retrieval performance compared to existing supervised hashing techniques. We believe that the mutual information based formulation is also potentially relevant for learning realvalued embeddings, and for other applications besides image retrieval, such as fewshot learning.
Acknowledgment
This research was supported in part by a BU IGNITION award, US NSF grant 1029430, and gifts from NVIDIA.
[]Fatih Cakir is a Data Scientist at FirstFuel Software. He was previously a member at the Image and Video Computing Group at Boston University working with Professor Stan Sclaroff as his Ph.D. advisor. His research interests are in the fields of Computer Vision and Machine Learning.
[]
Kun He
is a Ph.D. candidate in Computer Science and a member of the Image and Video Computing group at Boston University, advised by Professor Stan Sclaroff. He obtained his M.Sc. degree in Computer Science from Boston University in 2013, and his B.Sc. degree (with honors) in Computer Science and Technology from Zhejiang University, Hangzhou, China, in 2010.
He is a student member of the IEEE.
[]
Sarah Adel Bargal
is a Ph.D. candidate in the Image and Video Computing group in the Boston University Department of Computer Science. She received her M.Sc. from the American University in Cairo. Her research interests are in the areas of computer vision and deep learning. She is an IBM PhD Fellow and a Hariri Graduate Fellow.
[]Stan Sclaroff is a Professor in the Boston University Department of Computer Science. He received his Ph.D. from MIT in 1995. His research interests are in computer vision, pattern recognition, and machine learning. He is a Fellow of the IAPR and Fellow of the IEEE.
Footnotes
 Our MATLAB implementation of is publicly available at https://github.com/fcakir/mihash
References
 A. Andoni and I. Razenshteyn. Optimal datadependent hashing for approximate near neighbors. In Proc. ACM Symposium on Theory of Computing (STOC), 2015.
 F. Cakir and S. Sclaroff. Adaptive hashing for fast similarity search. In Proc. IEEE International Conf. on Computer Vision (ICCV), 2015.
 Z. Cao, M. Long, J. Wang, and P. S. Yu. HashNet: Deep learning to hash by continuation. In Proc. IEEE International Conf. on Computer Vision (ICCV), 2017.
 M. A. CarreiraPerpinan and R. Raziperchikolaei. Hashing with binary autoencoders. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2015.
 K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference (BMVC), 2014.
 J. Cheng, C. Leng, J. Wu, H. Cui, and H. Lu. Fast and accurate image matching with cascade hashing for 3d reconstruction. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
 T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. T. Zheng. NUSWIDE: A realworld web image database from National University of Singapore. In Proc. ACM CIVR, 2009.
 B. Dai, R. Guo, S. Kumar, N. He, and L. Song. Stochastic generative hashing. In Proc. International Conf. on Machine Learning (ICML), 2017.
 M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Localitysensitive hashing scheme based on pstable distributions. In Proc. on Computational geometry (SCG), 2004.
 T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2013.
 J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 K. Ding, C. Huo, B. Fan, and C. Pan. Knn hashing with factorized neighborhood representation. In Proc. IEEE International Conf. on Computer Vision (ICCV), December 2015.
 V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact binary codes learning. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2475–2483, 2015.
 L. Gao, J. Song, F. Zou, D. Zhang, and J. Shao. Scalable multimedia retrieval by deep learning hashing with relative similarity learning. In Proceedings of the 23rd ACM International Conference on Multimedia, 2015.
 K. Ge, Tiezhengand He and J. Sun. Graph cuts for supervised binary coding. In Proc. European Conf. on Computer Vision (ECCV), 2014.
 A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. International Conf. on Very Large Data Bases (VLDB), 1999.
 Y. Gong and S. Lazebnik. Iterative quantization: A Procrustean approach to learning binary codes. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011.
 Y. Gong, M. Pawlowski, F. Yang, L. Brandy, L. Boundev, and R. Fergus. Web scale photo hash clustering on a single machine. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
 J. Han, Xin Jin. Locality sensitive hashing based clustering. In Encyclopedia of Machine Learning. Springer US, 2010.
 T. H. Haveliwala. Scalable techniques for clustering the web. In Proc. of the WebDB Workshop, 2000.
 K. He, F. Wen, and J. Sun. Kmeans hashing: An affinitypreserving quantization method for learning binary compact codes. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2013.
 H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. In IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 2011.
 A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. In University of Toronto Technical Report, 2009.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems (NIPS), 2012.
 B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In Proc. Advances in Neural Information Processing Systems (NIPS), 2009.
 B. Kulis and K. Grauman. Kernelized localitysensitive hashing for scalable image search. In Proc. IEEE International Conf. on Computer Vision (ICCV), 2009.
 H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural networks. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
 W. J. Li, S. Wang, and W. C. Kang. Feature learning based deep supervised hashing with pairwise labels. In Proc. International Joint Conf. on Artificial Intelligence (IJCAI), 2016.
 X. Li, C. Shen, A. Dick, and A. van den Hengel. Learning compact binary codes for visual tracking. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2013.
 G. Lin, F. Liu, C. Shen, J. Wu, and H. T. Shen. Structured learning of binary codes with column generation for optimizing ranking measures. International Journal of Computer Vision (IJCV), 2016.
 G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter. Fast supervised hashing with decision trees for highdimensional data. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
 G. Lin, C. Shen, D. Suter, and A. van den Hengel. A general twostep approach to learningbased hashing. In Proc. IEEE International Conf. on Computer Vision (ICCV), 2013.
 K. Lin, H.F. Yang, J.H. Hsiao, and C.S. Chen. Deep learning of binary hash codes for fast image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2015.
 M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013.
 H. Liu, R. Wang, S. Shan, and X. Chen. Deep supervised hashing for fast image retrieval. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2016.
 J. W. Liu, Wei and, R. Ji, Y.G. Jiang, and S.F. Chang. Supervised hashing with kernels. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2012.
 X. Liu, D. Tao, M. Song, Y. Ruan, C. Chen, and J. Bu. Weakly supervised multiclass video segmentation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
 M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In Proc. International Conf. on Machine Learning (ICML), 2011.
 B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a database and webbased tool for image annotation. In International Journal of Computer Vision (IJCV), 2008.
 R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, 2009.
 G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter sensitive hashing. In Proc. IEEE International Conf. on Computer Vision (ICCV), 2003.
 F. Shen, C. S. Wei, L. Heng, and T. Shen. Supervised discrete hashing. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
 J. Song, L. Gao, Y. Yan, D. Zhang, and N. Sebe. Supervised hashing with pseudo labels for scalable multimedia retrieval. In Proceedings of the 23rd ACM International Conference on Multimedia, 2015.
 J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo. Effective multiple feature hashing for largescale nearduplicate video retrieval. IEEE Transactions on Multimedia, 2013.
 C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua. LDAHash: Improved matching with smaller descriptors. In IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 2012.
 E. Triantafillou, R. Zemel, and R. Urtasun. Fewshot learning through an information retrieval lens. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 2252–2262, 2017.
 E. Ustinova and V. Lempitsky. Learning deep embeddings with histogram loss. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 4170–4178, 2016.
 L. van der Maaten and G. Hinton. Visualizing highdimensional data using tSNE. Journal of Machine Learning Research, 2008.
 A. L. C. Wang. An industrialstrength audio search algorithm. In Proceedings of the 4th International Conference on Music Information Retrieval, 2003.
 J. Wang, S. Kumar, and S. F. Chang. Sequential projection learning for hashing with compact codes. In Proc. International Conf. on Machine Learning (ICML), 2010.
 J. Wang, S. Kumar, and S. F. Chang. Semisupervised hashing for largescale search. In IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 2012.
 Q. Wang, B. Shen, S. Wang, L. Li, and L. Si. Binary codes embedding for fast image tagging with incomplete labels. In Proc. European Conf. on Computer Vision (ECCV), 2014.
 Q. Wang, Z. Zhang, and L. Si. Ranking preserving hashing for fast similarity search. In Proc. International Joint Conf. on Artificial Intelligence (IJCAI), 2015.
 Y. Wang, Xiaofang Shi and K. M. Kitani. Deep supervised hashing with triplet labels. In Proc. Asian Conf. on Computer Vision (ACCV), 2016.
 Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Proc. Advances in Neural Information Processing Systems (NIPS), 2008.
 R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval via image representation learning. In Proc. AAAI Conf. on Artificial Intelligence (AAAI), volume 1, page 2, 2014.
 Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average precision. In Proc. ACM Conf. on Research & Development in Information Retrieval (SIGIR), 2007.
 D. Zhang, J. Wang, D. Cai, and J. Lu. Selftaught hashing for fast similarity search. In Proc. ACM Conf. on Research & Development in Information Retrieval (SIGIR), 2010.
 R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang. Bitscalable deep hashing with regularized similarity learning for image retrieval and person reidentification. IEEE Trans. on Image Processing, 2015.
 Z. Zhang, Y. Chen, and V. Saligrama. Efficient training of very deep neural networks for supervised hashing. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.
 F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic ranking based hashing for multilabel image retrieval. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
 B. Zhuang, G. Lin, C. Shen, and I. Reid. Fast training of tripletbased deep binary embedding networks. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.