Stochastic Generative Hashing
Abstract
Learningbased binary hashing has become a powerful paradigm for fast search and retrieval in massive databases. However, due to the requirement of discrete outputs for the hash functions, learning such functions is known to be very challenging. In addition, the objective functions adopted by existing hashing techniques are mostly chosen heuristically. In this paper, we propose a novel generative approach to learn hash functions through Minimum Description Length principle such that the learned hash codes maximally compress the dataset and can also be used to regenerate the inputs. We also develop an efficient learning algorithm based on the stochastic distributional gradient, which avoids the notorious difficulty caused by binary output constraints, to jointly optimize the parameters of the hash function and the associated generative model. Extensive experiments on a variety of largescale datasets show that the proposed method achieves better retrieval results than the existing stateoftheart methods.
1 Introduction
Search for similar items in webscale datasets is a fundamental step in a number of applications, especially in image and document retrieval. Formally, given a reference dataset with , we want to retrieve similar items from for a given query according to some similarity measure . When the negative Euclidean distance is used, i.e., , this corresponds to Nearest Neighbor Search (L2NNS) problem; when the inner product is used, i.e., , it becomes a Maximum Inner Product Search (MIPS) problem. In this work, we focus on L2NNS for simplicity, however our method handles MIPS problems as well, as shown in the supplementary material D. Bruteforce linear search is expensive for large datasets. To alleviate the time and storage bottlenecks, two research directions have been studied extensively: (1) partition the dataset so that only a subset of data points is searched; (2) represent the data as codes so that similarity computation can be carried out more efficiently. The former often resorts to searchtree or bucketbased lookup; while the latter relies on binary hashing or quantization. These two groups of techniques are orthogonal and are typically employed together in practice.
In this work, we focus on speeding up search via binary hashing. Hashing for similarity search was popularized by influential works such as Locality Sensitive Hashing (Indyk and Motwani, 1998; Gionis et al., 1999; Charikar, 2002). The crux of binary hashing is to utilize a hash function, , which maps the original samples in to bit binary vectors while preserving the similarity measure, e.g., Euclidean distance or inner product. Search with such binary representations can be efficiently conducted using Hamming distance computation, which is supported via POPCNT on modern CPUs and GPUs. Quantization based techniques (Babenko and Lempitsky, 2014; Jegou et al., 2011; Zhang et al., 2014b) have been shown to give stronger empirical results but tend to be less efficient than Hamming search over binary codes (Douze et al., 2015; He et al., 2013).
Datadependent hash functions are wellknown to perform better than randomized ones (Wang et al., 2014). Learning hash functions or binary codes has been discussed in several papers, including spectral hashing (Weiss et al., 2009), semisupervised hashing (Wang et al., 2010), iterative quantization (Gong and Lazebnik, 2011), and others (Liu et al., 2011; Gong et al., 2013; Yu et al., 2014; Shen et al., 2015; Guo et al., 2016). The main idea behind these works is to optimize some objective function that captures the preferred properties of the hash function in a supervised or unsupervised fashion.
Even though these methods have shown promising performance in several applications, they suffer from two main drawbacks: (1) the objective functions are often heuristically constructed without a principled characterization of goodness of hash codes, and (2) when optimizing, the binary constraints are crudely handled through some relaxation, leading to inferior results (Liu et al., 2014). In this work, we introduce Stochastic Generative Hashing (SGH) to address these two key issues. We propose a generative model which captures both the encoding of binary codes from input and the decoding of input from . This provides a principled hash learning framework, where the hash function is learned by Minimum Description Length (MDL) principle. Therefore, its generated codes can compress the dataset maximally. Such a generative model also enables us to optimize distributions over discrete hash codes without the necessity to handle discrete variables. Furthermore, we introduce a novel distributional stochastic gradient descent method which exploits distributional derivatives and generates higher quality hash codes. Prior work on binary autoencoders (CarreiraPerpinán and Raziperchikolaei, 2015) also takes a generative view of hashing but still uses relaxation of binary constraints when optimizing the parameters, leading to inferior performance as shown in the experiment section. We also show that binary autoencoders can be seen as a special case of our formulation. In this work, we mainly focus on the unsupervised setting^{2}^{2}2The proposed algorithm can be extended to supervised/semisupervised setting easily as described in the supplementary material E..
2 Stochastic Generative Hashing (SGH)
We start by first formalizing the two key issues that motivate the development of the proposed algorithm.
Generative view. Given an input , most hashing works in the literature emphasize modeling the forward process of generating binary codes from input, i.e., , to ensure that the generated hash codes preserve the local neighborhood structure in the original space. Few works focus on modeling the reverse process of generating input from binary codes, so that the reconstructed input has small reconstruction error. In fact, the generative view provides a natural learning objective for hashing. Following this intuition, we model the process of generating from , and derive the corresponding hash function from the generative process. Our approach is not tied to any specific choice of but can adapt to any generative model appropriate for the domain. In this work, we show that even using a simple generative model (Section 2.1) already achieves the stateoftheart performance.
Binary constraints. The other issue arises from dealing with binary constraints. One popular approach is to relax the constraints from (Weiss et al., 2009), but this often leads to a large optimality gap between the relaxed and nonrelaxed objectives. Another approach is to enforce the model parameterization to have a particular structure so that when applying alternating optimization, the algorithm can alternate between updating the parameters and binarization efficiently. For example, (Gong and Lazebnik, 2011; Gong et al., 2012) imposed an orthogonality constraint on the projection matrix, while (Yu et al., 2014) proposed to use circulant constraints, and (Zhang et al., 2014a) introduced Kronecker Product structure. Although such constraints alleviate the difficulty with optimization, they substantially reduce the model flexibility. In contrast, we avoid such constraints and propose to optimize the distributions over the binary variables to avoid directly working with binary variables. This is attained by resorting to the stochastic neuron reparametrization (Section 2.4), which allows us to backpropagate through the layers of weights using the stochsastic gradient estimator.
Unlike (CarreiraPerpinán and Raziperchikolaei, 2015) which relies on solving expensive integer programs, our model is endtoend trainable using distributional stochastic gradient descent (Section 3). Our algorithm requires no iterative steps unlike iterative quantization (ITQ) (Gong and Lazebnik, 2011). The training procedure is much more efficient with guaranteed convergence compared to alternating optimization for ITQ.
In the following sections, we first introduce the generative hashing model in Section 2.1. Then, we describe the corresponding process of generating hash codes given input , in Section 2.2. Finally, we describe the training procedure based on the Minimum Description Length (MDL) principle and the stochastic neuron reparametrization in Sections 2.3 and 2.4. We also introduce the distributional stochastic gradient descent algorithm in Section 3.
2.1 Generative Model
Unlike most works which start with the hash function , we first introduce a generative model that defines the likelihood of generating input given its binary code , i.e., . It is also referred as a decoding function. The corresponding hash codes are derived from an encoding function , described in Section 2.2.
We use a simple Gaussian distribution to model the generation of given :
(1) 
and is a codebook with codewords. The prior is modeled as the multivariate Bernoulli distribution on the hash codes, where . Intuitively, this is an additive model which reconstructs by summing the selected columns of given , with a Bernoulli prior on the distribution of hash codes. The joint distribution can be written as:
(2) 
This generative model can be seen as a restricted form of general Markov Random Fields in the sense that the parameters for modeling correlation between latent variables and correlation between and are shared. However, it is more flexible compared to Gaussian Restricted Boltzmann machines (Krizhevsky, 2009; Marc’Aurelio and Geoffrey, 2010) due to an extra quadratic term for modeling correlation between latent variables. We first show that this generative model preserves local neighborhood structure of the when the Frobenius norm of is bounded.
Proposition 1
If is bounded, then the Gaussian reconstruction error, is a surrogate for Euclidean neighborhood preservation.
Proof Given two points , their Euclidean distance is bounded by
where and denote the binary latent variables corresponding to and , respectively. Therefore, we have
which means minimizing the Gaussian reconstruction error, i.e., , will lead to Euclidean neighborhood preservation.
A similar argument can be made with respect to MIPS neighborhood preservation as shown in the supplementary material D. Note that the choice of is not unique, and any generative model that leads to neighborhood preservation can be used here. In fact, one can even use more sophisticated models with multiple layers and nonlinear functions. In our experiments, we find complex generative models tend to perform similarly to the Gaussian model on datasets such as SIFT1M and GIST1M. Therefore, we use the Gaussian model for simplicity.
2.2 Encoding Model
Even with the simple Gaussian model (1), computing the posterior is not tractable, and finding the MAP solution of the posterior involves solving an expensive integer programming subproblem. Inspired by the recent work on variational autoencoder (Kingma and Welling, 2013; Mnih and Gregor, 2014; Gregor et al., 2014), we propose to bypass these difficulties by parameterizing the encoding function as
(3) 
to approximate the exact posterior . With the linear parametrization, with . At the training step, a hash code is obtained by sampling from . At the inference step, it is still possible to sample . More directly, the MAP solution of the encoding function (3) is readily given by
This involves only a linear projection followed by a sign operation, which is common in the hashing literature. Computing in our model thus has the same amount of computation as ITQ (Gong and Lazebnik, 2011), except without the orthogonality constraints.
2.3 Training Objective
Since our goal is to reconstruct using the least information in binary codes, we train the variational autoencoder using the Minimal Description Length (MDL) principle, which finds the best parameters that maximally compress the training data. The MDL principle seeks to minimize the expected amount of information to communicate :
where is the description length of the hashed representation and is the description length of having already communicated in (Hinton and Van Camp, 1993; Hinton and Zemel, 1994; Mnih and Gregor, 2014). By summing over all training examples , we obtain the following training objective, which we wish to minimize with respect to the parameters of and :
(4) 
where and are parameters of the generative model as defined in (1), and comes from the encoding function defined in (3). This objective is sometimes called Helmholtz (variational) free energy (Williams, 1980; Zellner, 1988; Dai et al., 2016). When the true posterior falls into the family of (3), becomes the true posterior , which leads to the shortest description length to represent .
We emphasize that this objective no longer includes binary variables as parameters and therefore avoids optimizing with discrete variables directly. This paves the way for continuous optimization methods such as stochastic gradient descent (SGD) to be applied in training. As far as we are aware, this is the first time such a procedure has been used in the problem of unsupervised learning to hash. Our methodology serves as a viable alternative to the relaxationbased approaches commonly used in the past.
2.4 Reparametrization via Stochastic Neuron
Using the training objective of (4), we can directly compute the gradients w.r.t. parameters of . However, we cannot compute the stochastic gradients w.r.t. because it depends on the stochastic binary variables . In order to backpropagate through stochastic nodes of , two possible solutions have been proposed. First, the reparametrization trick (Kingma and Welling, 2013) which works by introducing auxiliary noise variables in the model. However, it is difficult to apply when the stochastic variables are discrete, as is the case for in our model. On the other hand, the gradient estimators based on REINFORCE trick (Bengio et al., 2013) suffer from high variance. Although some variance reduction remedies have been proposed (Mnih and Gregor, 2014; Gu et al., 2015), they are either biased or require complicated extra computation in practice.
In next section, we first provide an unbiased estimator of the gradient w.r.t. derived based on distributional derivative, and then, we derive a simple and efficient approximator. Before we derive the estimator, we first introduce the stochastic neuron for reparametrizing Bernoulli distribution. A stochastic neuron reparameterizes each Bernoulli variable with . Introducing random variables , the stochastic neuron is defined as
(5) 
Because , we have . We use the stochastic neuron (5) to reparameterize our binary variables by replacing with . Note that now behaves deterministically given . This gives us the reparameterized version of our original training objective (4):
(6) 
where with . With such a reformulation, the new objective can now be optimized by exploiting the distributional stochastic gradient descent, which will be explained in the next section.
3 Distributional Stochastic Gradient Descent
For the objective in (6), given a point randomly sampled from , the stochastic gradient can be easily computed in the standard way. However, with the reparameterization, the function is no longer differentiable with respect to due to the discontinuity of the stochastic neuron . Namely, the SGD algorithm is not readily applicable. To overcome this difficulty, we will adopt the notion of distributional derivative for generalized functions or distributions (Grubb, 2008).
3.1 Distributional derivative of Stochastic Neuron
Let be an open set. Denote as the space of the functions that are infinitely differentiable with compact support in . Let be the space of continuous linear functionals on , which can be considered as the dual space. The elements in space are often called general distributions. We emphasize this definition of distributions is more general than that of traditional probability distributions.
Input:
Definition 2 (Distributional derivative)
(Grubb, 2008) Let , then a distribution is called the distributional derivative of , denoted as , if it satisfies
It is straightforward to verify that for given , the function and moreover, , which is exactly the Dirac function. Based on the definition of distributional derivatives and chain rules, we are able to compute the distributional derivative of the function , which is provided in the following lemma.
Lemma 3
For a given sample , the distributional derivative of function w.r.t. is given by
(7) 
where denotes pointwise product and denotes the finite difference defined as , where if , otherwise , .
We can therefore combine distributional derivative estimators (7) with stochastic gradient descent algorithm (see e.g., (Nemirovski et al., 2009) and its variants (Kingma and Ba, 2014; Bottou et al., 2016)), which we designate as Distributional SGD. The detail is presented in Algorithm 1, where we denote
(8) 
as the unbiased stochastic estimator of the gradient at constructed by sample . Compared to the existing algorithms for learning to hash which require substantial effort on optimizing over binary variables, the proposed distributional SGD is much simpler and also amenable to online settings (Huang et al., 2013; Leng et al., 2015).
In general, the distributional derivative estimator (7) requires two forward passes of the model for each dimension. To further accelerate the computation, we approximate the distributional derivative by exploiting the mean value theorem and Taylor expansion by
(9) 
which can be computed for each dimension in one pass. Then, we can exploit this estimator
(10) 
in Algorithm 1. Interestingly, the approximate stochastic gradient estimator of the stochastic neuron we established through the distributional derivative coincides with the heuristic “pseudogradient” constructed (Raiko et al., 2014). Please refer to the supplementary material A for details for the derivation of the approximate gradient estimator (9).
3.2 Convergence of Distributional SGD
One caveat here is that due to the potential discrepancy of the distributional derivative and the traditional gradient, whether the distributional derivative is still a descent direction and whether the SGD algorithm integrated with distributional derivative converges or not remains unclear in general. However, for our learning to hash problem, one can easily show that the distributional derivative in (7) is indeed the true gradient.
Proposition 4
The distributional derivative is equivalent to the traditional gradient .
Proof First of all, by definition, we have . One can easily verify that under mild condition, both and are continuous and norm bounded. Hence, it suffices to show that for any distribution and , . For any , by definition of the distributional derivative, we have
. On the other hand, we always have
.
Hence, for all . By the Du BoisReymond’s lemma (see Lemma 3.2 in (Grubb, 2008)), we have .
Consequently, the distributional SGD algorithm enjoys the same convergence property as the traditional SGD algorithm. Applying theorem 2.1 in (Ghadimi and Lan, 2013), we arrive at
Theorem 5
Under the assumption that is Lipschitz smooth and the variance of the stochastic distributional gradient (8) is bounded by in the distributional SGD, for the solution sampled from the trajectory with probability where , we have
In fact, with the approximate gradient estimators (9), the proposed algorithm is also converging in terms of firstorder conditions, i.e.,
Theorem 6
Under the assumption that the variance of the approximate stochastic distributional gradient (10) is bounded by , for the solution sampled from the trajectory with probability where , we have
where denotes the optimal solution.
4 Connections
The proposed stochastic generative hashing is a general framework. In this section, we reveal the connection to several existing algorithms.
Iterative Quantization (ITQ). If we fix some , and where is formed by eigenvectors of the covariance matrix and is an orthogonal matrix, we have . If we assume the joint distribution as
and parametrize , then from the objective in (4) and ignoring the irrelevant terms, we obtain the optimization
(11) 
which is exactly the objective of iterative quantization (Gong and Lazebnik, 2011).
Binary Autoencoder (BA). If we use the deterministic linear encoding function, i.e., , and prefix some , and ignore the irrelevant terms, the optimization (4) reduces to
(12) 
which is the objective of a binary autoencoder (CarreiraPerpinán and Raziperchikolaei, 2015).
In BA, the encoding procedure is deterministic, therefore, the entropy term . In fact, the entropy term, if nonzero, performs like a regularization and helps to avoid wasting bits. Moreover, without the stochasticity, the optimization (12) becomes extremely difficult due to the binary constraints. While for the proposed algorithm, we exploit the stochasticity to bypass such difficulty in optimization. The stochasticity enables us to accelerate the optimization as shown in section 5.2.
5 Experiments
In this section, we evaluate the performance of the proposed distributional SGD on commonly used datasets in hashing. Due to the efficiency consideration, we conduct the experiments mainly with the approximate gradient estimator (9). We evaluate the model and algorithm from several aspects to demonstrate the power of the proposed SGH: (1) Reconstruction loss. To demonstrate the flexibility of generative modeling, we compare the reconstruction error to that of ITQ (Gong and Lazebnik, 2011), showing the benefits of modeling without the orthogonality constraints. (2) Convergence of the distributional SGD. We evaluate the reconstruction error showing that the proposed algorithm indeed converges, verifying the theorems. (3) Training time. The existing generative works require a significant amount of time for training the model. In contrast, our SGD algorithm is very fast to train both in terms of number of examples needed and the wall time. (4) Nearest neighbor retrieval. We show Recall K@N plots on standard large scale nearest neighbor search benchmark datasets of MNIST, SIFT1M, GIST1M and SIFT1B, for all of which we achieve stateoftheart among binary hashing methods. (5) Reconstruction visualization. Due to the generative nature of our model, we can regenerate the original input with very few bits. On MNIST and CIFAR10, we qualitatively illustrate the templates that correspond to each bit and the resulting reconstruction.
We used several benchmarks datasets, i.e., (1) MNIST which contains 60,000 digit images of size pixels, (2) CIFAR10 which contains 60,000 pixel color images in 10 classes, (3) SIFT1M and (4) SIFT1B which contain and samples, each of which is a dimensional vector, and (5) GIST1M which contains samples, each of which is a dimensional vector.
5.1 Reconstruction loss
(a) Reconstruction Error  (b) Training Time 
Method  8 bits  16 bits  32 bits  64 bits 

SGH  28.32  29.38  37.28  55.03 
ITQ  92.82  121.73  173.65  259.13 
Because our method has a generative model , we can easily compute the regenerated input , and then compute the loss of the regenerated input and the original , i.e., . ITQ also trains by minimizing the binary quantization loss, as described in Equation (2) in (Gong and Lazebnik, 2011), which is essentially reconstruction loss when the magnitude of the feature vectors is compatible with the radius of the binary cube. We plotted the reconstruction loss of our method and ITQ on SIFT1M in Figure 1(a) and on MNIST and GIST1M in Figure 4, where the xaxis indicates the number of examples seen by the training algorithm and the yaxis shows the average reconstruction loss. The training time comparison is listed in Table 1. Our method (SGH) arrives at a better reconstruction loss with comparable or even less time compared to ITQ. The lower reconstruction loss demonstrates our claim that the flexibility of the proposed model afforded by removing the orthogonality constraints indeed brings extra modeling ability. Note that ITQ is generally regarded as a technique with fast training among the existing binary hashing algorithms, and most other algorithms (He et al., 2013; Heo et al., 2012; CarreiraPerpinán and Raziperchikolaei, 2015) take much more time to train.
5.2 Empirical study of Distributional SGD
We demonstrate the convergence of the distributional derivative with Adam (Kingma and Ba, 2014) numerically on SIFT1M, GIST1M and MINST from bits to bits. The convergence curves on SIFT1M are shown in Figure 1 (a). The results on GIST1M and MNIST are similar and shown in Figure 4 in supplementary material C. Obviously, the proposed algorithm, even with a biased gradient estimator, converges quickly, no matter how many bits are used. It is reasonable that with more bits, the model fits the data better and the reconstruction error can be reduced further.
In line with the expectation, our distributional SGD trains much faster since it bypasses integer programming. We benchmark the actual time taken to train our method to convergence and compare that to binary autoencoder hashing (BA) (CarreiraPerpinán and Raziperchikolaei, 2015) on SIFT1M, GIST1M and MINST. We illustrate the performance on SIFT1M in Figure 1(b) . The results on GIST1M and MNIST datasets follow a similar trend as shown in the supplementary material C. Empirically, BA takes significantly more time to train on all bit settings due to the expensive cost for solving integer programming subproblem. Our experiments were run on AMD 2.4GHz Opteron CPUs and 32G memory. Our implementation of the stochastic neuron as well as the whole training procedure was done in TensorFlow. We have released our code on GitHub^{3}^{3}3https://github.com/doubling/Stochastic_Generative_Hashing. For the competing methods, we directly used the code released by the authors.
5.3 Large scale nearest neighbor retrieval
We compared the stochastic generative hashing on an L2NNS task with several stateoftheart unsupervised algorithms, including means hashing (KMH) (He et al., 2013), iterative quantization (ITQ) (Gong and Lazebnik, 2011), spectral hashing (SH) (Weiss et al., 2009), spherical hashing (SpH) (Heo et al., 2012), binary autoencoder (BA) (CarreiraPerpinán and Raziperchikolaei, 2015), and scalable graph hashing (GH) (Jiang and Li, 2015). We demonstrate the performance of our binary codes by doing standard benchmark experiments of Approximate Nearest Neighbor (ANN) search by comparing the retrieval recall. In particular, we compare with other unsupervised techniques that also generate binary codes. For each query, linear search in Hamming space is conducted to find the approximate neighbors.
Following the experimental setting of (He et al., 2013), we plot the Recall10@N curve for MNIST, SIFT1M, GIST1M, and SIFT1B datasets under varying number of bits (16, 32 and 64) in Figure 2. On the SIFT1B datasets, we only compared with ITQ since the training cost of the other competitors is prohibitive. The recall is defined as the fraction of retrieved true nearest neighbors to the total number of true nearest neighbors. The Recall10@N is the recall of 10 ground truth neighbors in the N retrieved samples. Note that Recall10@N is generally a more challenging criteria than Recall@N (which is essentially Recall1@N), and better characterizes the retrieval results. For completeness, results of various Recall K@N curves can be found in the supplementary material which show similar trend as the Recall10@N curves.



Figure 2 shows that the proposed SGH consistently performs the best across all bit settings and all datasets. The searching time is the same for the same number of bits, because all algorithms use the same optimized implementation of POPCNT based Hamming distance computation and priority queue. We point out that many of the baselines need significant parameter tuning for each experiment to achieve a reasonable recall, except for ITQ and our method, where we fix hyperparameters for all our experiments and used a batch size of and learning rate of with stepsize decay. Our method is less sensitive to hyperparameters.
5.4 Visualization of reconstruction


One important aspect of utilizing a generative model for a hash function is that one can generate the input from its hash code. When the inputs are images, this corresponds to image generation, which allows us to visually inspect what the hash bits encode, as well as the differences in the original and generated images.
In our experiments on MNIST and CIFAR10, we first visualize the “template” which corresponds to each hash bit, i.e., each column of the decoding dictionary . This gives an interesting insight into what each hash bit represents. Unlike PCA components, where the top few look like averaged images and the rest are high frequency noise, each of our image template encodes distinct information and looks much like filter banks of convolution neural networks. Empirically, each template also looks quite different and encodes somewhat meaningful information, indicating that no bits are wasted or duplicated. Note that we obtain this representation as a byproduct, without explicitly setting up the model with supervised information, similar to the case in convolution neural nets.
We also compare the reconstruction ability of SGH with the that of ITQ and real valued PCA in Figure 3. For ITQ and SGH, we use a bit hash code. For PCA, we kept 64 components, which amounts to bits. Visually comparing with SGH, ITQ reconstructed images look much less recognizable on MNIST and much more blurry on CIFAR10. Compared to PCA, SGH achieves similar visual quality while using a significantly lower ( less) number of bits!
6 Conclusion
In this paper, we have proposed a novel generative approach to learn binary hash functions. We have justified from a theoretical angle that the proposed algorithm is able to provide a good hash function that preserves Euclidean neighborhoods, while achieving fast learning and retrieval. Extensive experimental results justify the flexibility of our model, especially in reconstructing the input from the hash codes. Comparisons with approximate nearest neighbor search over several benchmarks demonstrate the advantage of the proposed algorithm empirically. We emphasize that the proposed generative hashing is a general framework which can be extended to semisupervised settings and other learning to hash scenarios as detailed in the supplementary material. Moreover, the proposed distributional SGD with the unbiased gradient estimator and its approximator can be applied to general integer programming problems, which may be of independent interest.
Acknowledgements
LS is supported in part by NSF IIS1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS1350983, NSF IIS1639792 EAGER, ONR N000141512340, NVIDIA, Intel and Amazon AWS.
References
 Babenko and Lempitsky (2014) Babenko, Artem and Lempitsky, Victor. Additive quantization for extreme vector compression. In roceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
 Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 Bottou et al. (2016) Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for largescale machine learning. arXiv preprint arXiv:1606.04838, 2016.
 CarreiraPerpinán and Raziperchikolaei (2015) Miguel A CarreiraPerpinán and Ramin Raziperchikolaei. Hashing with binary autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 557–566, 2015.
 Charikar (2002) Charikar, Moses S. Similarity estimation techniques from rounding algorithms. Proceedings of the thiryfourth annual ACM symposium on Theory of computing, pages 380–388, 2002. ‘’
 Dai et al. (2016) Bo Dai, Niao He, Hanjun Dai, and Le Song. Provable bayesian inference via particle mirror descent. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 985–994, 2016.
 Douze et al. (2015) Matthijs Douze, Hervé Jégou, and Florent Perronnin. Polysemous codes. In European Conference on Computer Vision, 2016.
 Ghadimi and Lan (2013) Saeed Ghadimi and Guanghui Lan. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
 Gionis et al. (1999) Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Similarity search in high dimensions via hashing. In VLDB, volume 99, pages 518–529, 1999.
 Gong and Lazebnik (2011) Yunchao Gong and Svetlana Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 817–824. IEEE, 2011.
 Gong et al. (2012) Yunchao Gong, Sanjiv Kumar, Vishal Verma, and Svetlana Lazebnik. Angular quantizationbased binary codes for fast similarity search. In Advances in neural information processing systems, 2012.
 Gong et al. (2013) Yunchao Gong, Sanjiv Kumar, Henry A Rowley, and Svetlana Lazebnik. Learning binary codes for highdimensional data using bilinear projections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 484–491, 2013.
 Gregor et al. (2014) Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive networks. In Proceedings of The 31st International Conference on Machine Learning, pages 1242–1250, 2014.
 Grubb (2008) Gerd Grubb. Distributions and operators, volume 252. Springer Science & Business Media, 2008.
 Gu et al. (2015) Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. Muprop: Unbiased backpropagation for stochastic neural networks. arXiv preprint arXiv:1511.05176, 2015.
 Guo et al. (2016) Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. Quantization based fast inner product search. 19th International Conference on Artificial Intelligence and Statistics, 2016.
 He et al. (2013) Kaiming He, Fang Wen, and Jian Sun. Kmeans hashing: An affinitypreserving quantization method for learning binary compact codes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2938–2945, 2013.
 Heo et al. (2012) JaePil Heo, Youngwoon Lee, Junfeng He, ShihFu Chang, and SungEui Yoon. Spherical hashing. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2957–2964. IEEE, 2012.
 Hinton and Van Camp (1993) Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5–13. ACM, 1993.
 Hinton and Zemel (1994) Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length and helmholtz free energy. In Advances in Neural Information Processing Systems, pages 3–10, 1994.
 Huang et al. (2013) LongKai Huang, Qiang Yang, and WeiShi Zheng. Online hashing. In Proceedings of the TwentyThird international joint conference on Artificial Intelligence, pages 1422–1428. AAAI Press, 2013.
 Indyk and Motwani (1998) Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613. ACM, 1998.
 Jegou et al. (2011) Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2011.
 Jiang and Li (2015) QingYuan Jiang and WuJun Li. Scalable Graph Hashing with Feature Transformation. In TwentyFourth International Joint Conference on Artificial Intelligence, 2015.
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
 Leng et al. (2015) Cong Leng, Jiaxiang Wu, Jian Cheng, Xiao Bai, and Hanqing Lu. Online sketching hashing. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2503–2511. IEEE, 2015.
 Liu et al. (2011) Wei Liu, Jun Wang, Sanjiv Kumar, and ShihFu Chang. Hashing with graphs. In Proceedings of the 28th international conference on machine learning (ICML11), pages 1–8, 2011.
 Liu et al. (2014) Wei Liu , Cun Mu, Sanjiv Kumar and ShihFu Chang. Discrete graph hashing. In Advances in Neural Information Processing Systems (NIPS), 2014.
 Marc’Aurelio and Geoffrey (2010) Ranzato Marc’Aurelio and E Hinton Geoffrey. Modeling pixel means and covariances using factorized thirdorder boltzmann machines. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2551–2558. IEEE, 2010.
 Mnih and Gregor (2014) Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.
 Nemirovski et al. (2009) Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
 Raiko et al. (2014) Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989, 2014.
 Shen et al. (2015) Fumin Shen, Wei Liu, Shaoting Zhang, Yang Yang, and Heng Tao Shen. Learning binary codes for maximum inner product search. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4148–4156. IEEE, 2015.
 Wang et al. (2010) Jun Wang, Sanjiv Kumar, and ShihFu Chang. Semisupervised hashing for scalable image retrieval. In Computer Vision and Pattern Recognition (CVPR), 2010.
 Wang et al. (2014) Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927, 2014.
 Weiss et al. (2009) Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In Advances in neural information processing systems, pages 1753–1760, 2009.
 Williams (1980) P. M. Williams. Bayesian conditionalisation and the principle of minimum information. British Journal for the Philosophy of Science, 31(2):131–144, 1980.
 Yu et al. (2014) Felix X Yu, Sanjiv Kumar, Yunchao Gong, and ShihFu Chang. Circulant binary embedding. In International conference on machine learning, volume 6, page 7, 2014.
 Zellner (1988) Arnold Zellner. Optimal Information Processing and Bayes’s Theorem. The American Statistician, 42(4), November 1988.
 Zhang et al. (2014a) Peichao Zhang, Wei Zhang, WuJun Li, and Minyi Guo. Supervised hashing with latent factor models. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 173–182. ACM, 2014a.
 Zhang et al. (2014b) Ting Zhang, Chao Du, and Jingdong Wang. Composite quantization for approximate nearest neighbor search. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 838–846, 2014b.
 Zhu et al. (2016) Han Zhu, Mingsheng Long, Jianmin Wang, and Yue Cao. Deep hashing network for efficient similarity retrieval. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
Supplementary Material
Appendix A Distributional Derivative of Stochastic Neuron
Before we prove the lemma 3, we first introduce the chain rule of distributional derivative.
Lemma 7
(Grubb, 2008) Let , we have

(Chain Rule I) The distribution derivative of for any is given by .

(Chain Rule II) The distribution derivative of for any with bounded is given by .
Proof of Lemma 3. Without loss of generality, we first consider dimension case. Given , , . For , we have
where the last equation comes from . We obtain
We generalize the conclusion to dimension case with expectation over , i.e., , we have the partial distributional derivative for th coordinate as
Therefore, we have the distributional derivative w.r.t. as
chain rule I  
To derive the approximation of the distributional derivative, we exploit the mean value theorem and Taylor expansion. Specifically, for a continuous and differential loss function , there exists
Moreover, for general smooth functions, we rewrite the by Taylor expansion, i.e.,
we have an approximator as
(13) 
Plugging into the distributional derivative estimator (7), we obtain a simple biased gradient estimator,
(14) 
Appendix B Convergence of Distributional SGD
Lemma 8
Proof of Theorem 5. Lemma 8 implies that by randomly sampling a search point with probability where from trajectory , we have
Lemma 9
Under the assumption that the variance of the approximate stochastic distributional gradient (10) is bounded by , the proposed distributional SGD outputs such that
where denotes the optimal solution.
Proof Denote the optimal solution as , we have
Taking expectation on both sides and denoting , we have
Therefore,
Appendix C More Experiments
c.1 Convergence of Distributional SGD and Reconstruction Error Comparison
(a) MNIST  (b) GIST1M 
We shows the reconstruction error comparison between ITQ and SGH on MNIST and GIST1M in Figure 4. The results are similar to the performance on SIFT1M. Because SGH optimizes a more expressive objective than ITQ (without orthogonality) and do not use alternating optimization, it find better solution with lower reconstruction error.
c.2 Training Time Comparison
(a) MNIST  (b) GIST1M 





We shows the training time comparison between BA and SGH on MNIST and GIST1M in Figure 5. The results are similar to the performance on SIFT1M. The proposed distributional SGD learns the model much faster.
c.3 More Evaluation on L2NNS Retrieval Tasks
We also use different RecallK@N to evaluate the performances of our algorithm and the competitors. We first evaluated the performance of the algorithms with Recall 1@N in Figure 6. This is an easier task comparing to . Under such measure, the proposed SGH still achieves the stateoftheart performance.
In Figure 7, we set and plot the recall by varying the length of the bits on MNIST, SIFT1M, and GIST1M. This is to show the effects of length of bits in different baselines. Similar to the Recall10@N, the proposed algorithm still consistently achieves the stateoftheart performance under such evaluation measure.
(a) L2NNS on MNIST  (b) L2NNS on SIFT1M  (c) L2NNS on GIST1M 
Appendix D Stochastic Generative Hashing For Maximum Inner Product Search
In Maximum Inner Product Search (MIPS) problem, we evaluate the similarity in terms of inner product which can avoid the scaling issue, i.e., the length of the samples in reference dataset and the queries may vary. The proposed model can also be applied to the MIPS problem. In fact, the Gaussian reconstruction model also preserve the inner product neighborhoods. Denote the asymmetric inner product as , we claim
Proposition 10
The Gaussian reconstruction error is a surrogate for asymmetric inner product preservation.
Proof We evaluate the difference between inner product and the asymmetric inner product,
which means minimizing the Gaussian reconstruction, i.e., , error will also lead to asymmetric inner product preservation.
We emphasize that our method is designed for hashing problems primarily. Although it can be used for MIPS problem, it is different from the product quantization and its variants whose distance are calculated based on lookup table. The proposed distributional SGD can be extended to quantization. This is out of the scope of this paper, and we will leave it as the future work.
d.1 MIPS Retrieval Comparison
To evaluate the performance of the proposed SGH on MIPS problem, we tested the algorithm on WORD2VEC dataset for MIPS task. Besides the hashing baselines, since KMH is the Hamming distance generalization of PQ, we replace the KMH with product quantization (Jegou et al., 2011). We trained the SGH with 71,291 samples and evaluated the performance with 10,000 query. Similarly, we vary the length of binary codes from , to , and evaluate the performance by Recall 10@N. We calculated the groundtruth via retrieval through the original inner product. The performances are illustrated in Figure 8. The proposed algorithm outperforms the competitors significantly, demonstrating the proposed SGH is also applicable to MIPS task.
Appendix E Generalization
We generalize the basic model to translation and scale invariant extension, semisupervised extension, as well as coding with .
e.1 Translation and Scale Invariant ReducedMRFs
As we known, the data may not zeromean, and the scale of each sample in dataset can be totally different. To eliminate the translation and scale effects, we extend the basic model to translation and scale invariant reducedMRFs by introducing parameter to separate the translation effect and the latent variable to model the scale effect in each sample , therefore, the potential function becomes
(15) 
where denotes elementwise product, and . Comparing to (2), we replace with so that the translation and scale effects in both dimension and sample are modeled explicitly.
We treat the as parameters and as latent variable. Assume the independence in posterior for computational efficiency, we approximate the posterior with , where denotes the parameters in the posterior approximation. With similar derivation, we obtain the learning objective as
(16) 
Obviously, the proposed distributional SGD is still applicable to this optimization.
e.2 Semisupervised Extension
Although we only focus on learning the hash function in unsupervised setting, the proposed model can be easily extended to exploit the supervision information by introducing pairwise model, e.g., (Zhang et al., 2014a; Zhu et al., 2016). Specifically, we are provided the (partial) supervision information for some pairs of data, i.e., , where
and stands for the set of nearest neighbors of . Besides the original Gaussian reconstruction model in the basic model in (2), we introduce the pairwise model into the framework, which results the joint distribution over as
where is an indicator that outputs when , otherwise . Plug the extended model into the Helmholtz free energy, we have the learning objective as,
Obviously, the proposed distributional SGD is still applicable to the semisupervised extension.
e.3 Binary Coding
In the main text, we mainly focus on coding with . In fact, the proposed model is applicable to coding with with minor modification. Moreover, the proposed distributional SGD is still applicable. We only discuss the basic model here, the model can also be extended to scaleinvariant a