Global Hashing System for Fast Image Search

Global Hashing System for Fast Image Search

Dayong Tian and Dacheng Tao,  The authors are with the Centre for Quantum Computation & Intelligent Systems and the Faculty of Engineering and Information Technology, University of Technology Sydney, 81 Broadway Street, Ultimo, NSW 2007, Australia (email:, Corresponding author: D. Tian (email:©2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.©2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Hashing methods have been widely investigated for fast approximate nearest neighbor searching in large datasets. Most existing methods use binary vectors in lower dimensional spaces to represent data points, which are usually real vectors of higher dimensionality. However, according to Shannon’s Source Coding Theorem (SSCT) in information theory, it is logical to represent low-dimensional real vectors with high-dimensional binary vectors, since a binary bit contains less information than a real number. We design a novel hashing method based on this principle. Data points are first embedded in a low-dimensional space, and then the Global Positioning System (GPS) method is introduced but modified for hashing. We devise data-independent and data-dependent methods to distribute the “satellites” at appropriate locations. Benefitting from the rationale of SSCT and rules on distributing satellites in a GPS, our data-dependent method outperforms other methods in different-sized datasets from 100K to 10M. By incorporating the orthogonality of the code matrix, both our data-independent and data-dependent methods are particularly impressive in experiments on longer bits.

Hashing, image retrieval, Global Positioning System.

I Introduction

Hashing methods are efficient for approximate nearest neighbor (ANN) searching, which is important in computer vision  [7][42][47][40] and machine learning [24][29][36][43]. Hashing methods map original input data points to binary hash codes while preserving their mutual distances; that is, the binary strings of similar data points in the original feature space should have low Hamming distances. Hashing with short codes can substantially reduce storage requirements and boost the ANN searching speed.
Popular hashing methods can be categorized into two groups according to their dependence on data. The most well-known data-independent hashing methods are Locality-Sensitive Hashing (LSH) [2] and its variances, e.g., those adopting cosine similarity [6] and kernel similarity [23]. The main drawback of these methods is the demand of more bits per hashing table, due to randomized hashing [35].
Data-dependent methods have become popular in the machine learning community. Spectral Hashing (SH) [44], one of the most popular data-dependent methods, generate hashing codes by solving the relaxed mathematical problem to circumvent the computation of pairwise distances in the whole dataset, i.e, the affinity matrix and the constraints that lead a NP-hard problem. Anchor Graph Hashing (AGH) [28] optimizes the object function of SH by using anchor points to construct a highly sparse affinity matrix. Discrete Graph Hashing (DGH) [27] follows this idea and incorporates the orthogonality of hashing code matrix. There are also methods based on linear projections of Principal Component Analysis (PCA) [46][20][19] or Linear Discriminant Analysis [37] and those hashing in kernel space, such as binary reconstructive embeddings (BRE) [22], random maximum margin hashing (RMMH) [18] and kernel-based supervised hashing (KSH) [42]. Unlike the ITQ that rotates the projection matrix obtained by PCA to minimize the loss function, the Neighborhood Discriminant Hashing (NDH) [39] incorporate the computation of the projection matrix during the minimization procedure. In general, the linear dimensionality reduction techniques, such as PCA, is inferior to nonlinear manifold learning methods which are able to more effectively preserve the local structure of the input data without assuming global linearity [38]. However, the nonlinear manifold techniques may be intractable for large datasets because of their high computation costs. To address this problem, Inductive Manifold Hashing (IMH) [35][33] learns the nonlinear manifold on a small subset and inductively insert the remainder of data. Besides, hashing methods focus on the image representations have been developed recently. For example, RZhang et al. [49] unifies the feature extraction and the hashing function learning. Zhang et al. [48] and Liu et al [26] develop their methods on multiple representations.
However, the main theoretical deficit in the data-dependent methods is that they fail to conform to Shannon’s Source Coding Theorem (SSCT) [11]. In practice, an image in the dataset is usually represented by a descriptor, e.g., SIFT [30] or GIST [32] descriptor with more than 128-dimensional 8-bit characters or 32-bit single real numbers in a computer. In information theory [11], entropy is the average amount of information contained in a message, which, in this context, refers to a descriptor vector or binary code vector. According to SSCT, the code length should be no less than the Shannon entropy of original data points. Without ambiguity in this paper, entropy refers to Shannon entropy. The entropy is defined as , where is a random variable and is the probability of . For instance, by assuming uniform distribution, the entropy of a 64-dimensional 8-bit character vector is 512, which means 512-bit binary strings are needed.
Exploiting this principle, we first reduce the dimensionality of the original data points, i.e., the descriptor vectors, by PCA. Then, the projections on the first principle components are encoded by -dimensional binary code, where . Hence, we need an over-determined system that can uniquely position every data point. This is similar to Global Positioning Systems (GPS) [13], which use dozens of satellites to position a receiver on the Earth surface. Since our method is directly inspired by GPS, we name it the Global Hashing System (GHS). We tackle the major issue of how to distribute satellites and propose two methods: one data-dependent method and one data-independent method. Unlike most existing methods [46][44][19] that handle the degraded version of orthogonality of code matrix in continuous domain, both our methods approximate the orthogonal code matrix directly in binary domain, which leads better performance on long-bit experiments. Note that although SH can be regarded as assigning more bits to PCA directions along which the data have greater ranges, it is somewhat heuristic [46].
After the satellites are well distributed, the distances from data points to each satellite (to simplify following discussion, this distance is denoted as D2S hereafter) are sorted separately. The nearest half is denoted as -1 while the other half is denoted as 1. Hence, our method can generate balanced code matrix easily. Although a balanced code matrix is considered to be one of the two conditions for good codes [44], it is rarely considered because it usually results in a NP-hard problem.

Ii Methodology

Let us define the used notations. A set of data points in a -dimensional space is represented by , which form the rows of data matrix . is obtained by the first eigenvectors of the data covariance matrix . and is the th row vector of . A binary code corresponding to is defined by , where is the length of the code and the code matrix .

Ii-a Global Positioning/Coding System

A satellite in a GPS has the ability to measure the distance between itself and a signal receiver on Earth surface. This results in a circle on which every point has the same distance to this satellite as the receiver. Hence, at least three satellites are needed to determine the true position which is the unique intersection of three such circles. More generally, a -dimensional point can be determined by its Euclidean distances to other points in this space [1].
In our GHS, each satellite only has 1-bit to record the Euclidean distances. That is, the receivers far from a satellite are denoted as 1 while the nearby ones are denoted as -1. Hence, our hashing function can be defined as:


where computes the Frobenius norm of each row of and can be any proper functions that return a positive real number. Here is adopted to generate a balanced code matrix. is the coordinate of the th satellite and it forms up the th row of satellite matrix .

Ii-B Data-dependent method (GHS-DD)

Formally, our hashing model can be described as:


Randomly setting does not produce satisfactory results. Furthermore, Eq. (2) requires pairwise distance between each pair of data points, which leads heavy burden in storage and computation. Inspired by ITQ, we circumvent it by minimizing the quantization loss.
At first, let us consider following quantization loss:


Because is always non-negative, we scale and shift B to . The underlying reasonability of Eq. (3) is similar to ITQ. To uniquely position a data point in -dimensional space, at least satellites are required and the locations of these satellites should satisfy following condition [1]:


where and . Eq. (4) is called the existence and uniqueness condition for GPS solution [1]. It can be satisfied by initializing an orthogonal . Hence, we create groups of satellites. Within each group, there are satellites, of which are orthogonal to each other. We define , a parameter discussed in Section II-D. Note that no more than mutual orthogonal vectors in a -dimensional space. Each group is rotated by an orthogonal matrix to find the best location, which gives the following model:


where is an indicator function. , if and , if . and are used to transform the values of D2S into a proper interval. Eq. (5) is minimized by iterative minimization.
Initialization. In each group, is initialized by the left singular vectors of a random matrix, so does . Another random vector is added into each group.
Update . The th column of is calculated by Eq. (1).
Update . Take the partial derivative with respect to , resulting


Update . Similar to ,


Please note when we deduce Eq. (7), is applied.
Update . We divide this step to two sub-problems. First, is substituted by to form up following minimization problem:


which is equivalent to


where . If we treat as a receiver, as satellites and as the D2S, the solution of Eq. (9) is the standard solution of GPS [3].
We construct following two matrices for each : and , where represents the th column of and returns a row vector which contains the diagonal elements of . Let . Then solve following quadratic equation about :


Eq. (10) usually have two solutions and , therefore two possible can be found by , where and which is useless in our model is related to D2S. To automatically choose a suitable from two solutions, we initialize with , where is a positive real constant. The whose norm is closer to is chosen for following steps. is also used in our data-independent satellite distribution algorithm and discussed in Section II-D along with parameter .
After s are calculated, is found by minimizing following problem:


Eq. (11) can be solved by singular value decomposition (SVD). Given and which contain and of Group , respectively, through SVD, we can get and
Convergence. When or maximum iteration is reached, the algorithm is terminated, where is a small positive real constant.
Output. and thresholds, i.e., in Eq. (1).
Out-of-Sample Hashing. A new query is projected by and then its distance to each satellite is cut off by .

Fig. 1: MAP on CIFAR-10 dataset for GHS-DI and GHS-DD. When approximates 0, both methods fail to get satisfactory results. The performance of both methods become stable after is larger than 1. On the other hand, GHS-DI gets its best results when is in interval , while it is for GHS-DD. For , the best results appear when approximates 1, because enough amounts of principal components should be selected.
8 12 16 24 32 64 96
GHS-DD 0.1890 0.2232 0.2392 0.2761 0.3053 0.3816 0.4131
0.1884 0.2214 0.2412 0.2806 0.3089 0.3972 0.4324
-0.32% -0.81% 0.83% 1.60% 1.17% 3.93% 4.46%
GHS-DI 0.1543 0.1838 0.2079 0.2581 0.2757 0.3474 0.4018
0.1537 0.1861 0.2098 0.2688 0.3008 0.3653 0.4144
-0.39% 1.24% 0.91% 3.98% 8.34% 4.90% 3.04%

TABLE I: MAP @ CIFAR-10 for parameter setting and

Ii-C Data-independent method (GHS-DI)

Another condition for good code is uncorrelation [23], i.e., . A direct way to satisfy this condition is distributing the satellites such that only one is close to each receiver; that is, there is no intersection among all spheres, where is the minimum radius that include the nearby data points of . However, in this situation, each receiver only has 1-bit 1. The hamming distance between any pair of receivers is 0 or 2, which means the distance between two data points in input space is not well preserved. What’s more, if we strictly satisfy the balance condition as well as uncorrelation condition in this way, at most 2 satellites can be used.
An alternative way is minimizing the intersections of sphere and sphere for any . That is, we put a tolerance for the values of non-diagonal elements of . They are allowed to be non-zero numbers with small absolute values.
The intersection of two -dimensional sphere is too difficult to compute, therefore the pairwise distance between each pair of satellites is maximized. Without constraints, the resulting may be . A reasonable constraint is distributing all satellites on the surface of sphere. As there is no prior knowledge about the data, we assume data points are uniformly distributed in a sphere. By , the D2S of each satellite will be comparable.
Under the abovementioned assumption, minimizing intersections can be achieved by maximizing the pairwise distance between each pair of satellites:


Eq. (12) can be maximized by Gradient Projection Algorithm (GPA) [9]. The GPA iteratively updates by moving along the gradient direction of and projects to the boundary defined by the constraint (Algorithm 1). The gradient of with respect to is

2:while  not converged do
5:end while
Algorithm 1   Data-Independent Satellite Distribution Algorithm

The projection step can be directly implemented by normalizing each . As the orthogonality of is considered, our GHS-DI method usually produces the second best results on experiments of longer hash bits. Actually the way that GHS-DD satisfies Eq. (4) intrinsically incorporates orthogonality. When , the hyper-sphere surface that separates the near and far data points can be treated as a hyper-plane. In this situation, with orthogonal and assumption of uniform distribution of data points, this property is easy to understand in and cases. More generally, we have following theorem.

Theorem 1.

If (1) data points are uniformly distributed in a sphere, (2) and (3) , then , where and are column vectors whose elements are the binary hash codes generated by Eq. (1).


Since the data points are uniformly distributed in a sphere, without losing generality, let us set and . In Eq. (1), if , the th element of will be set to , otherwise it will be set to . For any two points and that satisfy , we have , when . That is, which implies , where is the angle between two unit vectors along and , respectively. Hence, and locate on a plane whose distance to is .
To generate a balanced , should cross the origin and perpendicular to . Since , is also perpendicular to which corresponds to . It is evident that and separate the sphere into four parts with equal volume:


Since there are equal number of data points in these four parts, it is easy to verify that . ∎

In Theorem 1, condition (1) and (2) are impractical and therefore only the second sufficient condition can be satisfied by setting ; however, this contravenes the perspective of SSCT and the existence and uniqueness condition for GPS solution. In Section II-D, we will show usually cannot generate the best results. Although our methods cannot exactly fulfill these three conditions, its superiority of considering the orthogonality was proven by its high F-measure in experiments on longer bits (Section IV).

Ii-D Parameters and

There are two key parameters in our methods - and . should not be too small. Consider an extreme example that , then all bits of the points close to the origin will equal to 0 and bits of other points will equal to 1. Obviously, such codes are inefficient.
should be moderate. If is too large, the binary codes will gradually lose their ability to encode the values of projections which are real numbers. On the other hand, when becomes small, fewer projections can be used, so the data points reconstructed by these projections cannot approximate the original ones accurately enough.
The mean average precision (MAP) on CIFAR-10 dataset  [21] with varying and is shown in Fig. 1. CIFAR-10 comprises of 60K images from the 80 Million Tiny Image dataset [40] and we use 1024-dimensional GIST descriptor to represent each image. Their PCA projections are normalized by the largest Euclidean norm of all projected data. When testing on different s , at most one group containing less than satellites may exist. Based on the results in Fig. 1, we empirically set as 2 for all experiments and set as 1 for experiments whose , while 0.5 for others.
We also tested our two methods by setting (Table I). The percentages shown in Table I denote the improvement by setting . Referring to Table I, we observe that for , both methods perform better with , suggesting that the existence and uniqueness condition for GPS solution is important. For experiment on , the situation is opposite, because the number of PCA projections are too small and its effect dominates results. However, the differences are slight in these cases (less than 1%), so we did not use parameter setting in experiments of Section 4.

Iii Relations to Existing Methods

During past several years, many state-of-the-art data-dependent hashing methods have been proposed. These methods derive from various motivations. In this section, only those related to our proposed methods are briefly reviewed.

Iii-a Iterative Quantization (ITQ)

Gong et al. [46] formulated ITQ as a minimization problem:


Eq. (18) is minimized by iteratively updating and . is required to be orthogonal, which can be considered as a rotation to . IsoH [20] is directly derived from ITQ by finding a projection with equal variances for different dimensions. HH [45] rotates ; however, unlike ITQ, it uses an auxiliary variable for the code matrix during the iterative optimization and puts an orthogonal constraint on it. Then, the auxiliary variable is thresholded to generate code matrix. ok-means [31] rotates and scales to minimize the quantization loss. Our method rotates and scales the D2S. ITQ, IsoH and HH use principle components whose number is exactly equal to the bit length of hash codes. That is, they cannot be used to produce hash code that is longer than the data dimension. Theoretically, our methods can produce arbitrary length of hash codes.

Iii-B Inductive Hashing on Manifolds (IMH)

IMH [35] first generates the Base matrix by K-means clustering. Each column corresponds to a cluster center. Then it embeds into low-dimensional space by manifold learning methods [41][12]. The embedding methods affect the performance of IMH. Throughout this paper, t-SNE [41] is used because it achieved the best results in the authors’ experiments [35]. Finally, the embedding for the training data is calculated by


where the elements in is defined as


where is the th column of . Eq. (17) is quite similar to membership in fuzzy c-means clustering [4]. The embedding for the training data is linear combination of embedding for . In our method, each satellite encodes 1-bit according to the distances from itself to the data points and we don’t encode the satellites.

Iii-C Spectral Hashing (SH)

Weiss et al. [44] formulated the SH as:


Eq. (2) is similar to Eq. (18). The graph affinity matrix with is intractable for large datasets. SH evaluates smallest eigenvalues for each PCA direction to create a list of eigenvalues, sorts this list to find the smallest eigenvalues and then thresholds the corresponding eigenfunctions. The eigenvalue list creation step is consistent with the perspective of SSCT, however it is somewhat heuristic [46]. AGH and DGH compute D2S to form up a highly sparse affinity matrix to minimize the modified object function of SH. GHS-DD avoids the computation and storage of pairwise distances of all data points by minimizing the quantization loss. Furthermore, our method generates a balanced code matrix but they cannot.

Iii-D Spherical Hashing (SpH)

The final step of SpH [15] is the same as our method, so SpH also generates a balanced code matrix. However, SpH searches the locations of special points in the entire space, which makes it difficult to find a good solution. The authors claimed that the distances between these points should be neither too large nor too small, and hence an empirical point-finding procedure was devised that has less theoretical support. With more concrete theoretical analysis, our proposed method appears to outperform SpH.

Fig. 2: Mean F-measure of hash lookup with Hamming radius 2 for different methods on SUN397, GIST1M and SIFT10M.
8 12 16 24 32 64 96 128
GCS-DI 0.1336 0.1744 0.2194 0.2290 0.2579 0.3167 0.3588 0.3860
GCS-DD 0.1533 0.1945 0.2447 0.2746 0.2998 0.3492 0.3880 0.4096
ITQ 0.1508 0.1859 0.2301 0.2619 0.2886 0.3317 0.3592 0.3750
IsoH 0.1420 0.1677 0.1881 0.1950 0.2278 0.2578 0.2873 0.2882
HH 0.1478 0.1866 0.2213 0.2554 0.2687 0.3253 0.3543 0.3739
SH 0.1219 0.1369 0.1475 0.1705 0.1758 0.1897 0.2180 0.2206
IMH 0.1296 0.1357 0.1533 0.2453 0.2689 0.2896 0.3077 0.3990
okmeans 0.1469 0.1852 0.2136 0.2524 0.2716 0.3248 0.3507 0.3658
SpH 0.0377 0.0359 0.0364 0.0365 0.0363 0.0599 0.0942 0.2578
TABLE II: MAP on SUN397. denotes the number of hash bits used in hashing methods.
8 12 16 24 32 64 96 128
GCS-DI 0.1245 0.1552 0.1802 0.2052 0.2191 0.2596 0.2790 0.2885
GCS-DD 0.1358 0.1682 0.1952 0.2211 0.2438 0.2694 0.2854 0.2967
ITQ 0.1260 0.1593 0.1851 0.2098 0.2269 0.2577 0.2703 0.2775
IsoH 0.1121 0.1310 0.1844 0.1939 0.2288 0.2579 0.2712 0.2854
HH 0.1207 0.1603 0.1780 0.2019 0.2247 0.2597 0.2745 0.2880
SH 0.0871 0.0986 0.1033 0.1208 0.1339 0.1682 0.1781 0.1781
IMH 0.1248 0.1449 0.1748 0.1849 0.1965 0.2161 0.2385 0.2638
okmeans 0.1239 0.1610 0.1778 0.2070 0.2201 0.2565 0.2741 0.2809
SpH 0.0369 0.0349 0.0348 0.0359 0.0356 0.0637 0.0788 0.1919
TABLE III: MAP on GIST1M. denotes the number of hash bits used in hashing methods.
8 12 16 24 32 64 96 128
GCS-DI 0.1738 0.2193 0.2674 0.3342 0.3837 0.5156 0.5569 0.5797
GCS-DD 0.1864 0.2339 0.2769 0.3535 0.4098 0.5277 0.5692 0.5889
ITQ 0.1666 0.2195 0.2655 0.3452 0.3906 0.5025 0.5522 0.5782
IsoH 0.1764 0.2224 0.2469 0.3326 0.3766 0.4653 0.5524 0.5695
HH 0.1701 0.2258 0.2516 0.3143 0.3524 0.4494 0.5163 0.5554
SH 0.1704 0.2170 0.2382 0.2708 0.2810 0.3148 0.3039 0.3157
IMH 0.1833 0.1888 0.2007 0.2254 0.2884 0.3052 0.3358 0.3634
okmeans 0.1814 0.2260 0.2699 0.3233 0.3605 0.4401 0.4538 0.4964
SpH 0.0440 0.0487 0.0400 0.0475 0.0381 0.0615 0.1721 0.1947
TABLE IV: MAP on SIFT10M. denotes the number of hash bits used in hashing methods.
Train Test Train Test Train Test
TABLE V: Training and testing time in seconds
Fig. 3: Mean F-measure of hash lookup with Hamming radius 2 and MAP for different methods on CIFAR-10.
Fig. 4: The query images and the query results returned by compared methods with 32 hash bits.

Iv Experiments

Our experiments were conducted on three datasets of three different scales: SUN397 [17], GIST1M [16] and SIFT10M. SUN397 contains about 108K images and we represent each image by a 512-dimensional GIST descriptor [32]. GIST1M consists of 1 million 960-dimensional GIST descriptors. SIFT10M is a 10 million subset of SIFT1B [16] dataset which comprises of 1 billion 128-dimensional SIFT descriptors [30]. The 10 million data points are randomly chosen. 1K images are randomly selected from the whole SUN397 to form a separate test dataset. For GIST1M, there is a 1K test dataset available. For SIFT10M, we randomly selected 1K data points from its 10K test dataset. Groundtruth neighbors for a given query are defined as the samples in the top of 2% Euclidean distance.

Iv-a Protocols and Baselines

We evaluate our methods by comparing to seven hashing methods which includes: Iterative Quantization (ITQ) [46], Isotropic Hashing (IsoH) [20], Harmonious Hashing (HH) [45], Spectral Hashing (SH) [44], Inductive Manifold Hashing (IMH) [35], Orthogonal K-means (ok-means) [31] and Spherical Hashing (SpH) [15]. Our data-dependent and data-independent are denoted as GHS-DD and GHS-DI, respectively. We use publicly available codes of comparing methods and follow the suggesting parameter settings by corresponding publications. All data are zero-centered and in our methods, their PCA projections are normalized by the largest Euclidean norm of all projected data in our methods. Two kinds of experiments - Hamming ranking and hash lookup were conducted. The performance of Hamming ranking is measured by MAP and F1 score which is denoted as F-measure is used for evaluating the performance of hash lookup, where F1 score is defined as . Ground truths are defined by Euclidean neighbors.

Iv-B Quantitative Evaluation

The mean average precision (MAP) values are given in Table II-IV. It can be seen that GHS-DD outperforms all compared methods. The performance of GHS-DI is poorer than ITQ, HH and SH except of 128-bit experiments. Benefitting from the reasonability on information theory and balanced code matrix, GHS-DD exceeds ITQ, IsoH and HH. Due to the limitation on computation, SpH works on a small subset of the whole dataset and its empirical satellite distribution algorithm is demonstrated to be less efficient than ours. The F-measure is illustrated in Fig. 2. Again, GHS-DD exceeds others. It is worth noticing that GHS-DI generated the second best MAP and F-measure in experiments on longer bits (), because GHS-DI considers orthogonality of the code matrix. The way that GHS-DD satisfies the condition of uniqueness and existence of GPS solution, i.e., Eq. (4) and its data-dependent property makes it work better than GHS-DI.

Iv-C Computational Efficiency

Training and testing time on 32-bit are given in Table V. All experiments were done on MATLAB R2013b installed on a PC with 2.85 GHz CPU and 128 GB RAM. The major computation cost of GHS-DI is the calculation of D2S at the final step, which is linearly related to the product of data dimension and size of dataset. Hence, it takes the least time on GIST1M and SIFT10M. Because GHS-DD computes D2S in every iteration, its computation cost is moderate. When testing a new query, GHS-DI and GHS-DD computes D2S and hence their computation costs are approximate. Although the testing procedure of SpH is similar to ours, it computes D2S in original input data space whose dimension is , so its testing time is longer.

Iv-D Incorporating Label Information

To incorporate label information, a supervised dimensionality reduction method can be used to better capture the semantic structure of the dataset. Among various supervised dimensionality reduction methods, Canonical Correlation Analysis (CCA) [14] has proven to be efficient for extracting a common latent space from two views [10] and robust to noise [5].

Let be a label vector, where is the total number of labels. If the th image is associated with the corresponding label, and otherwise. is the matrix whose rows are comprised of label vectors. The goal of CCA is to maximize the correlation between projected data matrix and label matrix by finding two projection directions and . The correlation is defined as:


can be got by solving the following generalized eigenvalue problem:


where is a small regularization constant and is set to be 0.0001 here. Just as in the case of PCA, the leading generalized eigenvectors scaled their corresponding eigenvalues form up the rows of projection matrix and we obtain the embeded data matrix . Finally, both of our data-independent and data-dependent methods can be used to generate hashing codes.

CIFAR-10 dataset is used in this experiment. The 60K images in CIFAR-10 are labelled as 10 classes with 6,000 samples for each class. Again, each image is represented by a 1024 dimensional GIST feature. 1,000 samples are randomly chosen as queries and the remaining samples are used for training. Our proposed supervised hashing methods are denoted as CCA-GHS-DI and CCA-GHS-DD, respectively. The baseline methods are Supervised Discrete Hashing (SDH) [34], KSH [42], FastHash [25] and CCA-ITQ [46].

The mean F-measure of hash lookup Hamming distance 2 and MAP scores of the compared methods are given in Fig. 3. CCA-GHS-DD achieves the best F-measures and MAPs for all code lengths, while CCA-GHS-DI is only a little inferior to SDH for 16-bit code length. In the hash lookup experiments, we found that setting Hamming distance as 2 is favorable for both of our proposed methods, because two groups of satellites were used for experiments of . In Fig. 4, 5 queries with their corresponding results retrieved by compared methods using 16-bit hashing code are illustrated to qualitatively evaluate the performance. It can be seen that both CCA-GHS-DI and CCA-GHS-DD outperform the compared methods.

Fig. 5: Classification accuracy (%) on MNIST

Iv-E Classification with hashing codes

In this subsection, the MNIST dateset is used for evaluate the performance of the learned hashing codes by compared methods. The MNIST dataset consists of 70, 000 images, each of which is 784-dimensional. These images are handwritten digits from ‘0’ to ‘9’. BRE, CCA-ITA, KSH, FastHash and SDH are used as baselines.
Linear Support Vector Machine (SVM) is applied on the hashing codes. The LIBLINEAR [8] solver is used to train the SVM. The classification results are given in Fig. 5. From Fig. 5, it can be seen that both CCA-GHS-DD gets the highest classification accuracy over all hash bit length, while CCA-GHS-DI is the second best when but trail SDH in experiments on 32-bit hash codes.

V Conclusion

We have proposed a novel hashing method based on and Shannon’s Source Coding Theorem witch requires that the hashing codes should be longer than the embedding for original training data. To circumvent computation of pairwise distances between each pair of data points, we minimize the new formulation of quantization loss which is based on Global Positioning System (GPS). Data-dependent and data-independent methods are proposed to distribute the satellites. According to the experimental results on three scales of datasets, the data-dependent method (GHS-DD) was superior to other methods, and the data-independent method (GHS-DI) produced promising results in less training time. However, GHS-DD took a moderate length of time to train, and the demand on RAM was limited by the computation of the covariance matrix in PCA. By incorporating Canonical Correlation Analysis (CCA), the proposed methods can be used for supervised hashing. The performance of CCA-GHS-DI and CCA-GHS-DD are superior. Finally, the retained hashing codes are used for classification problem to further demonstrate the outstanding performance of the proposed methods. Future work will focus on improving the computational efficiency and investigating methods to train the model using a few samples from the whole dataset to handle larger datasets such as SIFT1B and Tiny 80M.


  • [1] J. S. Abel and J. W. Chaffee (1991-Nov.) Existence and uniqueness of GPS solutions. IEEE Transactions on Aerospace and Electronic System 27 (6), pp. 952–956. Cited by: §II-A, §II-B.
  • [2] A. Andoni and P. Indyk (2008-Jan.) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM 51 (1), pp. 117–122. Cited by: §I.
  • [3] S. Bancroft (1985-Jan.) An algebraic solution of the GPS equations. IEEE Transactions on Aerospace and Electronic System 21, pp. 56–59. Cited by: §II-B.
  • [4] J. C. Bezdek (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers. External Links: ISBN 0306406713 Cited by: §III-B.
  • [5] M. B. Blaschko and C. H. Lampert (2008) Correlational spectral clustering. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §IV-D.
  • [6] M. S. Charikar (2002) Similarity estimation techniques from rounding algorithms. In ACM Symposium on Theory of Computing, pp. 380–388. Cited by: §I.
  • [7] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan and J. Yagnik (2013) Fast, accurate detection of 100,000 object classes on a single machine. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1814–1821. Cited by: §I.
  • [8] R. Fan, K. Chang, C. Hsieh, X. Wang and C. Lin (2008-Nov.) LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research 9, pp. 1871–1874. Cited by: §IV-E.
  • [9] M. A. T. Figueiredo, R. D. Nowak and S. J. Wright (2007-Jan.) Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing 1 (4), pp. 586–597. Cited by: §II-C.
  • [10] D. P. Foster, S. M. Kakade and T. Zhang (2008) Multi-view dimensionality reduction via canonical correlation analysis. Technical report . Cited by: §IV-D.
  • [11] R. M. Gray (2011) Entropy and information theory. 2 edition, Springer-Verlag. Cited by: §I.
  • [12] G. Hinton and S. Roweis (2002) Stochastic neighbor embedding. In Advances in Neural Information Processing Systems, pp. 833–840. Cited by: §III-B.
  • [13] B. Hofmann-Wellenhof, H. Lichtenegger and J. Collins (1997) Global positioning system: theory and practice. Springer-Verlag. Cited by: §I.
  • [14] H. Hotelling (1936-Dec.) Relations between two sets of variables. Biometrika 28, pp. 321–377. Cited by: §IV-D.
  • [15] H. Jae-Pil, L. Youngwoon, H. Junfeng, C. Shih-Fu and Y. Sung-Eui (2012) Spherical hashing. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2957–2964. Cited by: §III-D, §IV-A.
  • [16] H. Jegou, M. Douze and C. Schmid (2011-Mar.) Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (1), pp. 117–128. Cited by: §IV.
  • [17] X. Jianxiong, J. Hays, K. A. Ehinger, A. Oliva and A. Torralba (2010) SUN database: large-scale scene recognition from abbey to zoo. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. Cited by: §IV.
  • [18] A. Joly and O. Buisson (2013) Random maximum margin hashing. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 873–880. Cited by: §I.
  • [19] W. Jun, S. Kumar and C. Shih-Fu (2012-Sep.) Semi-supervised hashing for large-scale search. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (12), pp. 2393–2406. Cited by: §I.
  • [20] W. Kong and W. Li (2012) Isotropic hashing. In Advances in Neural Information Processing Systems, pp. 1646–1654. Cited by: §I, §III-A, §IV-A.
  • [21] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §II-D.
  • [22] B. Kulis and T. Darrell (2009) Learning to hash with binary reconstructive embeddings. In Advances in Neural Information Processing Systems, pp. 1042–1050. Cited by: §I.
  • [23] B. Kulis and K. Grauman (2012-Nov.) Kernelized locality-sensitive hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (6), pp. 1092–1104. Cited by: §I.
  • [24] P. Li, A. Shrivastava, J. Moore and A. C. Konig (2011) Hashing algorithms for large-scale learning. In Advances in Neural Information Processing System, Cited by: §I.
  • [25] G. Lin, C. Shen, Q. Shi, A. van den Hengel and D. Suter (2014) Fast supervised hashing with decision trees for high-dimensional data. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1971–1978. Cited by: §IV-D.
  • [26] L. Liu, M. Yu and L. Shao (2015-03) Multiview alignment hashing for efficient image search. IEEE Transactions on Image Processing 24 (3), pp. 956–966. Cited by: §I.
  • [27] W. Liu, C. Mu, S. Kumar and S. Chang (2014) Discrete graph hashing. In Advances in Neural Information Processing Systems, Cited by: §I.
  • [28] W. Liu, J. Wang and S. Chang (2011) Hashing with graphs. In International Conference on Machine Learning, Cited by: §I.
  • [29] W. Liu, J. Wang, Y. Mu, S. Kumar and S. Chang (2012) Compact hyperplane hashing with bilinear functions. In International Conference on Machine Learning, Cited by: §I.
  • [30] D. G. Lowe (1999) Object recognition from local scale-invariant features. In IEEE International Conference on Computer Vision, pp. 1150–1157. Cited by: §I, §IV.
  • [31] M. Norouzi and D. J. Fleet (2013) Cartesian k-means. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3017–3024. Cited by: §III-A, §IV-A.
  • [32] A. Oliva and A. Torralba (2001-05) Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 42 (3), pp. 145–175. Cited by: §I, §IV.
  • [33] F. Shen, C. Shen, Q. Shi, A. van den Hengel, Z. Tang and H. T. Shen (2015) Hashing on nonlinear manifolds. IEEE Transactions on Image Processing 24 (6), pp. 1839–1851. Cited by: §I.
  • [34] F. Shen, C. Shen, W. Liu and H. Tao Shen (2015) Supervised discrete hashing. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 37–45. Cited by: §IV-D.
  • [35] F. Shen, C. Shen, Q. Shi, A. v. d. Hengel and Z. Tang (2013) Inductive hashing on manifolds. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1562–1569. Cited by: §I, §III-B, §IV-A.
  • [36] Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola and S.V.N. Vishwanathan (2009-Nov.) Hash kernels for structured data. Journal of Machine Learning Research 10, pp. 2615–2637. External Links: ISSN 1532-4435 Cited by: §I.
  • [37] C. Strecha, A. M. Bronstein, M. M. Bronstein and P. Fua (2012-05) LDAHash: improved matching with smaller descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1), pp. 66–78. Cited by: §I.
  • [38] A. Talwalkar, S. Kumar and H. Rowley (2008) Large-scale manifold learning. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8. Cited by: §I.
  • [39] J. Tang, Z. Li, M. Wang and R. Zhao (2015-Sept) Neighborhood discriminant hashing for large-scale image retrieval. IEEE Transactions on Image Processing 24 (9), pp. 2827–2840. Cited by: §I.
  • [40] A. Torralba, R. Fergus and W. T. Freeman (2008-05) 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (11), pp. 1958–1970. Cited by: §I, §II-D.
  • [41] L. van der Maaten and G. Hinton Visualizing data using t-sne. . Cited by: §III-B.
  • [42] L. Wei, W. Jun, J. Rongrong, J. Yu-Gang and C. Shih-Fu (2012) Supervised hashing with kernels. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2074–2081. Cited by: §I, §IV-D.
  • [43] K. Weinberger, A. Dasgupta, J. Langford, A. Smola and J. Attenberg (2009) Feature hashing for large scale multitask learning. In International Conference on Machine Learning, pp. 1113–1120. Cited by: §I.
  • [44] Y. Weiss, A. Torralba and R. Fergus (2008) Spectral hashing. In Advances in Neural Information Processing Systems, pp. 1753–1760. Cited by: §I, §III-C, §IV-A.
  • [45] B. Xu, J. Bu, Y. Lin, C. Chen, X. He and D. Cai (2013) Harmonious hashing. In International Joint Conference on Artificial Intelligence, pp. 1820–1826. Cited by: §III-A, §IV-A.
  • [46] G. Yunchao and S. Lazebnik (2011) Iterative quantization: a procrustean approach to learning binary codes. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 817–824. Cited by: §I, §III-A, §III-C, §IV-A, §IV-D.
  • [47] L. Zhang, H. Lu, D. Du and L. Liu (2016-02) Sparse hashing tracking. IEEE Transactions on Image Processing 25 (2), pp. 840–849. Cited by: §I.
  • [48] L. Zhang, Y. Zhang, R. Hong and Q. Tian (2015-07) Full-space local topology extraction for cross-modal retrieval. IEEE Transactions on Image Processing 24 (7), pp. 2212–2224. Cited by: §I.
  • [49] R. Zhang, L. Lin, R. Zhang, W. Zuo and L. Zhang (2015-12) Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Transactions on Image Processing 24 (12), pp. 4766–4779. Cited by: §I.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description