A Scalable Optimization Mechanism for Pairwise based Discrete Hashing
Abstract
Maintaining the pair similarity relationship among originally highdimensional data into a lowdimensional binary space is a popular strategy to learn binary codes. One simiple and intutive method is to utilize two identical code matrices produced by hash functions to approximate a pairwise real label matrix. However, the resulting quartic problem is difficult to directly solve due to the nonconvex and nonsmooth nature of the objective. In this paper, unlike previous optimization methods using various relaxation strategies, we aim to directly solve the original quartic problem using a novel alternative optimization mechanism to linearize the quartic problem by introducing a linear regression model. Additionally, we find that gradually learning each batch of binary codes in a sequential mode, i.e. batch by batch, is greatly beneficial to the convergence of binary code learning. Based on this significant discovery and the proposed strategy, we introduce a scalable symmetric discrete hashing algorithm that gradually and smoothly updates each batch of binary codes. To further improve the smoothness, we also propose a greedy symmetric discrete hashing algorithm to update each bit of batch binary codes. Moreover, we extend the proposed optimization mechanism to solve the nonconvex optimization problems for binary code learning in many other pairwise based hashing algorithms. Extensive experiments on benchmark singlelabel and multilabel databases demonstrate the superior performance of the proposed mechanism over recent stateoftheart methods.
1 Introduction
Hashing has become a popular tool to tackle largescale tasks in information retrieval, computer vision and machine leaning communities, since it aims to encode originally highdimensional data into a variety of compact binary codes with maintaining the similarity between neighbors, leading to significant gains in both computation and storage [1] [2] [3].
Early endeavors in hashing focus on dataindependent algorithms, like locality sensitive hashing (LSH) [4] and minwise hashing (MinHash) [5] [6]. They construct hash functions by using random projections or permutations. However, due to randomized hashing, in practice they usually require long bits to achieve high precision per hash table and multiple tables to boost the recall [2]. To learn compact binary codes, datadependent algorithms using available training data to learn hash functions have attracted increasing attention. Based on whether utilizing semantic label information, datadependent algorithms can be categorized into two main groups: unsupervised and supervised. Unlike unsupervised hashing [7] [8] [9] that explores data intrinsic structures to preserve similarity relations between neighbors without any supervision, supervised hashing [10] [11] [12] employs semantic information to learn hash functions, and thus it usually achieves better retrieval accuracy than unsupervised hashing on semantic similarity measures.
Among supervised hashing algorithms, pairwise based hashing, maintaining the relationship of similar or dissimilar pairs in a Hamming space, is one popular method to exploit label information. Numerous pairwise based algorithms have been proposed in the past decade, including spectral hashing (SH) [7], linear discriminant analysis hashing (LDAHash) [12], minimal loss hashing (MLH) [10], binary reconstruction embedding (BRE) [8] and kernelbased supervised hashing (KSH) [13], etc. Although these algorithms have been demonstrated effective in many largescale tasks, their employed optimization strategies are usually insufficient to explore the similarity information defined in the nonconvex and nondifferential objective functions. In order to handle these nonsmooth and nonconvex problems, four main strategies have been proposed: symmetric/asymmetric relaxation, and asymmetric/symmetric discrete. Symmetric relaxation [7] [11] [13] [14] is to relax discrete binary vectors in a continuous feasible region followed by thresholding to obtain binary codes. Although symmetric relaxation can simplify the original optimization problem, it often generates large accumulated errors between hash and linear functions. To reduce the accumulated error, asymmetric relaxation [15] utilizes the elementwise product of discrete and its relaxed continuous matrices to approximate a pairwise label matrix. Asymmetric discrete hashing [16] [17] [18] usually utilizes the product of two distinct discrete matrices to preserve pair relations into a binary space. Symmetric discrete hashing [19] [20] firstly learns binary codes with preserving symmetric discrete constraints and then trains classifiers based on the learned discrete codes. Although most of hashing algorithms with these four strategies have achieved promising performance, they have at least one of the following four major disadvantages: (i) Learning binary codes employs relaxation and thresholding strategies, thereby producing large accumulated errors; (ii) Learning binary codes requires high storage and computation costs, i.e. , where is the number of data points, thereby limiting their applications to largescale tasks; (iii) The used pairwise label matrix usually emphasizes the difference of images among different classes but neglects their relevance within the same class. Hence, existing optimization methods might perform poorly to preserve the relevance information among images; (iv) The employed optimization methods focus on one type of optimization problems and it is difficult to directly apply them to other problems.
Motivated by aforementioned observations, in this paper, we propose a novel simple, general and scalable optimization method that can solve various pairwise based hashing models for directly learning binary codes. The main contributions are summarized as follows:

We propose a novel alternative optimization mechanism to reformulate one typical quartic problem, in term of hash functions in the original objective of KSH [13], into a linear problem by introducing a linear regression model.

We present and analyze a significant discovery that gradually updating each batch of binary codes in a sequential mode, i.e. batch by batch, is greatly beneficial to the convergence of binary code learning.

We propose a scalable symmetric discrete hashing algorithm with gradually updating each batch of one discrete matrix. To make the update step more smooth, we further present a greedy symmetric discrete hashing algorithm to greedily update each bit of batch discrete matrices. Then we demonstrate that the proposed greedy hashing algorithm can be used to solve other optimization problems in pairwise based hashing.
2 Related Work
Based on the manner of using the label information, supervised hashing can be classified into three major categories: pointwise, multiwise and pairwise.
Pointwise based hashing formulates the searching into one classification problem based on the rule that the classification accuracy with learned binary codes should be maximized. Supervised discrete hashing (SDH) [24] leverages one linear regression model to generate optimal binary codes. Fast supervised discrete hashing (FSDH) [25] improves the computation requirement of SDH via fast SDH approximation. Supervised quantization hashing (SQH) [26] introduces composite quantization into a linear model to further boost the discriminative ability of binary codes. Deep learning of binary hash codes (DLBHC) [27] and deep supervised convolutional hashing (DSCH) [28] employ convolutional neural network to simultaneously learn image representations and hash codes in a pointwised manner. Pointwise based hashing is scalable and its optimization problem is relatively easier than multiwise and pairwise based hashing; however, its rule is inferior compared to the other two types of supervised hashing.
Multiwise based hashing is also named as ranking based hashing that learns hash functions to maximize the agreement of similarity orders over two items between original and Hamming distances. Triplet ranking hashing (TRH) [29] and column generation hashing (CGH) [30] utilize a triplet ranking loss to maximumly preserve the similarity order. Order preserving hashing (OPH) [31] learns hash functions to maximumly preserve the similarity order by taking it as a classification problem. Ranking based supervised hashing (RSH) [32] constructs a ranking triplet matrix to maintain orders of ranking lists. Ranking preserving hashing (RPH) [33] learns hash functions by directly optimizing a ranking measure, Normalized Discounted Cumulative Gain (NDCG) [34]. Top rank supervised binary coding (TopRSBC) [35] focuses on boosting the precision of top positions in a Hamming distance ranking list. Discrete semantic ranking hashing (DSeRH) [36] learns hash functions to maintain ranking orders with preserving symmetric discrete constraints. Deep network in network hashing (DNNH) [37], deep semantic ranking hashing (DSRH) [38] and tripletbased deep binary embedding (TDBE) [39] utilize convolutional neural network to learn image representations and hash codes based on the triplet ranking loss over three items. Most of these multiwise based hashing algorithms relax the ranking order or discrete binary codes in a continuously feasible region to solve their original nonconvex and nonsmooth problems.
Pairwise based hashing maintains relationship among originally highdimensional data into a Hamming space by calculating and preserving the relationship of each pair. SH [7] constructs one graph to maintain the similarity among neighbors and then utilizes it to map the highdimensional data into a lowdimensional Hamming space. Although the original version of SH is unsupervised hashing, it is easily converted into a supervised algorithm. Inspired by SH, many variants including anchor graph hashing [40], elastic embedding [41], discrete graph hashing (DGH) [2], and asymmetric discrete graph hashing (ADGH) [18] have been proposed. LDAHash [12] projects the highdimensional descriptor vectors into a lowdimensional Hamming space with maximizing the distance of interclass data and meanwhile minimizing the intraclass distances. MLH [10] adopts a structured prediction with latent variables to learn hash functions. BRE [8] aims to minimize the difference between Euclidean distances of original data and their Hamming distances. It leverages a coordinatedescent algorithm to solve the optimization problem with preserving symmetric discrete constraints. SSH [11] introduces a pairwise matrix and KSH [13] leverages the Hamming distance between pairs to approximate the pairwise matrix. This objective function is intuitive and simple, but the optimization problem is highly nondifferential and difficult to directly solve. KSH utilizes a “symmetric relaxation + greedy” strategy to solve the problem. Twostep hashing (TSH) [14] and FastHash [42] relax the discrete constraints into a continuous region . Kernel based discrete supervised hashing (KSDH) [15] adopts asymmetric relaxation to simultaneously learn the discrete matrix and a lowdimensional projection matrix for hash functions. Lin: Lin and Lin: V [16], asymmetric innerproduct binary coding (AIBC) [17] and asymmetric discrete graph hashing (ADGH) [18] employ the asymmetric discrete mechanism to learn lowdimensional matrices. Column sampling based discrete supervised hashing (COSDISH) [20] adopts the column sampling strategy same as latent factor hashing (LFH) [43] but directly learn binary codes by reformulating the binary quadratic programming (BQP) problems into equivalent clustering problems. Convolutional neural network hashing (CNNH) [19] divide the optimization problem into two subproblems [44]: (i) learning binary codes by a coordinate descent algorithm using Newton directions; (ii) training a convolutional neural network using the learned binary codes as labels. After that, deep hashing network (DHN) [45] and deep supervised pairwise hashing (DSPH) [46] simultaneously learn image representations and binary codes using pairwise labels. HashNet [47] learns binary codes from imbalanced similarity data. Deep cauchy hashing (DCH) [48] utilizes pairwise labels to generate compact and concentrated binary codes for efficient and effective Hamming space retrieval. Unlike previous work, in this paper we aim to present a simpler, more general and scalable optimization method for binary code learning.
3 Symmetric Discrete Hashing via A Pairwise Matrix
In this paper, matrices and vectors are represented by boldface uppercase and lowercase letters, respectively. For a matrix , its th row and th column vectors are denoted as and , respectively, and is one entry at the th row and th column.
3.1 Formulation
KSH [13] is one popular pairwise based hashing algorithm, which can preserve pairs’ relationship with using two identical binary matrices to approximate one pairwise real matrix. Additionally, it is a quartic optimization problem in term of hash functions, and thus more typical and difficult to solve than that only containing a quadratic term with respect to hash functions. Therefore, we first propose a novel optimization mechanism to solve the original problem in KSH, and then extend the proposed method to solve other pairwise based hashing models.
Given data points , suppose one pair when they are neighbors in a metric space or share at least one common label, and when they are nonneighbors in a metric space or have different class labels. For the singlelabel multiclass problem, the pairwise matrix is defined as [11]:
(1) 
For the multilabel multiclass problem, similar to [49], can be defined as:
(2) 
where is the relevance between and , which is defined as the number of common labels shared by and . is the weight to describe the difference between and . In this paper, to preserve the difference between nonneighbor pairs, we empirically set , where is the maximum relevance among all neighbor pairs. We do not set because few data pairs have the relevance being .
To encode one data point into bit hash codes, its th hash function can be defined as:
(3) 
where is a projection vector, and if , otherwise . Note that since can be written as the form with adding one dimension and absorbing , for simplicity we utilize in this paper. Let be hash codes of , and then for any pair , we have . To approximate the pairwise matrix , same as [13], a leastsquares style objective function is defined as:
(4) 
where and is a lowdimensional projection matrix. Eq. (4) is a quartic problem in term of hash functions, and this can be demonstrated by expanding its objective function.
3.2 Symmetric Discrete Hashing
3.2.1 Formulation transformation
In this subsection, we show the procedure to transform Eq. (4) into a linear problem. Since the objective function in Eq. (4) is a highly nondifferential quartic problem in term of hash functions , it is difficult to directly solve this problem. Here, we solve the problem in Eq. (4) via a novel alternative optimization mechanism: reformulating the quartic problem in term of hash functions into a quadratic one and then linearizing the quadratic problem. We present the detailed procedure in the following.
Firstly, we introduce a Lemma to show one of our main motivations to transform the quartic problem into a linear problem.
Lemma 1.
When the matrix satisfies the condition: , it is a global solution of the following problem:
(5) 
Lemma 1 is easy to solve, because when , makes the objective in Eq. (5) attain the maximum. Since satisfying is a global solution of the problem in Eq. (5), it suggests that the problem in term of hash functions can be transformed into a linear problem in term of . Inspired by this observation, we can solve the quartic problem in term of hash functions. For brevity, in the following we first ignore the constraint in Eq. (4) and aim to transform the quartic problem in term of into the linear form as the objective in Eq. (5), and then obtain the lowdimensional projection matrix .
To reformulate the quartic problem in term of into a quadratic one, in the th iteration, we set one discrete matrix to be and aim to solve the following quadratic problem in term of :
(6) 
Note that the problem in Eq. (6) is not strictly equal to the problem in Eq. (4) w.r.t . However, when , it is the optimal solution of both Eq. (6) and Eq. (4) w.r.t . The details are shown in Proposition 1.
Proposition 1.
Proof.
Obviously, if , it is the optimal solution of Eq. (6). Then we can consider the following formulation:
(7) 
Similar to one major motivation of asymmetric discrete hashing algorithms [16] [18], in Eq. (7), the feasible region of , in the left term is more flexible than in the right term (Eq. (4)), i.e. the left term contains both two cases and . Only when , . It suggests that when , it is the optimal solution of Eq. (4). Therefore, when , it is the optimal solution of both Eq. (4) and Eq. (6). ∎
Inspired by Proposition 1, we aim to seek through solving the problem in Eq. (6). Because is known and , the optimization problem in Eq. (6) equals:
(8) 
Since is a linear problem in term of , the main difficulty to solve Eq. (8) is caused by the nonconvex quadratic term . Thus we aim to linearize this quadratic term in term of by introducing a linear regression model as follows:
Theorem 1.
Given a discrete matrix and one real nonzero matrix , , where is a diagonal matrix and is an identity matrix.
Proof.
It is easy to verify that is the global optimal solution to the problem . Substituting into the above objective, its minimum value is . Therefore, Theorem 1 is proved. ∎
Theorem 1 suggests that when , the quadratic problem in Eq. (8) can be linearized as a regression type. We show the details in Theorem 2.
Theorem 2.
When , where is a constant, the problem in Eq. (8) can be reformulated as:
(9) 
Proof.
Next, we demonstrate that there exists and such that . The details are shown in Theorem 3.
Theorem 3.
Suppose that a full rank matrix , where is a positive diagonal matrix and . If and , and a real nonzero matrix satisfies the conditions: , and is a nonnegative real diagonal matrix with the th diagonal element being .
Proof.
Based on singular value decomposition (SVD), there exist matrices and , satisfying the conditions and , such that a real nonzero matrix is represented by , where is a nonnegative real diagonal matrix. Then . Note that when the vectors in and corresponds to the zero diagonal elements, they can be constructed by employing a GramSchmidt process such that and , and these constructed vectors are not unique.
Since and , it can have when . Since there exists , should satisfy: and . Additionally, based on , there exists . Therefore, Theorem 3 is proved. ∎
is usually a positivedefinite matrix thanks to , leading to . Based on Theorem 3, it is easy to construct a real nonzero matrix . Since , we set for simplicity, where and is a constant. Then Eq. (10) can be solved by alternatively updating , and . Actually, we can obtain by using an efficient algorithm in Theorem 4 that does not need to compute the matrices and .
Theorem 4.
For the inner th iteration embedded in the outer th iteration, the problem in Eq. (10) can be reformulated as the following problem:
(11) 
where denotes binary codes in the outer th iteration, and represents the obtained binary codes at the inner th iteration embedded in the outer th iteration.
Proof.
In Eq. (10), for the inner th iteration embedded in the outer th iteration, fixing as , it is easy to obtain . Substituting into Eq. (10), it becomes:
(12) 
Based on Theorem 2 and its proof, there have . Substituting it into Eq. (12), whose optimization problem becomes Eq. (11). Therefore, Theorem 4 is proved. ∎
For the inner loop embedded in the outer th iteration, there are many choices for the initialization value . Here, we set . At the th iteration, the global solution of Eq. (11) is . Additionally, for the inner loop, both and are global solutions of the th iteration, it suggests that the objective of Eq. (11) will be nondecreasing and converge to at least a local optima. Therefore, we have the following theorem.
Theorem 5.
For the inner loop embedded in the outer th iteration, the objective of Eq. (11) is monotonically nondecreasing in each iteration and will converge to at least a local optima.
Although Theorem 5 suggests that the objective of Eq. (11) will converge, its convergence is largely affected by the parameter , which is used to balance the convergence and semantic information in . Usually, the larger , the faster convergence but the more loss of semantic information. Therefore, we empirically set a small nonnegative constant for , i.e. , where .
Based on Eq. (11), the optimal solution can be obtained. Then we utilize to replace in Eq. (6) for next iteration in order to obtain the optimal solution . Since and , based on Lemma 1, should satisfy . However, it is an overdetermined linear system due to . For simplicity, we utilize a leastsquares model to obtain , which is .
3.2.2 Scalable symmetric discrete hashing with updating batch binary codes
Remark: The optimal solution of Eq. (6) is at least the local optimal solution of Eq. (4) only when . Given an initialization , can be alternatively updated by solving Eq. (11). However, with updating all binary codes at once on the nonconvex feasible region, might change on two different discrete matrices, which would lead to the error (please see Figure 1a) and the objective of Eq. (4) becomes worse (please see Figure 1b). Therefore, we divide into a variety of batches and gradually update each of them in a sequential mode, i.e. batch by batch.
To update one batch of , i.e. , where is one column vector denoting the index of selected binary codes in , the optimization problem derived from Eq. (6) is:
(13) 
where .
Furthermore, although is highdimensional for large , it is lowrank or can be approximated as a lowrank matrix. Similar to previous algorithms [40] [18], we can select () samples from training samples as anchors and then construct an anchor based pairwise matrix , which preserves almost all similarity information of . Let denote binary codes of anchors, and then utilize to replace for updating , Eq. (13) becomes:
(14) 
where denotes obtained at the th iteration in the outer loop, and .
Algorithm 1: SDH_P 
Input: Data , pairwise matrix , 
bit number , parameters , , batch size , 
anchor index , outer and inner loop 
maximum iteration number , . 
Output: and . 
1: Initialize: Let , set to be the 
lefteigenvectors of corresponding to 
its largest eigenvalues, calculate 
and . 
2: while not converge or reach maximum iterations 
3. ; 
4. for to do 
5. ; 
6: Do the SVD of ; 
7: ; 
8: repeat 
9: 
; 
10: until convergence 
11: ; 
12: end for 
13: end while 
14: Do the SVD of ; 
15: ; 
16: . 
Algorithm 2: GSDH_P 
Input: Data , pairwise matrix , 
bit number , parameters , , batch size , 
anchor index , outer/inner maximum 
iteration number , . 
Output: and . 
1: Initialize: Let , set to be the 
lefteigenvectors of corresponding to 
its largest eigenvalues, calculate , 
and . 
2: while not converge or reach maximum iterations 
3. ; 
4. for to do 
5. ; 
6: for to do 
7: Calculating with fixing , 
, ; 
8: repeat 
9: 
; 
10: until convergence 
11: ; 
12: end for 
13: end for 
14: end while 
15: Do the SVD of ; 
16: ; 
17: . 
Similar to Eq. (6), the problem in Eq. (14) can be firstly transformed into a quadratic problem, and then can be reformulated as a similar form to Eq. (11) based on Theorems 14. e.g.
(15) 
where denotes the batch binary codes at the th iteration in the outer loop.
For clarity, we present the detailed optimization procedure to attain by updating each batch and calculate the projection matrix in Algorithm 1, namely symmetric discrete hashing via a pairwise matrix (SDH_P). For Algorithm 1, with gradually updating each batch of , the error usually converges to zero (please see Figure 1a) and the objective of Eq. (4) also converges to a better local optima (please see Figure 1b). Besides, we also display the retrieval performance in term of mean average precision (MAP) with a small batch size and different iterations in Figure 1c.
3.3 Greedy Symmetric Discrete Hashing
To make the update step more smooth, we greedily update each bit of the batch matrix . Suppose that is the th bit of , it can be updated by solving the following optimization problem:
(16) 
where and represent the th and th bits of , respectively, and is the th bit of .
The problem in Eq. (16) can also be firstly transformed into a quadratic problem and then solved using Theorems 14. Similar to Eq. (11), the problem in Eq. (16) can be transformed to:
(17) 
where is the obtained at the th iteration in the outer loop, is the obtained at the th iteration in inner loop embedded in the th outer loop and .
In summary, we show the detailed optimization procedure in Algorithm 2, namely greedy symmetric discrete hashing via a pairwise matrix (GSDH_P). The error and the objective of Eq. (4) in Algorithm 2 and its retrieval performance in term of MAP with different number of iterations are shown in Figure 1a, b and c, respectively.
Outofsample: In the query stage, is employed as the binary codes of training data. We adopt two strategies to encode the query data point : (i) encoding it using ; (ii) similar to previous algorithms [50] [19], employing as labels to learn a classification model, like leastsquares, decision trees (DT) or convolutional neural networks (CNNs), to classify .
3.4 Convergence Analysis
Empirically, when , the proposed algorithms can converge to at least a local optima, although they cannot be theoretically guaranteed to converge in all cases. Here, we explain why gradually updating each batch of binary codes is beneficial to the convergence of hash code learning.
In Eq. (11), with updating one batch of , i.e. , Eq. (11) becomes:
(18) 
The hash code matrix can be represented as , where . Since , the objective of Eq. (18) is determined by:
(19) 
Based on Theorem 5, the inner loop can theoretically guarantee the convergence of the objective in Eq. (11), and thus the optimal solution of Eq. (19) can be obtained by the inner loop. Then it has:
(20) 
Because of , Eq. (20) usually leads to
(21) 
where and . Eq. (21) suggests that when , updating each batch matrix can usually make the objective of Eq. (11) gradually converge to at least a local optima.
3.5 Time Complexity Analysis
In Algorithm 1, and . Step 1 calculating matrices and requires and operations, respectively. For the outer loop stage, the time complexity of steps 6, 7, 9 and 11 is , , and , respectively. Hence, the outer loop stage spends operations. Steps 1416 to calculate the projection matrix spend , and , respectively. Therefore, the total complexity of Algorithm 1 is . Empirically, and .
For Algorithm 2, step 1 calculating , and spends at most . In the loop stage, the major steps both 7 and 9 require operations. Hence, the total time complexity of the loop stage is . Additionally, calculating the final costs the same time to the steps 1416 in Algorithm 1. Therefore, the time complexity of Algorithm 2 is .
4 Extension to Other Hashing Algorithms
In this subsection, we illustrate that the proposed algorithm GSDH_P is suitable for solving many other pairwise based hashing models.
Twostep hashing algorithms [14] [42] iteratively update each bit of the different loss functions defined on the Hamming distance of data pairs so that the loss functions of many hashing algorithms such as BRE [8], MLH [10] and EE [41] are incorporated into a general framework, which can be written as:
(22) 
where represents the th bit of binary codes and is obtained based on different loss functions with fixing all bits of binary codes except .
The algorithms [14] [42] firstly relax into and then employ LBFGSB [51] to solve the relaxed optimization problem, followed by thresholding to attain the binary vector . However, our optimization mechanism can directly solve Eq. (22) without relaxing .
Since and , the problem in Eq. (22) can be equivalently reformulated as:
(23) 
whose optimization type is same as the objective of Eq. (4) w.r.t . Replacing the constraint with , where is the th row vector of , Eq. (23) becomes:
(24) 
Since will consume large computation and storage costs for large , we select training anchors to construct based on different loss functions. Replacing in Eq. (24) with , it becomes:
(25) 
Similar to solving Eq. (4), we can firstly obtain and then calculate . To attain , we still gradually update each batch by solving the following problem:
(26) 
where .
The optimization type of Eq. (26) is the same as that of Eq. (16). can be obtained by gradually updating as shown in GSDH_P. After obtaining , can be attained by using .
Based on Eqs. (22)(26), many pairwise based hashing models can lean binary codes by using GSDH_P. For instances, we show the performance on solving the optimization model in BRE [8] [42]:
(27) 
where is the Hamming distance between and , and is an indicator function. Here, denotes .
One typical model with a hinge loss function [10] [42]: