Short Text Hashing Improved by Integrating MultiGranularity Topics and Tags
Abstract
Due to computational and storage efficiencies of compact binary codes, hashing has been widely used for largescale similarity search. Unfortunately, many existing hashing methods based on observed keyword features are not effective for short texts due to the sparseness and shortness. Recently, some researchers try to utilize latent topics of certain granularity to preserve semantic similarity in hash codes beyond keyword matching. However, topics of certain granularity are not adequate to represent the intrinsic semantic information. In this paper, we present a novel unified approach for short text Hashing using Multigranularity Topics and Tags, dubbed HMTT. In particular, we propose a selection method to choose the optimal multigranularity topics depending on the type of dataset, and design two distinct hashing strategies to incorporate multigranularity topics. We also propose a simple and effective method to exploit tags to enhance the similarity of related texts. We carry out extensive experiments on one short text dataset as well as on one normal text dataset. The results demonstrate that our approach is effective and significantly outperforms baselines on several evaluation metrics.
Keywords:
Similarity Search, Hashing, Topic Features, Short Text.1 Introduction
With the explosion of social media, numerous short texts become available in a variety of genres, e.g. tweets, instant messages, questions in Question and Answer (Q&A) websites and online advertisements [6]. In order to conduct fast similarity search in those massive datasets, hashing, which tries to learn similaritypreserving binary codes for document representation, has been widely used to accelerate similarity search. Unfortunately, many existing hashing methods based on keyword feature space usually fail to fully preserve the semantic similarity of short texts due to the sparseness of the original feature space. For example, there are three short texts as follows:
: “Rafael Nadal missed the Australian Open”;
: “Roger Federer won Grand Slam title”;
: “Tiger Woods broke numerous golf records”.
Obviously, the hashing methods based on keyword space cannot see the similarity among , and . In recent years, some researchers seek to address the challenge by latent semantic approach. For example, Wang et al. [12] preserve the semantic similarity of documents in hash codes by fitting the topic distributions, and Xu et al. [14] directly treat the latent topic features as tokens to represent one document for hashing learning. However, topics of certain granularity are not adequate to represent the intrinsic semantic information [4]. As we know, different topic models with predefined number of topics can extract different semantic level topics. For example, the topic model with a large number of topics can extract more fine grained topic features, such as “Tennis Open Progress” for and , and “Golf Star News” for , but fail to construct the semantic relevance of with the other texts, and the topic model with a few topics can extract more coarse grained semantic features, such as “Sport” and “Star” for , and , but lack distinguishing information and cannot learn the hashing function effectively, As a reasonable assumption, multigranularity topics are more suitable to preserve semantic similarity and learn hashing function for short text hashing.
On the other hand, tags are not fully utilized in many hashing methods. Actually, in various realworld applications, documents are often associated with multiple tags, which provide useful knowledge in learning effective hash codes [12]. For instance, in Q&A websites, each question has category labels or related tags assigned by its questioner. Another example is microblog, some tweets are labeled by their authors with hashtags in the form of “#keyword”. Thus, we should fully exploit the information contained in tags to strengthen the semantic relationship of related texts for hashing learning.
Based on the above observations, this paper proposes a unified short text Hashing using Multigranularity Topics and Tags, referred as HMTT for simplicity. In HMTT, two different ways are introduced to incorporate multigranularity topics and tag information for improving short text hashing.
The main contributions of this paper are threefold: Firstly, a novel unified short text hashing is proposed. To our best knowledge, this is the first time of incorporating multigranularity topics and tags into a unified hashing approach, and experiments are conducted to verify our assumption that short text hashing can be improved by integrating multigranularity topics and tags. Secondly, the optimal multigranularity topics can be selected automatically, i.e., to extract effective latent topic features for hashing learning. The experimental results indicate the optimal multigranularity topics can achieve better performances, compared with other multigranularity topics. Finally, two strategies to incorporate multigranularity topics for short text hashing are designed and compared through extensive experimental evaluations and analyses.
2 Related Work
Hashbased methods can be mainly divided into two categories. One category is dataoblivious hashing. As the most popular hashing technique, LocalitySensitive Hashing (LSH) [1] based on random projection has been widely used for similarity search. However, since they are not aware of data distribution, those methods may lead to generate quite inefficient hash codes in practice [16]. Recently, more researchers focus attention on the other category, dataaware hashing, For example, the Spectral Hashing (SpH) [13] generates compact binary codes by forcing the balanced and uncorrelated constraints into the learned codes. SelfTaught Hashing (STH) [18] and Two Step Hashing (TSH) [9] decompose the learning procedure into two steps: generating binary code and learning hash function, and a supervised version of STH is proposed in [16] denoted as STHs. However, the previous hashing methods, directly working in keyword feature space, usually fail to fully preserve semantic similarity. More recently, Wang et al. [12] proposed a Semantic Hashing using Tags and Topic Modeling (SHTTM). However, the limitations of SHTTM are that: Although the topic distributions are used to preserve the content similarity to generate hash codes, they do not utilize the topics to improve hashing function learning; Even the number of topics must keep consistent with dimensions of hash code, that this assumption is too strict to capture the optimal semantic features for different types of datasets.
3 Algorithm Description
A unified short text hashing approach HMTT is depicted in Fig. 1. Given a dataset of training texts denoted as: , where is the dimensionality of the keyword feature. Denote their tags as: , where is the total number of possible tags associated with each text. A tag with label 1 means a text is associated with a certain tag/category, while a tag with label 0 means a missing tag or the text is not associated with that tag/category. The goal of HMTT is to obtain optimal binary codes , and a hashing function : , which embeds the query text to its binary vector representation with bits. To achieve the similaritypreserving property, we require the similar texts to have similar binary codes in Hamming space. We first select the optimal topic models from the candidate topic models, and extract the multigranularity topic features . Then the binary codes and hash functions can be learned by integrating multigranularity topic features and tags. In the second phase which is online, the query text is represented by binary code mapped from the derived hash function, and then the approximate nearest neighbor search is accomplished in Hamming space. All pairs of hash code found within a certain Hamming distance of each other are semantic similar texts.
The main challenges of the idea are that: (1). How to select the optimal topic models; (2). How to utilize the tag information efficiently; and (3). How to integrate the multigranularity topics to preserve semantic similarity. The proposed approach HMTT will be described in detail in the following sections.
3.1 Estimate and Select the Optimal Topics
In this work, we straightforwardly obtain a set of candidate topics by predefining several different topic numbers of Latent Dirichlet Allocation (LDA) [3]. After training the topic models, we can draw multigranularity topic features, corresponding as distributions over the topics, from the candidate topic models.
In order to select the optimal topic models, we should utilize the tag information to evaluate the quality of topics. Inspired by [4, 7], the selection of optimal topic model sets depends on their capability in helping discriminate short texts without sharing any common tags. We denote different sets of topics as . For each entry , the probability topics distributions over documents are denoted as . The weight vector is , where is the weight indicating the importance of topic set. The purpose is to select the optimal topic sets . In [4], Chen et al. evaluate the quality of topics based on two aspects: discrimination and complementarity of the multigranularity topics. However, how to balance those two aspects is a tricky problem and the latter aspect, complementarity, is easy to introduce noises for preserving similarity. Thus, we propose a simple and effective method directly based on the key idea of Relief [7] as follows: Firstly, a subset with tags is sampled from training dataset, and we find two groups of nearest neighbors for each text : one group is from the texts sharing any common tags (denoted as ), and the other from the texts not sharing any common tags (denoted as ). Then the weight is updated as follows:
(1) 
where, is the symmetric KullbackLeibler (KL) divergence:
so is the value of . After updating the weight vector, we directly select the optimal topic sets according to the top weight values. In summary, the optimal topics selection procedure is depicted in Algorithm 1.
3.2 Content Similarity and Tags Preservation
In hashing problem, one key component is how to define the affinity matrix . Diverse approaches can be applied to construct the similarity matrix. In this paper, we choose cosine function as an example and use the local similarity structure of all text pairs to reconstruct the similarity function as follows:
(2) 
where represents the set of nearestneighbors of , and is an confidence coefficient. If two documents and share any common tag, we set a higher value . In reverse, the is given a lower value if two documents and are not related. The parameters and satisfy . For a particular dataset, the more trustworthy the tags are, the greater difference between and we set. In our experiments, we set and .
3.3 Learning to Hash with MultiLevel Topics
Below, from different perspectives, we propose two strategies to integrate multigranularity topics for improving short text hashing.
FeatureLevel Fusion
In order to integrate multigranularity topics, we here adopt a simple but powerful way to combine observed features and latent features for short text, similar as [10] and [4], and create a high dimensional vector as:
(3) 
where, are the optimal topic features, and
(4) 
We can straightforwardly construct the similarity matrix by Eq. 2 with the new features of training texts. Similar as TwoStep Hashing (TSH) [9], we see the binary code generation and hash function learning process as two separate steps. As a special example, Laplacian affinity loss and linear SVM are chosen to solve our problem. In first step, the training hash codes procedure can be formulated as following optimization:
(5) 
where is the pairwise similarity between documents and , is the hash code for , and is the Frobenius norm. To satisfy the similarity preservation, we seeks to minimize the quantity, because it incurs a heavy penalty if two similar documents are mapped far away. The problem is relaxed by discarding , the optimal dimensional realvalued vector can be obtained by solving Laplacian Eigenmaps problem [2]. Then, can be converted into binary codes via the media vector . In hash function learning step, thinking of each bit in the binary code as a binary class label for that text, we can train linear SVM classifiers to predict the bit binary code for any query document . Algorithm 2 shows the procedure of this strategy.
DecisionLevel Fusion
From another perspective, we can treat the optimal multigranularity topic feature sets extracted from short texts as multiview features. In our situation, there are view features: . We take a linear sum of those view similarities as follows:
(6) 
where, constructed as Eq. 2 is the affinity matrix defined on the th view features. By introducing a diagonal matrix whose entries are given by , Eq. 6 can be rewritten as , where is the Laplacian matrix defined on the th view features. By introducing Composite Hashing with Multiple Information Sources (CHMIS) [15], as a representative of Multiple View Hashing (MVH), we can simultaneously learn the hash codes of the training texts as well as a set of linear hash functions to infer the hash code for query text . The overall objective function is given as follows:
(7) 
where, and are tradeoff parameters, is the matrix trace function, is a combination coefficient vector to balance the outputs from each view features, and a series of linear hash function matrices: . In order to solve this hard optimization problem, we first relax the discrete constraints , and iteratively optimize one variable with the other two fixed. More detailed optimization procedures of this method can be found in [15]. Different from the former strategy, we do not need to preallocate the weight value of each view features, because that the combination coefficient vector learned iteratively in the process of optimization can balance the outputs of each view features, and the procedure of this strategy is shown in Algorithm 3.
3.4 Complexity Analysis
The training processes including binary code learning and hash function training are always conducted offline. Thus, our focus of efficiency is on the prediction process. This process of generating hash code for a query text only involves some Gibbs sampling iterations to extract multigranularity topics and dot products in hash function , which can be done in . Here, is the number of Gibbs sampling iterations for topic inference, is the sum of multigranularity topic numbers , is the dimensionality of hash code and denotes the sparsity of the observed keyword features. The values of the parameters above can be regarded as quite small constants. For example, and the average number of sparsity per document is no more than 100 in our experimental datasets. We can see the major time complexity is the Gibbs sampling for topic inference. In recent works, lots of studies focus to accelerate the topic inference. For example, in Biterm Topic Model (BTM), [5] gives a simplicity and efficient method without Gibbs sampling iterations and the time complexity for topic inference can be reduced to , where is the number of biterms in a query text.
4 Experiment and Analysis
4.1 Dataset and Experimental Settings
We carried out extensive experiments on two publicly available realworld text datasets: one is typical short text dataset, Search Snippets
The Search Snippets dataset collected by Phan [10] was selected from the results of web search transaction using predefined phrases of 8 different domains. We further filter the stop words and stem the texts. 20139 distinct words, 10059 training texts and 2279 test texts are left, and the average text length is 17.1.
The 20Newsgroups corpus was collected by Lang [8]. We use the popular ‘bydate‘ version which contains 20 categories, 26214 distinct words, 11314 training texts and 7532 test texts, and the average text length is 136.7.
For these datasets, we denote the category labels as tags. For Search Snippets, we use a largescale corpus [10] crawled from Wikipedia to estimate the topic models, and the original keyword features are directly used for learning the candidate topic models for 20Newsgroups due to the sufficient keyword features. In order to evaluate our method’s performance, we compute standard retrieval performance measures: recall and precision, by using each document in the test set as a query to retrieve documents in the training set within a specified Hamming distance. For the original keyword feature space cannot well reflect the semantic similarity of documents, even worse for short text, we simply test if the two documents share any common tag to decide whether a semantic similar text. This methodology is used in SH [11], STH [18], CHMIS [15] and SHTTM [12].
Five alternative hashing methods compared with our proposed approach are STHs [16], STH [18], LCH [17], LSI [11] and SpH [13]. The results of all baseline methods are obtained by the opensource implementation provided on their corresponding author’s homepage. In order to distinguish the proposed two strategies in our approach, the feature level fusion method is denoted as HMTTFea, and the decision level fusion method is named as HMTTDec
In our experiments, the candidate topic sets T = {, , , , , , } and the number of the optimal topic sets is fixed to 3. The parameters and in Eq. 7 are tuned from {0.1, 1, 10, 100}. The number of nearest neighbors is fixed to 25 when constructing the graph Laplacians in our approach, as well as in the baseline methods, STHs and STH. We evaluate the performance of different methods by varying the number of hashing bits from 4 to 64. For LDA, we used the opensource implementation GibbsLDA
4.2 Results and Analysis
We sample 100 texts for each category with tags information randomly from training dataset and set in Eq. 1 to 10 to evaluate the quality of topic sets by Algorithm 1. As the number of optimal topic sets is fixed to 3, we get the optimal topic sets O = {, , } for both two datasets coincidentally, and the weight vectors = {3.44, 1.7, 1} for Search Snippets and = {1.31, 1.22, 1} for 20Newsgroups. It is noteworthy that the weight values of the topic sets are affected by both the type of dataset and the settings of LDA. Below, a series of experiments are conducted to answer the questions: (1). How does the proposed approach HMTT compare with other baseline methods; (2). Whether the optimal multigranularity topics can outperform singlegranularity topics and other multigranularity topics; (3). Which approach of the two strategies to integrate multigranularity topics can achieve a better performance.
Compared with the existing hashing methods: In this section, we design an improved version of STHs, denoted as STHsTag, by replacing the original construction of similarity matrix with the proposed method described in Section 3.2. We remove 60 percent tags randomly from the training dataset to verify the robustness for HMTTFea, HMTTDec, STHs and STHsTag. The precisionrecall curves for retrieved examples are reported in Fig. 2. From these comparison results, we can see that HMTTFea and HMTTDec significantly outperform other baseline methods on Search Snippets as shown in Fig. 2 (a). For 20Newsgroups, HMTTDec performs close results with STHsTag in Fig. 2 (b). The reasons to explain this problem are that: Firstly, 20Newsgroups as a normal dataset has sufficient original features to learn hash codes so that STHsTag based on keyword features works well. Secondly, we directly learn the topic models of 20Newsgroups from the training dataset that result in some restrictions. Furthermore, STHs get a worse performance than STHsTag on two datasets. Because STHs uses a complete supervised approach which only utilizes the pairwise similarity of the documents with common tags, that method cannot well deal with the situations that tags are missing or incomplete. In our approach, we extract the optimal multigranularity topics depending on the type of dataset to learn hash codes and hashing function, and the tags are just utilized to adjust the similarity, which has stronger robustness. In the following experiment sets, we keep the all tags to improve the performance of hashing learning.



—  mP@Top 200  mP@Hamming Radius 3  
Methods  HMTTFea  HMTTDec  HMTTFea  HMTTDec  
Code Length  8 bits  16 bits  8 bits  16 bits  8 bits  16 bits  8 bits  16 bits 


103050*  0.829  0.799  0.826  0.782  0.411  0.802  0.403  0.778 
107090  0.819  0.800  0.797  0.762  0.375  0.789  0.328  0.754 
3090150  0.802  0.787  0.801  0.755  0.393  0.777  0.382  0.757 
1030  0.810  0.789  0.776  0.757  0.382  0.776  0.374  0.744 
1050  0.813  0.788  0.772  0.752  0.383  0.790  0.334  0.740 
3050  0.806  0.796  0.805  0.777  0.393  0.779  0.369  0.764 
103050W1  0.811  0.780  0.822  0.778  0.368  0.761  0.398  0.774 
10  0.627  0.624  0.639  0.602  0.316  0.610  0.296  0.576 
30  0.792  0.764  0.728  0.708  0.377  0.757  0.335  0.692 
50  0.782  0.758  0.731  0.723  0.360  0.730  0.320  0.707 
70  0.771  0.755  0.728  0.720  0.365  0.747  0.318  0.704 
90  0.757  0.733  0.735  0.708  0.363  0.736  0.332  0.692 
120  0.730  0.705  0.707  0.700  0.366  0.714  0.309  0.683 
150  0.740  0.727  0.675  0.674  0.370  0.729  0.304  0.660 

Compared with singlegranularity and other multigranularity topic sets: Here, the hashing performances of the optimal multigranularity topics are compared with singlegranularity and other multigranularity topics. We further evaluate the balance values of the multigranularity topics by fixing them to 1. In particular, we keep the parameters in Eq. 3 and in Eq. 7 to 1 for HMTTFea and HMTTDec respectively. The quantitative results on Search Snippets are reported in Table 1. From the results, we can see that the performances of multigranularity topics significantly outperform singlegranularity topics and the optimal multigranularity topics achieve a better performance in most situations. We also observe similar results on 20Newsgroups. But due to the limit of space, we select to present the results on the typical short texts dataset Search Snippets.
Compared between the proposed two strategies: Finally, we mainly discuss the performances between the proposed two strategies, HMTTFea and HMTTDec. In HMTTFea, we directly concatenate the multigranularity topics to produce one feature vector and decompose the hashing learning problem into two separate stages. In HMTTDec, the multigranularity topics extracted from the text content are treated as multiview features, and we simultaneously learn the hash codes as well as hash function. From the results in Table 1, we can see that the performances of HMTTFea surpass HMTTDec on several evaluation metrics. Obviously, the former strategy is more simple and effective for short text hashing in our approach. In summary, no matter in HMTTFea or HMTTDec, the experimental results indicate that short text hashing can be improved by integrating multigranularity topics.
5 Discussions and Conclusions
Short text hashing is a challenging problem due to the sparseness of text representation. In order to address this challenge, tags and latent topics should be fully and properly utilized to improve hashing learning. Furthermore, it is better to estimate the topic models from an external largescale corpus and the optimal topics should be selected depending on the type of dataset. This paper uses a simple and effective selection methods based on symmetric KLdivergence of topic distributions, we think that there are many other selection methods worthy of being explored further. Another key issue worthy of research is how to integrate the multigranularity topics effectively. In this paper, we propose a novel unified hashing approach for short text retrieval. In particular, the optimal multigranularity topics are chosen depending on the type of dataset. We then use the optimal multigranularity topics to learn hash codes and hashing function on two distinct ways, meanwhile, tags are utilized to enhance the semantic similarity of related texts. Extensive experiments demonstrate that the proposed method can perform better than the competitive methods on two public datasets.
Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grant No. 61203281 and No. 61303172.
Footnotes
 http://jwebpro.sourceforge.net/datawebsnippets.tar.gz
 http://people.csail.mit.edu/jrennie/20Newsgroups/
 https://github.com/jacoxu/shorttexthashingHMTT, http://www.CICLing.org/2015/data/148
 http://jgibblda.sourceforge.net/
References
 Andoni, A., Indyk, P.: Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions. In: Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on. pp. 459–468. IEEE (2006)
 Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation 15(6), 1373–1396 (2003)
 Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. the Journal of Machine Learning Research 3, 993–1022 (2003)
 Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multigranularity topics. In: Proceedings of the 22nd international joint conference on Artificial Intelligence. pp. 1776–1781. AAAI Press (2011)
 Cheng, X., Lan, Y., Guo, J., Yan, X.: Btm: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering p. 1 (2014)
 Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: CIKM. pp. 775–784. ACM (2011)
 Kononenko, I.: Estimating attributes: analysis and extensions of relief. In: Machine Learning: ECML94. pp. 171–182. Springer (1994)
 Lang, K.: Newsweeder: Learning to filter netnews. In: In Proceedings of the Twelfth International Conference on Machine Learning. Citeseer (1995)
 Lin, G., Shen, C., Suter, D., Hengel, A.v.d.: A general twostep approach to learningbased hashing. In: Computer Vision (ICCV), 2013 IEEE International Conference on. pp. 2552–2559. IEEE (2013)
 Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from largescale data collections. In: Proceedings of the 17th international conference on World Wide Web. pp. 91–100. ACM (2008)
 Salakhutdinov, R., Hinton, G.: Semantic hashing. International Journal of Approximate Reasoning 50(7), 969–978 (2009)
 Wang, Q., Zhang, D., Si, L.: Semantic hashing using tags and topic modeling. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. pp. 213–222. ACM (2013)
 Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Advances in neural information processing systems. pp. 1753–1760 (2009)
 Xu, J., Liu, P., Wu, G., Sun, Z., Xu, B., Hao, H.: A fast matching method based on semantic similarity for short texts. In: Natural Language Processing and Chinese Computing, pp. 299–309. Springer (2013)
 Zhang, D., Wang, F., Si, L.: Composite hashing with multiple information sources. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. pp. 225–234. ACM (2011)
 Zhang, D., Wang, J., Cai, D., Lu, J.: Extensions to selftaught hashing: Kernelisation and supervision. practice 29, 38 (2010)
 Zhang, D., Wang, J., Cai, D., Lu, J.: Laplacian cohashing of terms and documents. In: Advances in Information Retrieval, pp. 577–580. Springer (2010)
 Zhang, D., Wang, J., Cai, D., Lu, J.: Selftaught hashing for fast similarity search. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. pp. 18–25. ACM (2010)