Learning Topics using Semantic Locality
The topic modeling discovers the latent topic probability of the given text documents. To generate the more meaningful topic that better represents the given document, we proposed a new feature extraction technique which can be used in the data preprocessing stage. The method consists of three steps. First, it generates the word/word-pair from every single document. Second, it applies a two-way TF-IDF algorithm to word/word-pair for semantic filtering. Third, it uses the K-means algorithm to merge the word pairs that have the similar semantic meaning.
Experiments are carried out on the Open Movie Database (OMDb), Reuters Dataset and 20NewsGroup Dataset. The mean Average Precision score is used as the evaluation metric. Comparing our results with other state-of-the-art topic models, such as Latent Dirichlet allocation and traditional Restricted Boltzmann Machines. Our proposed data preprocessing can improve the generated topic accuracy by up to 12.99%.
During the last decades, most collective information has been digitized to form an immense database distributed across the Internet. Among all, text-based knowledge is dominant because of its vast availability and numerous forms of existence. For example, news, articles, or even Twitter posts are various kinds of text documents. On one hand, it is difficult for human users to locate one’s searching target in the sea of countless texts without a well-defined computational model to organize the information. On the other hand, in this big data era, the e-commerce industry takes huge advantages of machine learning techniques to discover customers’ preference. For example, notifying a customer of the release of “Star Wars: The Last Jedi” if he/she has ever purchased the tickets for “Star Trek Beyond”; recommending a reader “A Brief History of Time” from Stephen Hawking in case there is a “Relativity: The Special and General Theory” from Albert Einstein in the shopping cart on Amazon. The content based recommendation is achieved by analyzing the theme of the items extracted from its text description.
Topic modeling is a collection of algorithms that aim to discover and annotate large archives of documents with thematic information. Usually, general topic modeling algorithms do not require any prior annotations or labeling of the document while the abstraction is the output of the algorithms. Topic modeling enables us to convert a collection of large documents into a set of topic vectors. Each entry in this concise representation is a probability of the latent topic distribution. By comparing the topic distributions, we can easily calculate the similarity between two different documents. The availability of many manually categorized online documents, such as Internet Movie Database (IMDb) movie review , Wikipedia articles, makes the testing and validation of topic models possible.
Some topic modeling algorithms are highly frequently used in text-mining, preference recommendation and computer vision. Many of the traditional topic models focus on latent semantic analysis with unsupervised learning . Latent Semantic Indexing (LSI)  applies Singular-Value Decomposition (SVD)  to transform the term-document matrix to a lower dimension where semantically similar terms are merged. It can be used to report the semantic distance between two documents, however, it does not explicitly provide the topic information. The Probabilistic Latent Semantic Analysis (PLSA) model uses maximum likelihood estimation to extract latent topics and topic word distribution, while the Latent Dirichlet Allocation (LDA)  model performs iterative sampling and characterization to search for the same information. Restricted Boltzmann Machine (RBM)  is also a very popular model for the topic modeling. By training a two layer model, the RBM can learn to extract the latent topics in an unsupervised way.
All of the existing works are based on the bag-of-words model, where a document is considered as a collection of words. The semantic information of words and interaction among objects are assumed to be unknown during the model construction. Such simple representation can be improved by recent research in natural language processing and word embedding. In this paper, we will explore the existing knowledge and build a topic model using explicit semantic analysis.
This work studies effective data processing and feature extraction for topic modeling and information retrieval. We investigate how the available semantic knowledge, which can be obtained from language analysis, can assist in the topic modeling.
Our main contributions are summarized as the following:
A new topic model is designed which combines two classes of text features as the model input.
We demonstrate that a feature selection based on semantically related word pairs provides richer information thank simple bag-of-words approach.
The proposed semantic based feature clustering effectively controls the model complexity.
Compare to existing feature extraction and topic modeling approach, the proposed model improves the accuracy of the topic prediction by up to 12.99%.
The rest of the paper is structured as follows: In Section II, we review the existing methods, from which we got the inspirations. This is followed in Section III by details about our topic models. Section IV describes our experimental steps and analyzes the results. Finally, Section V concludes this work.
Ii Related Work
Many topic models have been proposed in the past decades. This includes LDA, Latent Semantic Analysis(LSA), word2vec, and RBM, etc. In this section, we will compare the pros and cons of these topic models for their performance in topic modeling.
LDA was one of the most widely used topic models. LDA introduces sparse Dirichlet prior distributions over document-topic and topic-word distributions, encoding the intuition that documents cover a small number of topics and that topics often use a small number of words . LSA was another topic modeling technique which is frequently used in information retrieval. LSA learned latent topics by performing a matrix decomposition (SVD) on the term-document matrix . In practice, training the LSA model is faster than training the LDA model, but the LDA model is more accurate than the LSA model.
Traditional topic models did not consider the semantic meaning of each word and cannot represent the relationship between different words. Word2vec can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary . During the training, the model generated word-context pairs by applying a sliding window to scan through a text corpus. Then the word2vec model trained word embeddings using word-context pairs by using the continuous bag of words (CBOW) model and the skip-gram model . The generated word vectors can be summed together to form a semantically meaningful combination of both words.
RBM was proposed to extract low-dimensional latent semantic representations from a large collection of documents . The architecture of the RBM is an undirected bipartite graphic, in which word-count vectors are modeled as Softmax input units and the output units are binary units. The Contrastive Divergence learning was used to approximate the gradient. By running the Gibbs sampler, the RBM reconstructed the distribution of units . A deeper structure of neural network, the Deep Belief Network (DBN), was developed based on stacked RBMs. In , the input layer was the same as RBM mentioned above, other layers are all binary units.
In this work, we adopt Restricted Boltzmann Machine (RBM) for topic modeling, and investigate feature selection for this model. Another state-of-the-art model in topic modeling is the LDA model. As mentioned in Section II, LDA is a statically model that is widely used for topic modeling. However, previous research  shows that the RBM based topic modeling gives 5.45%19.94% higher accuracy than the LDA based model. In Section IV, we also compare the MAP score of these two when applied to three different datasets. Our results also show that the RBM model has better efficiency and accuracy than the LDA model. Hence, we focus our discussion only for the RBM based topic modeling.
Our feature selection contains three steps handled by three different modules: feature generation module, feature filtering module and feature coalescence module. The whole structure of our framework as shown in Figure 1. Each module will be elaborated in the next.
The proposed feature selection is based on our observation that word dependencies provide additional semantic information than that simple word counts. However, there are quadratically more depended word pairs relationships than words. To avoid the explosion of feature set, filtering and coalescing must be performed. Overall those three steps perform feature generation, screening and pooling.
Iii-a Feature Generation: Semantic Word Pair Extraction
Current RBM model for topic modeling uses the bag-of-words approach. Each visible neuron represents the number of appearance of a dictionary word. We believe that the order of the words also exhibits rich information, which is not captured by the bag-of-words approach. Our hypothesis is that including word pairs (with specific dependencies) helps to improve topic modeling.
In this work, Stanford natural language parser   is used to analyze sentences in both training and testing corpus, and extract word pairs that are semantically dependent. Universal dependency(UD) is used during the extraction. For example, given the sentence: “Lenny and Amanda have an adopted son Max who turns out to be brilliant.”, which is part of description of the movie “Mighty Aphrodite” from the OMDb dataset. Figure 2 shows all the depended word pairs extracted using the Standford parser. Their order is illustrated by the arrows connection between them, and their relationship is marked beside the arrows. As you can see that the depended words are not necessarily adjacent to each other, however they are semantically related.
Because each single word may have combinations with many other different words during the dependency extraction, the total number of the word pairs will be much larger than the number of word in the training dataset. If we use all depended word pairs extracted from the training corpus, it will significantly increase the size of our dictionary and reduce the performance. To retain enough information with manageable complexity, we keep the 10,000 most frequent word pairs as the initial word pair dictionary. Input features of the topic model will be selected from this dictionary. Similarly, we use the 10,000 most frequent words to form a word dictionary. For both dictionary, stop words are removed.
Iii-B Feature Filtering: Two steps TF-IDF Processing
The word dictionary and word pair dictionary still contain a lot of high frequency words that are not very informational, such as ”first”, ”name”, etc. Term frequency-inverse document frequency (TF-IDF) is applied to screen out those unimportant words or word pairs and keep only important ones. The equation to calculate TF-IDF weight is as following:
Equation 1 calculates the Term Frequency (TF), which measures how frequently a term occurs in a document. Equation 2 calculates the Inverse document frequency (IDF), which measures how important a term is. The TF-IDF weight is often used in information retrieval and text mining. It is a statically measure to evaluate how important a word is to a document in a collection of corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus     .
As shown in Figure 1, Feature Filtering module, a two-step TF-IDF processing is adopted. First, the word-level TF-IDF is performed. The result of word level TF-IDF is used as a filter and a word pair is kept only if the TF-IDF scores of both words are higher than the threshold (0.01). After that, we treat each word pair as a single unit, and the TF-IDF algorithm is applied again to the word pairs and further filter out word pairs that are either too common or too rare. Finally, this module will generate the filtered word dictionary and the filtered word pair dictionary.
Iii-C Feature Coalescence: K-means Clustering
Even with the TF-IDF processing, the size of the word pair dictionary is still prohibitively large. We further cluster semantically close word pairs to reduce the dictionary size. Each word is represented by their embedded vectors calculated using Google’s word2vec model. The semantic distance between two words is measured as the Euclidean distance of their embedding vectors. The words that are semantically close to each other are grouped into K clusters.
We use the index of each cluster to replace the words in the word pair. If the cluster ID of two word pairs are the same, then the two word pairs are semantically similar and be merged. In this we can reduce the number of word pairs by more than 63%. We also investigate how the number of the cluster centrum (i.e. the variable K) will affect the model accuracy. The detailed experimental results on three different datasets will be given in IV.
Iv-a Experiments Setup
The proposed topic model will be tested in the context of content-based recommendation. Given a query document, the goal is to search the database and find other documents that fall into the category by analyzing their contents. In our experiment, we generate the topic distribution of each document by using RBM model. Then we retrieve the top N documents whose topic is the closet to the query document by calculating their Euclidean distance. The number of hidden units of the RBM is 500 which represents 500 topics. The number of visible units of the RBM equals to total number of different words and words pairs extracted as input features. The weights are updated using a learning rate of 0.01. During the training, momentum, epoch, and weight decay are set to be 0.9, 15, and 0.0002 respectively.
Our proposed method is evaluated on 3 datasets: OMDb, Reuters, and 20NewsGroup. All the datasets are divided into three subsets: training, validation, and testing. The split ratio is 70:10:20. For each dataset, a 5-fold cross-validation is applied.
OMDb, the Open Movie Database, is a database of movie information. The OMDb dataset is collected using OMDb APIs . The training dataset contains 6043 movie descriptions; the validation dataset contains 863 movie descriptions and the testing dataset contains 1727 movie descriptions. Based on the genre of the movie, we divided them into 20 categories and tagged them accordingly.
The Reuters, is a dataset consists of documents appeared on the Reuters newswire in 1987 and were manually classified into 8 categories by personnel from Reuters Ltd. There are 7674 documents in total. The training dataset contains 5485 news, the validation dataset contains 768 news and the testing dataset contains 1535 news.
The 20NewsGroup dataset is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The training dataset contains 13174 news, the validation dataset contains 1882 news and the testing dataset contains 3765 news. Both Reuters and 20NewsGroup dataset are download from .
We use score to evaluate our proposed method. It is a score to evaluate the information retrieval quality. This evaluation method considers the effect of orders in the information retrieval results. A higher score is better. If the relational result is shown in the front position (i.e. ranks higher in the recommendation), the score will be close to 1; if the relational result is shown in the back position (i.e. ranks lower in the recommendation), the score will be close to 0. , , , and are used to evaluate the retrieval performance. For each document, we retrieve 1, 3, 5, and 10 documents whose topic vectors have the smallest Euclidean distance with that of the query document. The documents are considered as relevant if they share the same class label. Before we calculate the , we need to calculate the for each document first. The equation of is described below,
where is an indicator function equaling if the item at rank is a relevant document, otherwise . Note that the average is over all relevant documents and the relevant documents not retrieved get a precision score of zero.
The equation of the score is as following,
where indicates the total number of queries.
Iv-C1 LDA and RBM Performance Comparison
In the first experiment, we investigate the topic modeling performance between LDA and RBM. For the training of the LDA model, the training iteration is 15 and the number of generated topics is 500 which are as the same as the RBM model. As we can see from the Table I. The RBM outperforms the LDA in all datasets. For example, using the evaluation, the RBM is 30.22% greater than the LDA in OMDb dataset, 18.18% greater in Reuters dataset and 25.25% greater in 20NewsGroup dataset. To have a fair comparison, the RBM model here is based on word only features. In the next we will show the including word pairs can further improve its score.
Iv-C2 Word/Word Pair Performance Comparison
In this experiment, we compare the performance of two RBM models. One of them only considers words as the input feature, while the other has combined words and word pairs as the input feature. The total feature size varies from 10500, 11000, 11500, 12000, 12500, 15000. For the word/word pair combined RBM model, the number of word feature is fixed to be 10000, and the number of word pair features is set to meet the requirement of total feature size.
|mAP||F = 10.5K||F = 11K||F = 11.5K||F = 12K||F = 12.5K||F = 15K|
|word||word pair||word||word pair||word||word pair||word||word pair||word||word pair||word||word pair|
Both models are first applied to the OMDb dataset, and the results are shown in Table II, section 1, the word/word pair combined model almost always performs better than the word-only model. For the , the and the , the most significant improvement occurs when total feature size is set to = 11000. About 10.48%, 7.97%, and 9.83% improved were found compared to the word-only model. For the , the most significant improvement occurs when the total feature size is set to = 12000, and about 9.35% improvement is achieved by considering word pair.
The two models are further applied on the Reuters dataset, and the results are shown in Table II, section 2. Again, the word/word pair combined model outperforms the word-only model almost all the time. For the up to 1.05%, 1.11%, 1.02% and 0.89% improvement are achieved.
The results for 20NewsGroup dataset are shown in Table II, section 3. Similar to previous two datasets, all the results from word/word pair combined model are better than the word-only model. For the , the most significant improvement occurs when when the total feature size is set to = 11500. Up to 10.40%, 11.91%, 12.46% and 12.99% improvements can achieved.
Iv-C3 Cluster Centrum Selection
In the third experiment, we focus on how the different K values affect the effectiveness of the generated word pairs in terms of their ability of topic modeling. The potential K values are 100, 300, 500, 800 and 1000. Then we compare the between our model and the baseline model, which consists of word only input features.
The OMDb dataset results are shown in Figure 3. As we can observe, all the K values give us better performance than the baseline. The most significant improvement occurs when in K = 100. Regardless of the size of word pair features, in average we can achieve 2.41%, 2.15%, 1.46% and 4.46% improvements in respectively.
The results of Reuters dataset are shown in Figure 4. When the K value is greater than 500, all scores for word/word pair combination model are better than the baseline. Because the mAP score for Reuters dataset in original model is already very high (almost all of them are higher than 0.9), compared to OMDb, it is more difficult to further improve the score of this dataset. For the , disregard the impact of input feature size, in average the most significant improvement happens when K = 500, which is 0.31%. For the , the and the , the most significant improvements happen when K = 800, which are 0.50%, 0.38% and 0.42% respectively.
The results for 20NewsGroup dataset results are shown in Figure 5. Similar to the Reuters dataset, when the K value is greater than 800, all scores for word/word pair combination model are better than the baseline. For the , in average the most significant improvements are 2.82%, 2.90%, 3.2% and 3.33% respectively, and they all happen when K = 1000.
In summary, a larger K value generally give a better result, like the Reuters dataset and the 20NewsGroup dataset. However, for some documents sets, such as OMDb, where the vocabulary semantically has a wide distribution, keeping the number of clusters small will not lose too much information.
Iv-C4 Word Pair Generation Performance
In the last experiment, we compare different word pair generation algorithms with the baseline. Similar to previous experiments, the baseline is the word-only RBM model whose input consists of the 10000 most frequent words. The “semantic” word pair generation is the method we proposed in this paper. The proposed technique is compared to a reference approach that applies the idea from the skip-gram  algorithm, and generates the word pairs from each word’s adjacent neighbor. We call it “N-gram” word pair generation. And the window size used in here is N = 2. For the “Non-K” word pair generation, we use the same algorithm as the “semantic” except that no K-means clustering is applied on the generated word pairs.
The first thing we observe from the Table III is that both “semantic” word pair generation and “Non-K” word pair generation give us better score than the baseline; however, the score of the “semantic” generation is slightly higher than the “Non-K” generation. This is because, although both “Non-K” and “semantic” techniques extract word pairs using natural language processing, without the K-means clustering, semantically similar pairs will be considered separately. Hence there will be lots of redundancies in the input space. This will either increase the size of the input space, or, in order to control the input size, reduce the amount of information captured by the input set. The K-means clustering performs the function of compression and feature extraction.
The second thing that we observe is that, for the “N-gram” word pair generation, its score is even lower than the baseline. Beside the OMDb dataset, other two datasets show the same pattern. This is because the “semantic” model extracts word pairs from natural language processing, therefore those word pairs have the semantic meanings and grammatical dependencies. However, the “N-gram” word pair generation simply extracts words that are adjacent to each other. When introducing some meaningful word pairs, it also introduces more meaningless word pairs at the same time. These meaningless word pairs act as noises in the input. Hence, including word pairs without semantic importance does not help to improve the model accuracy.
In this paper, we proposed a few techniques to preprocess the dataset and optimize the original RBM model. During the dataset preprocessing, first, we used a semantic dependency parser to extract the word pairs from each sentence in the text document. Then, by applying a two-way TF-IDF processing, we filtered the data in word level and word pair level. Finally, K-means clustering algorithm helped us merge the similar word pairs and remove the noise from the feature dictionary. We replaced the original word only RBM model by introducing word pairs. At the end, we showed that proper selection of K value and word pair generation techniques can significantly improve the topic prediction accuracy and the document retrieval performance. With our improvement, experimental results have verified that, compared to original word only RBM model, our proposed word/word pair combined model can improve the score up to 10.48% in OMDb dataset, up to 1.11% in Reuters dataset and up to 12.99% in the 20NewsGroup dataset.
-  D. M. Blei, “Probabilistic topic models,” Communications of the ACM, vol. 55, no. 4, pp. 77–84, 2012.
-  M. Steyvers and T. Griffiths, “Probabilistic topic models,” Handbook of latent semantic analysis, vol. 427, no. 7, pp. 424–440, 2007.
-  I. M. D. Inc., “The intenet movie database,” 1990. [Online]. Available: http://www.imdb.com/
-  Q. Mei, D. Cai, D. Zhang, and C. Zhai, “Topic modeling with network regularization,” in Proceedings of the 17th international conference on World Wide Web. ACM, 2008, pp. 101–110.
-  C. Wang and D. M. Blei, “Collaborative topic modeling for recommending scientific articles,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011, pp. 448–456.
-  X. Wang and E. Grimson, “Spatial latent dirichlet allocation,” in Advances in neural information processing systems, 2008, pp. 1577–1584.
-  T. K. Landauer, Latent semantic analysis. Wiley Online Library, 2006.
-  G. Golub and C. Reinsch, “Singular value decomposition and least squares solutions,” Numerische mathematik, vol. 14, no. 5, pp. 403–420, 1970.
-  T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 1999, pp. 289–296.
-  D. M. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
-  G. E. Hinton and R. R. Salakhutdinov, “Replicated softmax: an undirected topic model,” in Advances in neural information processing systems, 2009, pp. 1607–1614.
-  S. T. Dumais, “Latent semantic analysis,” Annual review of information science and technology, vol. 38, no. 1, pp. 188–230, 2004.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 3111–3119.
-  R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted boltzmann machines for collaborative filtering,” in Proceedings of the 24th international conference on Machine learning. ACM, 2007, pp. 791–798.
-  G. Hinton and R. Salakhutdinov, “Discovering binary codes for documents by learning deep generative models,” Topics in Cognitive Science, vol. 3, no. 1, pp. 74–91, 2011.
-  N. Srivastava, R. Salakhutdinov, and G. E. Hinton, “Modeling documents with deep boltzmann machines,” arXiv preprint arXiv:1309.6865, 2013.
-  D. Chen and C. Manning, “A fast and accurate dependency parser using neural networks,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 740–750.
-  J. Nivre, M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. T. McDonald, S. Pyysalo, N. Silveira et al., “Universal dependencies v1: A multilingual treebank collection.” in LREC, 2016.
-  K. Sparck Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of documentation, vol. 28, no. 1, pp. 11–21, 1972.
-  G. Salton and E. A. Fox, “Extended boolean information retrieval,” Communications of the ACM, vol. 26, no. 11, pp. 1022–1036, 1983.
-  G. Salton and M. J. McGill, “Introduction to modern information retrieval,” 1986.
-  G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information processing & management, vol. 24, no. 5, pp. 513–523, 1988.
-  H. C. Wu, R. W. P. Luk, K. F. Wong, and K. L. Kwok, “Interpreting tf-idf term weights as making relevance decisions,” ACM Transactions on Information Systems (TOIS), vol. 26, no. 3, p. 13, 2008.
-  B. Fritz, “Omdb api.” [Online]. Available: http://www.omdbapi.com/
-  A. M. d. J. C. Cachopo, “Improving methods for single-label text categorization,” Instituto Superior Técnico, Portugal, 2007.
-  A. Turpin and F. Scholer, “User performance versus precision measures for simple search tasks,” in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006, pp. 11–18.