Hierarchical Latent Semantic Mapping for Automated Topic Generation
Much of information sits in an unprecedented amount of text data. Managing allocation of these large scale text data is an important problem for many areas. Topic modeling performs well in this problem. The traditional generative models (PLSA,LDA) are the state-of-the-art approaches in topic modeling and most recent research on topic generation has been focusing on improving or extending these models. However, results of traditional generative models are sensitive to the number of topics , which must be specified manually and determines the of solution space for topic generation. The problem of generating topics from corpus resembles community detection in networks. Many effective algorithms can automatically detect communities from networks without a manually specified number of the communities. Inspired by these algorithms, in this paper, we propose a novel method named Hierarchical Latent Semantic Mapping (HLSM), which generates topics from corpus. HLSM calculates the association between each pair of words in the latent topic space, then constructs a unipartite network of words with this association and hierarchically generates topics from this network. We apply HLSM to several document collections and the experimental comparisons against several state-of-the-art approaches demonstrate the promising performance.
|Guorui Zhou, Guang Chen|
|School of Information and Communication Engineering|
|Beijing University of Posts and Telecommunications|
Managing large allocation of documents has become a popular challenge in many fields. Topic modeling, which assigns topics to documents, offers a promising solution for this challenge.
Topic models generate topics from a set of documents and assign topics to these documents. Based on these topics we can solve problems on cross-domain text classification Barathi (2011), understanding text clustering Chang & Hsu (2005), text recommendation, and other related text data applications. There has been an exceptional amount of research on topic-model algorithms. PLSA and LDA are highly modular and can therefore be easily extended. Since LDA’s introduction, there is much research based on it. The Correlated Topic Model Advances Blei & Lafferty (2006) follows this approach, inducing a correlation structure between topics by using the logistic normal distribution instead of the Dirichlet. Another extension is the hierarchical LDA Blei et al. (2010b), where topics are joined together in a hierarchy by using the nested Chinese restaurant process.
The core assumption of standard topic-model algorithms is that a corpus consisted of documents. And each document is generated by the processing selecting one topic from topics with probability then selecting one word from distinct words with probability . Then, our problem is translated to estimate probabilities and probabilities p(word—topic). LDA and PLSI both aim to estimate the values of these probabilities with the highest likelihood of generating the corpus (Hofmann, 1999; Blei & Jordan, 2003; Griffiths & Steyvers, 2004; Nallapati et al., 2007). Thus, the inference problem is transformed to an optimization problem (Blei et al., 2010a). But there exist many competing models with nearly identical likelihoods. Due to the high degeneracy of the likelihood landscape, standard optimization algorithms will more likely infer different models after different optimization runs than infer the model with the highest likelihood,as has been previously reported Blei et al. (2010a); Wallach et al. (2009). A research on the validity of LDA optimization algorithms for inferring topic models also proposed that current implementations of LDA had low validity (Lancichinetti et al., 2015).
Meanwhile, selecting the number of topics is one of the most problematic modeling choices in finite topic modeling. There is no effective method for choosing or evaluating the probability of held-out data for various values of so far. And degree to which LDA is robust to a poor setting of is not well-understood (Wallach et al., 2009). Ideally, if LDA has sufficient topics to model the data set well, an increase in would not have a impact on the assignments of tokens to topics –i.e., the additional topics should be used with low frequency. For example, if twenty topics is adequacy to exactly model the data, then inferred topic assignments would not be significantly affected by increasing the number of topics to fifty. If this is the case, using large would not have a improvement on the inference. In another words, we still need a robust . Actually, could be seem as the rank of the solution space for topic generation. Setting is same as manually selecting the rank of the solution space, which is obviously not reasonable.
The standard topic-model algorithms focus on the modeling the process of generating documents with topics. In this paper we propose an approach to get an initial guess of topics from the distribution of words and documents. If we think about an easy problem, in which one word can only belongs to one topic. Generating topics from corpus closely approximates to the processing of community detection in networks. A substantial amount of work in the area of community detection in networks has proposed effective algorithms to reveal the struct of the network only using the original information of the network without other prior knowledge. So we create a network of words in the corpus and detecting the communities of the network as the initial guess for topics, then refine these coarse topics. And the words with top in the topics extracted by HLSM are interesting, it seems that HLSM distinguishes the topics in a more concrete level. For example, image, jpeg, gif” and ”3d, graphic, ray” will be assigned to different topics.
The contribution of this paper can be summarized as follows:
Propose a novel approach to constructing network of words closely related to the latent topic space.
Adapt approaches from community detection in networks to initial hierarchical topic generation, and also propose a method to further refine the topics.
To evaluate the effectiveness of the proposed approach, we conducted experiments on several real-world text data sets. The experimental results demonstrate that our approach provides greatly improvements in terms of documents classification.
2 Hierarchical Latent Semantic Mapping
Hierarchical Latent Semantic Mapping (HLSM) is a network approach to topic modeling. Similar to the well-known topic models, each document is represented as a mixture over latent topics. The key feature that distinguishes the HLSM model from the existing topic models is that HLSM directly clusters words and defines each cluster as a topic, then refines these initial topics, thus HLSM estimates the probability distributions in a novel process.
The HLSM model infers topics as the following steps:
Construct the unipartite network.we calculate the association between each pair of words that co-occur in at least one document. Then we construct the unipartite network in which words are connected with the association above the threshold.
Clustering of words hierarchically.The words in the unipartite network are connected by the association in the latent topic space. Naturally we suppose that topics in the corpus will give rise to communities of words in the network. Thus we use the Hierarchical Map Equation Rosvall & Bergstrom (2011) to detect the communities. And in most of corpus, topics come in the form of multiple levels of abstraction. Abstract topic consists of several concrete topics. Thus we detect some massive communities corresponding to the abstract topics, then we detect minor communities, which correspond to the concrete topics, from the massive communities. We take the communities as a prior guess for the number of topics and word composition of each of the topics used to generate the documents.It is worth noting that we do not set the number of levels and the number of communities for each level. Hierarchical Map Equation can reveal the multilevel organization in the network of words automatically.
Refine the prior guess. After the last level of clustering of words, one may get some single communities of words, and in the step 2, one may get some single words not in the network. Thus the prior topics detected in step 2 are rough, we refine the topics using a PLSA-like likelihood optimization.
2.1 Construct the unipartite network
The association between words must be closely related to the topics to ensure the validity of clustering words based on this network. But the topics are latent, and all observations are the words collected into documents. If we assigns topics to documents artificially with prior human knowledge, one can observe that documents share the same topics also are more likely to share some words. Naturally we can believe that the words co-occur in many documents share the same topic, in another word these words are more similar in the latent topic space. To calculate the association between words in the latent topic space. Like the core idea of Latent Semantic Analysis (LSI), we map words to a vector space of reduced dimensionality based on a Singular Value Decomposition (SVD) of the co-occurrence matrix , which each row corresponds to a word, each column to a document in which the word appeared, and each matrix entry corresponds to the number of occurrences of word in document .
Starting with the standard SVD given by
the diagonal matrix contains the singular values of M. The approximation of is computed by setting all but the largest singular values in to zero (= ), which is rank optimal in the sense of the -matrix norm.
One obtains the approximation
The corresponding low-dimensional latent vectors will typically not be sparse, while the original high-dimensional Matrix is sparse. This implies that one can calculate meaningful association values between pairs of words in the latent topic space. In HLSM, we calculate the cosine similarity between the rows of as the association of each pair of words in the latent topic space, and connects word and with this association :
After calculating all the values of connections. Suppose that the association values between some pair of words are so low that we presume these connections are noise. One can set a threshold of to purne the connections lower than .
2.2 Clustering words hierarchically
In most of corpus, the structure of topics is not simple and always can be multiple levels. Some concrete topics sit under a same abstract topic. For example, words in a corpus focusing on “soccer” might be drawn from the topics “stars”, “matches”, “history of soccer”, etc.
We construct the network of words based on the association between words in the latent topic space. If the original structure of topics is multiple levels, the network should also have a multilevel structure. To reveal communities at multiple levels, we choose the Hierarchical Map Equation Rosvall & Bergstrom (2011). It is worth noting that we do not set the number of levels and the number of communities for each level. Instead Hierarchical Map Equation can reveal the multilevel organization in the network of words automatically.
The Map Equation proposed the duality between finding community structure in networks and minimizing the specification length of a random walker’s movements on a network. For a given network partition, the map equation definiens the limit of how laconic one can describe the trajectory of this random walk in theory.
The core idea of map equation is that if the random walker tends to stay in some blocks of the network for a long time, the code used for specification can be compressed. Therefore, when the proxy for real flow random walk in the network, estimating the minimum map equation over all possible network partitions could reveals the structure of the network with respect to the dynamics on the network.
In our problem, for a hierarchical network of nodes, each node corresponds to one word, segmentated into modules. There is a a submap with submodules in one modules. Correspondingly, there is a submap with submodules in each each submodule , and so on.
The corresponding hierarchical map equation is
with the specification length of submap at intermediary levels given by
and at the final modular level by
Weight of codebook depends on the rate of use of it, and is the sum of average length of codewords for each codebook. is the average length of codewords in the index codebook according to the rate of use of it, while the entropy terms depends on the rate at which the codebooks are used. On any given step the random walker switches the first level modules at probability of , while is the rate of index codebook is used.
At each submodule level, is the average length of the codewords according to the using rate in the subindex codebook and is the rate of codeword use for entering the submodules or exiting to a higher level. At the last level, is the average length of the codewords according to the using rate in the submodule codebook and is the rate of codeword use for visiting nodes in submodules or exiting to other submodules. The problem of seeking the hierarchical structure that best represents the structure is translated to finding the hierarchical partition of the network with the minimum map equation. Fig.2 illustrates an example for map equation.
In this example we can assume that all weights for connections in the network are equal, thus all rates can be calculated by counting
links and normalizing. The specification length for an unpartitioned network is . After the network is partitioned, the codewords of the first level modules are used at a total rate ( There are 25 lines in the network and 50 possible moves when considering direction, while only 2 moves can switch between the first level module.), while relative rates . And , noticing that there is a rate at random walker existing to Module 2, while is . Thus is:
L(M) = 0.04 bits + 0.61bits + 2.54 bits = 3.19 bits .
2.3 Refine the prior guess
Once the network is built, we detect clusters (same as the modules detected by Hierarchical Map Equation) of highly associated words using the Hierarchical Map Equation. After the last level of clustering, we get a hard partition of words, meaning that words can only belong to a single cluster. Actually a word may have multiple senses and multiple types of usage in different context. Consequently if we simply define every cluster as a topic, these rough topics can not provide a reasonable probabilistic interpretation of the corpus in terms of the latent topic space. Therefore we propose a method to further refine these rough topics.
We now discuss how we can compute the distributions and , given a partition of words. In the prior partition of words, we define every cluster as a topic. In fact, each word in the network can sit in only one module after the Hierarchical Map Equation processing. Therefore, . only if the word sits in the module, which corresponds to the topic . For other topics , . Noticing that in this step word can only belongs to one topic t, so , thus:
is the number of words in document , is the number of times word occurs in the document . It is also useful to introduce , which is the number of times topic was chosen and word was drawn. is the number of the words in the corpus. So far, the PLSA-like likelihood of our model is:
We can improve this likelihood by simply making documents more specific to fewer topics. For that our optimization algorithm simply finds, for each document, words assigned with some infrequent topics and reassigns the most significant topic in that document to these words.
For each document , we find the most significant topic with the smallest -value, considering a null model where each word is independently sampled from topic t with probability . Calling the number of words which actually come from topic , ( , see Eq . (6) ) , the -value of topic t is then computed using a binomial distribution, . Obviously -value represents the significance of the word better than , which only depends on the .
For each document , recall that after the step 2 we may get some single words not in the network. We simply assign these words to the most significant topic and we can calculate a baseline of the PLSA-like likelihood L(see Eq .(7)).
For each document , we define the infrequent topics simply as those which occur with probability smaller than a parameter: . We assign the most significant topic to the words which belong to the all infrequent topics . The will be incremented by the sum of all , while all are set to zero. Similarly, (see above) will be decreased by for each word w which belongs to an infrequent topic, and is increased accordingly.
After previous step for all document, we compute:
and the likelihood of model, , where we made explicit its dependency on . We pick the model with maximum by looping over all possible values of (from 0% to 50% with steps of 1%).
HLSM estimates the probabilities and from training data set, and calculates , for a new document from held out data set, won’t be changed, can be calculated by :
HLSM fixed the probabilities and after the training process, and hence is plagued by overfitting. It will be a shortcoming of the HLSM model, when the scale of the training data set is small.
3 Experimental Evaluations
HLSM is a topic model towards collections of text corpora. It can be applied to lots of applications such as classifying, clustering, filtering, information retrieval and related areas. Follow Blei’s idea Blei & Jordan (2003), in this section, we investigate two important applications: document modeling and document classification.
3.1 Document Modeling
The goal of document modeling is to generalize the trained model from the training dataset to a new dataset. The documents in the corpora are unlabeled, our goal is density estimation, thus we wish to obtain high likelihood on a held-out test set. In particular, we computed the perplexity of a held-out test set to evaluate the models. Models which yield a lower perplexity are considered to achieve a better generalization performance because the model is less surprised by a portion of the datasets which the model have never seen before. Formally, for a test set of documents, the perplexity is defined as:
We conduct this experiment on a subset of the 20Newsgroups data set, which has been widely used for evaluating the performance of cross-domain text classification algorithms. It contains nearly 20,000 newsgroup documents which have been evenly partitioned into 20 different newsgroups. We chose 3878 documents (we filtered some little documents) from domain comp.graphics, com.sys.mac.hardware, sci.crypt, and sci.med as our dataset used in the evaluation. We held out 20% of the corpus for test purpose and trained the models on the remaining 80%. In data preprocessing, we removed 163 stop words in standard list and the words occurrences less than 3 times from each corpus. We compare HLSM against PLSA, asymmetric LDA and TopicMapping. The initial for asymmetric LDA was set to 0.01 for all topics.
Fig. 3 shows the perplexity results where the number of the topics varies from 5 to 100. As can be seen, the HLSM model achieves slight improvement in terms of perplexity, while TopicMapping is close to asymmetric LDA. Experiment shows that the prior guess of HLSM makes great difference on the topic generation. Table 1 presents the examples of top 12 extracted topics on data set Comp and Sci, some topics with lower probability were not exhibited. We sorted the words with the learned topic-word probability. By examining the topical words, we can observe that the words in the same topic are always semantically relevant. For example, Topic 1 is about Mac hardware, and one domain in the data set Comp and Sci is comp.sys.mac.hardware, respectively. It is noteworthy that, some topics look similar in abstract level, but there are still some distinctions between them. For instance, words in Topic 2 and Topic 4 are semantically relevant but Topic 2 is more related to medical treatment, while Topic 4 probably describes some reports about disease. The result shows that our method can effectively identify the correlations between domain-specific features from different domains. Furthermore, our method can extracted narrow topics under the level of domain. And we conduct the next experiment on the whole 20Newsgroups data set.
|topic: 1||topic: 2||topic: 3||topic: 4||topic: 5||topic: 6|
|: 0.0801||: 0.0672||: 0.0662||: 0.0619||: 0.0607||: 0.0600|
|topic: 7||topic: 8||topic: 9||topic: 10||topic: 11||topic: 12|
|: 0.0562||: 0.0557||: 0.0507||: 0.0463||: 0.0455||: 0.0441|
|Data set||PLSA||LDA||asymmetric LDA||TopicMapping||HLSM|
|Data set||PLSA||LDA||asymmetric LDA||TopicMapping||HLSM|
|Comp and Sci||0.761||0.771||0.792||0.831||0.855|
|Comp and Talk||0.785||0.790||0.813||0.846||0.871|
|Comp and Rec||0.770||0.776||0.781||0.834||0.853|
|Sci and Rec||0.724||0.723||0.767||0.803||0.822|
|Talk and Rec||0.811||0.802||0.832||0.821||0.876|
|Talk and Sci||0.804||0.811||0.839||0.847||0.867|
|Comp and Sci||comp.graphics, comp.sys.mac.hardware, sci.crypt, sci.med|
|Comp and Talk||comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, talk.politics.mideast, talk.politics.misc|
|Comp and Rec||comp.graphics, comp.sys.ibm.pc.hardware, rec.motorcycles, rec.sport.baseball|
|Sci and Rec||sci.crypt, sci.med, rec.autos, rec.sport.baseball|
|Talk and Rec||talk.politics.mideast, talk.politics.misc, rec.autos, rec.sport.baseball|
|Talk and Sci||talk.politics.misc, talk.religion.misc, sci.crypt, sci.med|
3.2 Document classification
In the text classification problem, topic models are wished to classify a document into two or more mutually exclusive classes. The choice of features is a challenging aspect of the document classification problem. By representing the documents in terms of latent topic space, the topic models can generate the probabilities . If one use the vector of as the feature of documents to fix the text classification problem, the probabilities vector generated by the most effective model can perform better than the probabilities vector generated by other models.
To test the effectiveness of HLSM, we compared it with the following representative topic models and chose AC as the evaluation there. PLSA, symmetric LDA, asymmetric LDA, TopicMapping.
We generated six cross-domain text data sets from 20Newsgroups by utilizing its labeled structure. There are 4 fields in each data set, Table 3 summarizes the data sets generated from 20Newsgroups. To make the classification problem more effective and convincing, the task was defined as a multi-label classification.
In these experiments, we estimated the probabilities using the above topic models on all the documents of each data sets, and used the vector of probabilities as the only features to train a support vector machine (SVM) for multi-label classification. For each data set, 20% of the documents were held out as the test data and we trained a SVM for multi-label classification with the remaining 80% labeled documents. We used these classifiers to predict the class labels of unlabeled documents in the test data. Notice that there were 4 field in each data set, the classification process was considered as correct only if the document was classified into the original field.
We did the same data preprocessing as above, and the number of topics in each data set for LDA, PLSA, and asymmetric LDA was set to 4. Table 2 summarizes the classification performance on each data set, the first three row shows the best accuracy while the number of topics for LDA, PLSA, and asymmetric LDA varies. The last row of the table shows the average accuracy over all data sets. From the table we can observe that HLSM outperformed all other topic models on six data sets.
A topic model HLSM is presented in this paper to apply an approach from the area of community detection to topic generation. We apply the HLSM model to several document collections for document modeling and document clustering, and the experimental comparisons against state-of- the-art approaches demonstrate the promising performance. In particular, in the area of community detection, a substantial amount of work has been done on stochastic block models, which tries to fit a model to reveal community structure in networks. We believe this work, which is similar to topic model in spirit, would offer new insights into topic modeling.
- Barathi (2011) Barathi, B.U.A. Cross-domain text classification using semantic based approach. In Sustainable Energy and Intelligent Systems (SEISCON 2011), International Conference on, pp. 820–825, July 2011. doi: 10.1049/cp.2011.0479.
- Blei & Jordan (2003) Blei, D. M.; Ng, A. Y.; and Jordan, M. I. Latent dirichlet allocation. Journal of Machine Learning Research, pp. 993–1022, 2003.
- Blei et al. (2010a) Blei, D., Carin, L., and Dunson, D. Probabilistic topic models. Signal Processing Magazine, IEEE, 27(6):55–65, Nov 2010a. ISSN 1053-5888. doi: 10.1109/MSP.2010.938079.
- Blei & Lafferty (2006) Blei, David M. and Lafferty, John D. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pp. 113–120, New York, NY, USA, 2006. ACM. ISBN 1-59593-383-2. doi: 10.1145/1143844.1143859. URL http://doi.acm.org/10.1145/1143844.1143859.
- Blei et al. (2010b) Blei, David M., Griffiths, Thomas L., and Jordan, Michael I. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. J. ACM, 57(2):7:1–7:30, February 2010b. ISSN 0004-5411. doi: 10.1145/1667053.1667056. URL http://doi.acm.org/10.1145/1667053.1667056.
- Chang & Hsu (2005) Chang, Hsi-Cheng and Hsu, Chiun-Chieh. Using topic keyword clusters for automatic document clustering. In Information Technology and Applications, 2005. ICITA 2005. Third International Conference on, volume 1, pp. 419–424 vol.1, July 2005. doi: 10.1109/ICITA.2005.303.
- Griffiths & Steyvers (2004) Griffiths, Thomas L. and Steyvers, Mark. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004. doi: 10.1073/pnas.0307752101. URL http://www.pnas.org/content/101/suppl_1/5228.abstract.
- Hofmann (1999) Hofmann, T. Probabilistic latent semantic indexing. SIGIR, pp. 50–57, 1999.
- Lancichinetti et al. (2015) Lancichinetti, Andrea, Sirer, M. Irmak, Wang, Jane X., Acuna, Daniel, Körding, Konrad, and Amaral, Luís A. Nunes. High-reproducibility and high-accuracy method for automated topic classification. Phys. Rev. X, 5:011007, Jan 2015. doi: 10.1103/PhysRevX.5.011007. URL http://link.aps.org/doi/10.1103/PhysRevX.5.011007.
- Nallapati et al. (2007) Nallapati, Ramesh, Cohen, William, and Lafferty, J. Parallelized variational em for latent dirichlet allocation: An experimental evaluation of speed and scalability. In Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on, pp. 349–354, Oct 2007. doi: 10.1109/ICDMW.2007.33.
- Rosvall & Bergstrom (2011) Rosvall, Martin and Bergstrom, Carl T. Multilevel compression of random walks on networks reveals hierarchical organization in large integrated systems. PLoS ONE, 6(4):e18209, 04 2011. doi: 10.1371/journal.pone.0018209. URL http://dx.doi.org/10.1371%2Fjournal.pone.0018209.
- Wallach et al. (2009) Wallach, Hanna M., Mimno, David M., and McCallum, Andrew. Rethinking lda: Why priors matter. In Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 22, pp. 1973–1981. Curran Associates, Inc., 2009. URL http://papers.nips.cc/paper/3854-rethinking-lda-why-priors-matter.pdf.