On a Topic Model for Sentences
Probabilistic topic models are generative models that describe the content of documents by discovering the latent topics underlying them. However, the structure of the textual input, and for instance the grouping of words in coherent text spans such as sentences, contains much information which is generally lost with these models. In this paper, we propose sentenceLDA, an extension of LDA whose goal is to overcome this limitation by incorporating the structure of the text in the generative and inference processes. We illustrate the advantages of sentenceLDA by comparing it with LDA using both intrinsic (perplexity) and extrinsic (text classification) evaluation tasks on different text collections.
On a Topic Model for Sentences
|Georgios Balikas, Massih-Reza Amini and Marianne Clausel|
|University of Grenoble Alpes|
|Computer Science Laboratory (LIG), Applied Mathematics Laboratory (LJK)|
Text Mining; Topic Modeling; Unsupervised Learning
Statistical topic models are generative unsupervised models that describe the content of documents in large textual collections. Prior research has investigated the application of topic models such as Latent Dirichlet Allocation (LDA) [?] in a variety of domains ranging from image analysis to political science. Most of the work on topic models assumes exchangeability between words and treats documents in a bag-of-words fashion. As a result, the words’ grouping in coherent text segments, such as sentences or phrases, is lost.
However, the inner structure of documents is generally useful, when identifying topics. For instance, one would expect that in each sentence, after standard pre-processing steps such as stop-word removal, only a very limited number of latent topics would appear. Thus, we argue that coherent text segments should pose “constraints” on the amount of topics that appear inside those segments.
In this paper, we propose sentenceLDA (senLDA), whose purpose is to incorporate part of the text structure in the topic model. Motivated by the argument that coherent text spans should be produced by only a handful of topics, we propose to modify the generative process of LDA. Hence, we argue that the latent topics of short text spans should be consistent across the units of those spans. In our approach, such text spans can vary from paragraphs to sentences and phrases depending on the task’s purpose. Also, note that in the extreme case where words are the coherent text segments, the standard LDA model becomes a special case of senLDA.
In the remainder of the paper we present the senLDA and we derive its collapsed Gibbs sampler in Section 2, we illustrate its advantages by comparing it with LDA on intrinsic (in vitro) and extrinsic (ex vivo) evaluation experiments using collections of Wikipedia and PubMed articles in Section 3, and we conclude in Section 4.
A statistical topic model represents the words in a collection of documents as mixtures of “topics”, which are multinomials over a vocabulary of size . In the case of LDA, for each document a multinomial over topics is sampled from a Dirichlet prior with parameters . The probability of a term , given the topic , is represented by . We refer to the complete matrix of word-topic probabilities as . The multinomial parameters are again drawn from a Dirichlet prior parametrized by . Each observed term in the collection is drawn from a multinomial for the topic represented by a discrete hidden indicator variable . For simplicity in the mathematical development and notation, we assume symmetric Dirichlet priors but the extension to the asymmetric case is straightforward. Hence, the values of and are model hyper-parameters.
We extend LDA by adding an extra plate denoting the coherent text segments of a document. In the rest, without loss of generality we use sentences as coherent segments. A finer level of granularity can be achieved though, by analysing the structure of sentences and using phrases as such segments. The graphical representation of the senLDA model is shown in Figure On a Topic Model for Sentences and the generative process of a document collection using senLDA is described in Algorithm 1. For inference, we use a collapsed Gibbs sampling method [?]. We now derive the Gibbs sampler equations by estimating the hidden topic variables.
In senLDA the joint distribution can be factored:
because the first term is independent of and the second from . After standard manipulations as in the paradigm of [?] one arrives at:
where is a multidimensional extension of the beta function used for notation convenience, and , refer to the occurrences of topics with documents and topics with terms respectively. To calculate the full conditional we take into account the structure of the document and the fact that , . The subscript in denotes the words and the topic respectively of sentence . For the full conditional of topic we have:
For the first term of equation Eq. (On a Topic Model for Sentences) we have:
Here, for the generation of A and B we used the recursive property of the function: ; is a term that can occur many times in a sentence and denotes ’s frequency in sentence given that the sentence belongs to topic ; denotes how many words of sentence belong to topic .
The development of the second factor in the final step of Eq. (On a Topic Model for Sentences) is similar to the LDA calculations with the difference that the counts of topics per document are calculated given the allocation of sentences to topics and not the allocation of words to topics. This yields:
where denotes the number of times that topic has been observed with a sentence from document , excluding the sentence currently sampled. Note that Eq. (On a Topic Model for Sentences) reduces to the standard LDA collapsed Gibbs sampling inference equations if the coherent text spans are reduced to words.
The idea of integrating the sentence limits in the LDA model has been previously investigated. For instance, in [?] in the context of summarization the authors combine the unigram language model with topic models over sentences so that the latent topics are represented by sentences instead of terms. In [?] the notion of sentence topics is introduced and they are sampled from separate topic distributions and co-exist with the word topics. Also, Boyd et al. [?] propose an adaptation of topic models to the text structure obtained by the parsing tree of a document. Our method resembles these works in that it integrates the notion of sentences to extend LDA. In our case though, we directly extend LDA maintaining the association of words to topics, we retain its simplicity without adding extra hyper-parameters thus allowing a fast, gibbs sampling inference, and we do not require any language-dependent tools such as parsers.
We conduct experiments to verify the applicability and evaluate the performance of senLDA compared to LDA. The process is divided into two steps: (i) the training phase, where the topic models are trained to learn the their parameters, and (ii) the inference phase that is for new, unseen documents their topic distributions are estimated. We use the Gibbs sampling inference approach given by Eq. (On a Topic Model for Sentences). The hyper-parameters and are set to , with being the number of topics. Table On a Topic Model for Sentences shows the datasets we used. They come from the publicly available collections of Wikipedia [?] and PubMed [?]. The first four datasets (WikiTrain* and PubMedTrain*) were used for learning the topic model parameters; they differ in their respective size. Also, the vocabulary of the PubMed datasets is significantly larger due to the medical terms that appear. During preprocessing we only applied lower-casing, stop-word removal and lemmatization using the WordNet Lemmatizer.111The code and the data are publicly available at https://github.com/balikasg/topicModelling/ The rest of the document collections of Table On a Topic Model for Sentences are used for classification purposes and are discussed later in the section.
Documents Classes Timing (sec) WikiTrain1 10,000 46,051 - 182|271 WikiTrain2 30,000 65,820 - 332|434 PubMedTrain1 10,000 55,115 - 304|433 PubMedTrain2 60,000 150,440 - 1830|2799 Wiki37 2,459 23,559 37 - Wiki46 3,657 27,914 46 - PubMed25 7,248 40,173 25 - PubMed50 9,035 47,199 50 - Table \thetable: Description of the data used after pre-processing. “Timing” refers to the 25 first training iterations with the left (resp. right) values corresponding to senLDA (resp. LDA).
Intrinsic evaluation Topic model evaluation has been the subject of intense research. For intrinsic evaluation we report here perplexity [?], which is probably the dominant measure for topic models evaluation in the bibliography. The perplexity of held out documents given the model parameters is defined as the reciprocal geometric mean of the token likelihoods of those data, given the parameters of the model:
Note that senLDA samples per sentence and thus results in less flexibility at the word level where perplexity is calculated. Even though, the comparison between senLDA and LDA, at word level using perplexity, gives insights in the relative merits of the the proposed model.
Figure On a Topic Model for Sentences depicts the ratio of the perplexity values between senLDA and LDA. We set after grid searching for perplexity with 5-fold cross-validation on the training data. Values higher (resp. lower) than one signify that senLDA achieves lower (resp. higher) perplexity than LDA. The figure demonstrates that in the first iterations before convergence of both models, senLDA performs better. What is more, senLDA converges after only around 30 iterations, whereas LDA converges after 160 iterations on Wikipedia and 200 iterations on the PubMed datasets respectively. We define convergence as the situation where the model’s perplexity does not any more decrease over training iterations. The shaded area in the figure highlights the period while senLDA performs better. It is to be noted, that although competitive, senLDA does not outperform LDA given unlimited time resources. However, that was expected since for senLDA the training instances are sentences, thus the model’s flexibility is restricted when evaluated against a word-based measure.
An important difference between the models however, lies in the way they converge. From Figure On a Topic Model for Sentences it is clear that senLDA converges faster. We highlight this by providing exact timings for the first 25 iterations of the models (column “Timing” of Table On a Topic Model for Sentences) on a machine using an Intel Xeon CPU E5-2643 v3 at 3.40GHz. For both models we use our own Python implementations with the same speed optimisations. Using “WikiTrain2” and 125 topics, for 25 iterations the senLDA needs 332 secs, whereas LDA needs 434 sec., an improvement of 30%. Furthermore, comparing the convergence, senLDA needs 332 secs (25 iterations) whereas LDA needs more than 2770 secs (more than 160 iterations) making senLDA more than 8 times faster. Similarly for the “PubMedTrain2” dataset which is more complex due to its larger vocabulary size, senLDA converges around 12 times (an order of magnitude) faster. Note that senLDA’s fast convergence is a strong advantage and can be highly appreciated in different application scenarios where unlimited time resources are not available.
Extrinsic evaluation Previous studies have shown that perplexity does not always agree with human evaluations of topic models [?] and it is recommended to evaluate topic models on real tasks. To better support our development for senLDA applicability we also evaluate it using text classification as the evaluation task. For text classification, each document is represented by its topic distribution, which is the vectorial input to Support Vector Machines (SVMs). The classification collections are split on train/test (75%/25%) parts. The SVM regularization hyper-parameter is selected from using 5-fold cross-validation on the training part of the classification data. The PubMed testsets are multilabel, that is each instance is associated with several classes, 1.4 in average in the sets of Table On a Topic Model for Sentences. For the multilabel problem with the SVMs we used a binary relevance approach. To assess the classification performance, we report the evaluation measure, which is the harmonic mean of precision and recall.
The classification performance on measure for the different classification datasets is shown in Figure On a Topic Model for Sentences. First note that in the majority of the classification scenarios, senLDA outperforms LDA. In most cases, the performance difference increases when the larger train sets (“WikiTrain2” and “PubMedTrain2”) are used. For instance, in the second line of figures with the PubMed classification experiments, increasing the topic models’ training data benefits both LDA and senLDA , but senLDA still performs better. More importantly though and in consistence with the perplexity experiments, the advantage of senLDA remains: the faster senLDA convergence benefits the classification performance. The senLDA curves are steeper in the first training iterations and stabilize after roughly 30 iterations when the model converges. We believe that assigning the latent topics to coherent groups of words such as sentences results in document representations of finer level. In this sense, spans larger than single words can capture and express the document’s content more efficiently for discriminative tasks like classification.
To investigate the correlation of topic model representations learned on different levels of text, we report the classification performance using as document representations the concatenation of a document’s topic distributions output by LDA and senLDA . For instance, the concatenated vectorial representation of a document when for each model is a vector of 250 dimensions. The resulting concatenated representations are denoted by “senLDA+” in Figure On a Topic Model for Sentences. As it can be seen, “senLDA+” performs better compared to both LDA and senLDA . Its performance combines the advantages of both models: during the first iterations it is as steep as the senLDA representations and in the later iterations benefits by the LDA convergence to outperform the simple senLDA representation. Hence, the concatenation of the two distributions creates a richer representation where the two models contribute complementary information that achieves the best classification performance. Achieving the optimal performance using those representations suggests that the relaxation of the independence assumptions between the text structural units can be beneficial; this is also among the contributions of this work.
We proposed senLDA, an extension of LDA where topics are sampled per coherent text spans. This resulted in very fast convergence and good classification and perplexity performance. LDA and senLDA differ in that the second assumes a very strong dependence of the latent topics between the words of sentences, whereas the first assumes independence between the words of documents in general. In our future research, our goal is to investigate this dependence and further adapt the sampling process of topic models to cope with the rich text structure.
This work is partially supported by the CIFRE N 28/2015.
-  L. Azzopardi, M. Girolami, and K. van Risjbergen. Investigating the relationship between language model perplexity and IR precision-recall measures. In SIGIR, pages 369–370, 2003.
-  D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.
-  J. L. Boyd-Graber and D. M. Blei. Syntactic topic models. In Advances in neural information processing systems, pages 185–192, 2009.
-  R.-C. Chen, R. Swanson, and A. S. Gordon. An adaptation of topic modeling to sentences. 2010.
-  T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004.
-  G. Heinrich. Parameter estimation for text analysis. Technical report, Technical report, 2005.
-  I. Partalas, A. Kosmopoulos, N. Baskiotis, T. Artieres, G. Paliouras, E. Gaussier, I. Androutsopoulos, M.-R. Amini, and P. Galinari. LSHTC: A benchmark for large-scale text classification. CoRR, abs/1503.08581, march 2015.
-  G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weissenborn, et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):1, 2015.
-  D. Wang, S. Zhu, T. Li, and Y. Gong. Multi-document summarization using sentence-based topic models. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 297–300. Association for Computational Linguistics, 2009.