Unveiling the semantic structure of text documents using paragraph-aware Topic Models
Classic Topic Models are built under the Bag Of Words assumption, in which word position is ignored for simplicity. Besides, symmetric priors are typically used in most applications. In order to easily learn topics with different properties among the same corpus, we propose a new line of work in which the paragraph structure is exploited. Our proposal is based on the following assumption: in many text document corpora there are formal constraints shared across all the collection, e.g. sections. When this assumption is satisfied, some paragraphs may be related to general concepts shared by all documents in the corpus, while others would contain the genuine description of documents. Assuming each paragraph can be semantically more general, specific, or hybrid, we look for ways to measure this, transferring this distinction to topics and being able to learn what we call specific and general topics. Experiments show that this is a proper methodology to highlight certain paragraphs in structured documents at the same time we learn interesting and more diverse topics.
Topic Modeling refers to a popular set of algorithms that has been widely used for inferring topics or themes –defined by probability vectors over words– present on collections of text documents of any kind –news, books, scientific articles, patents, e-mails, biological data or even tweets–. Topic modeling quickly became popular after the first models were proposed (Deerwester et al., 1990), (Blei et al., 2003). Since then, a huge amount of related contributions appeared, on the one hand applying these models to specific problems and challenges, and on the other hand exploring more complex models adding capabilities to their ancestors. These new models try to overcome assumptions from the old ones. For instance, Latent Dirichlet Allocation (LDA) (Blei et al., 2003) assumed word-independence (Bag of Words), a pre-fixed number of topics, etc., simplifications that facilitate the inference, both in complexity and computation costs, achieving a fairly good performance on interpretability and many other tasks. However, this model ignores the fact that the ordering of the words contains valuable information. The interested reader could find interesting reviews of previous and current algorithms in (Blei, 2012) and (Boyd-Graber et al., 2017).
In this paper, the following assumption is analysed: in some corpora, paragraphs are the basic unit of semantic information. More specifically, when the assumption holds, documents are structured in paragraphs, with some paragraphs being semantically more meaningful than others. To keep it simple, we distinguish two types of paragraphs: general, that are semantically similar to other general paragraphs contained in most documents of the corpora, and specific, that contain the most important (and discriminative) semantic information of the documents. Consequently, we distinguish also between general topics, those corpus-related which appear in the majority of documents, and specific ones, which contain the information which genuinely describes a subset of documents. We claim that, when the assumption is met, our model is able to learn document structure and better quality topics.
The rest of the paper is organized as follows: in Section 2, we will briefly summarize previous contributions based on similar approaches or objectives, pondering how they differ from ours. In Section 3, we introduce a simple generative model for documents which allow us to derive a Gibbs Sampling based inference matching our purpose. In Section 4, we apply this inference on synthetic and real datasets, proving that this model works better when the assumptions are met, and exploring topics learned in real datasets. Finally, in Section 5 we discuss the findings of the experiments in the previous Section and suggest future applications and lines of work.
In this section, we briefly introduce some existing techniques that have already produced algorithms related to this work. This review establishes the grounds for better understanding the motivation of our work.
2.1 Previous work
There are previous models in literature proposing learning topics separately depending on how document-specific they are. In Chemudugunta et al. (2007) they propose a model that considers words coming from three possible topical sources –a corpus shared background distribution, a document-specific distribution, or a corpus-specific set of topics–. The model tries to isolate stopwords and other non-relevant words in the background distribution, while the rest of words are modeled depending on how often they are shared across documents. This improved the ability of matching queries, specially for low frequency words, since they matched those of document-specific topics. The assumed generation model presents three paths for generating words, and it is controlled by an additional latent variable, a Multinomial distribution acting as a switch. However, their choice is done word per word and is therefore subject to the limitations of the Bag of Words assumption. In contrast, our work is based on the assumption that there exists certain structure in the document at the paragraph level, so that general or specific words tend to occur (at least in certain corpora) separately on different paragraphs. Then, algorithms aware of this structure will produce better topics, and as a subproduct find the most semantically relevant paragraphs of each document.
Haghighi and Vanderwende (2009) considers also a background distribution, and content topics that may be general (for a collection) or specific (for a document). A sentence may contain background words and specific words, but all the specific words of the sentence must belong to the same topic. Topic transition between sentences is modeled with a Hidden Markov Model (HMM), with a high probability of keeping the same topic. Unlike this work, our model allows several topics both in specific paragraphs and background ones.
Finally, other authors have suggested models in which text segments share common attributes, mainly sentences. For instance, in Gruber et al. (2007) each sentence is assumed to share the same topic for all its words. Transitions between topics are also modeled as a HMM. This permits identifying sections in documents like scientific articles. In Balikas et al. (2016) authors suggest a model in which topics are learned at a sentence level. They revisit a typical Gibbs Sampling inference to consider how to compute the probability of a full sentence belonging to a topic. This model differs from ours because it assumes one topic per sentence. In ours each word can be sampled from different topics instead.
2.2 Motivation of this contribution
We assume that there are certain text document corpora in which, due to several factors like formal structure –job offers, grants proposals, patents, articles, etc– it is worth to model paragraphs separately. We expect these documents to manifest, at paragraph level, contextual information related to the corpus itself on one hand, and more specific and distinctive information (i.e., document-specific) on the other hand. In the first ones, we learn topics describing the corpus structure and general content, while in the second ones we discover more specific topics. We provide a model that allows detecting these paragraphs when the model is met. In the end, our motivation is learning higher quality topics in general, disentangling them when possible. In addition, the model unveils the structure of the document highlighting the most specific paragraphs. It is worth mentioning that choosing paragraphs as the semantic unit level is a trade-off between sentences and longer segments (e.g., sections). Obviously, the model would remain unchanged if the text span were changed.
3 Paragraph-aware LDA
3.1 Mathematical Notation and Generative model
Table 1 summarizes the most important variables and parameters that are necessary for the presentation of our model.
|Observed words in the text collection.|
|Topic assignments for each of the words.|
|Vector with the specific (general) topic distribution of document .|
|Hyperparameter for prior|
|Paragraph assignments: if the paragraph is specific, and if general.|
|Proportion of specific and general paragraphs in a document .|
|Hyperparameter for prior|
|Vector containing the vocabulary distribution for specific (general) topic .|
|Total number of specific (general) topics. In general,|
|Hyperparameter for prior|
|Proportion of words generated by specific topics in a specific paragraph|
To be more specific, the generative model corresponding to our model can be summarized as follows:
Specific (general) topics are sampled as:
For each document,
Topic proportions for specific and general topics are sampled:
\setstretch2 (narrower in documents)
(wider in documents)
The proportion of specific and general topics is obtained:
For each paragraph in the document,
Choose whether the paragraph is specific () or general ():
For each word in the paragraph:
if , sample the general topic and word from the selected topic:
Sample if the word comes from an specific or general topic ( or ), using .
Sample the topic from or , and sample the word from the selected topic: and .
The additional variables that are introduced with respect to standard LDA are due to the following differences of our model –see also the graphical model in Fig. 1:
An extra plate is included to reflect the paragraph level. For each paragraph we have included an additional binary variable that can take values and for general and specific paragraphs, respectively. This way, specific topics and general topics are sampled separately, as well as the topic-document proportions of each kind.
Even if a paragraph is described as specific, assuming all its words are going to be specific is unrealistic. For that, we introduce a mixing probability allowing an arbitrary small proportion of general words. Inferring is out of the scope of this work, and its value is fixed before the training stage.
All the variables whose output is a probability distribution have a Dirichlet prior, whereas all the assignments are modeled as a Multinomial distribution (this choice ensures conjugacy).
All priors are symmetric, i.e., hyperparameters , , , and are considered scalar values.
, forces to learn specific topics which, for each document, only a few will appear.
favours all general topics to appear in all documents.
Inference is based on Collapsed Gibbs Sampling (see Griffiths and Steyvers (2004), Heinrich (2005)). As for LDA, the factorized joint probability of the model is used to obtain full conditionals of the hidden variables we want to estimate. In LDA, and are integrated out (collapsed) since they can be estimated as statistics of topic assignments . In our model, is collapsed too, and the factorized joint probability is the following:
The full conditional for is obtained following a similar approach. First, we integrate out , resulting in the following Dirichlet-Multinomial distribution over :
where expresses the occurrences of specific and general paragraphs in the corpus.
Up to this point, the procedure for the two other factors in the joint probability is the same. However, when obtaining the full conditionals, the Dirichlet-Multinomials over and became proportional to Gamma function quotients. In LDA, we can implicitly make use of the fact that and , and the full conditional for is approximated by:
counts how many times the topic appears in document , ignoring the current assignment ;
refers to how many times the term was sampled from topic in other assignments.
For each paragraph assignment , however, we are now counting the number of words in documents belonging to each type of paragraphs. When obtaining the full conditional, we have to take into account that changing one paragraph assignment will affect more than one word –, where express the words assigned to a certain type of paragraph– and the recursion rule for Gamma function is applied in general more than once. This leads to the following full conditional, which is analogous to sentence-topics in Balikas et al. (2016).
where expresses how many times term appears in paragraph , assuming this one being .
In this section, we present different results on synthetic and real datasets in order to validate our proposal. We consider two sets of experiments:
Firstly, we construct a synthetic corpus using the generative model from the previous section. Since we create the corpus, we know the real labels for the specific and general paragraphs. These first experiments seek to validate the inference scheme and to show that, when our assumption about document structure holds, we can gain over usage of standard LDA both w.r.t. the quality of learned topics and the ability to discriminate between both kinds of paragraphs even when LDA is used together with a classifier that is given the true labels.
Secondly, we explore a real dataset –a subset of USPTO patents
1–. The intention here is to study if the proposed model can obtain better topics and identify relevant paragraphs in a real dataset.
4.1 Experiments on synthetic dataset
|Docs (test)||Paragraphs (test)||Words(test)|
|3000(500)||62627 (12439)||2191252 (434052)||10 (30)||5000||2 (0.1)||0.1(0.1)|
Table 2 contains the parameters that were used for generating this corpus, namely: the vocabulary size , the number of topics and , prior hyperparameters and number of documents, paragraphs and words generated, both for a train set and a test set . In addition, to make the problem more difficult some noise was added varying the parameter, that is, the proportion of general words in specific paragraphs. Concretely, for each document, . This may lead to confusing situations in which paragraphs labeled as specific contain only of specific words. Hyperparameter was set to during the inference.
In order to prove the capability of our method to identify documents structure, once the corpus was generated, three models were compared w.r.t. their ability to discriminate general and specific paragraphs:
parLDA (our model): after learning topics on the train set, paragraph probabilities were estimated on the test set.
LDA+SVM: topics were learned on the train set. Then, topic assignments were sampled on the test set. Characterizing each paragraph with its assignments, half the test set was used with its real labels to train an RBF-kernel SVM, whose parameters and were crossvalidated. Probabilities estimates were provided for the final test set.
BoW+SVM: instead of topic assignments, BoW vectors from the paragraphs in the training set were used to train a linear SVM, and to obtain probability estimates for the test set. The reason for choosing a linear SVM is that we wanted a ground truth on how separable were the paragraphs before applying Topic Modeling.
ROC curves for these methods are shown in Fig. 2. It can be observed that parLDA outperforms the other methods, even though true labels are used by LDA+SVM and BoW+SVM while our method is fully unsupervised.
A second experiment was carried out to evaluate the learned topics from parLDA and LDA. Since we had the real topics, we looked for metrics measuring distances/similarities among probability distributions to compare them against the learned topics. We used a Histogram Intersection similarity metric (see Cha and Srihari (2002)) to see how many original topics the algorithms are able to identify and how close the learned topics are from them, comparing our method and LDA. parLDA similarities were slightly superior to those of LDA. This make us think that parLDA typically learns better topics than LDA in this specific scenario where our generative assumption was right. To prove this, we defined a ‘correctly guessed’ topic as the one which coincides in at least 5 words with the one is predicting (in a top-10 representation), to see if these higher similarities lead to better learned topics: parLDA correctly guessed 37 out of 40 topics, whereas LDA guessed 28 out of 40.
4.2 Experiments on USPTO patents
In this second scenario we apply our method to a real corpus. Concretely, we selected 3000 patents from the week of January 31th, 2017. Table 3 shows more information about the corpus and the selected parameters for inference.
|3000||410652||8652304||- (15)||8323||2 (0.1)||()|
Table 3 contains the description for this corpus. Now, , the number of general topics, will be variable. is fixed to 15 as we will see below. In order to check which topics our model learns and which paragraphs it highlights, we defined the following experiment:
Then, when the number of topics is fixed, we launch our method to learn specific topics and as an incremental number of general topics (). This way, we can check if adding new topics in the general topics set helps improving specific topics quality, compared to LDA topics.
We decided to use here Topic Coherence to measure the topic quality since perplexity has been proved that is not correlated in all cases with human perception (see Chang et al. (2009)). Some of them are based on co-ocurrences of words in a reference corpus, proving a high correlation with human judgement (Lau et al., 2014), while there have been interesting proposals based on similarities of the word-vectors of topics (Fang et al., 2016). In addition, some measurements based on the corpus itself have been proposed too (Mimno et al., 2011).
After looking at the results of some of the abovementioned coherences, including one based in pre-trained in Fasttext word vectors (Bojanowski et al., 2016), we observed that most of them have unexpected behaviours when learned topics contain very domain-specific words, not being able to capture some semantic relationships. In the end, we selected the measurement in Röder et al. (2015), after providing the most sensible results. Human-correlated automatic coherence measurements are still a challenge to be analyzed in the future.
Fig. 3 (left) shows that, for , LDA learns a set of topics which, in average, have a higher Topic Coherence. Fig. 3 (right) compares that curve with the ones resulting from Topic Coherences for parLDA topics (average of all topics, and average of specific topics only, set to 15). When adding one or two background topics, we can see that coherence is worse. This may be related to how restrictive is to represent general paragraphs with one or two topics. However, the more general topics we add, the higher Coherence is for specific topic, higher than for LDA. This proves that adding general topics in which more general words could fit in, helps in cleaning specific topics, those in which we want to focus. Lastly, Table 4 shows some of the learned topics in this experiment.
|Topic type||coherence||Top 10 words|
|Spec.||0.86||compound, acid, carbon, solution, atom, solvent, formula, polymer, reaction, alkyl|
|Spec.||0.79||acid, composition, peptide, agent, enzyme, amino, aqueous, compound, surfactant, ester|
|Gen.||0.67||optical, electrode, voltage, lens, terminal, substrate, transistor, led, magnetic, coil|
|Gen.||0.63||sensor, vehicle, controller, switch, cell, module, mode, threshold, voltage, battery|
|LDA||0.82||optical, lens, beam, led, wavelength, laser, radiation, emission, mirror, angle|
|LDA||0.71||temperature, polymer, particle, resin, coating, weight, composition, metal, glass, fiber|
5 Conclusions and Further work
In this paper we have identified a specific set of corpora in which changing the semantic unit from words to paragraphs becomes helpful. We have shown with a simple model the benefits of this approach when the proposed generative model is met – in a synthetic dataset–, and also we are satisfied with the resultant topics and structures learned from real datasets. Highlighting paragraphs seems a reasonable way to tell Topic Modeling algorithms where they should put their efforts into learning high quality topics.
Future analysis should lead to improving the way in which a paragraph is classified as relevant. This will require the use of hand-labeled datasets and new metrics, and also more adapted inference models. Once this paragraph characterization is well studied, users of this approach should be able to give the model more information about the type of topics they are looking for.
This work has been partially supported by MINECO Projects TEC2014-52289-R and TEC2017-83838-R. Simón Roca-Sotelo has received financial support through the “la Caixa” Fellowship Grant for Doctoral Studies, “la Caixa” Banking Foundation, Barcelona, Spain.
Balikas, G., M.-R. Amini, and M. Clausel
2016. On a topic model for sentences. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, Pp. 921–924. ACM.
Blei, D. M.
2012. Probabilistic topic models. Communications of the ACM, 55(4):77–84.
Blei, D. M., A. Y. Ng, and M. I. Jordan
2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.
Bojanowski, P., E. Grave, A. Joulin, and
2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
Boyd-Graber, J., Y. Hu, and D. Mimno
2017. Applications of topic models. Foundations and Trends® in Information Retrieval, 11(2-3):143–296.
Cha, S.-H. and S. N. Srihari
2002. On measuring the distance between histograms. Pattern Recognition, 35(6):1355–1370.
Chang, J., S. Gerrish, C. Wang, J. L. Boyd-Graber, and D. M.
2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, Pp. 288–296.
Chemudugunta, C., P. Smyth, and M. Steyvers
2007. Modeling general and specific aspects of documents with a probabilistic topic model. In Advances in neural information processing systems, Pp. 241–248.
Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer, and
1990. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391.
Fang, A., C. Macdonald, I. Ounis, and P. Habel
2016. Using word embedding to evaluate the coherence of topics from twitter data. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, Pp. 1057–1060. ACM.
Griffiths, T. L. and M. Steyvers
2004. Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1):5228–5235.
Gruber, A., Y. Weiss, and M. Rosen-Zvi
2007. Hidden topic markov models. In Artificial intelligence and statistics, Pp. 163–170.
Haghighi, A. and L. Vanderwende
2009. Exploring content models for multi-document summarization. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Pp. 362–370. Association for Computational Linguistics.
2005. Parameter estimation for text analysis.
Lau, J. H., D. Newman, and T. Baldwin
2014. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Pp. 530–539.
Mimno, D., H. M. Wallach, E. Talley, M. Leenders, and
2011. Optimizing semantic coherence in topic models. In Proceedings of the conference on empirical methods in natural language processing, Pp. 262–272. Association for Computational Linguistics.
Röder, M., A. Both, and A. Hinneburg
2015. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining, Pp. 399–408. ACM.