Augmenting word2vec with latent Dirichlet allocation within a clinical application
This paper presents three hybrid models that directly combine latent Dirichlet allocation and word embedding for distinguishing between speakers with and without Alzheimer’s disease from transcripts of picture descriptions. Two of our models get F-scores over the current state-of-the-art using automatic methods on the DementiaBank dataset.
Akshay Budhkar University of Toronto The Vector Institute email@example.com Frank Rudzicz University of Toronto Toronto Rehabilitation Institute-UHN The Vector Institute firstname.lastname@example.org
Word embedding projects word tokens into a lower-dimensional latent space that captures semantic, morphological, and syntactic information Mikolov et al. (2013). Separately but related, the task of topic modelling also discovers latent semantic structures or topics in a corpus. Blei et al. (2003) introduced latent Dirichlet allocation (LDA), which is based on bag-of-words statistics to infer topics in an unsupervised manner. LDA considers each document to be a probability distribution over hidden topics, and each topic is a probability distribution over all words in the vocabulary. Both the topic distributions and the word distributions assume distinct Dirichlet priors.
The inferred probabilities over learned latent topics of a given document (i.e., topic vectors) can be used along with a discriminative classifier, as in the work by Luo and Li (2014), but other approaches such as TF-IDF (Lan et al., 2005) easily outperform this model, like in the case of the Reuters-21578 corpus (Lewis et al., 1987). To address this, Mcauliffe and Blei (2008) introduced a supervised topic model, sLDA, with the intention of inferring latent topics that are predictive of the provided label. Similarly, Ramage et al. (2009) introduced labeled LDA, another graphical model variant of the LDA, to do text classification. Both these variants have competitive results, but do not address the issue caused by the absence of contextual information embedded in these models.
Here, we hypothesize that creating a hybrid of LDA and word2vec models will produce discriminative features. These complementing models have been previously combined for classification by Liu et al. (2015), who introduced topical word embeddings in which topics were inferred on a small local context, rather than over a complete document, and input to a skip-gram model. However, these models are limited when working with small context windows and are relatively expensive to calculate when working with long texts as they involve multiple LDA inferences per document.
We introduce three new variants of hybrid LDA-word2vec models, and investigate the effect of dropping the first component after principal component analysis (PCA). These models can be thought of as extending the conglomeration of topical embedding models. We incorporate topical information into our word2vec models by using the final state of the topic-word distribution in the LDA model during training.
1.1 Motivation and related work
Alzheimer’s disease (AD) is a neurodegenerative disease that affects approximately 5.5 million Americans with annual costs of care up to $259B in the United States, in 2017, alone (Alzheimer’s Association et al., 2017). The existing state-of-the-art methods for detecting AD from speech used extensive feature engineering, some of which involved experienced clinicians. Fraser et al. (2016) investigated multiple linguistic and acoustic characteristics and obtained accuracies up to 81% with aggressive feature selection.
Standard methods that discover latent spaces from data, such as word2vec, allow for problem-agnostic frameworks that don’t involve extensive feature engineering. Yancheva and Rudzicz (2016) took a step in this direction, clinically, by using vector-space topic models, again in detecting AD, and achieved F-scores up to 74%. It is generally expensive to get sufficient labeled data for arbitrary pathological conditions. Given the sparse nature of data sets for AD, Noorian et al. (2017) augmented a clinical data set with normative, unlabeled data, including the Wisconsin Longitudinal Study (WLS), to effectively improve the state of binary classification of people with and without AD.
In our experiments, we train our hybrid models on a normative dataset and apply them for classification on a clinical dataset. While we test and compare these results on detection of AD, this framework can easily be applied to other text classification problems. The goal of this project is to i) effectively augment word2vec with LDA for classification, and ii) to improve the accuracy of dementia detection using automatic methods.
2.1 Wisconsin Longitudinal Study
The Wisconsin Longitudinal Study (WLS) is a normative dataset where residents of Wisconsin (N = 10,317) born between 1938 and 1940 perform the Cookie Theft picture description task from the Boston Diagnostic Aphasia Examination (Goodglass and Barresi, 2000). The audio excerpts from the 2011 survey (N = 1,366) were converted to text using the Kaldi open source automatic speech recognition (ASR) engine, specifically using a bi-directional long short-term memory network trained to the Fisher data set (Cieri et al., 2004). We use this normative dataset to train our topic and word2vec models.
DementiaBank (DB) is part of the TalkBank project (MacWhinney et al., 2011). Each participant was assigned to either the ‘Dementia’ group () or the ‘Control’ group () based on their medical histories and an extensive neuropsychological and physical assessment battery. Additionally, since many subjects repeated their engagement at yearly intervals (up to five years), we use samples from those in the ‘Dementia’ group, and from those in the ‘Control’ group. Each speech sample was recorded and manually transcribed at the word level following the CHAT protocol (MacWhinney, 1992). We use a fold group cross-validation (CV) to split this dataset while ensuring that a particular participant does not occur in both the train and test splits. Table 2 presents the distribution of Control and Dementia groups in the test split for each fold.
|Sex (M/F)||Age (years)|
|WLS||-/-||681/685||- (-)||71.2 (4.4)|
|DB||82/158||82/151||71.8 (8.5)||65.2 (7.8)|
|Fold 1||Fold 2||Fold 3||Fold 4||Fold 5||Total|
WLS is used to train our LDA, word2vec and hybrid models that are then used to generate feature vectors on the DB dataset. The feature vectors on the train set are used to train a discriminative classifier (e.g., SVM), that is then used to do the AD/CT binary classification on the feature vectors of the test set.
2.3 Text pre-processing
During the training of our LDA and word2vec models, we filter out spaCy’s list of stop words (Honnibal and Montani, 2017) from our datasets. For our LDA models trained on ASR transcripts, we remove the [UNK] and [NOISE] tokens generated by Kaldi. We also exclude the tokens um and uh, as they were the most prevalent words across most of the generated topics. We exclude all punctuation and numbers from our datasets.
Once an LDA model is trained, it can be used to infer the topic distribution on a given document. We set the number of topics empirically to K=5 and K=25.
We also use a pre-trained word2vec model trained on the Google News Dataset 111https://code.google.com/archive/p/word2vec/. The model contains embeddings for 3 million unique words, though we extract the most frequent 1 million words for faster performance. Words in our corpus that do not exist in this model are replaced with the UNK token. We also train our own word vectors with 300 dimensions and window size of 2 to be consistent with the pre-trained variant. Words are required to appear at least twice to have a mapped word2vec embedding. Both models incorporate negative sampling to aid with better representations for frequent words as discussed by Mikolov et al. (2013). Unless mentioned otherwise, the same parameters are used for all of our proposed word2vec-based models.
Given these models, we represent a document by averaging the word embeddings for all the words in that document, i.e.:
where is the number of words in the document and is the word2vec embedding for the word. This representation retains the number of dimensions () in the original model.
Third, TF-IDF is a common numerical statistic in information retrieval that measures the number of times a word occurs in a document, and through the entire corpus. We use a TF-IDF vector representation for each transcript for the top 1,000 words after preprocessing. Only the train set is used to compute the inverse document frequency values.
Finally, since the goal of this paper is to create a hybrid of LDA and word2vec models, one of the simpler hybrid models – i.e., concatenating LDA probabilities with average word2vec representations – is the fourth baseline model. Every document is represented by N + K dimensions, where is the word2vec size and is the number of topics.
3.2 Proposed models
3.2.1 Topic vectors
Once an LDA model is trained, we procure the word distribution for every topic. We represent a topic vector as the weighted combination of the word2vec vectors of the words in the vocabulary. This represents every inferred topic as a real-valued vector, with the same dimensions as the word embedding. A topic vector for a given topic is defined as:
where is the vocabulary size of our corpus, is the probability that a given word appears in the topic, from LDA, and is the word2vec embedding of that word.
Furthermore, this approach also represents a given document (or transcript) using these topic vectors as a linear combination of the topics in that document. This combination can be thought of as a topic-influenced point representation of the document in the word2vec space. A document vector is given by:
where is the topic vector defined in Equation 2, is the number of topics of the LDA model, and is the inferred probability that a given document contains topic .
3.2.2 Topical Embedding
To generate topical embeddings, we use the from LDA training as the ground truth of how words and topics are related to each other. We normalize that distribution, so that . This gives a topical representation for every word in the vocabulary.
We concatenate this representation to the one-hot encoding of a given word to train a skip-gram word2vec model. Figure 1 shows a single pass of the word2vec training with this added information. There, and are the concatenated representations of the input-output words determined by a context window, and is an -dimensional hidden layer. All the words and the topics are mapped to an -dimensional embedding during inference. Our algorithm also skips the softmax layer at the output of a standard word2vec model, as our vectors are now a combination of one-hot encoding and dense probability matrices. This is akin to what Liu et al. (2015) did with their LDA inference on a local context document; however, we use the state of the distribution at the last step of the training for all our calculations.
To get document representations, we use the average word2vec approach in Eq 1 on these modified word2vec embeddings. We also propose a new way of representing documents as seen in Figure 3 where we concatenate the average word2vec with the word2vec representation of the most prevalent topic in the document following LDA inference.
3.2.3 Topic-induced word2vec
Our final model involves inducing topics into the corpus itself. We represent every topic with the string topic_i where is its topic number; e.g., topic 1 is topic_1, and topic 25 is topic_25. We also create a sunk topic character (analogous to UNK in vocabulary space) and set it to topic_(K+1), where is the number of topics in the LDA model.
We normalize to get (Section 3.2.2). With a probability of 0.5, set empirically, we replace a given word with the topic string for , provided the max value is . If this max value is , the word is replaced with the sunk topic for that model.
Figure 2 shows an example of topic induction on a snapshot of an ASR transcript of WLS. This process is repeated times and this augmented corpus is now run through a standard skip-gram word2vec model with dimensions set to 400 to accommodate the bigger corpus. The intuition behind this approach is that it allows words to learn how they occur around in a corpus and vice versa.
3.3 PCA update
3.4 Discriminative classifier
Apart from the last experiment, where we compare different classifiers on one model, all experiments use an SVM classifier with a linear kernel and tolerance set to . All other parameters are set to the defaults in the scikit-learn222http://scikit-learn.org library.
4 Experimental setup
4.1 LDA, word2vec, and hybrid models
We use Rehurek’s Gensim333https://radimrehurek.com/gensim/ topic modelling library to generate our LDA and word2vec models. The LDA model follows Hoffman’s (Hoffman et al., 2010) online learning for LDA, ensuring fast model generation for a large corpus. To train our topical embeddings, we implement the skip-gram variant of word2vec using tensorflow. For all our word2vec models, we set the window size to and run through the corpus for iterations.
4.2 Metric calculation
We use scikit-learn (Pedregosa et al., 2011) to classify the vectors generated from our models. For all models, unless specified, we use the default parameters while keeping the discriminative models consistent through all experiments. Our random forest and gradient boosting classifiers each have 100 estimators to be consistent with the work of Noorian et al. (2017). We also employ the original pyLDAvis implementation on Github (Sievert and Shirley, 2014) to visualize topics across the models. t-SNE (Maaten and Hinton, 2008) is used to reduce the vector representations to two dimensions for plotting purposes.
5.1 Model Visualization
We take the 300-dimensional vector representations of 25 words closest to dishes in the word2vec model trained on WLS and run t-SNE dimensionality to plot them on two dimensions in Figure 4. Words similar to dishes occur in its vicinity.
Figure 5 does the same for the topic-induced model (for topics) trained on the augmented corpus as discussed in Section 3.2.3. In this scenario, we are able to see words similar to dishes, and topics that tend to occur in its vicinity. It is evident that all the topics occur close to each other in the embedding space.
Our 5-topic LDA model is visualized using pyLDAvis in Figure 6, and the word distribution in topic 1 is shown. Unlike some distinctly varied corpora, like Newsgroup 20 (as seen in AlSumait et al. (2009)), the topics in WLS do not seem to human-distinguishable and the same few words dominate all the topics. This is expected given that both the AD and CT patients are describing the same picture (Goodglass and Barresi, 2000), and the top 10 tokens of the stop words-filtered WLS dataset account for 16.89% of the total words in the corpus.
5.2 DB classification
The LDA-inferred topic probabilities are not discriminative on their own and give an accuracy of (when ). The TF-IDF model sets a very strong baseline with an accuracy of , which is already better than the automatic models of Yancheva and Rudzicz (2016) on the same data. Using trained word2vec models for average word2vec representations give better accuracy than using a pre-trained model. Simply concatenating the LDA dense matrix to the average word2vec values gives an accuracy of which is comparable to the TF-IDF model. The PCA update on the trained word2vec model boosts the accuracy, and is in line with the work done by Arora et al. (2016). This is not the case for the pre-trained word vectors, where the accuracy drops to after the update.
The topic vectors end up providing no discriminative information to our classifier. This is the case regardless of whether topic vectors are linearly combined (to get dimensions) or concatenated (to get dimensions).
|LDA||Pre-trained word2vec||Trained word2vec||TF-IDF||Concatenation||Topic Vectors|
|5 Topics||25 Topics||PCA Update||PCA Update|
|Topical word2vec||Topical word2vec + topic||Topical word2vec||Topical word2vec + topic||Topic-Induced word2vec||Topic-Induced word2vec + topic||Topic-Induced word2vec||Topic-Induced word2vec + topic||Topic-Induced word2vec||Topic-Induced word2vec + topic||Topic-Induced word2vec||Topic-Induced word2vec + topic|
|25 topics||25 topics and PCA||5 topics||25 topics||5 topics and PCA||25 topics and PCA|
The 25-topic topical embedding model discussed in Section 3.2.2 outperforms the TF-IDF baseline and gives accuracies of when using the average word2vec approach. There is a slight improvement when we concatenate the topic information. All topic-induced models beat the topical embedding model, with the 25-topics variant giving a 5-fold average accuracy of .
The PCA updates to most of these models decrease the accuracy of classification except for the 5-topic topic-induced variant, where the accuracy increases from to when using average word2vec as features to a SVM classifier, and from to when using the concatenated variant.
To check if our accuracies are statistically significant, we calculate our test statistic (Z) as follows:
where are the proportions of samples correctly classified by the two classifiers respectively, is the number of samples (which in our case is ) and .
Augmenting word2vec models with topic information significantly improves accuracy in the topic-induced word2vec model () when compared to the vanilla-trained word2vec model. This change is not significant, however, in the topical embedding model (), though it still outperforms Yancheva and Rudzicz (2016).
5.3 Ablation study
Using the best-performing model (i.e., the 25-topic topic-induced word2vec model with average word2vec as features), we consider other discriminative classifiers. As seen in Table 5, the linear SVM model gives the best accuracy of , though all other models perform similarly, with accuracies upwards of . There is no statistically significant difference between using an SVM vs. a logistic regression () or a gradient boosting classifier ().
|Discriminative Classifier||F1 micro||F1 macro|
|SVM w/ linear kernel||77.50%||77.19%|
|Gradient Boosting Classifier||73.14%||72.39%|
Although the topic distributions of the LDA models were not distinctive enough in themselves, they capture subtle differences between the AD and CT patients missed by the vanilla word2vec models. Simple concatenation of this distribution to the document increases the accuracy by ().
Topic vectors on their own do not provide much generative potential for this clinical data set. The hypothesis is that representing a document as a single point in space, after going through two layers of contraction, removes relevant information to classification.
However, using the same word-topic distribution, normalizing it per word, and combining that information directly into word2vec training increases accuracies . Concatenating the topical embedding to the average word2vec also helps to boost accuracy slightly.
6.1 Topic-induced negative sampling
Our novel topic-induced model performs the best among our proposed models, with an accuracy of on a 5-fold split of the DB dataset. To put this in perspective, Yancheva and Rudzicz (2016)’s automatic vector-space topic models achieved on the same data set, albeit with a slightly different setup.
The idea of adding topics as strings to the corpus is an idea similar to adding noise during negative sampling of word2vec (Mikolov et al., 2013). However, the vanilla word2vec models incorporate negative sampling, and are substantially outperformed by our topic-induced variants. The intuition of letting the words ‘know’ the kind of topics that occur around them, and vice versa, seems to be conducive in incorporating that information into the embeddings themselves. These noisy-additions to the corpus also get assigned meaningful embeddings, as can be seen in certain cases where the concatenation model outperforms the average word2vec variant.
Applying PCA to the features does not have a significant trend.
In this paper, we show the utility of augmenting word2vec with LDA-induced topics. We present three models, two of which outperform vanilla word2vec and LDA models for a clinical binary text classification task. By contrast, topic vector baselines collapse all the relevant information and only perform randomly.
Our topic-induced model with 25 topics trained on WLS and tested on DB achieve an accuracy of . Going forward, we will test this model on other tasks, diagnostic and otherwise, to see its generalizability. This can provide a starting point for clinical classification problems where labeled data may be scarce.
The Wisconsin Longitudinal Study is sponsored by the National Institute on Aging (grant numbers R01AG009775, R01AG033285, and R01AG041868), and was conducted by the University of Wisconsin.
- AlSumait et al. (2009) Loulwah AlSumait, Daniel Barbará, James Gentle, and Carlotta Domeniconi. 2009. Topic significance ranking of LDA generative models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pages 67–82.
- Alzheimer’s Association et al. (2017) Alzheimer’s Association et al. 2017. 2017 Alzheimer’s disease facts and figures. Alzheimer’s & Dementia 13(4):325–373.
- Arora et al. (2016) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2016. A simple but tough-to-beat baseline for sentence embeddings .
- Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent Dirichlet allocation. Journal of machine Learning research 3(Jan):993–1022.
- Cieri et al. (2004) Christopher Cieri, David Miller, and Kevin Walker. 2004. Fisher English training speech parts 1 and 2. Philadelphia: Linguistic Data Consortium .
- Fraser et al. (2016) Kathleen C Fraser, Jed A Meltzer, and Frank Rudzicz. 2016. Linguistic features identify Alzheimer’s disease in narrative speech. Journal of Alzheimer’s Disease 49(2):407–422.
- Goodglass and Barresi (2000) Harold Goodglass and Barbara Barresi. 2000. Boston diagnostic aphasia examination: Short form record booklet. Lippincott Williams & Wilkins.
- Hoffman et al. (2010) Matthew Hoffman, Francis R Bach, and David M Blei. 2010. Online learning for latent Dirichlet allocation. In advances in neural information processing systems. pages 856–864.
- Honnibal and Montani (2017) Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear .
- Lan et al. (2005) Man Lan, Chew-Lim Tan, Hwee-Boon Low, and Sam-Yuan Sung. 2005. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In Special interest tracks and posters of the 14th international conference on World Wide Web. ACM, pages 1032–1033.
- Lewis et al. (1987) David Lewis et al. 1987. Reuters-21578. Test Collections 1.
- Liu et al. (2015) Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2015. Topical word embeddings. In Association for the Advancement of Artificial Intelligence. pages 2418–2424.
- Luo and Li (2014) Le Luo and Li Li. 2014. Defining and evaluating classification algorithm for high-dimensional data based on latent topics. PloS one 9(1):e82119.
- Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9(Nov):2579–2605.
- MacWhinney (1992) Brian MacWhinney. 1992. The CHILDES project: Tools for analyzing talk. Child Language Teaching and Therapy 8(2):217–218. https://doi.org/10.1177/026565909200800211.
- MacWhinney et al. (2011) Brian MacWhinney, Davida Fromm, Margaret Forbes, and Audrey Holland. 2011. AphasiaBank: Methods for studying discourse. Aphasiology 25(11):1286–1307.
- Mcauliffe and Blei (2008) Jon D Mcauliffe and David M Blei. 2008. Supervised topic models. In Advances in neural information processing systems. pages 121–128.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. pages 3111–3119.
- Noorian et al. (2017) Zeinab Noorian, Chloé Pou-Prom, and Frank Rudzicz. 2017. On the importance of normative data in speech-based assessment. arXiv preprint arXiv:1712.00069 .
- Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12(Oct):2825–2830.
- Ramage et al. (2009) Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, pages 248–256.
- Sievert and Shirley (2014) Carson Sievert and Kenneth Shirley. 2014. LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces. pages 63–70.
- Yancheva and Rudzicz (2016) Maria Yancheva and Frank Rudzicz. 2016. Vector-space topic models for detecting Alzheimer’s disease. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 2337–2346.