Auto-Encoding Variational Bayes for Inferring Topics and Visualization
Visualization and topic modeling are widely used approaches for text analysis. Traditional visualization methods find low-dimensional representations of documents in the visualization space (typically 2D or 3D) that can be displayed using a scatterplot. In contrast, topic modeling aims to discover topics from text, but for visualization, one needs to perform a post-hoc embedding using dimensionality reduction methods. Recent approaches propose using a generative model to jointly find topics and visualization, allowing the semantics to be infused in the visualization space for a meaningful interpretation. A major challenge that prevents these methods from being used practically is the scalability of their inference algorithms. We present, to the best of our knowledge, the first fast Auto-Encoding Variational Bayes based inference method for jointly inferring topics and visualization. Since our method is black box, it can handle model changes efficiently with little mathematical rederivation effort. We demonstrate the efficiency and effectiveness of our method on real-world large datasets and compare it with existing baselines.
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/.
Visualization and topic modeling are important tools in the analysis of text corpora. Visualization methods, such as t-SNE , find low-dimensional representations of documents in the visualization space (typically 2D or 3D) that can be displayed using a scatterplot. Such visualization is useful for exploratory tasks. However, there is a lack of semantic interpretation as those visualization methods do not extract topics. In contrast, topic modeling aims to discover semantic topics from text, but for visualization, one needs to perform a post-hoc embedding using dimensionality reduction methods. Since this pipeline approach may not be ideal, there has been recent interest in jointly inferring topics and visualization using a single generative model . This joint approach allows the semantics to be infused in the visualization space where users can view documents and their topics. The problem of jointly inferring topics and visualization can be formally stated as follows.
Problem. Let denote a finite set of documents and let be a finite vocabulary from these documents. Given a number of topics , and visualization dimension , we want to find:
For topic modeling: latent topics, and their word distributions collectively denoted as , topic distributions of documents collectively denoted as , and
For visualization: -dimensional visualization coordinates for documents , and topics such that the distances between documents, topics in the visualization space reflect the topic-document distributions .
To solve this problem, PLSV (Probabilistic Latent Semantic Visualization) is the first model that attempts to tie together all latent variables of topics and visualization (i.e., ) in a generative model. Its tight integration between visualization and the underlying topic model can support applications such as user-driven topic modeling where users can interactively provide feedback to the model . PLSV can also be used as a basic building block when developing new models for other analysis tasks, such as visual comparison of document collections .
Relatively less attention has been paid to methods for fast inference of topics and visualization. Existing models often use Maximum a Posteriori (MAP) estimation with the EM algorithm, which is difficult to scale to large datasets. As shown in Figure 12, to run a PLSV model of 50 topics via MAP estimation on a dataset of modest size (e.g., 20 Newsgroups), it takes more than 18 hours using a single core. This long running time limits the usability of these visualization methods in practice.
In this paper, we aim to propose a fast Auto-Encoding Variational Bayes (AEVB) based inference method for inferring topics and visualization. AEVB  is a black-box variational method which is efficient for inference and learning in latent Gaussian Models with large datasets. However, to apply the AEVB approach to topic models like LDA, one needs to deal with problems caused by the Dirichlet prior and by posterior collapse . One of the successful AEVB based methods proposed to tackle those challenges for topic models is AVITM .
It is not straightforward to apply AEVB or AVITM to our problem because of two main challenges. First, as reviewed in Section 2, PLSV models a document’s topic distribution using a softmax function over its Euclidean distances to topics. It is not clear how to express this nonlinear functional relationship between three categories of latent variables (i.e., topic distribution , document coordinate , and topic coordinate ) when applying AVITM to visualization. Second, AEVB has an assumption that latent encodings are identically and independently distributed (i.i.d.) across samples  . In our case, this assumption works well with latent document coordinates where each document is associated with its latent encoding in the visualization space. However, for topic coordinates and word probabilities , that assumption is too strong. The reason is that latent encodings of any topic w.r.t any documents are not independent, but in fact, in our extreme case these latent encodings are similar, i.e., , for any documents and any topic . In other words, is shared across documents. The same argument also applies to word probabilities .
To address the first challenge, we propose to model the nonlinear functional relationship between , , using a normalized Radial Basis Function (RBF) Neural Network . In this model, is the center vector for neuron , i.e., are treated as parameters of the RBF network and will be estimated. Similarly, we model as parameters of a linear neural network that is connected to the RBF network to form the decoder in the AEVB approach. By treating and as parameters of the decoder, we can solve the second challenge, though it can be seen that our algorithm does not learn their posterior distributions but rather their point estimates. In Section 3, we present in detail our proposed method. We focus on PLSV model in this work, though the proposed AEVB inference method could be easily adapted to other visualization models.
We summarize our contributions as follows:
We propose, to the best of our knowledge, the first AEVB inference method for the problem of jointly inferring topics and visualization.
In our approach, we design a decoder that includes an RBF network connected to a linear neural network. These networks are parameterized by topic coordinates and word probabilities, ensuring that they are shared across all documents.
We conduct extensive experiments on real-world large datasets, showing the efficiency and effectiveness of our method. While running much faster than PLSV, it gains better visualization quality and comparable topic coherence.
Since our method is black box, it can handle model changes efficiently with little mathematical rederivation effort. We implement different PLSV models that use different RBFs by just changing a few lines of code. We experimentally show that PLSV with Gaussian or Inverse quadratic RBFs consistently produces good performance across datasets.
2 Background and Related Work
2.1 Topic Modeling and Visualization
Topic models [3, 8] are widely used for unsupervised representation learning of text and have found applications in different text mining tasks [23, 2, 28, 12]. Popular topic models such as LDA , find a low-dimensional representation of each document in topic space. Each dimension of the topic space has a meaning attached to it and is modeled as a probability distribution over words. In contrast, t-SNE , LargeVis  are visualization methods aiming to find for each document a low-dimensional representation (typically 2D or 3D). However, we often do not have such semantic interpretation for that low-dimensional space as in topic models. Therefore, there have been works attempting to infuse semantics to the visualization space by jointly modeling topics and visualization [10, 18]. These methods often suffer from the scalability issue with large datasets. In this work, we aim to scale up these methods by proposing a fast AEVB based inference method. We focus on PLSV  for applying our proposed method. PLSV has been used as a basic block for building new models for visual text mining tasks [19, 17]. Our proposed method could be easily adapted to these models.
PLSV assumes the following process to generate documents and visualization:
For each topic :
Draw a word distribution:
Draw a topic coordinate:
For each document :
Draw a document coordinate:
For each word in document :
Draw a topic:
Draw a word:
Here has a Dirichlet prior. Topic and document coordinates have Gaussian priors of the forms: and respectively. The topic distribution of a document is defined using a softmax function over its distances to topics:
As we can see from Eq. 1, the th topic proportion of document is high when document coordinate is close to topic coordinate . This relationship ensures that the distances between documents, topics in the visualization space reflect the topic-document distributions . In the PLSV paper, the parameters are estimated using MAP estimation with the EM algorithm. As shown in our experiments, the algorithm does not scale to large datasets.
2.2 Auto-Encoding Variational Bayes for Topic Models
AEVB  and its variant WiSE-ALE , AVITM  are black-box variational inference methods whose purpose is to allow practitioners to quickly explore and adjust the model’s assumptions with little rederivation effort . AVITM is an auto-encoding variational inference method for topic models. It approximates the true posterior using a variational distribution where is hyperparameter of Dirichlet prior and are the free variational parameters over respectively. Different from Mean-Field Variational Inference, AVITM computes the variational parameters using an inference neural network and they are chosen by optimizing the following ELBO (i.e., the lower bound to the marginal log likelihood):
By collapsing and approximating the Dirichlet prior with a logistic normal distribution, the second term (i.e., the expectations with respect to ) in the ELBO can be approximated using the reparameterization trick as in AEVB. The second term is also referred to as an expected negative reconstruction error in variational auto-encoders (VAE). While AVITM is successfully applied to LDA, it is not straightforward to apply it to our problem as discussed in the introduction.
3 Proposed Auto-Encoding Variational Bayes for Inferring Topics and Visualization
We represent a document as a row vector of word counts: and is the number of occurrences of word in the document. The marginal likelihood of a document is given by:
The marginal likelihood of the corpus is . Note that here we treat , and as fixed quantities that are to be estimated. Therefore we are working with a non-smoothed PLSV where , and are not endowed with a posterior distribution. By treating , and as model parameters, we ensure that they are shared across all documents in the AEVB approach. We will consider a fuller Bayesian approach to PLSV in our future work.
As in AVITM, we collapse the discrete latent variable to avoid the difficulty of determining a reparameterization function for it. The rightmost integral in Eq. 3 is the marginal likelihood after is collapsed. We now only consider the true posterior distribution over latent variable : . Due to the intractability of Eq. 3, it is intractable to compute the posterior. We approximate it by a variational distribution parameterized by . The variational parameter is estimated using an inference network as in AEVB. We have the following lower bound to the marginal log likelihood (ELBO) of a document:
Since the prior is a Gaussian, we can let the variational posterior be a Gaussian with a diagonal covariance matrix: . The KL divergence between two Gaussians in Eq. 4 can be computed in a closed form as follows :
where , diagonal are outputs of the encoding feed forward neural network with variational parameters . The expectation w.r.t in Eq. 4 can be estimated using reparameterization trick . More specifically, we sample from the posterior by using reparameterization over random variable , i.e., where . The expectation can then be approximated as:
In Eq. 6, the decoding term is computed as:
where is the topic-word probability matrix, is a row vector of word counts, is a row vector of topic proportions and is computed as in Eq. 1. Based on Eq. 7 and Eq. 1, we propose using a decoder with two connected neural networks:
In this network, we have neurons in the hidden layer and is the center vector for neuron . The RBF function is a non-linear function that depends on the distance and is the influence weight of neuron on where . While can be estimated by optimizing the ELBO, we choose to fix it as when and 0 otherwise. The parameters of this network are then the center vectors of neurons that are the coordinates of topics in the visualization space. The RBF function can have different forms, e.g., Gaussian: , Inverse quadratic: , or Inverse multiquadric: where
Linear Neural Network for computing . The output of the above normalized RBF network will be the input of a linear neural network to compute in the decoding term. We treat as the parameters, i.e., the linear weights , of the network and it is computed using a softmax over the network weights to ensure the simplex constraint on : . The architecture of the whole Variational Auto-Encoder is given in Figure 1. We use batch normalization  to mitigate the posterior collapse issue found in the AEVB approach [7, 25].
We evaluate the effectiveness and efficiency of our proposed AEVB based inference method for visualization and topic modeling both quantitatively and qualitatively. We use four real-world public datasets from different domains including newswire articles, newsgroups posts and academic papers.
2: contains 7674 newswire articles from 8 categories .
3: contains 18251 newsgroups posts from 20 categories.
Web of Science
4: we use Web of Science WOS-46985 dataset . It contains the abstracts and keywords of 46,985 published papers from 7 research domains: CS, Psychology, Medical, ECE, Civil, MAE, and Biochemistry.
5: contains the titles and abstracts of 598,748 research papers from arXiv. The papers are from 7 categories: Math, CS, Nucl, Stat, Astro, Quant, and Physics.
We perform preprocessing by removing stopwords and stemming. The vocabulary sizes are 3000, 3248, 4000, and 5000 for Reuters, 20 Newsgroups, Web of Science, and Arxiv respectively. Note that our problem is unsupervised and the ground-truth class labels are mainly used for evaluation. Before detailing the experiment results, we describe the comparative methods.
Comparative Methods. We compare the following methods for inferring topics and visualization:
6: the original PLSV using MAP estimation with EM algorithm .
PLSV-VAE (Gaussian) [this paper]
7: we apply our proposed variational auto-encoder (VAE) inference to PLSV where Gaussian RBF is used. We write PLSV-VAE to refer to PLSV-VAE (Gaussian).
PLSV-VAE (Inverse quadratic) and PLSV-VAE (Inverse multiquadric) [this paper]: these are PLSV-VAE models with Inverse quadratic and Inverse multiquadric RBFs. Since our method is black box, we can quickly implement these two models by just changing a few lines of code of PLSV-VAE (Gaussian) implementation.
Pipeline approach: this is the approach of topic modeling followed by embedding of documents’ topic proportions for visualization. We compare the above joint models with two pipeline models:
ProdLDA-VAE + t-SNE: similar to the above but we use ProdLDA-VAE8 instead of LDA-VAE.
In the next sections, we report the experiment results averaged across 10 independent runs. For PLSV models, we choose that work well for large datasets in our experiments. We run PLSV-MAP with the number of EM iterations set to 200 and the maximum number of iterations for the quasi-Newton algorithm set to 10. Following AVITM, we set , the batch size to 256, the number of samples per document to 1, the learning rate to 0.002, and use dropout with probability . We use Adam as our optimizing algorithm. VAE based models are trained with 1000 epochs. All the experiments are conducted on a system with 64GB memory, an Intel(R) Xeon(R) CPU E5-2623 v3, 16 cores at 3.00GHz. The GPU in use on this system is NVIDIA Quadro P2000 GPU with 1024 CUDA cores and 5 GB GDDR5.
4.1 Classification in the Visualization Space
We quantitatively evaluate the visualization quality by measuring the -NN accuracy in the visualization space. This evaluation approach is also adopted in t-SNE, LargeVis, and the original PLSV. A -NN classifier is used to classify documents using their visualization coordinates. A good visualization should group documents with the same label together and hence yield a high classification accuracy in the visualization space. Figures 4 and 4 show -NN accuracy of all methods on each dataset, for varying number of nearest neighbors and number of topics . For some settings, we do not show PLSV-MAP’s performance as it does not return any results even after 24 hours of running. We can see that PLSV-VAE consistently achieves the best result, except for 25 topics on Reuters (Figure 4a) where it produces a comparable result with PLSV-MAP. These results show that the joint approach outperforms the pipeline approach and VAE inference may help improve the visualization quality of PLSV. To verify this qualitatively, in Section 4.3, we show some visualization examples of all methods across datasets. Note that in this section, we show the accuracy of PLSV-VAE with Gaussian RBF. In Section 4.4, we present the performance of PLSV-VAE with different RBFs.
4.2 Topic Coherence
We quantitatively measure the quality of topic models produced by all methods in terms of topic coherence. The objective is to show that while having better visualization quality, PLSV-VAE also gains comparable, if not better, topic coherence. For topic coherence evaluation, we use Normalized Pointwise Mutual Information (NPMI) which has been shown to be correlated with human judgments . NPMI is computed as follows:
We estimate , , and using Wikipedia 7-gram dataset
4.3 Visualization Examples
We compare visualizations produced by all methods qualitatively by showing some visualization examples. In these visualizations, each document is represented by a point and the color of each point indicates the class of that document. Figures 7 and 7 present visualizations by PLSV-MAP, PLSV-VAE on Reuters and 20 Newsgroups. We see that PLSV-VAE can find meaningful clusters of documents. For example, PLSV-VAE in Figure 7(b) separates well the eight classes into different clusters such as the pink cluster for acq, the orange cluster for earn, and the brown cluster for crude. The visualization by PLSV-MAP in Figure 7(a) also shows clear clusters but it runs much slower than PLSV-VAE as shown in Section 4.5. Figure 7 presents visualization outputs for 20 Newsgroups. For this more challenging dataset, PLSV-VAE produces better-separated clusters, as compared to PLSV-MAP. For example, baseball and hockey are mixed in Figure 7(a) by PLSV-MAP but these classes are separated better in Figure 7(b) by PLSV-VAE. We do not show visualizations of Web of Science and Arxiv by PLSV-MAP because it fails to return any results even after 24 hours of running. We instead show visualizations of these two large datasets by PLSV-VAE and ProdLDA-VAE + t-SNE in Figures 7 and 11. As we can see, visualizations by PLSV-VAE are more intuitive than the ones by ProdLDA-VAE + t-SNE, which supports the outperformance of the joint approach over the pipeline approach.
4.4 Comparing Different Radial Basis Functions
Since our method is black box, we can quickly explore PLSV-VAE model with different assumptions. In this section, we show how different RBFs affect the performance of PLSV-VAE. Besides PLSV-VAE with Gaussian RBF, we implement another two variants of PLSV-VAE that uses two other RBFs: Inverse quadratic and Inverse multiquadric RBFs. We choose these two because, similar to Gaussian, they support the assumption that the th topic proportion of document is high when document coordinate is close to topic coordinate . For these model changes, we do not need to perform a mathematical rederivation, but we only need to change a few lines of code of PLSV-VAE (Gaussian). Figures 11 and 11 show the -NN accuracy and topic coherence of PLSV-VAE with different RBFs. In general, PLSV-VAE with Gaussian or Inverse quadratic RBFs consistently produces good performance across datasets. In some cases, Inverse quadratic produces better results.
4.5 Topic Examples and Running Time Comparison
To qualitatively evaluate the topics, in Figure 11, we show visualization and topic examples generated by PLSV-VAE (Inverse quadratic) on Arxiv. In the visualization, each black empty circle represents a topic that is associated with a list of top 10 words. We see that the topics are meaningful and reflect different research subdomains discussed in the Arxiv papers. For example, many topics are studied in the CS domain such as “graph, g, vertex, k”, “model, data, use, method”, and “logic, program, system”. For the Astro domain, we have topics like “galaxi, cluster, star”, and ”observ, ray, model, star”. Topics such as “energi, nucleu, reaction” and “electron, energi, atom” are discussed in the Nucl domain. By allowing the semantics to be infused in the visualization space, users can now not only see the documents but also their topics. The joint nature of the model may lead to potential applications in different visual text mining tasks.
Finally, we show the running time of all the methods in Figure 12. As expected, PLSV-MAP running on a single core is very slow and it fails to return any results on large datasets even after 24 hours of running. PLSV-VAE runs much faster. It only needs about 5 hours for 200 topics on the largest dataset Arxiv. For completeness, we also include the running time of LDA-VAE, and ProdLDA-VAE. PLSV-VAE is as fast as these methods. In summary, PLSV-VAE can find good topics and visualization while it can scale well to large datasets, which will increase its usability in practice.
We propose, to the best of our knowledge, the first fast AEVB based inference method for jointly learning topics and visualization. In our approach, we design a decoder that includes a normalized RBF network connected to a linear neural network. These networks are parameterized by topic coordinates and word probabilities, ensuring that they are shared across all documents. Due to our method’s black box nature, we can quickly experiment with different RBFs with minimal reimplementation effort. Our extensive experiments on four real-world large datasets show that PLSV-VAE runs much faster than PLSV-MAP while gaining better visualization quality and comparable topic coherence.
This research is sponsored by NSF #1757207 and NSF #1914635.
- is Euclidean distance in our experiments
- We use the implementation at https://github.com/tuanlvm/SEMAFORE
- The implementation of our method can be found at https://github.com/dangpnh2/plsv_vae
- We use the implementation at https://github.com/akashgit/autoencoding_vi_for_topic_models
- We use the Multicore t-SNE implementation at https://github.com/DmitryUlyanov/Multicore-TSNE
- (1995) Neural networks for pattern recognition. Oxford University Press. Cited by: §1, §3.
- (2007) A correlated topic model of science. The Annals of Applied Statistics 1 (1), pp. 17–35. Cited by: §2.1.
- (2003) Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), pp. 993–1022. Cited by: §2.1.
- (2007) Improving Methods for Single-label Text Categorization. Note: PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa Cited by: 1st item.
- (2018) Gaussian process prior variational autoencoders. In Advances in Neural Information Processing Systems, pp. 10369–10380. Cited by: §1.
- (2013) Utopian: user-driven topic modeling based on interactive nonnegative matrix factorization. IEEE transactions on visualization and computer graphics 19 (12), pp. 1992–2001. Cited by: §1.
- (2019-05) Lagging inference networks and posterior collapse in variational autoencoders. In International Conference on Learning Representations (ICLR), New Orleans, LA, USA. External Links: Cited by: §1, §3.
- (1999) Probabilistic latent semantic analysis. In UAI, Cited by: §2.1.
- (2015-07–09 Jul) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 448–456. External Links: Cited by: §3.
- (2008) Probabilistic latent semantic visualization: topic model for visualizing documents. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 363–371. Cited by: §1, §2.1, 1st item.
- (2010) Efficiently learning mixtures of two gaussians. In Proceedings of the forty-second ACM symposium on Theory of computing, pp. 553–562. Cited by: §3.
- (2019) TopicSifter: interactive search space reduction through targeted topic modeling. In 2019 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 35–45. Cited by: §2.1.
- (2014) Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Cited by: §1, §3.
- (2014) Auto-encoding variational bayes. CoRR abs/1312.6114. Cited by: §2.2.
- (2017) HDLTex: hierarchical deep learning for text classification. In Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on, Cited by: 3rd item.
- (2014) Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 530–539. Cited by: §4.2.
- (2019) ContraVis: contrastive and visual topic modeling for comparing document collections. In The World Wide Web Conference, pp. 928–938. Cited by: §1, §2.1.
- (2014) Manifold learning for jointly modeling topic and visualization. In Twenty-Eighth AAAI Conference on Artificial Intelligence, Cited by: §2.1.
- (2014) Semantic visualization for spherical representation. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1007–1016. Cited by: §2.1.
- (2016) Semantic visualization with neighborhood graph regularization. Journal of Artificial Intelligence Research 55, pp. 1091–1133. Cited by: §3.
- (2019) WiSE-ale: wide sample estimator for aggregate latent embedding. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019, Cited by: §1, §2.2.
- (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §1, §2.1, 1st item.
- (2009) Topic modeling for the social sciences. In NIPS 2009 workshop on applications for topic models: text and beyond, Vol. 5, pp. 27. Cited by: §2.1.
- (2014) Black box variational inference. ArXiv abs/1401.0118. Cited by: §2.2.
- (2019) Preventing posterior collapse with delta-vaes. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §3.
- (2017) Autoencoding variational inference for topic models. In ICLR, Cited by: §1, §2.2, 1st item.
- (2016) Visualizing large-scale and high-dimensional data. In Proceedings of the 25th international conference on world wide web, pp. 287–297. Cited by: §2.1.
- (2019) Comparelda: a topic model for document comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7112–7119. Cited by: §2.1.