AutoEncoding Variational Bayes for Inferring Topics and Visualization
Abstract
Visualization and topic modeling are widely used approaches for text analysis. Traditional visualization methods find lowdimensional representations of documents in the visualization space (typically 2D or 3D) that can be displayed using a scatterplot. In contrast, topic modeling aims to discover topics from text, but for visualization, one needs to perform a posthoc embedding using dimensionality reduction methods. Recent approaches propose using a generative model to jointly find topics and visualization, allowing the semantics to be infused in the visualization space for a meaningful interpretation. A major challenge that prevents these methods from being used practically is the scalability of their inference algorithms. We present, to the best of our knowledge, the first fast AutoEncoding Variational Bayes based inference method for jointly inferring topics and visualization. Since our method is black box, it can handle model changes efficiently with little mathematical rederivation effort. We demonstrate the efficiency and effectiveness of our method on realworld large datasets and compare it with existing baselines.
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/.
1 Introduction
Visualization and topic modeling are important tools in the analysis of text corpora. Visualization methods, such as tSNE [22], find lowdimensional representations of documents in the visualization space (typically 2D or 3D) that can be displayed using a scatterplot. Such visualization is useful for exploratory tasks. However, there is a lack of semantic interpretation as those visualization methods do not extract topics. In contrast, topic modeling aims to discover semantic topics from text, but for visualization, one needs to perform a posthoc embedding using dimensionality reduction methods. Since this pipeline approach may not be ideal, there has been recent interest in jointly inferring topics and visualization using a single generative model [10]. This joint approach allows the semantics to be infused in the visualization space where users can view documents and their topics. The problem of jointly inferring topics and visualization can be formally stated as follows.
Problem. Let denote a finite set of documents and let be a finite vocabulary from these documents. Given a number of topics , and visualization dimension , we want to find:

For topic modeling: latent topics, and their word distributions collectively denoted as , topic distributions of documents collectively denoted as , and

For visualization: dimensional visualization coordinates for documents , and topics such that the distances between documents, topics in the visualization space reflect the topicdocument distributions .
To solve this problem, PLSV (Probabilistic Latent Semantic Visualization) is the first model that attempts to tie together all latent variables of topics and visualization (i.e., ) in a generative model. Its tight integration between visualization and the underlying topic model can support applications such as userdriven topic modeling where users can interactively provide feedback to the model [6]. PLSV can also be used as a basic building block when developing new models for other analysis tasks, such as visual comparison of document collections [17].
Relatively less attention has been paid to methods for fast inference of topics and visualization. Existing models often use Maximum a Posteriori (MAP) estimation with the EM algorithm, which is difficult to scale to large datasets. As shown in Figure 12, to run a PLSV model of 50 topics via MAP estimation on a dataset of modest size (e.g., 20 Newsgroups), it takes more than 18 hours using a single core. This long running time limits the usability of these visualization methods in practice.
In this paper, we aim to propose a fast AutoEncoding Variational Bayes (AEVB) based inference method for inferring topics and visualization. AEVB [13] is a blackbox variational method which is efficient for inference and learning in latent Gaussian Models with large datasets. However, to apply the AEVB approach to topic models like LDA, one needs to deal with problems caused by the Dirichlet prior and by posterior collapse [7]. One of the successful AEVB based methods proposed to tackle those challenges for topic models is AVITM [26].
It is not straightforward to apply AEVB or AVITM to our problem because of two main challenges. First, as reviewed in Section 2, PLSV models a document’s topic distribution using a softmax function over its Euclidean distances to topics. It is not clear how to express this nonlinear functional relationship between three categories of latent variables (i.e., topic distribution , document coordinate , and topic coordinate ) when applying AVITM to visualization. Second, AEVB has an assumption that latent encodings are identically and independently distributed (i.i.d.) across samples [5] [21]. In our case, this assumption works well with latent document coordinates where each document is associated with its latent encoding in the visualization space. However, for topic coordinates and word probabilities , that assumption is too strong. The reason is that latent encodings of any topic w.r.t any documents are not independent, but in fact, in our extreme case these latent encodings are similar, i.e., , for any documents and any topic . In other words, is shared across documents. The same argument also applies to word probabilities .
To address the first challenge, we propose to model the nonlinear functional relationship between , , using a normalized Radial Basis Function (RBF) Neural Network [1]. In this model, is the center vector for neuron , i.e., are treated as parameters of the RBF network and will be estimated. Similarly, we model as parameters of a linear neural network that is connected to the RBF network to form the decoder in the AEVB approach. By treating and as parameters of the decoder, we can solve the second challenge, though it can be seen that our algorithm does not learn their posterior distributions but rather their point estimates. In Section 3, we present in detail our proposed method. We focus on PLSV model in this work, though the proposed AEVB inference method could be easily adapted to other visualization models.
We summarize our contributions as follows:

We propose, to the best of our knowledge, the first AEVB inference method for the problem of jointly inferring topics and visualization.

In our approach, we design a decoder that includes an RBF network connected to a linear neural network. These networks are parameterized by topic coordinates and word probabilities, ensuring that they are shared across all documents.

We conduct extensive experiments on realworld large datasets, showing the efficiency and effectiveness of our method. While running much faster than PLSV, it gains better visualization quality and comparable topic coherence.

Since our method is black box, it can handle model changes efficiently with little mathematical rederivation effort. We implement different PLSV models that use different RBFs by just changing a few lines of code. We experimentally show that PLSV with Gaussian or Inverse quadratic RBFs consistently produces good performance across datasets.
2 Background and Related Work
2.1 Topic Modeling and Visualization
Topic models [3, 8] are widely used for unsupervised representation learning of text and have found applications in different text mining tasks [23, 2, 28, 12]. Popular topic models such as LDA [3], find a lowdimensional representation of each document in topic space. Each dimension of the topic space has a meaning attached to it and is modeled as a probability distribution over words. In contrast, tSNE [22], LargeVis [27] are visualization methods aiming to find for each document a lowdimensional representation (typically 2D or 3D). However, we often do not have such semantic interpretation for that lowdimensional space as in topic models. Therefore, there have been works attempting to infuse semantics to the visualization space by jointly modeling topics and visualization [10, 18]. These methods often suffer from the scalability issue with large datasets. In this work, we aim to scale up these methods by proposing a fast AEVB based inference method. We focus on PLSV [10] for applying our proposed method. PLSV has been used as a basic block for building new models for visual text mining tasks [19, 17]. Our proposed method could be easily adapted to these models.
PLSV assumes the following process to generate documents and visualization:

For each topic :

Draw a word distribution:

Draw a topic coordinate:


For each document :

Draw a document coordinate:

For each word in document :

Draw a topic:

Draw a word:


Here has a Dirichlet prior. Topic and document coordinates have Gaussian priors of the forms: and respectively. The topic distribution of a document is defined using a softmax function over its distances to topics:
(1) 
As we can see from Eq. 1, the th topic proportion of document is high when document coordinate is close to topic coordinate . This relationship ensures that the distances between documents, topics in the visualization space reflect the topicdocument distributions . In the PLSV paper, the parameters are estimated using MAP estimation with the EM algorithm. As shown in our experiments, the algorithm does not scale to large datasets.
2.2 AutoEncoding Variational Bayes for Topic Models
AEVB [14] and its variant WiSEALE [21], AVITM [26] are blackbox variational inference methods whose purpose is to allow practitioners to quickly explore and adjust the model’s assumptions with little rederivation effort [24]. AVITM is an autoencoding variational inference method for topic models. It approximates the true posterior using a variational distribution where is hyperparameter of Dirichlet prior and are the free variational parameters over respectively. Different from MeanField Variational Inference, AVITM computes the variational parameters using an inference neural network and they are chosen by optimizing the following ELBO (i.e., the lower bound to the marginal log likelihood):
(2) 
By collapsing and approximating the Dirichlet prior with a logistic normal distribution, the second term (i.e., the expectations with respect to ) in the ELBO can be approximated using the reparameterization trick as in AEVB. The second term is also referred to as an expected negative reconstruction error in variational autoencoders (VAE). While AVITM is successfully applied to LDA, it is not straightforward to apply it to our problem as discussed in the introduction.
3 Proposed AutoEncoding Variational Bayes for Inferring Topics and Visualization
We represent a document as a row vector of word counts: and is the number of occurrences of word in the document. The marginal likelihood of a document is given by:
(3) 
The marginal likelihood of the corpus is . Note that here we treat , and as fixed quantities that are to be estimated. Therefore we are working with a nonsmoothed PLSV where , and are not endowed with a posterior distribution. By treating , and as model parameters, we ensure that they are shared across all documents in the AEVB approach. We will consider a fuller Bayesian approach to PLSV in our future work.
As in AVITM, we collapse the discrete latent variable to avoid the difficulty of determining a reparameterization function for it. The rightmost integral in Eq. 3 is the marginal likelihood after is collapsed. We now only consider the true posterior distribution over latent variable : . Due to the intractability of Eq. 3, it is intractable to compute the posterior. We approximate it by a variational distribution parameterized by . The variational parameter is estimated using an inference network as in AEVB. We have the following lower bound to the marginal log likelihood (ELBO) of a document:
(4) 
Since the prior is a Gaussian, we can let the variational posterior be a Gaussian with a diagonal covariance matrix: . The KL divergence between two Gaussians in Eq. 4 can be computed in a closed form as follows [11]:
(5) 
where , diagonal are outputs of the encoding feed forward neural network with variational parameters . The expectation w.r.t in Eq. 4 can be estimated using reparameterization trick [13]. More specifically, we sample from the posterior by using reparameterization over random variable , i.e., where . The expectation can then be approximated as:
(6) 
In Eq. 6, the decoding term is computed as:
(7) 
where is the topicword probability matrix, is a row vector of word counts, is a row vector of topic proportions and is computed as in Eq. 1. Based on Eq. 7 and Eq. 1, we propose using a decoder with two connected neural networks:
Normalized Radial Basis Function Network for computing . We generalize in Eq. 1 using a Normalized Radial Basis Function (RBF) Network [1] as follows:
(8) 
In this network, we have neurons in the hidden layer and is the center vector for neuron . The RBF function is a nonlinear function that depends on the distance and is the influence weight of neuron on where . While can be estimated by optimizing the ELBO, we choose to fix it as when and 0 otherwise. The parameters of this network are then the center vectors of neurons that are the coordinates of topics in the visualization space. The RBF function can have different forms, e.g., Gaussian: , Inverse quadratic: , or Inverse multiquadric: where
Linear Neural Network for computing . The output of the above normalized RBF network will be the input of a linear neural network to compute in the decoding term. We treat as the parameters, i.e., the linear weights , of the network and it is computed using a softmax over the network weights to ensure the simplex constraint on : . The architecture of the whole Variational AutoEncoder is given in Figure 1. We use batch normalization [9] to mitigate the posterior collapse issue found in the AEVB approach [7, 25].
4 Experiments
We evaluate the effectiveness and efficiency of our proposed AEVB based inference method for visualization and topic modeling both quantitatively and qualitatively. We use four realworld public datasets from different domains including newswire articles, newsgroups posts and academic papers.
Dataset Description

Reuters
^{2} : contains 7674 newswire articles from 8 categories [4]. 
20 Newsgroups
^{3} : contains 18251 newsgroups posts from 20 categories. 
Web of Science
^{4} : we use Web of Science WOS46985 dataset [15]. It contains the abstracts and keywords of 46,985 published papers from 7 research domains: CS, Psychology, Medical, ECE, Civil, MAE, and Biochemistry. 
Arxiv
^{5} : contains the titles and abstracts of 598,748 research papers from arXiv. The papers are from 7 categories: Math, CS, Nucl, Stat, Astro, Quant, and Physics.
We perform preprocessing by removing stopwords and stemming. The vocabulary sizes are 3000, 3248, 4000, and 5000 for Reuters, 20 Newsgroups, Web of Science, and Arxiv respectively. Note that our problem is unsupervised and the groundtruth class labels are mainly used for evaluation. Before detailing the experiment results, we describe the comparative methods.
Comparative Methods. We compare the following methods for inferring topics and visualization:
Joint approach:

PLSVMAP
^{6} : the original PLSV using MAP estimation with EM algorithm [10]. 
PLSVVAE (Gaussian) [this paper]
^{7} : we apply our proposed variational autoencoder (VAE) inference to PLSV where Gaussian RBF is used. We write PLSVVAE to refer to PLSVVAE (Gaussian). 
PLSVVAE (Inverse quadratic) and PLSVVAE (Inverse multiquadric) [this paper]: these are PLSVVAE models with Inverse quadratic and Inverse multiquadric RBFs. Since our method is black box, we can quickly implement these two models by just changing a few lines of code of PLSVVAE (Gaussian) implementation.
Pipeline approach: this is the approach of topic modeling followed by embedding of documents’ topic proportions for visualization. We compare the above joint models with two pipeline models:

ProdLDAVAE + tSNE: similar to the above but we use ProdLDAVAE8 instead of LDAVAE.
In the next sections, we report the experiment results averaged across 10 independent runs. For PLSV models, we choose that work well for large datasets in our experiments. We run PLSVMAP with the number of EM iterations set to 200 and the maximum number of iterations for the quasiNewton algorithm set to 10. Following AVITM, we set , the batch size to 256, the number of samples per document to 1, the learning rate to 0.002, and use dropout with probability . We use Adam as our optimizing algorithm. VAE based models are trained with 1000 epochs. All the experiments are conducted on a system with 64GB memory, an Intel(R) Xeon(R) CPU E52623 v3, 16 cores at 3.00GHz. The GPU in use on this system is NVIDIA Quadro P2000 GPU with 1024 CUDA cores and 5 GB GDDR5.
4.1 Classification in the Visualization Space
We quantitatively evaluate the visualization quality by measuring the NN accuracy in the visualization space. This evaluation approach is also adopted in tSNE, LargeVis, and the original PLSV. A NN classifier is used to classify documents using their visualization coordinates. A good visualization should group documents with the same label together and hence yield a high classification accuracy in the visualization space. Figures 4 and 4 show NN accuracy of all methods on each dataset, for varying number of nearest neighbors and number of topics . For some settings, we do not show PLSVMAP’s performance as it does not return any results even after 24 hours of running. We can see that PLSVVAE consistently achieves the best result, except for 25 topics on Reuters (Figure 4a) where it produces a comparable result with PLSVMAP. These results show that the joint approach outperforms the pipeline approach and VAE inference may help improve the visualization quality of PLSV. To verify this qualitatively, in Section 4.3, we show some visualization examples of all methods across datasets. Note that in this section, we show the accuracy of PLSVVAE with Gaussian RBF. In Section 4.4, we present the performance of PLSVVAE with different RBFs.
4.2 Topic Coherence
We quantitatively measure the quality of topic models produced by all methods in terms of topic coherence. The objective is to show that while having better visualization quality, PLSVVAE also gains comparable, if not better, topic coherence. For topic coherence evaluation, we use Normalized Pointwise Mutual Information (NPMI) which has been shown to be correlated with human judgments [16]. NPMI is computed as follows:
(10) 
We estimate , , and using Wikipedia 7gram dataset
4.3 Visualization Examples
We compare visualizations produced by all methods qualitatively by showing some visualization examples. In these visualizations, each document is represented by a point and the color of each point indicates the class of that document. Figures 7 and 7 present visualizations by PLSVMAP, PLSVVAE on Reuters and 20 Newsgroups. We see that PLSVVAE can find meaningful clusters of documents. For example, PLSVVAE in Figure 7(b) separates well the eight classes into different clusters such as the pink cluster for acq, the orange cluster for earn, and the brown cluster for crude. The visualization by PLSVMAP in Figure 7(a) also shows clear clusters but it runs much slower than PLSVVAE as shown in Section 4.5. Figure 7 presents visualization outputs for 20 Newsgroups. For this more challenging dataset, PLSVVAE produces betterseparated clusters, as compared to PLSVMAP. For example, baseball and hockey are mixed in Figure 7(a) by PLSVMAP but these classes are separated better in Figure 7(b) by PLSVVAE. We do not show visualizations of Web of Science and Arxiv by PLSVMAP because it fails to return any results even after 24 hours of running. We instead show visualizations of these two large datasets by PLSVVAE and ProdLDAVAE + tSNE in Figures 7 and 11. As we can see, visualizations by PLSVVAE are more intuitive than the ones by ProdLDAVAE + tSNE, which supports the outperformance of the joint approach over the pipeline approach.
4.4 Comparing Different Radial Basis Functions
Since our method is black box, we can quickly explore PLSVVAE model with different assumptions. In this section, we show how different RBFs affect the performance of PLSVVAE. Besides PLSVVAE with Gaussian RBF, we implement another two variants of PLSVVAE that uses two other RBFs: Inverse quadratic and Inverse multiquadric RBFs. We choose these two because, similar to Gaussian, they support the assumption that the th topic proportion of document is high when document coordinate is close to topic coordinate . For these model changes, we do not need to perform a mathematical rederivation, but we only need to change a few lines of code of PLSVVAE (Gaussian). Figures 11 and 11 show the NN accuracy and topic coherence of PLSVVAE with different RBFs. In general, PLSVVAE with Gaussian or Inverse quadratic RBFs consistently produces good performance across datasets. In some cases, Inverse quadratic produces better results.
4.5 Topic Examples and Running Time Comparison
To qualitatively evaluate the topics, in Figure 11, we show visualization and topic examples generated by PLSVVAE (Inverse quadratic) on Arxiv. In the visualization, each black empty circle represents a topic that is associated with a list of top 10 words. We see that the topics are meaningful and reflect different research subdomains discussed in the Arxiv papers. For example, many topics are studied in the CS domain such as “graph, g, vertex, k”, “model, data, use, method”, and “logic, program, system”. For the Astro domain, we have topics like “galaxi, cluster, star”, and ”observ, ray, model, star”. Topics such as “energi, nucleu, reaction” and “electron, energi, atom” are discussed in the Nucl domain. By allowing the semantics to be infused in the visualization space, users can now not only see the documents but also their topics. The joint nature of the model may lead to potential applications in different visual text mining tasks.
Finally, we show the running time of all the methods in Figure 12. As expected, PLSVMAP running on a single core is very slow and it fails to return any results on large datasets even after 24 hours of running. PLSVVAE runs much faster. It only needs about 5 hours for 200 topics on the largest dataset Arxiv. For completeness, we also include the running time of LDAVAE, and ProdLDAVAE. PLSVVAE is as fast as these methods. In summary, PLSVVAE can find good topics and visualization while it can scale well to large datasets, which will increase its usability in practice.
5 Conclusion
We propose, to the best of our knowledge, the first fast AEVB based inference method for jointly learning topics and visualization. In our approach, we design a decoder that includes a normalized RBF network connected to a linear neural network. These networks are parameterized by topic coordinates and word probabilities, ensuring that they are shared across all documents. Due to our method’s black box nature, we can quickly experiment with different RBFs with minimal reimplementation effort. Our extensive experiments on four realworld large datasets show that PLSVVAE runs much faster than PLSVMAP while gaining better visualization quality and comparable topic coherence.
Acknowledgements
This research is sponsored by NSF #1757207 and NSF #1914635.
Footnotes
 is Euclidean distance in our experiments
 http://ana.cachopo.org/datasetsforsinglelabeltextcategorization
 https://scikitlearn.org/0.19/datasets/twenty_newsgroups.html
 https://data.mendeley.com/datasets/9rw3vkcfy4/6
 http://zhang18f.myweb.cs.uwindsor.ca/datasets/
 We use the implementation at https://github.com/tuanlvm/SEMAFORE
 The implementation of our method can be found at https://github.com/dangpnh2/plsv_vae
 We use the implementation at https://github.com/akashgit/autoencoding_vi_for_topic_models
 We use the Multicore tSNE implementation at https://github.com/DmitryUlyanov/MulticoreTSNE
 https://nlp.cs.nyu.edu/wikipediadata/
References
 (1995) Neural networks for pattern recognition. Oxford University Press. Cited by: §1, §3.
 (2007) A correlated topic model of science. The Annals of Applied Statistics 1 (1), pp. 17–35. Cited by: §2.1.
 (2003) Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), pp. 993–1022. Cited by: §2.1.
 (2007) Improving Methods for Singlelabel Text Categorization. Note: PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa Cited by: 1st item.
 (2018) Gaussian process prior variational autoencoders. In Advances in Neural Information Processing Systems, pp. 10369–10380. Cited by: §1.
 (2013) Utopian: userdriven topic modeling based on interactive nonnegative matrix factorization. IEEE transactions on visualization and computer graphics 19 (12), pp. 1992–2001. Cited by: §1.
 (201905) Lagging inference networks and posterior collapse in variational autoencoders. In International Conference on Learning Representations (ICLR), New Orleans, LA, USA. External Links: Link Cited by: §1, §3.
 (1999) Probabilistic latent semantic analysis. In UAI, Cited by: §2.1.
 (201507–09 Jul) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 448–456. External Links: Link Cited by: §3.
 (2008) Probabilistic latent semantic visualization: topic model for visualizing documents. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 363–371. Cited by: §1, §2.1, 1st item.
 (2010) Efficiently learning mixtures of two gaussians. In Proceedings of the fortysecond ACM symposium on Theory of computing, pp. 553–562. Cited by: §3.
 (2019) TopicSifter: interactive search space reduction through targeted topic modeling. In 2019 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 35–45. Cited by: §2.1.
 (2014) Autoencoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, Cited by: §1, §3.
 (2014) Autoencoding variational bayes. CoRR abs/1312.6114. Cited by: §2.2.
 (2017) HDLTex: hierarchical deep learning for text classification. In Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on, Cited by: 3rd item.
 (2014) Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 530–539. Cited by: §4.2.
 (2019) ContraVis: contrastive and visual topic modeling for comparing document collections. In The World Wide Web Conference, pp. 928–938. Cited by: §1, §2.1.
 (2014) Manifold learning for jointly modeling topic and visualization. In TwentyEighth AAAI Conference on Artificial Intelligence, Cited by: §2.1.
 (2014) Semantic visualization for spherical representation. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1007–1016. Cited by: §2.1.
 (2016) Semantic visualization with neighborhood graph regularization. Journal of Artificial Intelligence Research 55, pp. 1091–1133. Cited by: §3.
 (2019) WiSEale: wide sample estimator for aggregate latent embedding. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019, Cited by: §1, §2.2.
 (2008) Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §1, §2.1, 1st item.
 (2009) Topic modeling for the social sciences. In NIPS 2009 workshop on applications for topic models: text and beyond, Vol. 5, pp. 27. Cited by: §2.1.
 (2014) Black box variational inference. ArXiv abs/1401.0118. Cited by: §2.2.
 (2019) Preventing posterior collapse with deltavaes. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, Cited by: §3.
 (2017) Autoencoding variational inference for topic models. In ICLR, Cited by: §1, §2.2, 1st item.
 (2016) Visualizing largescale and highdimensional data. In Proceedings of the 25th international conference on world wide web, pp. 287–297. Cited by: §2.1.
 (2019) Comparelda: a topic model for document comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7112–7119. Cited by: §2.1.