# Streaming dynamic and distributed inference of latent geometric structures

###### Abstract

We develop new models and algorithms for learning the temporal dynamics of the topic polytopes and related geometric objects that arise in topic model based inference. Our model is nonparametric Bayesian and the corresponding inference algorithm is able to discover new topics as the time progresses. By exploiting the connection between the modeling of topic polytope evolution, Beta-Bernoulli process and the Hungarian matching algorithm, our method is shown to be several orders of magnitude faster than existing topic modeling approaches, as demonstrated by experiments working with several million documents in a dozen minutes.

## 1 Introduction

The topic or population polytope (Nguyen, 2015; Tang et al., 2014) is a fundamental geometric object that underlies the presence of latent topic variables in topic and admixture models (Blei et al., 2003; Pritchard et al., 2000). When data and the associated topics are indexed by time dimension, it is of interest to study the temporal dynamics of such latent geometric structures. In this paper, we will study the modeling and algorithms for learning the temporal dynamics of topic polytope that arises in the analysis of text corpora. The convex geometry of topic models provides the theoretical basis for posterior contraction analysis of latent topics (Nguyen, 2015; Tang et al., 2014). Furthermore, Yurochkin & Nguyen (2016); Yurochkin et al. (2017) exploited convex geometry to develop fast and quite accurate inference algorithms in a number of parametric and nonparametric settings.

Several authors have extended the basic topic modeling framework to analyze how topics evolve over time. The Dynamic Topic Models (DTM) (Blei & Lafferty, 2006) demonstrated the importance of accounting for non-exchangeability between document groups, particularly when time index is provided. Another approach is to keep topics fixed and consider only evolving topic popularity (Wang & McCallum, 2006). Hong et al. (2011) extended such an approach to multiple corpora. Ahmed & Xing (2012) proposed a nonparametric construction extending DTM where topics can appear or eventually die out. The evolution of the latent geometric structure (i.e., the topic polytope) is implicitly present in all these works, however it was not explicitly addressed and analyzed.

Directly confronting the inference of temporal dynamics of the topic polytope offers several opportunities and challenges. To start, what is the suitable ambient space in which the topic polytope is represented? As topics evolve, so are the number of topics that may become active and dormant, raising distinct modeling choices. Interesting issues arise in the inference, too. For instance, what is the principled way of mix-matching the vertices of a polytope to its next reincarnation? We must also keep in mind the modeling of geometric structure in a way that facilitates efficient inference.

In this work we propose to consider an isometric embedding of the unit sphere in the word simplex, so that the evolution of topic polytopes may be modeled by a collection of (random) trajectories of points residing on the unit sphere. Instead of attempting to mix-match vertices in an ad hoc fashion, we appeal to a Bayesian nonparametric modeling framework that allows the number of topic vertices to be random and vary across time. The mix-matching between topic variables shall be guided by the assumption on the smoothness of the collection of global trajectories on the sphere using Von Mises-Fisher dynamics (Mardia & Jupp, 2009). The selection of active topics at each time point will be enabled by a nonparametric prior on the random binary matrices via the (hierarchical) Beta-Bernoulli process Thibaux & Jordan (2007).

Our contribution includes a sequence of Bayesian nonparametric models in increasing levels of complexity: the simpler model describes a topic polytope evolving over time, while the full model describes the temporal dynamics of a collection of topic polytopes as they arise from multiple corpora. The semantics of topics can be summarized as follows: there is a latent collection of global topics of unknown cardinality that evolve over time (e.g. topics in science or social topics in Twitter) and each year (or day) a subset of the global topics is elucidated by the community (i.e., some topics may be dormant at a given time point). The nature of each global topic may change smoothly (via varying word frequencies). Additionally, different subsets of global topics are associated with different groups (e.g. journals or Twitter location stamps), with some becoming active/inactive over time.

Another key contribution includes a suite of approximate inference algorithms that scale well in an online and distributed setting. In particular, the online MAP update of the latent topic polytope can be viewed as solving an optimal matching problem for which a fast Hungarian matching algorithm can be applied. Our approach is able to perform dynamic nonparametric topic inference on 3 million documents in 12 minutes, which is significantly faster than prior static online and/or distributed topic modeling algorithms (Newman et al., 2008; Hoffman et al., 2010; Wang et al., 2011; Bryant & Sudderth, 2012; Broderick et al., 2013).

The remainder of the paper is organized as follows. In Section 2 we define a Markov process over the space of topic polytopes (simplices). In Section 3 we present a series of models for polytope dynamics and describe our algorithms for online dynamic and/or distributed inference. Section 4 demonstrates experimental results. We conclude with a discussion in Section 5.

## 2 Temporal dynamics of a topic polytope

A fundamental object of inference in this work is the topic polytope arising in topic modeling (Blei et al., 2003; Nguyen, 2015). Given a vocabulary of words, a topic is defined as a probability distribution on the vocabulary. Thus a topic is taken to be a point in the vocabulary simplex, namely, , and a topic polytope for a corpus of documents is defined as a convex hull of topics associated with the documents. Geometrically, the topics correspond to the vertices (extreme points) of the (latent) topic polytope to be inferred from data.

In order to infer about the temporal dynamics of a topic polytope, one might consider the evolution of each topic variable, say , which represents a vertex of the polytope at time . A standard approach is due to Blei & Lafferty (2006), who proposed to use a Gaussian Markov chain in for modeling temporal dynamics and a logistic normal transformation , which sends elements of unbounded into the compact convex set . This map is many-to-one; its behavior difficult to control near the boundary of .

Motivated by the directional representation of topic polytope, cf. Yurochkin et al. (2017), we shall represent each topic variable as a point in a unit sphere , which possesses a natural isometric embedding (i.e. one-to-one) in the vocabulary simplex , so that the temporal dynamics of a topic variable can be identified as a (random) trajectory on . This trajectory shall be modeled as a Markovian process on : . Von Mises-Fisher (vMF) distribution is commonly used in the field of directional statistics (Mardia & Jupp, 2009) to model points on a unit sphere and was previously utilized for text modeling (Banerjee et al., 2005; Reisinger et al., 2010). Its likelihood is proportional to the cosine similarity, a similarity measure popular in text mining applications (Feldman & Sanger, 2007).

##### Isometric embedding of in vocabulary simplex

We recall the directional representation of topic polytope (Yurochkin et al., 2017): let be a collection of vertices of a topic polytope. Each vertex is represented as , where is a reference point in a convex hull of , is a topic direction and . Moreover, is determined so that the tip of direction vector resides on the boundary of . Since the effective dimensionality of is , we can now define an one-to-one and isometric map sending onto as follows: map of the vocabulary simplex where it is first translated so that becomes the origin and then rotated into , where resulting topics, say , are normalized to the unit length. Observe that this geometric map is an isometry and hence invertible. It preserves angles between vectors, therefore we can evaluate vMF density without performing the map explicitly, by simply setting .

Figure 1 provides a geometric illustration for , vocabulary simplex shown as red triangle. Two topics on the boundary (face) of the vocabulary simplex are and . Green dot is the reference point and . In Fig. 1 (left) we move by translation to the origin and rotate from to plane. In Fig. 1 (center) we show the resulting image of and add a unit sphere (blue) in . Corresponding to topics are the points on the sphere with . Now, apply the inverse translation and rotation to both and , the result is shown in Fig. 1 (right) — we are back to and , where . In Fig. 1(a) we give a geometric illustration of the temporal dynamics.

As described above, each topic evolves in a random trajectory residing in a unit sphere, and so the evolution of a collection of topics can be modeled by a collection of corresponding trajectories on the sphere. Note that the number of "active" topics may be unknown and vary over time. Moreover, a topic may be activated, become dormant, and then resurface after some time. New modeling elements are introduced in the next section to account for these phenomena.

## 3 Hierarchical Bayesian modeling for single or multiple topic polytopes

We shall present a sequence of models with increasing levels of complexity: we start by introducing a hierarchical model for online learning of the temporal dynamics of a single topic polytope, allowing for varying number of vertices over time. Next, a model for multiple topic polytopes learned on different corpora drawing on a common pool of global topics. Finally, we present a "full" model for modeling evolution of global topic trajectories over time and across groups of corpora.

### 3.1 Dynamic model for single topic polytope

At a high level, our model maintains a collection of global trajectories taking values on a unit sphere. Each trajectory shall be endowed with a Von Mises-Fisher dynamic described in the previous section. At each time point, a random topic polytope is constructed by selecting a (random) subset of points on the trajectory evaluated at time . The random selection is guided by a Beta-Bernoulli process prior (Thibaux & Jordan, 2007). This construction is motivated by a modeling technique of Nguyen (2010), who studied a Bayesian hierarchical model for inference of smooth trajectories on an Euclidean domain using Dirichlet process priors. Our model using Beta-Bernoulli process as a building block is more appropriate for the purpose of topic discovery. Due to the isometric embedding of in described in the previous section, from here on we shall refer to topics as points on .

First, generate a collection of global topic trajectories using Beta process prior (cf. Thibaux & Jordan (2007)) with a base measure on the space of trajectories on and concentration parameter :

(1) |

It follows that , where follows a stick-breaking construction (Teh et al., 2007): , and each is a sequence of random elements on the unit sphere , which are generated as follows:

(2) |

At any given time , the process induces a marginal measure , whose support is given by the atoms of as they are evaluated at time . Now, select a subset of the global topics that are active at via the Bernoulli process This means that

(3) |

are supported by atoms , which represent the topics that are active at time . Finally, assume that noisy measurements of each of these topic variables are generated via:

(4) |

Noisy estimates for the topics at any particular time point may come from either the global topics observed until the previous time point or a topic yet unexplored. The collection of random variables that “link up” observed noisy estimates at any time point to the global topics observed thus far is of interest: let denote the binary matrix representing the assignment of observed topic estimates to global topics at time point , i.e, if the vector is a noisy estimate for .

By conditional independence, the joint posterior of the hidden given observed noisy is:

For a time point , is proportional to

(5) |

The equation above represents a product of four quantities: (1) probability of s, where denotes the number of occurrences of topic up to time point (cf. popularity of a dish in the Indian Buffet Process (IBP) metaphor (Ghahramani & Griffiths, 2005)), (2) vMF conditional of given (cf. Eq. (2)), where denotes a normalizing constant, (3) number of new global topics at time , , and (4) emission probability (cf. Eq. (4)). Derivations details are given in the Supplement.

##### Streaming Dynamic Matching (SDM)

To perform MAP estimation in the streaming setting, we highlight the connection of the maximization of the posterior (15) to the objective of an optimal matching problem: given a cost matrix workers should be assigned to tasks, at most one worker per task and one task per worker. The solution of this problem is obtained by employing the well-known Hungarian algorithm (Kuhn, 1955). This connection is formalized by the following.

###### Proposition 1.

Given the cost matrix

Consider the optimization problem subject to the constraints that (a) for each fixed , at most one of is , rest are , and (b) for each fixed , exactly one of is , rest are . Then, the MAP for Eq. (5) can be obtained by the Hungarian algorithm, which solves for , and then obtain given by

###### Proof.

Proof of this proposition is given in the Supplement. ∎

### 3.2 Hierarchical Beta process for multiple topic polytopes

We now expand our modeling by allowing the presence of multiple corpora, each of which maintains its own topic polytope. Large text corpora often can be partitioned based on some grouping criteria, e.g. scientific papers by journals, news by different media agencies or tweets by location stamps. In this subsection we model the collection of topic polytopes observed at a single time point by employing the Hierarchical Beta Process prior (HBP) (Thibaux & Jordan, 2007). The modeling of a collection of polytopes evolving over time will be described in the following subsection.

First, generate global topic measure as in Eq. (1). Here, we are interested only in a single time point, the base measure is simply a , the uniform distribution over . Next, for each group , generate a group specific distribution over topics:

(6) |

where s vary around corresponding . The distributional properties of are described in Thibaux & Jordan (2007). We now proceed similarly to Eq. (3):

(7) |

Notice that each group selects only a subset from the collection of global topics, which is consistent with the idea of partitioning by journals: some topics of NIPS are not represented in KDD and vice versa. The next step is analogous to Eq. (4):

(8) |

We again use to denote the binary matrix representing the assignment of global topics to the noisy topic estimates, i.e., if the topic estimate for group arises as a noisy estimate of global topic . However, the matching problem is now different from before: we don’t have any information about the global topics as there is no history, instead we should match a collection of topic polytopes to a global topic polytope. As shown in the Supplement, the matrix of topic assignments is distributed a priori by an Indian Buffet Process (IBP) with parameter . The conditional probability for global topics and assignment matrix given topic estimates has the following form:

(9) |

where is any upper bound on the number of global topics, i.e., and the value of depends on the partitions of , while represents popularity of global topic after the assignment. The constant is the inverse surface integral of the unit sphere in dimensions with respect to the Lebesgue measure.

##### Distributed Matching (DM)

Similar to Section 3.1 we look for point estimates for the topic directions and for the topic assignment matrix . Direct computation of the global MAP estimate for Eq. (20) is not straight forward. The problem of matching across groups and topics is not amenable to a closed form Hungarian algorithm. However we show that for a fixed group the assignment optimization reduces to a case of the Hungarian algorithm. This motivates the use of Hungarian algorithm iteratively, which guarantees convergence to a local optimum.

###### Proposition 2.

Define the cost matrix

where denotes groups excluding group and is the number of global topics before group (due to exchangeability of the IBP, group can always be considered last). Then, a locally optimum MAP estimate for Eq. 9 can be obtained by iteratively employing the Hungarian algorithm to solve: for each group , which maximizes , subject to constraints: (a) for each fixed and , at most one of is , rest are and (b) for each fixed and , exactly one of is , rest are . After solving for for group , the updated iterative estimate for is obtained as:

###### Proof.

Proof of this proposition is given in the Supplement. ∎

### 3.3 Dynamic hierarchical Beta process

Our “full” model, the Dynamic Hierarchical Beta Process model (dHBP), combines the constructions described in subsections 3.1 and 3.2 to enable the inference of temporal dynamics of collections of topic polytopes. We start by specifying the upper level Beta process given by Eq. (1) and base measure given by Eq. (2). Next, for each group , generate a group’s distribution over topics:

(10) |

At any given time , each group selects a subset from the common pool of global topics:

(11) |

Let be the corresponding collection of atoms – topics active at time in group . Noisy measurements of these topics are generated by:

(12) |

The conditional distribution of global topics at given the state of the global topics at time :

(13) | ||||

where is a function dependent on the counts . Analogous to the Chinese Restaurant Franchise (Teh et al., 2006), one can think of an Indian Buffet Franchise in the case of HBP. A headquarter buffet provides some dishes each day and the local branches serve a subset of those dishes. Although this analogy seems intuitive, we are not aware of a corresponding Gibbs sampler and it remains to be a question of future studies. Therefore, unfortunately, we are unable to handle the term directly and instead propose a heuristic replacement — stripping away popularity of topics across groups and only considering group specific topic popularity (groups still remain dependent through the atom locations).

##### Streaming Dynamic Distributed Matching (SDDM)

We combine our results to perform approximate inference of the model in Section 3.3. Using Hungarian algorithm, iterating over groups at time obtain estimates for based on the following cost matrix:

where denotes the popularity of topic in group up to time (plus one is used to indicate that global topic exists even when ) and is the number of groups observed at time – heuristic terms replacing the Indian Buffet Franchise term. Then update global topic estimates:

(14) |

## 4 Experiments

The goals of the experiments is to demonstrate the learning of latent temporal dynamics and topic discovery, the ability to perform learning in a distributed and streaming fashion, and to demonstrate scalability of our approaches. We analyze two datasets: Early Journal Content (EJC) available from JSTOR^{1}^{1}1http://www.jstor.org/dfr/about/sample-datasets and collection of Wikipedia articles partitioned by categories and in time according to their popularity.

##### Early Journal Content

Early Journal Content dataset spans years from 1665 up to 1922. Years before 1882 contain very few articles and we aggregated them into a single timepoint. After preprocessing, dataset has 400k scientific articles from over 400 unique journals. Vocabulary was truncated to 4516 words. We set all articles from the last available year (1922) aside for the testing purposes. We compare three variations of our model with CoSAC (Yurochkin et al., 2017) and parametric models such as Streaming Variational Bayes (SVB) (Broderick et al., 2013) and Dynamic Topic Models (DTM) (Blei & Lafferty, 2006) trained with 100 topics. Perplexity scores on the held out data, training times, computing resources and number of topics are reported in Table 1. SDM achieves the best perplexity, while SDDM outperforms DM, which suggests that modeling time is beneficial. For this dataset modeling groups negatively effects perplexity which may be due to majority of the groups having very few articles (i.e. less than 100) - a setup challenging for many topic modeling algorithms. We report details about parameter settings in the Supplement. Next we present a case study of a topic based on the SDM results.

##### Case study: epidemics.

The beginning of the 20th century is known to have a vast history of disease epidemics of various kinds, such as smallpox, typhoid, yellow fever and scarlet fever, to name a few. Vaccines or effective treatments against the majority of them were developed shortly after. One of the journals represented in the EJC dataset is the "Public Health Report", however publications from it are only available starting 1896. One of the primary objectives of the journal was to reflect epidemic disease infections. As one of the goals of our modeling approach is to do topic discovery, it is interesting to see if the model can discover an epidemics related topic around 1896.

Figure 1(b) shows that SDM correctly discovered a new topic is 1896 semantically related to epidemics. We plot the evolution of probabilities of the top 15 words in this topic across time. We observe that word "typhoid" increases in probability towards 1910 in the "epidemics" topic, which aligns with historical events such as Typhoid Mary in 1907 and chlorination of public drinking water in the US in 1908 for controlling the typhoid fever. The probability of "tuberculosis" increases too aligning with foundation of the National Association for the Study and Prevention of Tuberculosis in 1904.

Perplexity | Time | Topics | Cores | Perplexity | Time | Topics | Cores | |
---|---|---|---|---|---|---|---|---|

SDM | 1181 | 24min | 125 | 1 | 1255 | 35min | 182 | 20 |

DM | 1361 | 8min | 125 | 20 | 1260 | 14min | 182 | 20 |

SDDM | 1262 | 3.2min | 103 | 20 | 1199 | 12min | 238 | 20 |

DTM | 1194 | 56hours | 100 | 1 | NA | >72hours | 100 | 2 |

SVB | 1840 | 3hours | 100 | 20 | 1219 | 29.5hours | 100 | 20 |

CoSAC | 1191 | 51min | 132 | 1 | 1227 | 4.4hours | 173 | 4 |

##### Streaming Wiki corpus

We have collected articles from Wikipedia together with their page view counts for the 12 months of 2017 and category information (e.g., Arts, History, Health, etc.). We used categories as groups and partitioned the data across time according to the page view counts. Detailed description of the dataset construction is given in the Supplement. Total number of documents is 3 million and we reduced vocabulary to 7359 words (similar to Hoffman et al. (2010)). In Table 1 (rightmost columns) we report training times and perplexity on held out documents from category Art from December 2017. The time-group partitioning considered by SDDM achieves best perplexity score potentially due to more fine grained topical representation. This example shows the scalability of our methods — SDDM took 12min to process 3 million data points, which is much faster than prior topic modeling approaches. We also considered Fast DTM (Bhadury et al., 2016), but the implementation available online appeared not efficient enough. We report a relative comparison to our results based on the best run-time reported in their work in the Supplement.

## 5 Discussion and Conclusion

In a different setting Fox et al. (2009) utilized Beta-Bernoulli process in time series modeling to capture switching regimes of an autoregressive process, where the corresponding Indian Buffet Process was used to select subsets of the latent states of the Hidden Markov Model. Campbell et al. (2015) utilized Hungarian algorithm for streaming mean-field variational inference of the Dirichlet Process mixture model. The idea of using time series as base measure in the Bayesian nonparametric settings was previously explored by Nguyen (2010) using hierarchical Dirichlet and Gaussian processes. While his approach is suitable for modeling evolving discrete probability measures, it appears ill-suited for modeling of evolving polytopes or generally richer geometric structures, since these require keeping track of a (unknown) number of vertices that may mix-match. The Beta process based constructions are more suitable for modeling evolving geometric objects and also lead to a natural matching based inference.

Our work suggests the naturalness of incorporating sophisticated Bayesian nonparametric techniques in the inference of rich latent geometric structures of interest, in an online, dynamic, distributed fashion. Unfortunately, data size and complex modeling are at crossroads — training sophisticated models on large data is extremely challenging. Our work demonstrates the feasibility of approximate nonparametric learning at scale, by utilizing suitable geometric representations and devising fast algorithms for obtaining reasonable point estimates for such representations. Further directions include incorporating more meaningful geometric features into the models (e.g., via more elaborated base measure modeling for the Beta Process), and developing efficient algorithms for full Bayesian inference.

#### Acknowledgments

This research is supported in part by grants NSF CAREER DMS-1351362, NSF CNS-1409303, a research gift from Adobe Research and a Margaret and Herman Sokol Faculty Award to XN.

## References

- Ahmed & Xing (2012) Ahmed, Amr and Xing, Eric P. Timeline: A dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. arXiv preprint arXiv:1203.3463, 2012.
- Banerjee et al. (2005) Banerjee, Arindam, Dhillon, Inderjit S, Ghosh, Joydeep, and Sra, Suvrit. Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research, 6:1345–1382, September 2005.
- Bhadury et al. (2016) Bhadury, Arnab, Chen, Jianfei, Zhu, Jun, and Liu, Shixia. Scaling up dynamic topic models. In Proceedings of the 25th International Conference on World Wide Web, pp. 381–390. International World Wide Web Conferences Steering Committee, 2016.
- Blei & Lafferty (2006) Blei, D. M. and Lafferty, J. D. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120, 2006.
- Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, March 2003.
- Broderick et al. (2013) Broderick, Tamara, Boyd, Nicholas, Wibisono, Andre, Wilson, Ashia C, and Jordan, Michael I. Streaming variational Bayes. In Advances in Neural Information Processing Systems, pp. 1727–1735, 2013.
- Bryant & Sudderth (2012) Bryant, Michael and Sudderth, Erik B. Truly nonparametric online variational inference for hierarchical Dirichlet processes. In Advances in Neural Information Processing Systems, pp. 2699–2707, 2012.
- Campbell et al. (2015) Campbell, Trevor, Straub, Julian, Fisher III, John W, and How, Jonathan P. Streaming, distributed variational inference for bayesian nonparametrics. In Advances in Neural Information Processing Systems, pp. 280–288, 2015.
- Feldman & Sanger (2007) Feldman, Ronen and Sanger, James. The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge university press, 2007.
- Fox et al. (2009) Fox, Emily, Jordan, Michael I, Sudderth, Erik B, and Willsky, Alan S. Sharing features among dynamical systems with Beta processes. In Advances in Neural Information Processing Systems, pp. 549–557, 2009.
- Ghahramani & Griffiths (2005) Ghahramani, Zoubin and Griffiths, Thomas L. Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, pp. 475–482, 2005.
- Griffiths & Ghahramani (2011) Griffiths, Thomas L and Ghahramani, Zoubin. The Indian buffet process: An introduction and review. Journal of Machine Learning Research, 12:1185–1224, 2011.
- Hoffman et al. (2010) Hoffman, Matthew, Bach, Francis R, and Blei, David M. Online learning for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems, pp. 856–864, 2010.
- Hong et al. (2011) Hong, Liangjie, Dom, Byron, Gurumurthy, Siva, and Tsioutsiouliklis, Kostas. A time-dependent topic model for multiple text streams. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 832–840. ACM, 2011.
- Kuhn (1955) Kuhn, Harold W. The Hungarian method for the assignment problem. Naval Research Logistics (NRL), 2(1-2):83–97, 1955.
- Mardia & Jupp (2009) Mardia, Kanti V and Jupp, Peter E. Directional statistics, volume 494. John Wiley & Sons, 2009.
- Newman et al. (2008) Newman, David, Smyth, Padhraic, Welling, Max, and Asuncion, Arthur U. Distributed inference for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems, pp. 1081–1088, 2008.
- Nguyen (2010) Nguyen, XuanLong. Inference of global clusters from locally distributed data. Bayesian Analysis, 5(4):817–845, 2010.
- Nguyen (2015) Nguyen, XuanLong. Posterior contraction of the population polytope in finite admixture models. Bernoulli, 21(1):618–646, 02 2015.
- Pritchard et al. (2000) Pritchard, Jonathan K, Stephens, Matthew, and Donnelly, Peter. Inference of population structure using multilocus genotype data. Genetics, 155(2):945–959, 2000.
- Reisinger et al. (2010) Reisinger, Joseph, Waters, Austin, Silverthorn, Bryan, and Mooney, Raymond J. Spherical topic models. In Proceedings of the 27th International Conference on Machine Learning, pp. 903–910, 2010.
- Tang et al. (2014) Tang, Jian, Meng, Zhaoshi, Nguyen, Xuanlong, Mei, Qiaozhu, and Zhang, Ming. Understanding the limiting factors of topic modeling via posterior contraction analysis. In Proceedings of the 31st International Conference on Machine Learning, pp. 190–198, 2014.
- Teh et al. (2006) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 2006.
- Teh et al. (2007) Teh, Yee Whye, Grür, Dilan, and Ghahramani, Zoubin. Stick-breaking construction for the Indian buffet process. In Artificial Intelligence and Statistics, pp. 556–563, 2007.
- Thibaux & Jordan (2007) Thibaux, Romain and Jordan, Michael I. Hierarchical Beta processes and the Indian buffet process. In Artificial Intelligence and Statistics, pp. 564–571, 2007.
- Wang et al. (2011) Wang, Chong, Paisley, John, and Blei, David. Online variational inference for the hierarchical Dirichlet process. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 752–760, 2011.
- Wang & McCallum (2006) Wang, Xuerui and McCallum, Andrew. Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 424–433. ACM, 2006.
- Yurochkin et al. (2017) Yurochkin, M., Guha, A., and Nguyen, X. Conic Scan-and-Cover algorithms for nonparametric topic modeling. In Advances in Neural Information Processing Systems, pp. 3881–3890, 2017.
- Yurochkin & Nguyen (2016) Yurochkin, Mikhail and Nguyen, XuanLong. Geometric Dirichlet Means Algorithm for topic inference. In Advances in Neural Information Processing Systems, pp. 2505–2513, 2016.

## Appendix A Derivations of posterior probabilities

### a.1 Dynamic Beta Process posterior

The departing point for arriving at MAP estimation algorithm for the Dynamic Beta Process proposed in Section 3.1 of the main text is the posterior derivation at a time point (Eq. (5) of the main text):

(15) |

Starting with the vMF emission probabilities,

(16) |

we obtain the last term of Eq. (15). The conditional distribution of given , obtained when random variables are marginalized out, can be decomposed into two parts: parametric part – time -th incarnations of subset of previously observed global topics and nonparametric part – number of new topics appearing at time . The middle term can be seen to come from the Poisson prior on the number of new topics induced by the Indian Buffet Process (see Thibaux & Jordan (2007) for details):

(17) |

Finally, the first term of Eq. (15) is composed of a probability of previously observed global topic to appear at time :

(18) |

where denotes the number of times topic appeared up to time . Also, the base measure probability of the vMF dynamics is:

(19) |

Combining Equations 16–19 we arrive at Eq. (15) (Eq. (5) of the main text).

### a.2 Posterior of the Hierarchical Beta process for multiple topic polytopes

First recall Eq. (9) of the main text:

(20) |

To arrive at this result first note that , where is a uniform distribution on sphere from the model specification of Section 3.2 of the main text and hence is a constant. Next, the likelihood

It remains to show that follows an IBP marginally. Consider a finite approximation of the model defined in Section 3.2, i.e., working with measures with atoms. Then, for , , and we have the following generating process,

(21) |

We first integrate w.r.t. :

(22) |

where, using the fact that is taking values or :

and denotes Beta function. Plugging this in, we obtain:

(23) |

This is the same equation as Eq. (10) obtained in Griffiths & Ghahramani (2011) for the finite feature model. Now, following Griffiths & Ghahramani (2011) by taking the limit , we arrive at the IBP as desired.

## Appendix B Proofs of Propositions

### b.1 Proof for Proposition 1.

###### Proposition 1.

Given the cost matrix

Consider the optimization problem subject to the constraints that (a) for each fixed , at most one of is , rest are , and (b) for each fixed , exactly one of is , rest are . Then, the MAP for Eq. (15) can be obtained by the Hungarian algorithm, which solves for , and then obtain given by

###### Proof.

First we express the logarithm of the posterior distribution Eq. (15) in a form of a matching problem by splitting the terms related to previously observed topics and new topics:

(24) |

with the convention that if .

Next, consider the simultaneous maximization of and . For , if , i.e., is a noisy version of , then the increment pertaining to the reward of the Hungarian objective, in comparison to when is . Von Mises-Fisher distribution is conjugate to itself and so it admits a closed form MAP estimation: . On the other hand, if the reward is , which admits a form for MAP estimation as . Therefore the increment in reward value becomes .

For , it is seen easily from our representation in Eq. (24) and recalling uniform prior for the new global topics, that the reward term of the objective becomes and that given , objective function is maximized for , when . ∎

### b.2 Proof for Proposition 2.

###### Proposition 2.

Define the cost matrix