# Scalable inference of topic evolution via models for latent geometric structures

###### Abstract

We develop new models and algorithms for learning the temporal dynamics of the topic polytopes and related geometric objects that arise in topic model based inference. Our model is nonparametric Bayesian and the corresponding inference algorithm is able to discover new topics as the time progresses. By exploiting the connection between the modeling of topic polytope evolution, Beta-Bernoulli process and the Hungarian matching algorithm, our method is shown to be several orders of magnitude faster than existing topic modeling approaches, as demonstrated by experiments working with several million documents in under two dozens of minutes.^{1}^{1}1Code: https://github.com/moonfolk/SDDM

## 1 Introduction

The topic or population polytope is a fundamental geometric object that underlies the presence of latent topic variables in topic and admixture models (Blei et al., 2003; Pritchard et al., 2000; Tang et al., 2014). The geometry of topic models provides the theoretical basis for posterior contraction analysis of latent topics, in addition to helping to develop fast and quite accurate inference algorithms in parametric and nonparametric settings (Nguyen, 2015; Yurochkin and Nguyen, 2016; Yurochkin et al., 2017, 2019c). When data and the associated topics are indexed by time dimension, it is of interest to study the temporal dynamics of such latent geometric structures. In this paper, we will study the modeling and algorithms for learning temporal dynamics of topic polytope that arises in the analysis of text corpora.

Several authors have extended the basic topic modeling framework to analyze how topics evolve over time. The Dynamic Topic Models (DTM) (Blei and Lafferty, 2006) demonstrated the importance of accounting for non-exchangeability between document groups, particularly when time index is provided. Another approach is to keep topics fixed and consider only evolving topic popularity (Wang and McCallum, 2006). Hong et al. (2011) extended such an approach to multiple corpora. Ahmed and Xing (2012) proposed a nonparametric construction extending DTM where topics can appear or eventually die out. Although the evolution of the latent geometric structure (i.e., the topic polytope) is implicitly present in these works, it was not explicitly addressed nor is the geometry exploited. A related limitation shared by these modeling frameworks is the lack of scalability, due to inefficient joint modeling and learning of topics at each time point and topic evolution over time. To improve scalability, a natural solution is decoupling the two phases of inference.

To this end, we seek to develop a series of topic *meta*-models, i.e. models for temporal dynamics of topic polytopes, assuming that the topic estimates from each time point have already been obtained via some efficient static topic inference technique.
The focus on inference of topic evolution offers novel opportunities and challenges. To start, what is the suitable ambient space in which the topic polytope is represented? As topics evolve, so are the number of topics that may become active or dormant, raising distinct modeling choices. Interesting issues arise in the inference, too. For instance, what is the principled way of *matching* vertices of a collection of polytopes to their next reincarnations? Such question arises because we consider modeling of topics learned independently across timestamps and text corpora, which entails the need for preserving the topic structure’s permutation invariance of the vertex labels.

We consider an isometric embedding of the unit sphere in the word simplex, so that the evolution of topic polytopes may be represented by a collection of (random) trajectories of points residing on the unit sphere. Instead of attempting to mix-match vertices in an ad hoc fashion, we appeal to a Bayesian nonparametric modeling framework that allows the number of topic vertices to be random and vary across time. The mix-matching between topics shall be guided by the assumption on the smoothness of the collection of global trajectories on the sphere using von Mises-Fisher dynamics (Mardia and Jupp, 2009). The selection of active topics at each time point will be enabled by a nonparametric prior on the random binary matrices via the (hierarchical) Beta-Bernoulli process (Thibaux and Jordan, 2007).

Our contribution includes a sequence of Bayesian nonparametric models in increasing levels of complexity: the simpler model describes a topic polytope evolving over time, while the full model describes the temporal dynamics of a collection of topic polytopes as they arise from multiple corpora. The semantics of topics can be summarized as follows: there is a collection of latent global topics of unknown cardinality evolving over time (e.g. topics in science or social topics in Twitter). Each year (or day) a subset of the global topics is elucidated by the community (some topics may be dormant at a given time point). The nature of each global topic may change smoothly (via varying word frequencies). Additionally, different subsets of global topics are associated with different groups (e.g. journals or Twitter location stamps), some becoming active/inactive over time.

Another key contribution includes a suite of scalable approximate inference algorithms suitable for online and distributed settings. In particular, we focus mainly on MAP updates rather than a full Bayesian integration. This is appropriate in an online learning setting, moreover such updates of the latent topic polytope can be viewed as solving an optimal matching problem for which a fast Hungarian matching algorithm can be applied. Our approach is able to perform dynamic nonparametric topic inference on 3 million documents in 20 minutes, which is significantly faster than prior static online and/or distributed topic modeling algorithms (Newman et al., 2008; Hoffman et al., 2010; Wang et al., 2011; Bryant and Sudderth, 2012; Broderick et al., 2013).

The remainder of the paper is organized as follows. In Section 2 we define a Markov process over the space of topic polytopes (simplices). In Section 3 we present a series of models for polytope dynamics and describe our algorithms for online dynamic and/or distributed inference. Section 4 demonstrates experimental results. We conclude with a discussion in Section 5.

## 2 Temporal dynamics of a topic polytope

The fundamental object of inference in this work is the topic polytope arising in topic modeling which we shall now define (Blei et al., 2003; Nguyen, 2015). Given a vocabulary of words, a topic is defined as a probability distribution on the vocabulary. Thus a topic is taken to be a point in the vocabulary simplex, namely, , and a topic polytope for a corpus of documents is defined as a convex hull of topics associated with the documents. Geometrically, the topics correspond to the vertices (extreme points) of the (latent) topic polytope to be inferred from data.

In order to infer about the temporal dynamics of a topic polytope, one might consider the evolution of each topic variable, say , which represents a vertex of the polytope at time . A standard approach is due to Blei and Lafferty (2006), who proposed to use a Gaussian Markov chain in for modeling temporal dynamics and a logistic normal transformation , which sends elements of into . In our meta-modeling approach, we consider topics, i.e. points in , learned independently across time and corpora. Logistic normal map is many-to-one, hence it is undesirably ambiguous in mapping a collection of topic polytopes to .

We propose to represent each topic variable as a point in a unit sphere , which possesses a natural isometric embedding (i.e. one-to-one) in the vocabulary simplex , so that the temporal dynamics of a topic variable can be identified as a (random) trajectory on . This trajectory shall be modeled as a Markovian process on : . Von Mises-Fisher (vMF) distribution is commonly used in the field of directional statistics (Mardia and Jupp, 2009) to model points on a unit sphere and was previously utilized for text modeling (Banerjee et al., 2005; Reisinger et al., 2010).

##### Isometric embedding of into the vocabulary simplex

We start with the directional representation of topic polytope (Yurochkin et al., 2017): let be a collection of vertices of a topic polytope. Each vertex is represented as , where is a reference point in a convex hull of , is a topic direction and . Moreover, is determined so that the tip of direction vector resides on the boundary of . Since the effective dimensionality of is , we can now define an one-to-one and isometric map sending onto as follows: map of the vocabulary simplex where it is first translated so that becomes the origin and then rotated into , where resulting topics, say , are normalized to the unit length. Observe that this geometric map is an isometry and hence invertible. It preserves angles between vectors, therefore we can evaluate vMF density without performing the map explicitly, by simply setting . The following lemma formalizes this idea.

###### Lemma 1.

is a homeomorphism, where , and , for any .

Proofs of this Lemma is given in Supplement B.1. The intuition behind the construction is provided via Figure 1 which gives a geometric illustration for , vocabulary simplex shown as red triangle. Two topics on the boundary (face) of the vocabulary simplex are and . Green dot is the reference point and . In Fig. 1 (left) we move by translation to the origin and rotate from to plane. In Fig. 1 (center left) we show the resulting image of and add a unit sphere (blue) in . Corresponding to topics are the points on the sphere with . Now, apply the inverse translation and rotation to *both* and
, the result is shown in Fig. 1 (center right) — we are back to and , where . In Fig. 1 (right) we give a geometric illustration of the temporal dynamics.

As described above, each topic evolves in a random trajectory residing in a unit sphere, so the evolution of a collection of topics can be modeled by a collection of corresponding trajectories on the sphere. Note that the number of "active" topics may be unknown and vary over time. Moreover, a topic may be activated, become dormant, and then resurface after some time. New modeling elements are introduced in the next section to account for these phenomena.

## 3 Hierarchical Bayesian modeling for single or multiple topic polytopes

We shall present a sequence of models with increasing levels of complexity: we start by introducing a hierarchical model for online learning of the temporal dynamics of a single topic polytope, allowing for varying number of vertices over time. Next, a static model for *multiple* topic polytopes learned on different corpora drawing on a common pool of global topics. Finally, we present a "full" model for modeling evolution of global topic trajectories over time and across groups of corpora.

### 3.1 Dynamic model for single topic polytope

At a high level, our model maintains a collection of global trajectories taking values on a unit sphere. Each trajectory shall be endowed with a von Mises-Fisher dynamic described in the previous section. At each time point, a random topic polytope is constructed by selecting a (random) subset of points on the trajectory evaluated at time . The random selection is guided by a Beta-Bernoulli process prior (Thibaux and Jordan, 2007). This construction is motivated by a modeling technique of Nguyen (2010), who studied a Bayesian hierarchical model for inference of smooth trajectories on an Euclidean domain using Dirichlet process priors. Our generative model, using Beta-Bernoulli process as a building block, is more appropriate for the purpose of topic discovery. Due to the isometric embedding of in described in the previous section, from here on we shall refer to topics as points on .

First, generate a collection of global topic trajectories using Beta Process prior (cf. Thibaux and Jordan (2007))
^{2}^{2}2Thibaux and Jordan (2007) write BP, ; we set , and write BP.
with a base measure on the space of trajectories on and mass parameter :

(1) |

It follows that , where follows a stick-breaking construction (Teh et al., 2007): , and each is a sequence of random elements on the unit sphere , which are generated as follows:

(2) |

At any given time , the process induces a marginal measure , whose support is given by the atoms of as they are evaluated at time . Now, select a subset of the global topics that are active at via the Bernoulli process Then . are supported by atoms representing topics active at time . Finally, assume that noisy measurements of each of these topic variables are generated via:

(3) |

Noisy estimates for the topics at any particular time point may come from either the global topics observed until the previous time point or a topic yet unexplored. We emphasize that topics for are the quantities we aim to model, hence we refer to our approach as the *meta*-model. These topics may be learned, for each time point independently, by any stationary topic modeling algorithms, and then transformed to sphere by applying Lemma 1.

Let denote the binary matrix representing the assignment of observed topic estimates to global topics at time point , i.e, if the vector is a noisy estimate for . In words, these random variables “link up” the noisy estimates at any time point to the global topics observed thus far. By conditional independence, the joint posterior of the hidden given observed noisy is:

At ,

(4) |

The equation above represents a product of four quantities: (1) probability of s, where denotes the number of occurrences of topic up to time point (cf. popularity of a dish in the Indian Buffet Process (IBP) metaphor (Ghahramani and Griffiths, 2005)), (2) vMF conditional of given (cf. Eq. (2)), (3) number of new global topics at time , , and (4) emission probability (cf. Eq. (3)). Derivation details are given in Supplement A.1.

##### Streaming Dynamic Matching (SDM)

To perform MAP estimation in the streaming setting, we
highlight the connection of the maximization of the posterior (4) to the objective of an optimal *matching* problem: given a cost matrix, workers should be assigned to tasks, at most one worker per task and one task per worker. The solution of this problem is obtained by employing the well-known Hungarian algorithm (Kuhn, 1955). In the context of dynamic topic modeling, our goal is to match topics learned on the new timestamp to the trajectories of topics learned over the previous timestamps, where the cost is governed by our model. This connection is formalized by the following.

###### Proposition 1.

Given the cost consider the optimization problem subject to the constraints that (a) for each fixed , at most one of is and the rest are , and (b) for each fixed , exactly one of is and the rest are . Then, the MAP estimate for Eq. (4) can be obtained by the Hungarian algorithm, which solves for to obtain as

(5) |

We defer proof to Supplement B.2. To complete description of the inference we shall discuss how noisy estimates are obtained from the bag-of-words representation of the documents observed at time point . We choose to use CoSAC (Yurochkin et al., 2017) algorithm to obtain topics from , collection of documents at time point . CoSAC is a stationary topic modeling algorithm which can infer number of topics from the data and is computationally efficient for moderately sized corpora. We note that other topic modeling algorithms, e.g., variational inference (Blei et al., 2003) or Gibbs sampling (Griffiths and Steyvers, 2004; Teh et al., 2006), can be used in place of CoSAC. Estimated topics are then transformed to using Lemma 1 and reference point , where is the number of words in the corresponding document. Our reference point is simply an average (computed dynamically) of the normalized documents observed thus far. Finally we update MAP estimates of global topics dynamics based on Proposition 1. Streaming Dynamic Matching (SDM) is summarized in Algorithm 1.

##### Additional related literature

utilizing similar technical building blocks in different contexts. Fox et al. (2009) utilized Beta-Bernoulli process in time series modeling to capture switching regimes of an autoregressive process, where the corresponding Indian Buffet Process was used to select subsets of the latent states of the Hidden Markov Model. Williamson et al. (2010) used Indian Buffet Process in topic models to sparsify document topic proportions. Campbell et al. (2015) utilized Hungarian algorithm for streaming mean-field variational inference of the Dirichlet Process mixture model.

### 3.2 Beta-Bernoulli Process for multiple topic polytopes

We now consider meta-modeling in the presence of multiple corpora, each of which maintains its own topic polytope. Large text corpora often can be partitioned based on some grouping criteria, e.g. scientific papers by journals, news by different media agencies or tweets by location stamps. In this subsection we model the collection of topic polytopes observed at a single time point by employing the Beta-Bernoulli Process prior (Thibaux and Jordan, 2007). The modeling of a collection of polytopes evolving over time will be described in the following subsection.

First, generate global topic measure as in Eq. (1). Here, we are interested only in a single time point, the base measure is simply a , the uniform distribution over . Next, for each group , select a subset of the global topics:

(6) |

Notice that each group selects only a subset from the collection of global topics, which is consistent with the idea of partitioning by journals: some topics of ICML are not represented in SIGGRAPH and vice versa. The next step is analogous to Eq. (3):

(7) |

We again use to denote the binary matrix representing the assignment of global topics to the noisy topic estimates, i.e., if the topic estimate for group arises as a noisy estimate of global topic .
However, the *matching* problem is now different from before: we don’t have any information about the global topics as there is no history, instead we should match a *collection* of topic polytopes to a global topic polytope.
The matrix of topic assignments is distributed a priori by an Indian Buffet Process (IBP) with parameter . The conditional probability for global topics and assignment matrix given topic estimates has the following form:

(8) |

and IBP is the prior (see Eq. (15) in (Griffiths and Ghahramani, 2011)) with denoting the popularity of global topic .

##### Distributed Matching (DM)

Similar to Section 3.1, we look for point estimates for the topic directions and for the topic assignment matrix . Direct computation of the global MAP estimate for Eq. (8) is not straight-forward. The problem of matching across groups and topics is not amenable to a closed form Hungarian algorithm. However we show that for a fixed group the assignment optimization reduces to a case of the Hungarian algorithm. This motivates the use of Hungarian algorithm iteratively, which guarantees convergence to a local optimum.

###### Proposition 2.

Given the cost

where denotes groups excluding group and is the number of global topics before group (due to exchangeability of the IBP, group can always be considered last). Then, a locally optimum MAP estimate for Eq. (8) can be obtained by iteratively employing the Hungarian algorithm to solve: for each group , which maximizes , subject to constraints: (a) for each fixed and , at most one of is , rest are and (b) for each fixed and , exactly one of is , rest are . After solving for , is obtained as

### 3.3 Dynamic Hierarchical Beta Process

Our “full” model, the Dynamic Hierarchical Beta Process model (dHBP), builds on the constructions described in subsections 3.1 and 3.2 to enable the inference of temporal dynamics of collections of topic polytopes. We start by specifying the upper level Beta Process given by Eq. (1) and base measure given by Eq. (2). Next, for each group , we introduce an additional level of hierarchy to model group specific distributions over topics

(9) |

where s vary around corresponding . The distributional properties of are described in (Thibaux and Jordan, 2007).

At any given time , each group selects a subset from the common pool of global topics:

(10) |

Let be the corresponding collection of atoms – topics active at time in group . Noisy measurements of these topics are generated by:

(11) |

The conditional distribution of global topics at given the state of the global topics at is

(12) |

where is the prior term dependent on the popularity counts history from current and previous time points. Analogous to the Chinese Restaurant Franchise (Teh et al., 2006), one can think of an Indian Buffet Franchise in the case of HBP. A headquarter buffet provides some dishes each day and the local branches serve a subset of those dishes. Although this analogy seems intuitive, we are not aware of a corresponding Gibbs sampler and it remains to be a question of future studies. Therefore, unfortunately, we are unable to handle this prior term directly and instead propose a heuristic replacement — stripping away popularity of topics across groups and only considering group specific topic popularity (groups still remain dependent through the atom locations).

##### Streaming Dynamic Distributed Matching (SDDM)

We combine our results to perform approximate inference of the model in Section 3.3. Using Hungarian algorithm, iterating over groups at time obtain estimates for based on the following cost

where first case is if ; denotes the popularity of topic in group up to time (plus one is used to indicate that global topic exists even when ). Then compute global topic estimates At time point , the noisy topics for each of the groups can be obtained by applying CoSAC to corresponding documents in parallel. SDDM algorithm is described in Supplement C.

## 4 Experiments

We study ability of our models to learn the latent temporal dynamics and discover new topics that change over time. Next we show that our models scale well by utilizing temporal and group inherent data structures. We also study hyperparameters choices. We analyze two datasets: the Early Journal Content (http://www.jstor.org/dfr/about/sample-datasets), and a collection of Wikipedia articles partitioned by categories and in time according to their popularity.

### 4.1 Temporal Dynamics and Topic Discovery

##### Early Journal Content.

The Early Journal Content dataset spans years from up to . Years before contain very few articles, and we aggregated them into a single timepoint. After preprocessing, dataset has scientific articles from over unique journals. The vocabulary was truncated to words. We set all articles from the last available year () aside for the testing purposes.

##### Case study: epidemics.

The beginning of the th century is known to have a vast history of disease epidemics of various kinds, such as smallpox, typhoid, yellow fever to name a few. Vaccines or effective treatments for the majority of them were developed shortly after. One of the journals represented in the EJC dataset is the "Public Health Report"; however, publications from it are only available starting . Primary objective of the journal was to reflect epidemic disease infections. As one of the goals of our modeling approach is topic discovery, we verify that the model can discover an epidemics-related topic around . Figure 1(a) shows that SDM correctly discovered a new topic is semantically related to epidemics. We plot the evolution of probabilities of the top words in this topic across time. We observe that word "typhoid" increases in probability towards in the "epidemics" topic, which aligns with historical events such as Typhoid Mary in and chlorination of public drinking water in the US in for controlling the typhoid fever. The probability of "tuberculosis" also increases, aligning with foundation of the National Association for the Study and Prevention of Tuberculosis in .

##### Case study: law.

Some of the EJC journals are related to the topic of law. Our DM algorithm identified a global topic semantically similar to law by matching similar topics present in 32 out of the 417 journals. In Figure 1(b) we present the learned global topic and 4 examples of the matched local topics with the corresponding journal names. Our algorithm correctly identified that these journals have a shared law topic.

### 4.2 Scalability

##### Wiki Corpus.

We collected articles from Wikipedia and their page view counts for the months of and category information (e.g., Arts, History). We used categories as groups and partitioned the data across time according to the page view counts. Dataset construction details are given in Supplement G.2. The total number of documents is about million, and we reduced vocabulary to words similarly to (Hoffman et al., 2010). For testing we set aside documents from category Art from December .

##### Modeling Grouping.

In Fig. 3 we present comparisons on Wiki data: CoSAC (Yurochkin et al., 2017) v.s DM under the static distributed setting and SDM v.s SDDM under the dynamic streaming setting. Fig. 3 (left) shows that for data accessible in groups, DM outperforms CoSAC by, as DM runs CoSAC on different data groups in parallel and then matches the outputs. Matching time adds only a small overhead compared to the runtime of CoSAC. Similarly, in Fig. 3 (right), SDDM is faster than SDM, since SDDM can process documents of different groups in parallel and interleaves CoSAC with matching: while matching is being performed on data groups with timestamp , CoSAC can process the data that arrives with timestamp in parallel.

Perplexity | Time | Topics | Cores | Perplexity | Time | Topics | Cores | |
---|---|---|---|---|---|---|---|---|

SDM | 1179 | 22min | 125 | 1 | 1254 | 2.4hours | 182 | 1 |

DM | 1361 | 5min | 125 | 20 | 1260 | 15min | 182 | 20 |

SDDM | 1241 | 2.3min | 103 | 20 | 1201 | 20min | 238 | 20 |

DTM | 1194 | 56hours | 100 | 1 | NA | >72hours | 100 | 1 |

SVB | 1840 | 3hours | 100 | 20 | 1219 | 29.5hours | 100 | 20 |

CoSAC | 1191 | 51min | 132 | 1 | 1227 | 4.4hours | 173 | 1 |

##### Modeling temporality

also benefits scalability. We compare our methods with other topic models on both Wiki and EJC datasets: Streaming Variational Bayes (SVB) (Broderick et al., 2013) and Dynamic Topic Models (DTM) (Blei and Lafferty, 2006) trained with topics. Perplexity scores on the held out data, training times, computing resources and number of topics are reported in Table 1. On the wiki dataset, SDDM took only min to process approximately million documents, which is much faster than the other approaches.

Regarding perplexity scores, SDDM generally outperforms DM, which suggests that modeling time is beneficial. For the EJC dataset, SDM outperforms SDDM. Modeling groups might negatively affect perplexity because the majority of the EJC journals (groups) have very few articles (i.e. less than – a setup challenging for many topic modeling algorithms). On the Wiki corpus each category (group) has sufficient amount of training documents and time-group partitioning considered by SDDM achieves the best perplexity score.

### 4.3 Parameter choices

The rate of topic dynamics of the SDM and SDDM is effectively controlled by , where smaller values imply higher dynamics rate. Parameter controls variance of local topics around corresponding global topics in all of our models. This variance dictates how likely a local topic to be matched to an existing global topic. When this variance is small, the model will tend to identify local topics as new global topics more often. Lastly, affects the probability of new topic discovery, which scales with time and number of groups. In the preceding experiments we set for SDM; for DM; for SDDM. In Figure 4 we show heatpmaps for perplexity and number of learned topics, fixing and varying and . We see that for large , SDM identifies more topics to fit the smaller variability constraint imposed by the parameter.

## 5 Discussion and Conclusion

Our work suggests the naturalness of incorporating sophisticated Bayesian nonparametric techniques
in the inference of rich latent geometric structures of interest.
We demonstrated the feasibility of *approximate* nonparametric learning at scale, by utilizing suitable geometric representations and devising fast algorithms for obtaining reasonable point estimates for such representations. Further directions include incorporating more meaningful geometric
features into the models (e.g., via more elaborated base measure modeling for the Beta Process) and developing efficient algorithms for full Bayesian inference.
For instance, the latent geometric structure of the problem is solely encoded in the base measure.
We want to explore choices of base measures for other geometric structures such as collections of k-means centroids, principal components, etc. Once an appropriate base measure is constructed, our Beta process based models can be utilized
to enable a new class of Bayesian nonparametric models amenable to scalable inference and suitable for analysis of large datasets. In our concurrent work we have utilized model construction similar to one from Section 3.2 to perform Federated Learning of neural networks trained on heterogeneous data (Yurochkin et al., 2019b) and proposed a general framework for model fusion (Yurochkin et al., 2019a).

#### Acknowledgments

This research is supported in part by grants NSF CAREER DMS-1351362, NSF CNS-1409303, a research gift from Adobe Research and a Margaret and Herman Sokol Faculty Award to XN.

## References

- Ahmed and Xing (2012) Ahmed, A. and Xing, E. P. (2012). Timeline: A dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. arXiv preprint arXiv:1203.3463.
- Banerjee et al. (2005) Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. (2005). Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research, 6, 1345–1382.
- Bhadury et al. (2016) Bhadury, A., Chen, J., Zhu, J., and Liu, S. (2016). Scaling up dynamic topic models. In Proceedings of the 25th International Conference on World Wide Web, pages 381–390. International World Wide Web Conferences Steering Committee.
- Blei and Lafferty (2006) Blei, D. M. and Lafferty, J. D. (2006). Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, pages 113–120.
- Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.
- Broderick et al. (2013) Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. (2013). Streaming variational Bayes. In Advances in Neural Information Processing Systems, pages 1727–1735.
- Bryant and Sudderth (2012) Bryant, M. and Sudderth, E. B. (2012). Truly nonparametric online variational inference for hierarchical Dirichlet processes. In Advances in Neural Information Processing Systems, pages 2699–2707.
- Campbell et al. (2015) Campbell, T., Straub, J., Fisher III, J. W., and How, J. P. (2015). Streaming, distributed variational inference for bayesian nonparametrics. In Advances in Neural Information Processing Systems, pages 280–288.
- Fox et al. (2009) Fox, E., Jordan, M. I., Sudderth, E. B., and Willsky, A. S. (2009). Sharing features among dynamical systems with Beta processes. In Advances in Neural Information Processing Systems, pages 549–557.
- Ghahramani and Griffiths (2005) Ghahramani, Z. and Griffiths, T. L. (2005). Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, pages 475–482.
- Griffiths and Ghahramani (2011) Griffiths, T. L. and Ghahramani, Z. (2011). The Indian buffet process: An introduction and review. Journal of Machine Learning Research, 12, 1185–1224.
- Griffiths and Steyvers (2004) Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. PNAS, 101(suppl. 1), 5228–5235.
- Hoffman et al. (2010) Hoffman, M., Bach, F. R., and Blei, D. M. (2010). Online learning for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems, pages 856–864.
- Hong et al. (2011) Hong, L., Dom, B., Gurumurthy, S., and Tsioutsiouliklis, K. (2011). A time-dependent topic model for multiple text streams. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 832–840. ACM.
- Kuhn (1955) Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics (NRL), 2(1-2), 83–97.
- Mardia and Jupp (2009) Mardia, K. V. and Jupp, P. E. (2009). Directional statistics, volume 494. John Wiley & Sons.
- Newman et al. (2008) Newman, D., Smyth, P., Welling, M., and Asuncion, A. U. (2008). Distributed inference for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems, pages 1081–1088.
- Nguyen (2010) Nguyen, X. (2010). Inference of global clusters from locally distributed data. Bayesian Analysis, 5(4), 817–845.
- Nguyen (2015) Nguyen, X. (2015). Posterior contraction of the population polytope in finite admixture models. Bernoulli, 21(1), 618–646.
- Pritchard et al. (2000) Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945–959.
- Reisinger et al. (2010) Reisinger, J., Waters, A., Silverthorn, B., and Mooney, R. J. (2010). Spherical topic models. In Proceedings of the 27th International Conference on Machine Learning, pages 903–910.
- Tang et al. (2014) Tang, J., Meng, Z., Nguyen, X., Mei, Q., and Zhang, M. (2014). Understanding the limiting factors of topic modeling via posterior contraction analysis. In Proceedings of the 31st International Conference on Machine Learning, pages 190–198.
- Teh et al. (2006) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476).
- Teh et al. (2007) Teh, Y. W., Grür, D., and Ghahramani, Z. (2007). Stick-breaking construction for the Indian buffet process. In Artificial Intelligence and Statistics, pages 556–563.
- Thibaux and Jordan (2007) Thibaux, R. and Jordan, M. I. (2007). Hierarchical Beta processes and the Indian buffet process. In Artificial Intelligence and Statistics, pages 564–571.
- Wang et al. (2011) Wang, C., Paisley, J., and Blei, D. (2011). Online variational inference for the hierarchical Dirichlet process. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 752–760.
- Wang and McCallum (2006) Wang, X. and McCallum, A. (2006). Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 424–433. ACM.
- Williamson et al. (2010) Williamson, S., Wang, C., Heller, K. A., and Blei, D. M. (2010). The IBP compound Dirichlet process and its application to focused topic modeling. In Proceedings of the 27th International Conference on Machine Learning, pages 1151–1158.
- Yurochkin and Nguyen (2016) Yurochkin, M. and Nguyen, X. (2016). Geometric Dirichlet Means Algorithm for topic inference. In Advances in Neural Information Processing Systems, pages 2505–2513.
- Yurochkin et al. (2017) Yurochkin, M., Guha, A., and Nguyen, X. (2017). Conic Scan-and-Cover algorithms for nonparametric topic modeling. In Advances in Neural Information Processing Systems, pages 3881–3890.
- Yurochkin et al. (2019a) Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., and Hoang, N. (2019a). Statistical model aggregation via parameter matching. In Advances in Neural Information Processing Systems.
- Yurochkin et al. (2019b) Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., and Khazaeni, Y. (2019b). Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, pages 7252–7261.
- Yurochkin et al. (2019c) Yurochkin, M., Guha, A., Sun, Y., and Nguyen, X. (2019c). Dirichlet simplex nest and geometric inference. In International Conference on Machine Learning, pages 7262–7271.

## Appendix A Derivations of posterior probabilities

### a.1 Dynamic Beta Process posterior

The departing point for arriving at MAP estimation algorithm for the Dynamic Beta Process proposed in Section 3.1 of the main text is the posterior derivation at a time point (Eq. (4) of the main text):

(13) |

Starting with the vMF emission probabilities,

(14) |

we obtain the last term of Eq. (13). The conditional distribution of given , obtained when random variables are marginalized out, can be decomposed into two parts: parametric part – time -th incarnations of subset of previously observed global topics and nonparametric part – number of new topics appearing at time . The middle term can be seen to come from the Poisson prior on the number of new topics induced by the Indian Buffet Process (see Thibaux and Jordan (2007) for details):

(15) |

Finally, the first term of Eq. (13) is composed of a probability of previously observed global topic to appear at time :

(16) |

where denotes the number of times topic appeared up to time . Also, the base measure probability of the vMF dynamics is:

(17) |

Combining Equations 14–17 we arrive at Eq. (13) (Eq. (4) of the main text).

### a.2 Posterior of the Beta process for multiple topic polytopes

First recall Eq. (8) of the main text:

(18) |

To arrive at this result first note that , where is a uniform distribution on sphere from the model specification of Section 3.2 of the main text and hence is a constant. Next, the likelihood

Integrating the latent Beta Process, it can be verified that follows an IBP marginally (Thibaux and Jordan, 2007), i.e. .

## Appendix B Proofs for Lemma 1 and Propositions

### b.1 Proof for Lemma 1.

###### Lemma 1.

is a homeomorphism, where , and , for any .

###### Proof.

Given any let . Clearly this is a continuous map. Consider the maps . We show , for . Notice that , since and for all . The boundary condition for some is also satisfied, therefore the range of the map is , when . For any as . The right inverse property is proved similarly. ∎

### b.2 Proof for Proposition 1.

###### Proposition 1.

Given the cost

consider the optimization problem subject to the constraints that (a) for each fixed , at most one of is and the rest are , and (b) for each fixed , exactly one of is and the rest are . Then, the MAP estimate for Eq. (13) can be obtained by the Hungarian algorithm, which solves for to obtain:

###### Proof.

First we express the logarithm of the posterior distribution Eq. (13) in a form of a *matching* problem by splitting the terms related to previously observed topics and new topics:

(19) |

Next, consider the simultaneous maximization of and . For , if , i.e., is a noisy version of , then the increment in the posterior probability is:

On the other hand, if , this increment becomes