Conic Scan-and-Cover algorithms for nonparametric topic modeling

Conic Scan-and-Cover algorithms for
nonparametric topic modeling

Mikhail Yurochkin
Department of Statistics
University of Michigan
moonfolk@umich.edu &Aritra Guha
Department of Statistics
University of Michigan
aritra@umich.edu &XuanLong Nguyen
Department of Statistics
University of Michigan
xuanlong@umich.edu
Abstract

We propose new algorithms for topic modeling when the number of topics is unknown. Our approach relies on an analysis of the concentration of mass and angular geometry of the topic simplex, a convex polytope constructed by taking the convex hull of vertices representing the latent topics. Our algorithms are shown in practice to have accuracy comparable to a Gibbs sampler in terms of topic estimation, which requires the number of topics be given. Moreover, they are one of the fastest among several state of the art parametric techniques.111Code is available at https://github.com/moonfolk/Geometric-Topic-Modeling. Statistical consistency of our estimator is established under some conditions.

 

Conic Scan-and-Cover algorithms for
nonparametric topic modeling


  Mikhail Yurochkin Department of Statistics University of Michigan moonfolk@umich.edu Aritra Guha Department of Statistics University of Michigan aritra@umich.edu XuanLong Nguyen Department of Statistics University of Michigan xuanlong@umich.edu

\@float

noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float

1 Introduction

A well-known challenge associated with topic modeling inference can be succinctly summed up by the statement that sampling based approaches may be accurate but computationally very slow, e.g., Pritchard et al. (2000); Griffiths & Steyvers (2004), while the variational inference approaches are faster but their estimates may be inaccurate, e.g., Blei et al. (2003); Hoffman et al. (2013). For nonparametric topic inference, i.e., when the number of topics is a priori unknown, the problem becomes more acute. The Hierarchical Dirichlet Process model (Teh et al., 2006) is an elegant Bayesian nonparametric approach which allows for the number of topics to grow with data size, but its sampling based inference is much more inefficient compared to the parametric counterpart. As pointed out by Yurochkin & Nguyen (2016), the root of the inefficiency can be traced to the need for approximating the posterior distributions of the latent variables representing the topic labels — these are not geometrically intrinsic as any permutation of the labels yields the same likelihood.

A promising approach in addressing the aforementioned challenges is to take a convex geometric perspective, where topic learning and inference may be formulated as a convex geometric problem: the observed documents correspond to points randomly drawn from a topic polytope, a convex set whose vertices represent the topics to be inferred. This perspective has been adopted to establish posterior contraction behavior of the topic polytope in both theory and practice (Nguyen, 2015; Tang et al., 2014). A method for topic estimation that exploits convex geometry, the Geometric Dirichlet Means (GDM) algorithm, was proposed by Yurochkin & Nguyen (2016), which demonstrates attractive behaviors both in terms of running time and estimation accuracy. In this paper we shall continue to amplify this viewpoint to address nonparametric topic modeling, a setting in which the number of topics is unknown, as is the distribution inside the topic polytope (in some situations).

We will propose algorithms for topic estimation by explicitly accounting for the concentration of mass and angular geometry of the topic polytope, typically a simplex in topic modeling applications. The geometric intuition is fairly clear: each vertex of the topic simplex can be identified by a ray emanating from its center (to be defined formally), while the concentration of mass can be quantified for the cones hinging on the apex positioned at the center. Such cones can be rotated around the center to scan for high density regions inside the topic simplex — under mild conditions such cones can be constructed efficiently to recover both the number of vertices and their estimates.

We also mention another fruitful approach, which casts topic estimation as a matrix factorization problem (Deerwester et al., 1990; Xu et al., 2003; Anandkumar et al., 2012; Arora et al., 2012). A notable recent algorithm coming from the matrix factorization perspective is RecoverKL (Arora et al., 2012), which solves non-negative matrix factorization (NMF) efficiently under assumptions on the existence of so-called anchor words. RecoverKL remains to be a parametric technique — we will extend it to a nonparametric setting and show that the anchor word assumption appears to limit the number of topics one can efficiently learn.

Our paper is organized as follows. In Section 2 we discuss recent developments in geometric topic modeling and introduce our approach; Sections 3 and 4 deliver the contributions outlined above; Section 5 demonstrates experimental performance; we conclude with a discussion in Section 6.

2 Geometric topic modeling

Background and related work

In this section we present the convex geometry of the Latent Dirichlet Allocation (LDA) model of Blei et al. (2003), along with related theoretical and algorithmic results that motivate our work. Let be vocabulary size and be the corresponding vocabulary probability simplex. Sample topics (i.e., distributions on words) , , where . Next, sample document-word probabilities residing in the topic simplex (cf. Nguyen (2015)), by first generating their barycentric coordinates (i.e., topic proportions) and then setting for and . Finally, word counts of the -th document can be sampled , where is the number of words in document . The above model is equivalent to the LDA when individual words to topic label assignments are marginalized out.

Nguyen (2015) established posterior contraction rates of the topic simplex, provided that and either number of topics is known or topics are sufficiently separated in terms of the Euclidean distance. Yurochkin & Nguyen (2016) devised an estimate for , taken to be a fixed unknown quantity, by formulating a geometric objective function, which is minimized when topic simplex is close to the normalized documents . They showed that the estimation of topic proportions given simply reduces to taking barycentric coordinates of the projection of onto . To estimate given , they proposed a Geometric Dirichlet Means (GDM) algorithm, which operated by performing a k-means clustering on the normalized documents, followed by a geometric correction for the cluster centroids. The resulting algorithm is remarkably fast and accurate, supporting the potential of the geometric approach. The GDM is not applicable when is unknown, but it provides a motivation which our approach is built on.

The Conic Scan-and-Cover approach

To enable the inference of when is not known, we need to investigate the concentration of mass inside the topic simplex. It suffices to focus on two types of geometric objects: cones and spheres, which provide the basis for a complete coverage of the simplex. To gain intuition of our procedure, which we call Conic Scan-and-Cover (CoSAC) approach, imagine someone standing at a center point of a triangular dark room trying to figure out all corners with a portable flashlight, which can produce a cone of light. A room corner can be identified with the direction of the farthest visible data objects. Once a corner is found, one can turn the flashlight to another direction to scan for the next ones. See Fig. 0(a), where red denotes the scanned area. To make sure that all corners are detected, the cones of light have to be open to an appropriate range of angles so that enough data objects can be captured and removed from the room. To make sure no false corners are declared, we also need a suitable stopping criterion, by relying only on data points that lie beyond a certain spherical radius, see Fig. 0(b). Hence, we need to be able to gauge the concentration of mass for suitable cones and spherical balls in . This is the subject of the next section.

(a) An incomplete coverage using
3 cones (containing red points).
(b) Complete coverage using
3 cones (red) and a ball (yellow).
(c) Cap and cone
Figure 1: Complete coverage of topic simplex by cones and a spherical ball for , .

3 Geometric estimation of the topic simplex

We start by representing in terms of its convex and angular geometry. First, is centered at a point denoted by . The centered probability simplex is denoted by . Then, write for and for . Note that re-centering leaves corresponding barycentric coordinates unchanged. Moreover, the extreme points of centered topic simplex can now be represented by their directions and corresponding radii such that for any .

3.1 Coverage of the topic simplex

The first step toward formulating a CoSAC approach is to show how can be covered with exactly cones and one spherical ball positioned at . A cone is defined as set , where we employ the angular distance (a.k.a. cosine distance) and is the cosine of angle formed by vectors and .

The Conical coverage

It is possible to choose so that the topic simplex can be covered with exactly cones, that is, . Moreover, each cone contains exactly one vertex. Suppose that is the incenter of the topic simplex , with being the inradius. The incenter and inradius correspond to the maximum volume sphere contained in . Let denote the distance between the -th and -th vertex of , with for all , and such that . Then we can establish the following.

Proposition 1.

For simplex and , where and , the cone around any vertex direction of contains exactly one vertex. Moreover, complete coverage holds: .

We say there is an angular separation if for any (i.e., the angles for all pairs are at least ), then . Thus, under angular separation, the range that allows for full coverage is nonempty independently of . Our result is in agreement with that of Nguyen (2015), whose result suggested that topic simplex can be consistently estimated without knowing , provided there is a minimum edge length . The notion of angular separation leads naturally to the Conic Scan-and-Cover algorithm. Before getting there, we show a series of results allowing us to further extend the range of admissible .

The inclusion of a spherical ball centered at allows us to expand substantially the range of for which conical coverage continues to hold. In particular, we can reduce the lower bound on in Proposition 1, since we only need to cover the regions near the vertices of with cones using the following proposition. Fig. 0(b) provides an illustration.

Proposition 2.

Let , ; given in Prop. 1, and

(1)

then we have whenever .

Notice that as , the value of . Hence if , the admissible range for in Prop. 2 results in a substantial strengthening from Prop. 1. It is worth noting that the above two geometric propositions do not require any distributional properties inside the simplex.

Coverage leftovers

In practice complete coverage may fail if and are chosen outside of corresponding ranges suggested by the previous two propositions. In that case, it is useful to note that leftover regions will have a very low mass. Next we quantify the mass inside a cone that does contain a vertex, which allows us to reject a cone that has low mass, therefore not containing a vertex in it.

Proposition 3.

The cone whose axis is a topic direction has mass

(2)

where is the simplicial cap of which is composed of vertex and a base parallel to the corresponding base of and cutting adjacent edges of in the ratio .

See Fig. 0(c) for an illustration for the simplicial cap described in the proposition. Given the lower bound for the mass around a cone containing a vertex, we have arrived at the following guarantee.

Proposition 4.

For , let be such that and let be such that

(3)

where angle . Then, as long as

(4)

the bound .

3.2 CoSAC: Conic Scan-and-Cover algorithm

Having laid out the geometric foundations, we are ready to present the Conic Scan-and-Cover (CoSAC) algorithm, which is a scanning procedure for detecting the presence of simplicial vertices based on data drawn randomly from the simplex. The idea is simple: iteratively pick the farthest point from the center estimate , say , then construct a cone for some suitably chosen , and remove all the data residing in this cone. Repeat until there is no data point left.

Specifically, let be the index set of the initially unseen data, then set and update . The parameter needs to be sufficiently large to ensure that the farthest point is a good estimate of a true vertex, and that the scan will be completed in exactly iterations; needs to be not too large, so that does not contain more than one vertex. The existence of such is guaranteed by Prop. 1. In particular, for an equilateral , the condition of the Prop. 1 is satisfied as long as .

In our setting, is unknown. A smaller would be a more robust choice, and accordingly the set will likely remain non-empty after iterations. See the illustration of Fig. 0(a), where the blue regions correspond to after iterations of the scan. As a result, we proceed by adopting a stopping criteria based on Prop. 2: the procedure is stopped as soon as , which allows us to complete the scan in iterations (as in Fig. 0(b) for ).

The CoSAC algorithm is formally presented by Algorithm 1. Its running is illustrated in Fig. 2, where we show iterations 1, 26, 29, 30 of the algorithm by plotting norms of the centered documents in the active set and cone against cosine distance to the chosen direction of a topic. Iteration 30 (right) satisfies stopping criteria and therefore CoSAC recovered correct . Note that this type of visual representation can be useful in practice to verify choices of and . The following theorem establishes the consistency of the CoSAC procedure.

Theorem 1.

Suppose are the true topics, incenter is given, and for and . Let be the estimated number of topics, be the output of Algorithm 1 trained with and as in Prop. 2. Then ,

Remark

We found the choices and to be median of to be robust in practice and agreeing with our theoretical results. From Prop. 3 it follows that choosing as median length is equivalent to choosing resulting in an edge cut ratio such that , then , which, for any equilateral topic simplex , is satisfied by setting , provided that based on the Eq. (3).

4 Document Conic Scan-and-Cover algorithm

In the topic modeling problem, for are not given. Instead, under the bag-of-words assumption, we are given the frequencies of words in documents which provide a point estimate for the . Clearly, if number of documents and length of documents , we can use Algorithm 1 with the plug-in estimates in place of , since . Moreover, will be estimated by . In practice, and are finite, some of which may take relatively small values. Taking the topic direction to be the farthest point in the topic simplex, i.e., , where , may no longer yield a robust estimate, because the variance of this topic direction estimator can be quite high (in Proposition 5 we show that it is upper bounded with ).

To obtain improved estimates, we propose a technique that we call “mean-shifting”. Instead of taking the farthest point in the simplex, this technique is designed to shift the estimate of a topic to a high density region, where true topics are likely to be found. Precisely, given a (current) cone , we re-position the cone by updating . In other words, we re-position the cone by centering it around the mean direction of the cone weighted by the norms of the data points inside, which is simply given by . This results in reduced variance of the topic direction estimate, due to the averaging over data residing in the cone.

The mean-shifting technique may be slightly modified and taken as a local update for a subsequent optimization which cycles through the entire set of documents and iteratively updates the cones. The optimization is with respect to the following weighted spherical k-means objective:

(5)

where cones yield a disjoint data partition (this is different from ). The rationale of spherical k-means optimization is to use full data for estimation of topic directions, hence further reducing the variance due to short documents. The connection between objective function (5) and topic simplex estimation is given in the Supplement. Finally, obtain topic norms along the directions using maximum projection: . Our entire procedure is summarized in Algorithm 2.

Remark

In Step 9 of the algorithm, cone with a very low cardinality, i.e., , for some small constant , is discarded because this is likely an outlier region that does not actually contain a true vertex. The choice of is governed by results of Prop. 4. For small , and for an equilateral we can choose such that . Plugging these values into Eq. (3) leads to Now, plugging in we obtain for large . Our approximations were based on large to get a sense of , we now make a conservative choice , so that . As a result, a topic is rejected if the corresponding cone contains less than 0.1% of the data.

Finding anchor words using Conic Scan-and-Cover

Another approach to reduce the noise is to consider the problem from a different viewpoint, where Algorithm 1 will prove itself useful. RecoverKL by Arora et al. (2012) can identify topics with diminishing errors (in number of documents ), provided that topics contain anchor words. The problem of finding anchor words geometrically reduces to identifying rows of the word-to-word co-occurrence matrix that form a simplex containing other rows of the same matrix (cf. Arora et al. (2012) for details). An advantage of this approach is that noise in the word-to-word co-occurrence matrix goes to zero as no matter the document lengths, hence we can use Algorithm 1 with "documents" being rows of the word-to-word co-occurrence matrix to learn anchor words nonparametrically and then run RecoverKL to obtain topic estimates. We will call this procedure cscRecoverKL.

0:  document generating distributions , angle threshold , norm threshold
0:  topics
1:   {find center};   for {center the data}
2:   {initialize active set};   {initialize topic count}
3:  while  do
4:      {find topic}
5:      {find cone of near documents}
6:      {update active set}
7:     , {compute topic}
8:  end while
Algorithm 1 Conic Scan-and-Cover (CoSAC)

Figure 2: Iterations 1, 26, 29, 30 of the Algorithm 1. Red are the documents in the cone ; blue are the documents in the active set for next iteration. Yellow are documents .
0:  normalized documents , angle threshold , norm threshold , outlier threshold
0:  topics
1:   {find center};   for {center the data}
2:   {initialize active set};    {initialize topic count}
3:  while  do
4:      {initialize direction}
5:     while  not converged do {mean-shifting}
6:         {find cone of near documents}
7:         {update direction}
8:     end while
9:      {update active set}if   then {record topic direction}
10:  end while
11:   weighted spherical k-means
12:  for  in  do
13:      {find topic length along direction }
14:      {compute topic}
15:  end for
Algorithm 2 CoSAC for documents

5 Experimental results

5.1 Simulation experiments

Figure 3: Minimum matching Euclidean distance for (a) varying corpora size, (b) varying length of documents; (c) Running times for varying corpora size; (d) Estimation of number of topics.

Figure 4: Gibbs sampler convergence analysis for (a) Minimum matching Euclidean distance for corpora sizes 1000 and 5000; (b) Perplexity for corpora sizes 1000 and 5000; (c) Perplexity for NYTimes data.

In the simulation studies we shall compare CoSAC (Algorithm 2) and cscRecoverKL based on Algorithm 1 both of which don’t have access to the true , versus popular parametric topic modeling approaches (trained with true ): Stochastic Variational Inference (SVI), Collapsed Gibbs sampler, RecoverKL and GDM (more details in the Supplement). The comparisons are done on the basis of minimum-matching Euclidean distance, which quantifies distance between topic simplices (Tang et al., 2014), and running times (perplexity scores comparison is given in Fig. 7). Lastly we will demonstrate the ability of CoSAC to recover correct number of topics for a varying .

Estimation of the LDA topics

First we evaluate the ability of CoSAC and cscRecoverKL to estimate topics , fixing . Fig. 3(a) shows performance for the case of fewer but longer documents (e.g. scientific articles, novels, legal documents). CoSAC demonstrates performance comparable in accuracy to Gibbs sampler and GDM.

Next we consider larger corpora of shorter documents (e.g. news articles, social media posts). Fig. 3(b) shows that this scenario is harder and CoSAC matches the performance of Gibbs sampler for . Indeed across both experiments CoSAC only made mistakes in terms of for the case of , when it was underestimating on average by 4 topics and for when it was off by around 1, which explains the earlier observation. Experiments with varying and are given in the Supplement.

It is worth noting that cscRecoverKL appears to be strictly better than its predecessor. This suggests that our procedure for selection of anchor words is more accurate in addition to being nonparametric.

Running time

A notable advantage of the CoSAC algorithm is its speed. In Fig. 3(c) we see that Gibbs, SVI, GDM and CoSAC all have linear complexity growth in , but the slopes are very different and approximately are for SVI and Gibbs (where is the number of iterations which has to be large enough for convergence), number of k-means iterations to converge for GDM and is of order for the CoSAC procedure making it the fastest algorithm of all under consideration.

Next we compare CoSAC to per iteration quality of the Gibbs sampler trained with 500 iterations for and . Fig. 4(b) shows that Gibbs sampler, when true is given, can achieve good perplexity score as fast as CoSAC and outperforms it as training continues, although Fig. 4(a) suggests that much longer training time is needed for Gibbs sampler to achieve good topic estimates and small estimation variance.

Estimating number of topics

Model selection in the LDA context is a quite challenging task and, to the best of our knowledge, there is no "go to" procedure. One of the possible approaches is based on refitting LDA with multiple choices of and using Bayes Factor for model selection (Griffiths & Steyvers, 2004). Another option is to adopt the Hierarchical Dirichlet Process (HDP) model, but we should understand that it is not a procedure to estimate of the LDA model, but rather a particular prior on the number of topics, that assumes to grow with the data. A more recent suggestion is to slightly modify LDA and use Bayes moment matching (Hsu & Poupart, 2016), but, as can be seen from Figure 2 of their paper, estimation variance is high and the method is not very accurate (we tried it with true and it took above 1 hour to fit and found 35 topics). Next we compare Bayes factor model selection versus CoSAC and cscRecoverKL for . Fig. 3(d) shows that CoSAC consistently recovers exact number of topics in a wide range.

We also observe that cscRecoverKL does not estimate well (underestimates) in the higher range. This is expected because cscRecoverKL finds the number of anchor words, not topics. The former is decreasing when later is increasing. Attempting to fit RecoverKL with more topics than there are anchor words might lead to deteriorating performance and our modification can address this limitation of the RecoverKL method.

5.2 Real data analysis

In this section we demonstrate CoSAC algorithm for topic modeling on one of the standard bag of words datasets — NYTimes news articles. After preprocessing we obtained documents over words. Bayes factor for the LDA selected the smallest model among , while CoSAC selected 159 topics. We think that disagreement between the two procedures is attributed to the misspecification of the LDA model when real data is in play, which affects Bayes factor, while CoSAC is largely based on the geometry of the topic simplex.

The results are summarized in Table 1 — CoSAC found 159 topics in less than 20min; cscRecoverKL estimated the number of anchor words in the data to be 27 leading to fewer topics. Fig. 4(c) compares CoSAC perplexity score to per iteration test perplexity of the LDA (1000 iterations) and HDP (100 iterations) Gibbs samplers. Text files with top 20 words of all topics are available on GitHub. We note that CoSAC procedure recovered meaningful topics, contextually similar to LDA and HDP (e.g. elections, terrorist attacks, Enron scandal, etc.) and also recovered more specific topics about Mike Tyson, boxing and case of Timothy McVeigh which were present among HDP topics, but not LDA ones. We conclude that CoSAC is a practical procedure for topic modeling on large scale corpora able to find meaningful topics in a short amount of time.

Perplexity Coherence Time
cscRecoverKL 27 2603 -238 37 min
HDP Gibbs 35 hours
LDA Gibbs 80 5.3 hours
CoSAC 159 1568 -322 19 min
Table 1: Modeling topics of NYTimes articles

6 Discussion

We have analyzed the problem of estimating topic simplex without assuming number of vertices (i.e., topics) to be known. We showed that it is possible to cover topic simplex using two types of geometric shapes, cones and a sphere, leading to a class of Conic Scan-and-Cover algorithms. We then proposed several geometric correction techniques to account for the noisy data. Our procedure is accurate in recovering the true number of topics, while remaining practical due to its computational speed. We think that angular geometric approach might allow for fast and elegant solutions to other clustering problems, although as of now it does not immediately offer a unifying problem solving framework like MCMC or variational inference. An interesting direction in a geometric framework is related to building models based on geometric quantities such as distances and angles.

Acknowledgments

This research is supported in part by grants NSF CAREER DMS-1351362, NSF CNS-1409303, a research gift from Adobe Research and a Margaret and Herman Sokol Faculty Award.

Appendix A Proofs of main theorems

We start by reminding the reader of our geometric setup. First, topic simplex is centered at a point denoted by . Let — centered probability simplex. Then, write for and for . Note that re-centering leaves corresponding barycentric coordinates unchanged. Moreover, the extreme points of centered topic simplex can now be represented by their directions and corresponding radii such that for any .

a.1 Coverage of the topic simplex

Suppose that is the incenter of the topic simplex , with being the inradius. Recall that the incenter and inradius correspond to the maximum volume sphere inside . Let denote the distance between the and vertex of , with for all , and such that

TheoremProposition 1.

For simplex and , where and , the cone around any vertex direction of contains exactly one vertex. Moreover, complete coverage holds: .

Proof.

Let . Then, for any , for any , does not contain any other vertices. This can be explained as follows. Fix , and choose . Define as the angle at made by the side connecting the vertex and vertex . Then from the cosine law for triangles, we have

Now, for any , with , the cone does not cover any vertex other than vertex , for any . Now satisfies

from which we obtain the upper bound for . For the lower bound, consider for vertex , the cone connecting the incenter to facial incenters of facets containing vertex . Then . Now for each , , where , with satisfying . From this we get the lower bound. The restriction is needed to ensure that the set is non-empty. ∎

TheoremProposition 2.

Let , ; given in Prop. 1, and

(6)

then we have whenever .

Figure 5: : vertex point, : point where the adjacent side to the vertex has been cut off by the sphere, : distance to vertex from incenter, : radius of sphere, : incenter
Proof.

Let be the angle formed by the line joining the vertex to the incenter and the radial vector from incenter to the point where the sphere cuts the edge connecting and (segment on Fig. 5). From the sine law for a triangle we have

(7)

Solving for we have . Now, since we must choose the largest such over all and , the bound follows immediately. Notice that as , the value of , whereas strictly. Thus, as increases the lower bound in this limiting scenario is dominated by , thereby obtaining an improvement in the bound from Proposition 1. ∎

TheoremProposition 3.

The cone whose axis is a topic direction has mass

(8)

where is the simplicial cap of which is composed of vertex and a base parallel to the corresponding base of and cutting adjacent edges of in the ratio .

The truncated beta probability calculations in Proposition 3 can be found in Olver et al. (2010).

TheoremProposition 4.

For , let be such that and let be such that

(9)

where angle . Then, as long as

(10)

the bound .

Proof.

Consider Figure 5, with length of , where is the proportion in which the cone cuts , the edge joining vertex and vertex . Now, from the sine law of a triangle,

(11)

where is as defined in the proof of Proposition 2. Now . The choice of satisfies

(12)

therefore proves the theorem. Since, , for all , the function is increasing as the angle between and increases, as can be checked for maxima by the first derivative rule. Using the cosine law,

(13)

Minimizing this quantity with respect to and we get the result. ∎

a.2 Consistency of the Conic Scan-and-Cover algorithm

Under the LDA setup (as presented in Section 2), recall that is the length of the edge connecting the and vertex, i.e., , where is the norm. Let denote an -ball in -norm. Then the following result states that with high probability there exists a document in a neighborhood of every vertex.

Lemma 1.

Let for as before. Then for any and any ,