Topic Modeling via Full Dependence Mixtures

# Topic Modeling via Full Dependence Mixtures

## Abstract

In this paper we introduce a new approach to topic modelling that scales to large datasets by using a compact representation of the data and by leveraging the GPU architecture. In this approach, topics are learned directly from the co-occurrence data of the corpus. In particular, we introduce a novel mixture model which we term the Full Dependence Mixture (FDM) model. FDMs model second moment under general generative assumptions on the data. While there is previous work on topic modeling using second moments, we develop a direct stochastic optimization procedure for fitting an FDM with a single Kullback Leibler objective. Moment methods in general have the benefit that an iteration no longer needs to scale with the size of the corpus. Our approach allows us to leverage standard optimizers and GPUs for the problem of topic modeling. In particular, we evaluate the approach on two large datasets, NeurIPS papers and a Twitter corpus, with a large number of topics, and show that the approach performs comparably or better than the the standard benchmarks.

## 1 Introduction

A topic model is a probabilistic model of joint distribution in the data, that is typically used as a dimensionality reduction technique in a variety of applications, such as for instance text mining, information retrieval and recommender systems. In this paper we concentrate on topic models in text data. Perhaps the most widely used topic model for text is the Latent Dirichlet Allocation (LDA) model, [7]. LDA is a fully generative probabilistic model, and is typically learned through a Bayesian approach — by sampling the parameter distribution given the data, via Collapsed Gibbs samplers [11], [27], variational methods, [7], [10], [14], or other related approaches. A typical learning procedure for Bayesian methods involves an iteration over the entire corpus, where a topic assignment is sampled per each token or document in the corpus. In order to apply these methods to large corpora, a variety of optimized procedures were developed, where speed improvement is achieved either via parallelization or via more economic sampling scheme, per token. An additional level of complexity is added by the fact that LDA has two hyperparameters, the Dirichlet priors of the token distribution in topics, and topic distribution in documents. This may complicate the application of the methods since the choice of parameters may influence the results even for relatively large data sizes. A detailed discussion of LDA optimization is given in Section 2.

In this paper we propose an alternative approach to topic modeling, based on principles that are principally different from the standard LDA optimization techniques. We show that using this approach it is possible to analyze large datasets on a single workstation with a GPU, and to obtain results that are comparable or better than the standard benchmarks.

In the rest of this section we introduce some necessary notation, describe our model and the related loss function, discuss the optimization procedure for this loss, and overview the experimental results of the paper.

### 1.1 Method Description

We assume that the text data was generated from a pLSA (Probabilistic Latent Semantic Allocation) probability model , [15], as follows: Denote by the set of distinct tokens in the corpus, and suppose that we are given topics, , , where each topic is a probability distribution on the set of tokens . Then the pLSA assumption is that each document is generated by independently sampling tokens from a mixture of topics, denoted :

 νd=∑tθd(t)μt, (1)

where and for every document . Note that we do not specify the generative model for . In this sense, pLSA is a semi-generative model, and is more general than LDA.

Next, for every document in the corpus we construct its token co-occurrence probability matrix, and we take the co-occurrence probability matrix of the corpus to be the average of the document matrices. Let be the dictionary size - the number of distinct tokens in the corpus. Then the co-occurrence matrix of the corpus is an matrix, with non-negative entries that sum to . Suppose that one performs the following experiment: Sample a document from a corpus at random, and then sample two tokens independently from the document. Then is the probability to observe the pair in this experiment (up to a small modification, full details are given in Section 3).

Now, if one assumes the pLSA model of the corpus, then it can be shown that the expectation of should be of the form

 Mu,v(μ,α)=T∑i,j=1αi,jμi(u)μj(v), (2)

where are the topics and , , and represent the corpus level topic-topic correlations. We refer to the matrices of the form (2) as Full Dependence Mixture (FDM) matrices. This is due to the analogy with standard multinomial mixture models (also known as categorical mixture models), which can be represented in the form (2) but with restricted to be a diagonal matrix. Multinomial mixture models correspond to the special case where each document contains samples from only one topic, where the topic may be different for different documents. Equivalently, in multinomial mixtures, each is on all but one coordinates.

In this paper, we consider a set of topics to be a good fit for the data if there are some correlation coefficients such that is close to , the FDM generated from the data. Specifically, we define the loss by

 L(μ,α)=−∑u,v∈XˆMu,vlogM(μ,α)u,v, (3)

and we are interested in minimizing over all . Clearly, minimizing is equivalent to minimizing the Kullback-Leibler divergence between and , viewed as probability distributions over .

To gain basic intuition as to why that approximates well the empirical matrix should yield good topics, it is useful to consider again the simple case of the multinomial mixture, and moreover, the specific case where the topics are disjoint. In this case, the matrix will be block diagonal (up to a reordering of the dictionary), with disjoint blocks that correspond to the topics. Thus, provided enough samples , the topics can be easily read off from the matrix. An example of a more complicated matrix is shown in Figure 1. Here the ground set is , and the topics are uniform on intervals ,, and respectively. The documents were generated from the mixture (1), with sampled from a non-uniform Dirichlet, , which make the topics appear with different frequencies in the data. Although here the relation between the topics and the matrix is more involved, one can still see that the topics could be traceable from the matrix.

In Section 4 we show the asymptotic consistency of the loss (3) for topics which satisfy the classical anchor words ([9],[3]) assumption. That is, if the topics satisfy the anchor words assumption, then given enough samples the topics can be uniquely reconstructed from the matrix by minimizing the loss (3). The anchor words assumption roughly states that each topic has a word that is unique to that particular topic. Note that this word does not have to appear in each document containing the topic, and may in fact have a relatively low probability inside the topic itself. It is known, [3], and easy to verify, that natural topics, such as topics produced by learning LDA, do satisfy the anchor words assumption. We refer to Section 4 for further details.

The advantage of using the cost is that it depends on the corpus only through the matrix . Therefore, the size of the corpus does not enter the optimization problem directly, and we are dealing with a fixed size, problem. This is a general feature of reconstruction through moments approaches, such as for instance [3], [2]. In particular, the number of variables for the optimization is , in contrast to variational or Gibbs sampler based methods, which either have variables for every document, or have a single variable for every token in the corpus, respectively.

### 1.2 Optimization of the Objective L(μ,α).

For smaller problems, one may directly optimize the objective (3) using gradient descent methods. However, note that if one computes directly, then one has to compute , which is a sum of matrices of size . Indeed, denote by the matrix such that its entry is . Then . On standard GPU computing architectures, all of the matrices will have to be in memory simultaneously, which is prohibitive even for moderate values of . To resolve this issue, we reformulate the optimization of as a stochastic optimization problem in . To this end, note that is an expectation of the term over pairs of tokens , sampled from , where is viewed as a probability distribution over . Formally,

 L(μ,α)=E(u,v)∼ˆMlogM(μ,α)u,v. (4)

Therefore, given , one only has to compute the gradient of , rather than full at – which is a much smaller optimization problem, of size , and this can be done for moderate batch sizes. This approach makes the optimization of practically feasible. The full algorithm, given as Algorithm 2 below, is discussed in detail in Section 3. Note that this approach differs from the standard stochastic gradient descent flow on the GPU, where the batches consist of data samples (documents in the case of text data). Instead, here the data is already summarized as , and the batches consist of pairs of tokens that we sample actively from .

### 1.3 Experimental Results

We evaluate the FDM algorithm on a semi-synthetic dataset where the ground truth topics are known and are taken to be realistic (topics learned by LDA on NeurIPS data) and on two real world datasets: the NeurIPS full papers corpus, and a very large (20 million tweets) Twitter dataset that was collected via the Twitter API. For the semi-synthetic dataset the topic quality was measured by comparison to the ground truth topics, while for the real datasets log-likelihhod on a hold-out set was measured. We compare FDM to a state of the art LDA collapsed Gibbs sampler (termed SparseLDA), and to the topic learning algorithm introduced in [3] (termed EP). Additional details on these benchmarks are given in Section 2. For the semi-synthetic models, we find that while, as expected, SparseLDA with true hyperparameters performs somewhat better, FDM performs similarly to a SparseLDA with close but different hyperparameters. All algorithms find a reasonable approximation of the topics, althpough EP performance is notably weaker. On NeurIPS FDM performs similarly to SparseLDA, while on Twitter data FDM performance is somewhat better. Both algorithms outperform EP.

To summarize, the contributions of this paper are as follows: We introduce a new approach to topic modeling, via the fitting of the empirical FDM to the topic FDM by likelihood minimization, and prove the consistency of the associated loss. We introduce a practical optimisation procedure for this method, and experimentally show that it produces topics that are comparable or better than the state of the art approaches, while using principles that significantly differ from the existing methods.

## 2 Literature

The subject of optimization in topic models has received significant attention in the literature. We first describe the general directions of this research. Variational methods for the LDA objective were developed in the paper that introduced the LDA model, [7]. See also [10], [14]. The collapsed Gibbs sampler for LDA was introduced in [11], [27]. Optimizations of the collapsed sampler that exploit the sparsity of the topics in a document were developed in [31], [30], [18] and yielded further performance improvements. Streaming, or online methods for the LDA objective were proposed in [23], [13]. We also note that the topic modelling problem, and in particular the pLSA model, is closely related to the Non-Negative Matrix Factorization (NMF) problem, [16]. In this context, EM type algorithms for topic models were developed in the paper that introduced pLSA, [15]. Streaming algorithms for NMF in general are also an active field, see for instance [33], [12], which involve Euclidean costs, but could possibly be adapted to the topic modelling setting. Finally, distributed methods for LDA were introduced in [23], [26], [20], [32], [1].

Topic reconstruction from corpus statistics such as the matrix were previously considered in the theoretical study of topic models. Topic reconstruction from the third moments of the data was proposed in [2]. While highly theoretically significant, these algorithms require construction an analysis of an matrix, and thus can not be applied with dictionaries of size of several thousands tokens. An algorithm that is based on the matrix , as in our approach, was given in [4] and improved in [3]. However, despite the fact that both [3] and our approach use , the underling principles behind the two algorithms are completely different. The approach of [3] is based on the notion of anchor words (see Section 4), and consists of two steps: First, the algorithm attempts to explicitly identify the anchor words from the matrix . Then, the topics are reconstructed using the obtained anchor words. Due to the structure of the pLSA model, it can be shown that any raw of the matrix can be approximately represented as a convex combination of the raws of that correspond to the anchor words. Thus, the anchor word identification in [3] is done by identifying the approximate extreme points of a set, where the set is the set of raws of . In contrast, our approach does not attempt to find or use anchor words explicitly and conceptually is a much simpler gradient descent algorithm. We optimize in the space of topics directly, by approximating by an FDM . It is also worth mentioning that in the consistency proof of Algorithm 2, Theorem 4.2 (although not in the algorithm itself), we use the characterization of the topics as the extreme points of a certain polytope. However, these are not the same extreme points as in [3]. The extreme points we use for the proof correspond to topics, while the extreme points in [3] correspond to conditional probabilities of tokens given anchor words, and are generally very different from the topics themselves. Finally, as mentioned earlier, we use the algorithm form [3] as an additional benchmark. While the algorithm of [3] is extremely fast, and can be very precise under certain conditions, the quality of the topics found by that algorithm is significantly lower compared to the topics produced by the standard optimized LDA Gibbs sampler.

A variant of the Gibbs sampler for LDA that is adapted to the computation on GPU was recently proposed in [28]. We note that similarly to the Gibbs samplers or variational methods, this approach maintains a form of a topic assignment for each document. Therefore, the number of variables that need to be stored grows with the number of documents and topics and is , where is the number of documents and number of topics. This is true despite the remarkable memory optimizations described in [28], which address other matrices used by that algorithm. Observe that for GPUs, this problem is quite severe, since the amount of memory typically available on a GPU is much smaller than the CPU memory. This is in contrast to our approach, where the GPU memory requirement is independent of the number of documents or tokens. To put this in context, the datasets we analyze here, NeurIPS and Twitter (both with ) can not be analyzed via the approach of [28] on a high end desktop GPU ( memory).

The MALLET code, [21], is widely used as the standard benchmark. This code is based on a collapsed Gibbs sampler for LDA, and implements a variety of optimizations discussed earlier. In particular it exploits sparsity, based on [31], parallelization (within a single workstation) based on [23], and has an efficient and publicly available implementation.

Finally, we note that, while outside of the scope of this paper, the methods presented here could easily be adapted to distributed, multi-GPU settings, using standard SGD parallelization techniques. This may be achieved either by elementary means, by increasing the batch size and processing it on multiple GPUs in parallel, or via more involved, lock-free methods, such as [25].

## 3 Formal Algorithm Specification

Recall that is the size of the dictionary. For a document given as a sequence of tokens , where is the total number of tokens in , denote by the count vector of ,

 cd(u)=#{xi|xi=u}%for$u∈X$. (5)

Thus, is the bag of words representation of , and is the number of times appears in . With this notation, the construction of the matrix from the data is described in Algorithm 1.

To motivate this construction, assuming the pLSA model, suppose a document is sampled from a mixture of topics

 ν=∑t≤Tθ(t)μt. (6)

The co-occurrence matrix of the mixture is

 (Mν)u,v=ν(u)⋅ν(v)=∑i,j≤Tθ(i)θ(j)μi(u)μj(v). (7)

Thus, is the probability of obtaining the pair when one samples two token independently from . The co-occurrence matrix of the corpus is the average of the corresponding over all documents . Observe form (7) that that has the form of an FDM, (2), and hence the co-occurrence matrix of the full corpus also has this form. Next, note that we do not have access to the matrices themselves, only to the documents , which are samples from . We therefore estimate using the tokens of the document. Specifically, the matrix constructed in (9) in Algorithm 1 is an unbiased estimate of from the tokens in : We have

 EdˆMd|θ=Mν, (8)

where the expectation is over the documents sampled from . We provide the proof of this statement in the supplementary material. Therefore, to obtain an approximation to the co-occurrence matrix of the model, in Algorithm 1 we first compute the estimates for each document, and then average over the corpus.

Once the matrix is constructed, our goal is to reconstruct the topics and the corpus level coefficients from . As discussed in the introduction, the FDM optimization algorithm is a stochastic gradient ascent on the pairs of tokens sampled from , given as Algorithm 2. Note that since the parameters in which we are interested are constrained to be probability distributions, we parametrize them with free variables via softmax:

 αi,j=eα′ij∑i′,j′≤Teα′i′j′ and μt(l)=eμ′t(l)∑l′≤Neμ′t(l′), (11)

where , , and . Thus, in Algorithm 2, are functions of and the gradient ascent is over . Note that any SGD optimizer, including adaptive rate optimizers, may be used in Step of Algorithm 2. We use Adam, [17], in the experiments.

## 4 Consistency

In this section we discuss the consistency of the loss function (3) under the anchor words assumption. Specifically, we show that if the the corpus is generated by a pLSA model with a set of topics that satisfy the anchor words assumption, then among topics satisfying this assumption, the loss (3) can only be asymptotically minimized by an FDM with the true topics . It follows that if one minimizes the loss (3) and the resulting topics satisfy the anchor words assumptions (this holds empirically, see below), then in the limit the topics must be the true topics.

The anchor words assumption was introduced in [9] as a part of a set of sufficient conditions for identifiability in NMF. A set of topics is said to have an anchor words property, denoted (), if for every , there is a point such that and for all . The point is called the anchor word of the topic . As mentioned earlier, natural topics tends to have the anchor word property. For instance, topics found by various LDA based methods have anchor words, [3]. We note that topics found by the FDM optimization algorithm, Algorithm 2, also have anchor words.

We first develop a few equivalent geometric characterizations of anchor words. While some of the arguments used in the proofs are similar in spirit to those in [9],[4], the particular notions we introduce have not appeared in the literature previously, and significantly clarify the geometric nature of the anchor word assumption. We thus provide these results here for completeness. Some of the equivalences below play an important role in the proof of Theorem 4.2.

Denote by the probability simplex in . A set of probability measures is called span-maximal () if . Here and denote the span and the convex hull of the set. A set is positive () if the following holds: For every linear combination , if then for all . In other words, a set is positive if only its linear combinations with non-negative coefficients belong to the simplex. Finally, a set is maximal () if the following holds: For any set , if , then we must have for some permutation on . Equivalently, a set is maximal if it can not be properly contained in a convex hull of another set of topics of size .

###### Proposition 4.1.

For a set of linearly independent topics , the properties (), (), (), and () are equivalent.

The proof is given in supplementary material Section C. Note in addition that implies that are linearly independent. We now discuss the relation between between and an FDM matrix . In particular, we describe how can be recovered from the image of (as an operator, ) under . Note first that the image of satisfies . Indeed, for a fixed , a column of can be written as a linear combination, . Moreover, if the matrix has full rank, , and if are linearly independent, then . Therefore, when , given we can recover as . Now, if satisfies , then by Proposition 4.1 it also satisfies . It then follows that can be recovered as . Finally, if we know , we can recover , since these are simply the extreme points of that polytope, and every polytope is uniquely characterized by its extreme points. This relation between and is at the basis of the consistency result below.

We state the consistency for the probabilistic setting: We complement the pLSA model to be a full generative model by assuming that topic distribution is sampled independently for each document from some probability distribution on . If is a Dirichlet distribution, symmetric or asymmetric, this corresponds to an LDA model. Other examples include models with correlated topics, [6], or hierarchical topics, [19], among many others. The only requirement on the topic distribution is the following: Let be the expected topic-topic co-occurrence matrix corresponding to the sampling scheme. We require that is full rank. This assumption holds in all the examples above.

###### Theorem 4.2 (Consistency).

Consider a generative pLSA model (1), over topics which satisfy (), and where are sampled independently from a fixed distribution on , with topic-topic expected co-ocurrence matrix . Set . Then with probability ,

 limD→∞ˆMu,vlogMu,v=∑u,vMu,vlogMu,v. (12)

Conversely, let be a different set of topics satisfying (). There is a , such that for any FDM over , with probability ,

 limD→∞ˆMu,vlogM′u,v≤∑u,vMu,vlogMu,v−γ. (13)

This result is proved in supplementary material Section D. The probability in Theorem 4.2 is over the samples from the generative model, through the random variables (note that , defined (10), depends on ). The theorem states that the loss (3) is minimized at the true parameters and . Indeed, denote . Note that the cost is a random variable, as it depends on . Then Eq. (12) states that converges to , while Eq. (13) is equivalent to stating that for any other set of topics, , we have for any . The gap will depend on how well approximates the true topics .

## 5 Experiments

In the following sections we discuss experiments on semi-synthetic data, the NeurIPS papers corpus and the Twitter dataset; see Section 1.3 for an overview.

### 5.1 Synthetic Data

To approximate real data in a synthetic setting, we used topics learned by SparseLDA optimization on the NeurIPS papers corpus (Section 5.2) as the ground truth topics. The dictionary size in this case is tokens. The synthetic documents were generated using the LDA model: For each document, a topic distribution was sampled per document from a symmetric Dirichlet with the standard concentration parameter , and tokens were sampled from the mixture of the ground truth topics. The corpus size was documents. Note that for a dictionary of size and non-uniform topics, this is not a very large corpus.

To reconstruct the topics, we compared three algorithms: (i) SparseLDA – a sparsity optimized parallel collapsed Gibbs sampler for LDA, implemented in the MALLET framework (see Section 2 for details), which was run with 4 parallel threads. (ii) FDM, Algorithm 2. (iii) The topic learning algorithm from [3], to which we refer as EP (Extreme Points) in what follows. SparseLDA was run with 4 threads. All algorithms were run 5 times, until convergence, on the fixed dataset. Note that the EP algorithm does not have a random initialization, but uses a random projection as an intermediate dimensionality reduction step. Therefore different runs might be affected by restarts (although very mildly, in practice). We used as the dimension of the random projection – a value that was specified in [3] as the practical value for the dictionary sizes of the order we use here. Hardware specifications are given in the supplementary material. All the algorithms were run with the true number of topics as a parameter.

The SparseLDA algorithm was run in two modes: With the true hyperparameters, , corresponding to the true of the corpus, and with topic sparsity parameter , a standard setting which was also used to learn the ground truth topics. To model a situation where the true hyperparameters are unknown, we also evaluated SparseLDA with a modified hyperparameter , and same . Note that this is a relatively mild change of the hyperparameter.

The quality of the learned topics was measured by calculating the optimal matching 1 distances to the ground truth topics. That is, given the topics returned by the model, , and the ground truth topics , we compute

 err=1TminτT∑t=1∣∣μt−ντ(t)∣∣1, (14)

where is the matching — a permutation of the set . The optimal matching was computed using the Hungarian algorithm.

The results are given in Table 1, which shows for each algorithm the average error and the standard deviation over the different runs. To put the numbers in perspective, the typical distance between two ground truth topics is around . Thus all algorithms learned at least some approximation of the ground truth. By visual inspection, a topic at distance from a given ground truth topic tends to capture the mass at the correct tokens, but the amount of mass deviates somewhat from the correct one.

We observe that SparseLDA with the ground truth attained best performance. This is not surprising, since the algorithm is based on the generative model that is the true generative model of the corpus, and was provided with the true hyperparameters. Both of these constitute a considerable prior knowledge. The performance of the EP algorithm was relatively low, which is likely due to the fact that the corpus size was not sufficiently large for that algorithm, with this set of topics.

Finally, FDM and SparseLDA with the modified hyperparameter attained similar performance. Interestingly, SparseLDA and FDM tend to err slightly differently. In Figure 3 for a set of topics found by each algorithm we show the histogram of the quantities , i.e. the distribution of the distances within the matching. SparseLDA tends to completely miss about 50 to 80 out of 500 topics, while FDM is slightly less precise on the topics that it does approximate well. This figure is remarkably consistent across the different runs of the algorithms.

### 5.2 NeurIPS Dataset

The NeurIPS dataset, [22], consists of all NeurIPS papers from 1987 to 2016. Each document was taken to be a single paragraph. Stop words, numbers and tokens appearing less than 50 times were removed. All tokens were stemmed and documents with less than 20 tokens are removed. The preprocessed dataset contained roughly documents over the dictionary of around unique tokens. of the documents were taken at random as a hold-out (test) set. The log-likelihood of the documents in the hold-out set was used as the performance measure. The computation of the (log) likelihood on the hold out set given topics is standard, with full details provided in supplementary material Section E. The following models were trained: FDM, SparseLDA () and EP ([3] ) with topics. All models were run until convergence.

The mean hold-out log-likelihoods for each method are shown in Table 2, and a histogram of the distribution of the holdout log-likelihoods for a single run of each algorithm is shown in Figure 1(a). We observe that the performance of SparseLDA and FDM are practically identical, and both perform better than EP.

To obtain some insight into the relation between the models, Figure 1(b) shows the histogram of the optimal matching distances between the topics learned by a fixed run of SpraseLDA and a fixed run of FDM (blue). For scale, the distances of SparseLDA and FDM to a fixed topic, the empirical distribution of the corpus, are also shown. It appears that both models find somewhat similar topics.

We first describe the collection and processing of the Twitter corpus. The tweets were collected via the Tweeter API, [29]. The data contains about 16M (million) (after pre-processing, see below) publicly available tweets posted by 60K users during roughly the period of 1/2014 to 1/2017. The tweets were preprocessed to remove numbers, none-ASCII characters, mentions (Twitter usernames preceded by an @ character) and URLs were removed. All tokens were stemmed, and stop words and tokens with less than 3 characters were removed. The most common 200 tokens and rare tokens (less than 1000 appearances in the corpus) were removed. Following this, tweets shorter than 4 tokens were also removed. This resulted in a corpus of about tweets over a dictionary of slightly more than unique tokens. Each tweet was considered as a separate document, and typical tweets have 4 to 8 tokens.

The experiment setting was similar to the NeurIPS papers corpus. of the documents were taken as a hold-out set, and the holdout loglikelihood of SparseLDA, FDM and EP topics was evaluted. All algorithms were run to convergence, for about 12 hours for each run.

The resulting hold-out log-likelihoods are given in Table 2, and the distribution of the log-likelihoods in shown in Figure 3(a). We observe that in this case the performance of FDM is better than that of SparseLDA, while EP does not produce a good approximation of the dataset.

## 6 Conclusions

In this paper a introduced a new topic modeling approach, FDM topic modeling, which is based on matching, via KL divergence, of the token co-occurrence distribution induced by the topics to the co-occurrence distribution of the corpus. We have shown the asymptotic consistency of the approach under the anchor words assumption and presented an efficient stochastic optimization procedure for this problem. This algorithm enables the approach to leverage GPU computation and efficient SGD optimizers. Our empirical evaluation shows that FDM produces topics of good quality and improves over the performance of SparseLDA.

Topic Modeling via Full Dependence Mixtures - Supplementary Material

## Appendix A Outline

In this supplementary material we provide the proofs of the results stated in the main text. In addition, details on the hardware used in the experiments are given in Section F.

## Appendix B The Unbiased Estimate ˆMd

In this section we discuss the construction of the matrix , defined in (9) in Algorithm 1. Specifically, we show the unbiased estimate property of , (8).

First, let us introduce some notation. For any two vectors , denote by the outer product of , an matrix given by:

 (a⊗b)u,v=a(v)⋅b(u). (15)

For any probability distribution on , is simply the probability of obtaining the pair when sampling independently twice from .

For a document , recall that , defined in (35), is the token counts vector of the document and is the total number of tokens in . Set to be the empirical probability distribution on corresponding to .

As described in the main text, assuming the pLSA model, each document is an i.i.d sample from some mixture of topics

 νd=∑tθd(t)μt. (16)

Let us fix some mixture

 ν=∑tθ(t)μt. (17)

The co-occurrence matrix for the mixture is by definition

 (Mν)u,v=ν(u)⋅ν(v)=∑i,j≤Tθ(i)θ(j)μi(u)μj(v). (18)

Note that with our notation, we have

 Mν=ν⊗ν. (19)

Moreover, given a document , note that the empirical co-ocurrence matrix may be written as

 ˆMd=ldld−1^d⊗^d−1ld−1Diag(^d). (20)

Here is a diagonal matrix with diagonal entries given by the vector .

We first compute the expectation of in the following Lemma.

###### Lemma B.1.

Let be an i.i.d sample from a distribution . Then

 E^d⊗^d=ld−1ldν⊗ν+1ldDiag(ν). (21)
###### Proof.

Consider the coordinate of the matrix .

 (E^d⊗^d)u,v=E^d(u)⋅^d(v)= 1l2dE(ld∑i=1\mathbbm1{xi=u})⋅(ld∑j=1\mathbbm1{xj=v})= 1ld2ld∑i,j=1E(\mathbbm1{xi=u}⋅\mathbbm1{xj=v}).

The results is now obtained by by considering separately the cases , , , . Indeed, choose for instance fixed . For we have

 E(\mathbbm1{xi=u}⋅\mathbbm1{xj=v})=ν(u)⋅ν(v). (22)

For , since ,

 E(\mathbbm1{xi=u}⋅\mathbbm1{xi=v})=0. (23)

Since there are pairs with , we thus have overall that

 (E^d⊗^d)u,v=1ld2ld∑i,j=1E(\mathbbm1{xi=u}⋅\mathbbm1{xj=v})=ld(ld−1)l2dν(u)⋅ν(v)=ld−1ldν(u)⋅ν(v)=(ld−1ldν⊗ν)u,v. (24)

The diagonal case, , is handled similarly. ∎

It follows therefore that if is a document constructed by sampling tokens from , then is not an unbiased estimate of . One can however easily fix this by subtracting the diagonal and renormalizing. Indeed, from Lemma B.1 we have

 ν⊗ν = ldld−1E^d⊗^d−1ld−1Diag(ν) = ldld−1E^d⊗^d−1ld−1EDiag(^d) = E(ldld−1^d⊗^d−1ld−1Diag(^d)) = EˆMd.

That is, is an unbiased estimate of .

## Appendix C Proof of Proposition 4.1

###### Proof.

: Suppose . Since also have , by span-maximality it follows that there are , with such that . Now, by linear independence, there is a unique representation of . Thus . Conversely, to show , suppose . By (P) we have . Note also that . Thus .

We now show . First assume . Let be the anchor words. Choose some . Then by definition of the anchor words, . Since , and this implies . For the converse, we show . Assume . This means that there is a topic such that for every token for which , there is another topic, , such that . In particular, set . It follows then that is absolutely continuous with respect to . That is, for every for which , we have . Using this property, it is clear that for a small enough , we have for all . Thus . However, this expression has a strictly negative coefficient at , thus contradicting .

Finally, we show and .

:
Assume . We show . Since and since are linearly independent, we have . By , this implies . Since any polytope defines its extreme points uniquely, we obtain the conclusion of .

For , we equivalently show . Assume . Consider the distribution constructed earlier. Clearly we may write with , and , strictly positive. It follows that , thus implying . ∎

## Appendix D Proof of Theorem 4.2

In this section we use the notation introduced in Section B.

###### Proof.

It follows from Lemma B.1 and (B) that conditioned on , we have . Therefore by definition,

 EˆMd=ET∑i,j=1θd(i)θd(j)μi⊗μj=T∑i,j=1Θi,jμi(u)μj(v)=M. (26)

Since are bounded independent variables in , and , by the Law of Large Numbers we have with probability , which in turn implies (12).

Conversely, let

 M′=∑i,jβi,jμ′i⊗μ′j (27)

be an FDM based on the topics . Set and . Clearly . On the other hand, since has full rank, , we have and . Next, by Lemma 4.1, topics with () property are identified by their span. Since , this implies . Set

 γ′:=infM′s.t.Im(M′)⊆V′∥∥M−M′∥∥op>0, (28)

where the norm is (say) the operator norm. The fact that is crucial and follows, for instance, from the Davis-Kahan “sin” Perturbation Theorem, [5](Section VII.3). Indeed, assume to the contrary that . This would imply that there is a sequence , with , converging to . By Davis-Kahan Theorem, the eigen-spaces of must converge to those of . However, since and since (and hence , see [5]), this is impossible.

It remains to observe that since all finite dimensional norms are equivalent, implies where is the coordinatewise norm. Next, Pinsker’s Inequality, [8], yields a lower bound on the KL-divergence,

 0<γ′′≤∥∥M−M′∥∥1≤√2DKL(M|M′), (29)

where

 DKL(M|M′) =∑u,vMu,vlogMu,vM′u,v (30) =∑u,vMu,vlogMu,v−∑u,vMu,vlogM′u,v. (31)

To summarize,

 γ′′′:=infM′s.t.Im(M′)⊆V′DKL(M′|M)>0. (32)

Finally, we obtain

 limD→∞∑u,vˆMu,vlog(M′u,v) =∑u,vMu,vlog(M′u,v) (33) ≤∑u,vMu,vlog(Mu,v)−γ′′′, (34)

where (33) follows from the first part of the proof and (34) follows form (32) and (31). ∎

## Appendix E Holdout Likelihood Computation

For a document given as a sequence of tokens , where is the total number of tokens in , recall from the main text that we denote by the count vector of ,

 cd(u)=#{xi|xi=u}%for$u∈X$. (35)

The empirical distribution of , denoted by is the normalized count vector, .

Given the topics returned by the model, for every document we first compute the topics assignment as follows:

 θd = \operatornamewithlimitsargminθDKL(^d,m(θ)) (36) = \operatornamewithlimitsargmaxθ∑u∈X^d(u)log{m(θ)(u)} (37) = \operatornamewithlimitsargmaxθ1ld∑xi∈dlog{m(θ)(xi)}, (38)

where is the mixture generated by the topics and the assignment , , and is the Kullback-Leibler divergence. Thus is the assignment such that the mixture best approximates the document in KL divergence. Equivalently, is the assignment such that the mixture gives the highest likelihood to the document . This is the standard definition of likelihood for pLSA models.

Note that (36) is a convex problem in , and can be solved efficiently and in parallel over the documents. Solving (36) is a standard step in most in pLSA and Non-Negative Matrix Factorization methods, and existing efficient implementations may be used. See for instance [24].

Next, given we compute the document likelihood as and take the overall likelihood of the holdout set of documents to be the average of