Abstract
The question of how to determine the number of independent latent factors, or topics, in Latent Dirichlet Allocation (LDA) is of great practical importance. In most applications, the exact number of topics is unknown, and depends on the application and the size of the data set. We introduce a spectral model selection procedure for topic number estimation that does not require learning the model’s latent parameters beforehand and comes with probabilistic guarantees. The procedure is motivated by the spectral learning approach and relies on adaptations of results from random matrix theory. In a simulation experiment taken from the nonparametric Bayesian literature, this procedure outperforms the nonparametric Bayesian approach in both accuracy and speed. We also discuss some implications of our results for the sample complexity and accuracy of popular spectral learning algorithms for LDA. The principles underlying the procedure can be extended to spectral learning algorithms for other exchangeable mixture models with similar conditional independence properties.
Fast, Guaranteed Spectral Model Selection for Topic Models
The question of how to determine the model order–that is, the number of independent latent factors–in mixture models such as Latent Dirichlet Allocation (Blei et al., 2003) is of great practical importance. These models are widely used for tasks ranging from bioinformatics to computer vision to natural language processing. Finding the least number of latent factors that explains the data improves predictive performance, as well as increasing computational and storage efficiency. In most appplications, the exact number of latent factors (also known as topics or components) is unknown: model order often depends on the application and increases as the data set grows. For a fixed training set, the user can subjectively finetune the number of topics or optimize it according to objective measures of fit along with the other parameters of the model, but this is a timeconsuming process, and it is not intuitively clear how to increase the number of topics as new data points are encountered without an additional round of finetuning. Moreover, spectral learning procedures can fail if the number of latent factors is underestimated.
In this paper, we present a simple and efficient procedure that estimates model order from the spectral characteristics of the sample crosscorrelation matrix of the observed data. We focus on LDA in this paper in order to illustrate our approach, but our principles be extended to other mixture models with similar conditional independence properties. Unlike previous approaches to model order selection, the resulting procedure comes with probabilistic guarantees and does not to require computationally expensive learning of the hidden parameters of the model in order to return an estimate of the model order. The estimate can be further refined by running a spectral learning procedure that does learn the parameters.
Our approach relies on the assumption that the parameter vectors that characterize each of the topics are randomly distributed. We show that with high probability, the least singular value of the random matrix resulting from collecting these parameter vectors will be wellbounded. Roughly speaking, randomly distributed topics will be unlikely to be too correlated with each other. We show that as a result, the approximate number of latent factors can be predictably reovered from the spectral characteristics of the observable first and second moments of the data.
For LDA, the requisite moments can be efficiently computed from the sufficient statistics of the model, namely the termdocument cooccurrence matrix. The usefulness of our procedure is illustrated by the following proposition for the usual case where the number of topics and the vocabulary size (or dimensionality) are such that (though we also present results for the more general case in this paper):
Proposition 1.1.
Suppose we have an LDA topic model over a vocabulary of size with concentration parameter , and we wish to determine how many nonzero topics there are in the corpus. Suppose and almost surely. Then, for large enough, if we gather independent samples as in Lemma 3.9, we can recover the number of topics whose expected proportion is greater than , with probability greater than .
The results which allow us to prove this guarantee also provide new insights on sample complexity bounds for spectral learning of mixture models, in particular excess correlation analysis (ECA) (Anandkumar et al., 2012b). These spectral algorithms have garnered attention partly because they offer better scalability to large data sets than MCMC methods, and partly because they provide probabilistic guarantees on sample complexity that are elusive for MCMC methods. However, sample complexity results in previous literature bound the estimation error and sample complexity of learning the latent parameter matrix in terms of itself: given that in practice is unknown beforehand, this is of limited practical utility for assessing the confidence of the estimate. In contrast, our results allow sample complexity to be expressed directly in terms of the known quantity :
Proposition 1.2.
Suppose we have an LDA topic model over a vocabulary of size . Suppose the number of topics is fixed, and the variance of the entries of the latent wordtopic matrix is fixed and finite. Then, for large enough, if we gather independent samples, we can recover the parameter matrix with error less than , with probability greater than .
Taken together, these two results increase the usefulness of spectral algorithms for mixture models by allowing the number of topics to be set in a datadriven manner, and by providing more explicit sample complexity guarantees, giving the user a better idea of the quality of the learned parameters. This brings spectral methods closer to providing a guaranteed and computationally efficient alternative for nonparametric Bayesian models.
Nonparametric Bayesian models such as the Hierarchical Dirichlet Process (HDP) (Teh et al., 2006) have been useful in addressing the problem of model order estimation. These models allow a distribution over an infinite number of topics. When HDP is fitted using a Markov chain Monte Carlo (MCMC) sampling algorithm to optimize posterior likelihood, new topics are sampled as necessitated by the data. However, training a nonparametric model using MCMC can be impractically slow for the large sample sizes likely to be encountered in many realworld applications. As is common for MCMC methods, the Gibbs sampler for HDP is susceptible to local optima (Griffiths et al., 2004). Indeed, maximum likelihood estimation of topic models in general has been shown to be NPhard (Arora et al., 2012).
Another class of methods is based on learning models for a finite range of topics, and then optimizing some function of likelihood or performing a likelihoodbased hypothesis test over this range (e.g., the Bayes factor method, or optimization of the Bayesian Information Criterion, Akaike Information Criterion, or perplexity). Not only do these methods suffer from the same susceptibility to local minima, but they are even more computationally intensive than nonparametric methods when the range of model orders under consideration is large. This is because the latent parameters of the model must be learned for every single model order under consideration in order to compute the likelihoods as a basis for comparison. The range of candidate model orders must be prespecified by the user. Computational complexity increases linearly as the size of the range under consideration increases. In addition, they have been outperformed by nonparametric methods in experimental settings (Griffiths et al., 2004).
On the other hand, spectral learning methods (Arora et al., 2012; Anandkumar et al., 2012b) have been shown to provide asymptotic guarantees of exact recovery and to be computationally efficient. However, these techniques require specifying the number of latent factors beforehand, and in some cases produce highly unstable results when the number of latent factors is underestimated (Kulesza et al., ). Therefore, a guaranteed procedure for estimating the true number of latent factors should increase the practicality of these methods for learning probabilistic models.
We will first provide a brief overview of the assumptions of the LDA generative model and discuss how our method is motivated by the spectral learning approach in Section id1. In Section id1, we adapt nonasymptotic results concerning the singular values of random matrices to this setting. Practicioners interested in implementing our model order estimation method can consult Section id1, where we describe our procedure for finding the number of topics, demonstrate that our method outperforms a nonparametric Bayesian method on an experimental setting taken from the literature, and discuss some other implications of our results for the accuracy of algorithms for learning , .
For a vector , is the Euclidean norm and is the Euclidean distance between and a subspace . For a matrix , is the MoorePenrose pseudoinverse; is the largest singular value, is the largest eigenvalue; and is the operator norm. a.s. is ”almost surely,” and is ”with probability.”
Latent Dirichlet Allocation (Blei et al., 2003) is a generative mixture model widely used in topic modeling. This model assumes that the data comprise a corpus of documents. In turn, each document is made up of discrete, observed word tokens. The observed word tokens are assumed to be generated from latent topics as follows:
In the generative process above, the concentration parameter can be seen as controlling how finegrained the topics are; the smaller the value of , the more distinguishable the topics are from each other. The relative magnitude of each represents the expected proportion of word tokens in the corpus assigned to topic . The concentration parameter governs how topically distinct documents are (in the limit , we have a model where each document has a single topic rather than a mix of topics (Anandkumar et al., 2012a)).
For a large class of mixture models including LDA and Gaussian Mixture Models, the observed data can be represented as a sequence of exchangeable vectors that are conditionally independent given a latent factor vector which is assumed to be strictly positive. For instance, in an LDA model each data point (word token) can be represented as a canonical basis vector of dimensionality , where is the vocabulary size (number of distinct terms). The th elment of is equal to 1 if the word token that it represents is observed to belong to class , and 0 otherwise. For LDA, determines the mixture of topics present in a particular document. Therefore is a vector whose support is a.s. equal to the number of nonzero topics (the model order).
Although the sufficient statistics of LDA can be represented in other, more succinct ways, this representation turns out to be more than a curiosity. To see why, observe that under this representation the conditional expectation of the observed data generated by the models can be represented as a linear combination of some latent matrix (known in LDA as the wordtopic matrix) and the latent membership vector :
For these mixture models, the principal learning problem is to estimate efficiently and accurately. Using the equation above and the conditional independence of any three distinct observed vectors given in the LDA model, we can derive equations for the expectations of the moments of the observed data in terms of . In particular, the expected first moment, which is the vector of the expected probability masses of the terms in the vocabulary, can be written as
(1) 
and the expected second moment, which is the matrix of the expected crosscorrelations between any two terms in the vocabulary, can be written as
(2) 
Analogous expressions for even higher moments can be expressed using tensors. In fact, Anandkumar et al. (2012b) were able to develop fast spectral algorithms for learning the hidden parameters of mixture models from the second and thirdorder moments of the data by taking advantage of this relationship. The resulting algorithm, excess correlation analysis (ECA), comes with probabilistic guarantees of finding the optimal solution, unlike MCMC approaches. In the case of LDA, the only userspecified inputs to the ECA spectral algorithm are the supposed number of topics and the concentration parameter governing the distribution of the membership vector . The matrix is treated as fixed, but unknown.
Note that Eqs. (1) and (2) demonstrate an explicit linearalgebraic relationship between the latent parameter matrix , the expected moments of the data and , and the expected moments of . In fact, for LDA, is the vector that specifies the expected proportion of data points assigned to each topic across the entire data set– roughly speaking, if we expect about half of the word tokens in the data set to belong to topic . Therefore, the model order is the number of nonzero topics (i.e., the support) of . In the case of LDA, some elementary computation (cf. (Anandkumar et al., 2012a) Thm. 4.3) demonstrates that can be written as a product of , , and as follows:
(3) 
where is the MoorePenrose pseudoinverse of and . This suggests that and therefore the number of nonzero topics can be recovered by first learning and then estimating according to Eq. 3. The true number of topics is then equal to the number of such that . However, the number of latent factors must be speciefied beforehand in ECA, since the algorithm involves a truncated matrix decomposition and a truncated tensor decomposition. For lowdimensional data sets, it is possible to do this by setting . However, the time complexity of ECA scales as and the space complexity scales as due to the storage and decomposition of the third moment tensor, (Anandkumar et al., 2012c), so this approach is not tractable for even moderatelysized datasets. On the other hand, it is not possible to determine with any certainty whether we have captured all the nonzero topics if we set when is unknown. This is because when , then ECA learns highly unstable estimates of , which results in incorrect estimates of . For instance, consider the following toy example: set . Set .
If we try to recover the first two values of from the moments by running ECA with , we get . In a practical setting where a finite number of noisy samples are used to estimate the moments, one might conclude that is noise and that there is only one topic in this model. Similar parameter recovery problems arise when using lowrank approximations for learning spectral algorithms for other models (see (Kulesza et al., ) for some Hidden Markov Model examples). Thus, iterative methods where is increased or decreased until for some are uncertain to provide the correct result.
We suggest a novel approach in this paper, based on singular value bounds. Observe that taking the singular values of both sides of Eq. 3 yields:
(4) 
Thus, rather than learning , we need only find some reasonably sharp bound on the least singular value of . If we treat the matrix as a random matrix (as in standard Bayesian approaches to LDA) and place an approximate bound on the variance of the entries of , then has very predictable spectral characteristics for reasonably large . To prove this, we must adapt some recent results from random matrix theory. In random matrix theory, finding the least singular values of random matrices is often referred to as resolving the socalled ”hard edge” of the spectrum. While most work on the hard edge of the spectrum has focused on settings where the matrices are square and all entries are i.i.d. with mean zero, these conditions do not hold in the case of for Dirichlet mixture models such as LDA. We use some elementary facts about Dirichlet random variables to adapt the known results to the matrices of interest in our setting.
Note that and are not precisely known either, but it is relatively straightforward to derive estimators for them from the observed data. These estimators can be proven to be reasonably accurate via application of standard tail bounds for the eigenvalues and singular values of random matrices.
Thus, we can show that the observed moments of the data contain enough information to reveal the number of underlying topics to arbitrary accuracy with high probability, given enough samples. The principles behind our results can be extended to any exchangeable mixture models that can be represented as in Eqs. (1) and (2), though we will work with the LDA model to make our analysis concrete.
We place some further conditions on the LDA model that allow wellbehaved spectral properties. These conditions are generally equivalent to those for ECA (Anandkumar et al., 2012a), with the exception of our assumptions on :

The matrix is of full rank. Note that this condition follows a.s. from the generative process described above (Chafaï, 2010).

The concentration parameters and are approximately known. Intuitively, as increases, the topics are less distinguishable from each other. Note that varying this assumption only affects our model by increasing the number of samples required to learn the number of topics within a certain level of accuracy. For simplicity of presentation, our derivations below assume that the entries for all . This is known as a symmetric Dirichlet prior and is equivalent to a uniform distribution on the simplex (Bordenave et al., 2012). Setting a symmetric prior on is standard procedure in most applications of Dirichlet mixture models; for an empirical justification of this practice, see (Wallach et al., 2009).

In the worst case, the number of topics is equal to the size of the vocabulary, and a.s.. In most applications of Dirichlet topic models, the number of topics is in the tens or hundreds, and the size of the vocabulary is in the hundreds or thousands.
Under the assumptions and generative model above, we attempt to recover the number of topics within a margin of error defined by the expected probability mass of the topics, as follows:
Definition 2.1.
A topic is relevant iff the expected proportion of data points in the corpus belonging to the topic exceeds . That is, a topic is relevant iff .
Our procedure, as described below, is guaranteed to find at least all relevant topics with low probability of detecting topics to which no words are assigned in the corpus. As long as , converges to 0 as the number of samples increases. For a fixed number of samples and a fixed failure probability , the relevance threshhold for recovered topics increases when we wish to recover less distinguishable topics (i.e., as increases).
In this section we provide tail bounds on the smallest singular values of rectangular Dirichlet random matrices. Similar results can be derived for other Markov random matrices. These bounds closely mirror the work of (Tao & Vu, 2008), (Tao & Vu, 2009), and (Rudelson & Vershynin, 2009), and depend on probabilistic bounds on the distance between any given random vector corresponding to a column of a random matrix and the subspace spanned by the vectors corresponding to the rest of the columns. The estimation of these distances is much simplified for random vectors with independent entries, but for a Dirichlet random matrix, the entries in each column are dependent, as they must sum to one. Fortunately Dirichlet random vectors are related to vectors with independent entries in an elementary way.
Fact 3.1.
Define a vector such that Gamma for some for all . Then the scaled vector Dir
Corollary 3.2.
For any Dirichlet random matrix with i.i.d. columns, and for the corresponding Gamma random matrix with indpendent entries Gamma, we have that
The following singular value bound for square matrices follows from Tao & Vu (2009) Corollary 4:
Theorem 3.3.
Suppose is an random matrix with independent, identically distributed entries with variance 1, mean , and bounded fourth moment. For any there exist positive positive constants that depend only on such that
Though this bound applies also to rectangular matrices (i.e., cases where the number of topics grows more slowly than ) by the Cauchy Interlacing Theorem of singular values (cf. (Horn & Johnson, 1990)), this bound is not sharp when . The following result follows from adapting the arguments in Tao et al. (2010) Section 8:
Theorem 3.4.
Let be positive integers. Suppose is an Gamma random matrix with independent, identically distributed entries with variance 1, mean . Then for every there exist and that depend only on the moments of such that, for all .
In order to prove Theorem 3.4, we need two results, presented here without proof:
Proposition 3.5.
(Distance Tail Bound; (Rudelson & Vershynin, 2009) Thm. 4.1). Let be a vector in whose coordinates are independent and identically distributed random variables with unit variance and bounded fourth moment. Let be a random subspace in spanned by vectors, whose coordinates are independent and identically distributed random variables with bounded fourth moment and unit variance, independent of . Then for every , we have that depend only on the moments such that
Remark 3.6.
Although this result is stated for mean zero random variables, see discussion in e.g., Tao et al. (2010), Prop. 5.1 for discussion as to why it can be extended to noncentered random variables.
Lemma 3.7.
(Negative Second Moment; (Tao et al., 2010) Lemma A.4). Let and let be a full rank matrix with columns . For each , let be the hyperplane generated by the remaining columns of . Then
Now we prove Theorem 3.4.
Proof.
Proof of Theorem 3.4. Let be a random matrix as above. By Lemma 3.7, we have that
(5) 
By Proposition 3.5 and the union bound, w.p. we have for all . Thus, with this probability, the righthand side of Eq. (5) is less than . On the other hand, as the are ordered decreasingly, the lefthand side of Eq. (5) is at least It follows that, w.p. ,
thus completing the proof. ∎
Now we are ready to derive a singular value bound for Dirichlet random matrices.
Theorem 3.8.
Let be a random matrix whose columns are independent identically distributed random vectors drawn from a symmetric Dirichlet distribution with parameter vector with concentration parameter . Then for every there exist and that depend only on the moments of such that, for all .:
Proof.
Recall that for a symmetric Dirichlet distribution with concentration parameter , each entry of the dimensional vector drawn from this distribution has mean . Fix . Observe that has variance 1 and mean for all . Therefore, by corollary 3.2, it follows that for any ,
We can exploit elementary tail bounds to control the sum in the denominator on the righthand side above using standard concentrationofmeasure results. For instance, by Chebyshev’s inequality and the mutual independence of the columns of , for any ,
(6) 
We can make the second term on the righthand side arbitrarily small by increasing , and for a fixed we can make the first term on the righthand side arbitrarily small by decreasing for large enough. Therefore, we can find a for any such that for all large enough,
∎
From the theorem above, we can deduce that, with high probability,
(7)  
(8) 
We are able to bound the error in estimating from a sample thanks to sample concentration lemmas for singular values that are analogoous to more wellknown concentration lemmas for scalar random variables (e.g., Markov’s inequality).
Lemma 3.9.
(TO BE FINISHED)
Although we are unable to compute the estimator without knowledge of , we can use Theorem 3.8 to provide an upper bound for the elements .
Define . We can now apply Theorem 3.8 to Eq S2.Ex2 to infer that there is a constant such that, for large enough,
(9) 
with high probability that depends on (the constant can be chosen arbitrarily so that the probability is negligible ^{1}^{1}1For most applications, we recommend . We computed the least singular value for randomly generated Dirichlet random matrices with and ; all of these matrices were dominated by . ).
This suggests the following procedure to estimate the number of topics:
To compare the performance of our procedure against previous model order estimation methods, we replicated the same experimental setting used to demonstrate the model selection capabilities of hLDA (Griffiths et al., 2004). hLDA (a Gibbs sampling method for the nonparametric equivalent of LDA using the Chinese restaurant process prior) was shown to be much faster and accurate than the Bayes factors method (a likelihoodbased hypothesistesting method) in this setting. 210 corpora of 1000 10word documents each were generated from an LDA model with , a vocabulary size of 100, and wordtopic matrix with columns randomly generated from a symmetric Dirichlet ( for , so ) and .
hLDA requires the input of a concentration parameter that controls how frequently a new topic is introduced, so the authors set . Since Gibbs sampling is subject to local maxima, so the sampler is randomly restarted 25 times for each corpus. Each time, the sampler is burned in for 10000 iterations and subsequently samples are taken 100 iterations apart for another 1000 iterations. The restart with the highest average likelihood over the postburnin period is selected, and the number of topics for this restart that had nonzero word assignments throughout the burnin period is selected as the hLDA prediction of model order. We used the Java implementation of the hLDA Gibbs sampler by Bleier (2010).
For our spectral model selection procedure, we set our topic relevance sensitivity threshhold at , which corresponds to an expected error rate of . We implemented our procedure using the Matlab standard library. Both methods are somewhat sensitive to and , so we set these parameters to the ground truth for both methods, just as in (Griffiths et al., 2004).
Figure 1 shows that our model outperforms hLDA for this experimental setting (points are jittered slightly to reveal overlapping points). Our procedure correctly estimated the model order for all of the 210 corpora without the optional ECA step, whereas for hLDA the error rate was 10 out of 210. (Griffiths et al., 2004) reported an error rate of 15 out of 210 for hLDA in this experimental setting, and an error rate of 80 out of 210 for the Bayes factors method.
The running time for hLDA Gibbs sampling procedure was sec per corpus on a single thread of a machine with an eightcore 2.67Ghz CPU, while the running time for the spectral model selection procedure without the ECA step was 0.252 sec per corpus. However, hLDA learns the latent matrix while estimating the model order. Including the ECA step in our spectral model selection procedure to learn , the running time increases to 2.05 sec per corpus.
The learnability and sample complexity of spectral algorithms for mixture models depend crucially on the latent variable matrix being wellconditioned. For instance (Anandkumar et al., 2012a)’s algorithm for learning LDA comes with the following guarantee:
Theorem 4.1.
((Anandkumar et al., 2012a) Thm 5.1). Fix . Let and let denote the smallest (nonzero) singular value of . Suppose that we obtain independent samples of in the LDA model. w.p. greater than , the following holds: for sampled uniformly over the sphere , w.p. greater than 3/4, Algorithm 5 in (Anandkumar et al., 2012a) returns a set such that there exists a permutation of the columns of so that for all
Theorem 3.8 allows us to replace the dependence on by a dependence on , , and :
Corollary 4.2.
Let , , , , , and be as in 4.1. Suppose that we obtain independent samples of in the LDA model. w.p. greater than ,
Proposition 1.2 follows from assuming that the variance parameter of each entry remains constant as increases (so that ), and from assuming that is fixed, so that
In this paper, we have derived a novel procedure for determining the number of latent topics in Latent Dirichlet Allocation. Our experiments suggest that this procedure can outperform nonparametric Bayesian models learned using MCMC.
Our results rely on a adapting randommatrixtheoretic results to the case of rectangular noncentered matrices, and connecting these results to the spectral properties of the moments of data generated by an LDA model. Similar randommatrix theoretic results should be applicable to the problem of finding the number of latent factors in many other mixture models with similar conditional independence properties, and we plan to present such results in future work.
References
 Anandkumar et al. (2012a) Anandkumar, A, Foster, DP, Hsu, D, Kakade, SM, and Liu, YK. Two SVDs suffice: spectral decompositions for probabilistic topic models and latent dirichlet allocation. arXiv preprint arXiv:1204.6703, 2012a.
 Anandkumar et al. (2012b) Anandkumar, A., Hsu, D., and Kakade, S.M. A method of moments for mixture models and hidden Markov models. JMLR: Workshop & Conference Proc., 23:33.1–33.34, 2012b.
 Anandkumar et al. (2012c) Anandkumar, Anima, Ge, Rong, Hsu, Daniel, Kakade, Sham M, and Telgarsky, Matus. Tensor decompositions for learning latent variable models. arXiv preprint arXiv:1210.7559, 2012c.
 Arora et al. (2012) Arora, S., Ge, R., and Moitra, R. Learning topic models – going beyond SVD. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, 2012.
 Blei et al. (2003) Blei, D.M., Ng, Andrew Y, and Jordan, M.I. Latent dirichlet allocation. J. Machine Learning Research, 3:993–1022, 2003.
 Bleier (2010) Bleier, A. Java Gibbs sampler for the HDP, 2010. https://github.com/arnim/HDP.
 Bordenave et al. (2012) Bordenave, C., Caputo, P., and Chafaï, D. Circular law theorem for random Markov matrices. Prob. Theory & Related Fields, 152:751–779, 2012.
 Chafaï (2010) Chafaï, D. The dirichlet markov ensemble. Journal of Multivariate Analysis, 101(3):555–567, 2010.
 Griffiths et al. (2004) Griffiths, T.L., Jordan, M.I, Tenenbaum, J.B., and Blei, D.M. Hierarchical topic models and the nested chinese restaurant process. Advances in neural information processing systems, 16:106–114, 2004.
 Horn & Johnson (1990) Horn, R.A. and Johnson, C.R. Matrix Analysis. Cambridge University Press, 1990.
 (11) Kulesza, Alex, Rao, N Raj, and Singh, Satinder. An exploration of lowrank spectral learning.
 Rudelson & Vershynin (2009) Rudelson, M. and Vershynin, R. The smallest singular value of a random rectangular matrix. Commun. Pure Appl. Math, 62:1707–1739, 2009.
 Tao & Vu (2008) Tao, T. and Vu, V. Random matrices: the circular law. Comm. in Contemp. Math., 10.02:261–307, 2008.
 Tao & Vu (2009) Tao, T. and Vu, V. Smooth analysis of the condition number and the least singular value. pp. 0805.3167v2, 2009.
 Tao et al. (2010) Tao, T., Vu, V., and Krishnapur, M. Random matrices: Universality of ESDs and the circular law. The Annals of Probability, 38(5):2023–2065, 2010.
 Teh et al. (2006) Teh, Y.W., Jordan, M.I., Beal, M., and Blei, D.M. Hierarchical Dirichlet processes. J. Am. Stat. Assoc., 101, 2006.
 Tropp (2011) Tropp, J.A. Userfriendly tail bounds for sums of random matrices. Found. Comput. Math., X:X, 2011.
 Wallach et al. (2009) Wallach, H.M., Mimno, D.M., and McCallum, A. Rethinking LDA: Why priors matter. In NIPS, volume 22, pp. 1973–1981, 2009.
Lemma .1.
(Tropp (Tropp, 2011) Thm. 5.1 (Eigenvalue Bennett Inequality). Consider a finite sequence of independent, random, selfadjoint random matrices with dimension , all of which have zero mean. Given an integer , define . Then, for all ,
where the function for .