The question of how to determine the number of independent latent factors, or topics, in Latent Dirichlet Allocation (LDA) is of great practical importance. In most applications, the exact number of topics is unknown, and depends on the application and the size of the data set. We introduce a spectral model selection procedure for topic number estimation that does not require learning the model’s latent parameters beforehand and comes with probabilistic guarantees. The procedure is motivated by the spectral learning approach and relies on adaptations of results from random matrix theory. In a simulation experiment taken from the nonparametric Bayesian literature, this procedure outperforms the nonparametric Bayesian approach in both accuracy and speed. We also discuss some implications of our results for the sample complexity and accuracy of popular spectral learning algorithms for LDA. The principles underlying the procedure can be extended to spectral learning algorithms for other exchangeable mixture models with similar conditional independence properties.
Fast, Guaranteed Spectral Model Selection for Topic Models
The question of how to determine the model order–that is, the number of independent latent factors–in mixture models such as Latent Dirichlet Allocation (Blei et al., 2003) is of great practical importance. These models are widely used for tasks ranging from bioinformatics to computer vision to natural language processing. Finding the least number of latent factors that explains the data improves predictive performance, as well as increasing computational and storage efficiency. In most appplications, the exact number of latent factors (also known as topics or components) is unknown: model order often depends on the application and increases as the data set grows. For a fixed training set, the user can subjectively fine-tune the number of topics or optimize it according to objective measures of fit along with the other parameters of the model, but this is a time-consuming process, and it is not intuitively clear how to increase the number of topics as new data points are encountered without an additional round of fine-tuning. Moreover, spectral learning procedures can fail if the number of latent factors is underestimated.
In this paper, we present a simple and efficient procedure that estimates model order from the spectral characteristics of the sample cross-correlation matrix of the observed data. We focus on LDA in this paper in order to illustrate our approach, but our principles be extended to other mixture models with similar conditional independence properties. Unlike previous approaches to model order selection, the resulting procedure comes with probabilistic guarantees and does not to require computationally expensive learning of the hidden parameters of the model in order to return an estimate of the model order. The estimate can be further refined by running a spectral learning procedure that does learn the parameters.
Our approach relies on the assumption that the parameter vectors that characterize each of the topics are randomly distributed. We show that with high probability, the least singular value of the random matrix resulting from collecting these parameter vectors will be well-bounded. Roughly speaking, randomly distributed topics will be unlikely to be too correlated with each other. We show that as a result, the approximate number of latent factors can be predictably reovered from the spectral characteristics of the observable first and second moments of the data.
For LDA, the requisite moments can be efficiently computed from the sufficient statistics of the model, namely the term-document co-occurrence matrix. The usefulness of our procedure is illustrated by the following proposition for the usual case where the number of topics and the vocabulary size (or dimensionality) are such that (though we also present results for the more general case in this paper):
Suppose we have an LDA topic model over a vocabulary of size with concentration parameter , and we wish to determine how many nonzero topics there are in the corpus. Suppose and almost surely. Then, for large enough, if we gather independent samples as in Lemma 3.9, we can recover the number of topics whose expected proportion is greater than , with probability greater than .
The results which allow us to prove this guarantee also provide new insights on sample complexity bounds for spectral learning of mixture models, in particular excess correlation analysis (ECA) (Anandkumar et al., 2012b). These spectral algorithms have garnered attention partly because they offer better scalability to large data sets than MCMC methods, and partly because they provide probabilistic guarantees on sample complexity that are elusive for MCMC methods. However, sample complexity results in previous literature bound the estimation error and sample complexity of learning the latent parameter matrix in terms of itself: given that in practice is unknown beforehand, this is of limited practical utility for assessing the confidence of the estimate. In contrast, our results allow sample complexity to be expressed directly in terms of the known quantity :
Suppose we have an LDA topic model over a vocabulary of size . Suppose the number of topics is fixed, and the variance of the entries of the latent word-topic matrix is fixed and finite. Then, for large enough, if we gather independent samples, we can recover the parameter matrix with error less than , with probability greater than .
Taken together, these two results increase the usefulness of spectral algorithms for mixture models by allowing the number of topics to be set in a data-driven manner, and by providing more explicit sample complexity guarantees, giving the user a better idea of the quality of the learned parameters. This brings spectral methods closer to providing a guaranteed and computationally efficient alternative for nonparametric Bayesian models.
Nonparametric Bayesian models such as the Hierarchical Dirichlet Process (HDP) (Teh et al., 2006) have been useful in addressing the problem of model order estimation. These models allow a distribution over an infinite number of topics. When HDP is fitted using a Markov chain Monte Carlo (MCMC) sampling algorithm to optimize posterior likelihood, new topics are sampled as necessitated by the data. However, training a nonparametric model using MCMC can be impractically slow for the large sample sizes likely to be encountered in many real-world applications. As is common for MCMC methods, the Gibbs sampler for HDP is susceptible to local optima (Griffiths et al., 2004). Indeed, maximum likelihood estimation of topic models in general has been shown to be NP-hard (Arora et al., 2012).
Another class of methods is based on learning models for a finite range of topics, and then optimizing some function of likelihood or performing a likelihood-based hypothesis test over this range (e.g., the Bayes factor method, or optimization of the Bayesian Information Criterion, Akaike Information Criterion, or perplexity). Not only do these methods suffer from the same susceptibility to local minima, but they are even more computationally intensive than nonparametric methods when the range of model orders under consideration is large. This is because the latent parameters of the model must be learned for every single model order under consideration in order to compute the likelihoods as a basis for comparison. The range of candidate model orders must be pre-specified by the user. Computational complexity increases linearly as the size of the range under consideration increases. In addition, they have been outperformed by nonparametric methods in experimental settings (Griffiths et al., 2004).
On the other hand, spectral learning methods (Arora et al., 2012; Anandkumar et al., 2012b) have been shown to provide asymptotic guarantees of exact recovery and to be computationally efficient. However, these techniques require specifying the number of latent factors beforehand, and in some cases produce highly unstable results when the number of latent factors is underestimated (Kulesza et al., ). Therefore, a guaranteed procedure for estimating the true number of latent factors should increase the practicality of these methods for learning probabilistic models.
We will first provide a brief overview of the assumptions of the LDA generative model and discuss how our method is motivated by the spectral learning approach in Section id1. In Section id1, we adapt non-asymptotic results concerning the singular values of random matrices to this setting. Practicioners interested in implementing our model order estimation method can consult Section id1, where we describe our procedure for finding the number of topics, demonstrate that our method outperforms a nonparametric Bayesian method on an experimental setting taken from the literature, and discuss some other implications of our results for the accuracy of algorithms for learning , .
For a vector , is the Euclidean norm and is the Euclidean distance between and a subspace . For a matrix , is the Moore-Penrose pseudoinverse; is the largest singular value, is the largest eigenvalue; and is the operator norm. a.s. is ”almost surely,” and is ”with probability.”
Latent Dirichlet Allocation (Blei et al., 2003) is a generative mixture model widely used in topic modeling. This model assumes that the data comprise a corpus of documents. In turn, each document is made up of discrete, observed word tokens. The observed word tokens are assumed to be generated from latent topics as follows:
In the generative process above, the concentration parameter can be seen as controlling how fine-grained the topics are; the smaller the value of , the more distinguishable the topics are from each other. The relative magnitude of each represents the expected proportion of word tokens in the corpus assigned to topic . The concentration parameter governs how topically distinct documents are (in the limit , we have a model where each document has a single topic rather than a mix of topics (Anandkumar et al., 2012a)).
For a large class of mixture models including LDA and Gaussian Mixture Models, the observed data can be represented as a sequence of exchangeable vectors that are conditionally independent given a latent factor vector which is assumed to be strictly positive. For instance, in an LDA model each data point (word token) can be represented as a canonical basis vector of dimensionality , where is the vocabulary size (number of distinct terms). The -th elment of is equal to 1 if the word token that it represents is observed to belong to class , and 0 otherwise. For LDA, determines the mixture of topics present in a particular document. Therefore is a vector whose support is a.s. equal to the number of nonzero topics (the model order).
Although the sufficient statistics of LDA can be represented in other, more succinct ways, this representation turns out to be more than a curiosity. To see why, observe that under this representation the conditional expectation of the observed data generated by the models can be represented as a linear combination of some latent matrix (known in LDA as the word-topic matrix) and the latent membership vector :
For these mixture models, the principal learning problem is to estimate efficiently and accurately. Using the equation above and the conditional independence of any three distinct observed vectors given in the LDA model, we can derive equations for the expectations of the moments of the observed data in terms of . In particular, the expected first moment, which is the vector of the expected probability masses of the terms in the vocabulary, can be written as
and the expected second moment, which is the matrix of the expected cross-correlations between any two terms in the vocabulary, can be written as
Analogous expressions for even higher moments can be expressed using tensors. In fact, Anandkumar et al. (2012b) were able to develop fast spectral algorithms for learning the hidden parameters of mixture models from the second- and third-order moments of the data by taking advantage of this relationship. The resulting algorithm, excess correlation analysis (ECA), comes with probabilistic guarantees of finding the optimal solution, unlike MCMC approaches. In the case of LDA, the only user-specified inputs to the ECA spectral algorithm are the supposed number of topics and the concentration parameter governing the distribution of the membership vector . The matrix is treated as fixed, but unknown.
Note that Eqs. (1) and (2) demonstrate an explicit linear-algebraic relationship between the latent parameter matrix , the expected moments of the data and , and the expected moments of . In fact, for LDA, is the vector that specifies the expected proportion of data points assigned to each topic across the entire data set– roughly speaking, if we expect about half of the word tokens in the data set to belong to topic . Therefore, the model order is the number of nonzero topics (i.e., the support) of . In the case of LDA, some elementary computation (cf. (Anandkumar et al., 2012a) Thm. 4.3) demonstrates that can be written as a product of , , and as follows:
where is the Moore-Penrose pseudoinverse of and . This suggests that and therefore the number of nonzero topics can be recovered by first learning and then estimating according to Eq. 3. The true number of topics is then equal to the number of such that . However, the number of latent factors must be speciefied beforehand in ECA, since the algorithm involves a truncated matrix decomposition and a truncated tensor decomposition. For low-dimensional data sets, it is possible to do this by setting . However, the time complexity of ECA scales as and the space complexity scales as due to the storage and decomposition of the third moment tensor, (Anandkumar et al., 2012c), so this approach is not tractable for even moderately-sized datasets. On the other hand, it is not possible to determine with any certainty whether we have captured all the non-zero topics if we set when is unknown. This is because when , then ECA learns highly unstable estimates of , which results in incorrect estimates of . For instance, consider the following toy example: set . Set .
If we try to recover the first two values of from the moments by running ECA with , we get . In a practical setting where a finite number of noisy samples are used to estimate the moments, one might conclude that is noise and that there is only one topic in this model. Similar parameter recovery problems arise when using low-rank approximations for learning spectral algorithms for other models (see (Kulesza et al., ) for some Hidden Markov Model examples). Thus, iterative methods where is increased or decreased until for some are uncertain to provide the correct result.
We suggest a novel approach in this paper, based on singular value bounds. Observe that taking the singular values of both sides of Eq. 3 yields:
Thus, rather than learning , we need only find some reasonably sharp bound on the least singular value of . If we treat the matrix as a random matrix (as in standard Bayesian approaches to LDA) and place an approximate bound on the variance of the entries of , then has very predictable spectral characteristics for reasonably large . To prove this, we must adapt some recent results from random matrix theory. In random matrix theory, finding the least singular values of random matrices is often referred to as resolving the so-called ”hard edge” of the spectrum. While most work on the hard edge of the spectrum has focused on settings where the matrices are square and all entries are i.i.d. with mean zero, these conditions do not hold in the case of for Dirichlet mixture models such as LDA. We use some elementary facts about Dirichlet random variables to adapt the known results to the matrices of interest in our setting.
Note that and are not precisely known either, but it is relatively straightforward to derive estimators for them from the observed data. These estimators can be proven to be reasonably accurate via application of standard tail bounds for the eigenvalues and singular values of random matrices.
Thus, we can show that the observed moments of the data contain enough information to reveal the number of underlying topics to arbitrary accuracy with high probability, given enough samples. The principles behind our results can be extended to any exchangeable mixture models that can be represented as in Eqs. (1) and (2), though we will work with the LDA model to make our analysis concrete.
We place some further conditions on the LDA model that allow well-behaved spectral properties. These conditions are generally equivalent to those for ECA (Anandkumar et al., 2012a), with the exception of our assumptions on :
The matrix is of full rank. Note that this condition follows a.s. from the generative process described above (Chafaï, 2010).
The concentration parameters and are approximately known. Intuitively, as increases, the topics are less distinguishable from each other. Note that varying this assumption only affects our model by increasing the number of samples required to learn the number of topics within a certain level of accuracy. For simplicity of presentation, our derivations below assume that the entries for all . This is known as a symmetric Dirichlet prior and is equivalent to a uniform distribution on the simplex (Bordenave et al., 2012). Setting a symmetric prior on is standard procedure in most applications of Dirichlet mixture models; for an empirical justification of this practice, see (Wallach et al., 2009).
In the worst case, the number of topics is equal to the size of the vocabulary, and a.s.. In most applications of Dirichlet topic models, the number of topics is in the tens or hundreds, and the size of the vocabulary is in the hundreds or thousands.
Under the assumptions and generative model above, we attempt to recover the number of topics within a margin of error defined by the expected probability mass of the topics, as follows:
A topic is -relevant iff the expected proportion of data points in the corpus belonging to the topic exceeds . That is, a topic is -relevant iff .
Our procedure, as described below, is guaranteed to find at least all -relevant topics with low probability of detecting topics to which no words are assigned in the corpus. As long as , converges to 0 as the number of samples increases. For a fixed number of samples and a fixed failure probability , the relevance threshhold for recovered topics increases when we wish to recover less distinguishable topics (i.e., as increases).
In this section we provide tail bounds on the smallest singular values of rectangular Dirichlet random matrices. Similar results can be derived for other Markov random matrices. These bounds closely mirror the work of (Tao & Vu, 2008), (Tao & Vu, 2009), and (Rudelson & Vershynin, 2009), and depend on probabilistic bounds on the distance between any given random vector corresponding to a column of a random matrix and the subspace spanned by the vectors corresponding to the rest of the columns. The estimation of these distances is much simplified for random vectors with independent entries, but for a Dirichlet random matrix, the entries in each column are dependent, as they must sum to one. Fortunately Dirichlet random vectors are related to vectors with independent entries in an elementary way.
Define a vector such that Gamma for some for all . Then the scaled vector Dir
For any Dirichlet random matrix with i.i.d. columns, and for the corresponding Gamma random matrix with indpendent entries Gamma, we have that
See (Bordenave et al., 2012) Section 2 and Lemma B.4.
The following singular value bound for square matrices follows from Tao & Vu (2009) Corollary 4:
Suppose is an random matrix with independent, identically distributed entries with variance 1, mean , and bounded fourth moment. For any there exist positive positive constants that depend only on such that
Though this bound applies also to rectangular matrices (i.e., cases where the number of topics grows more slowly than ) by the Cauchy Interlacing Theorem of singular values (cf. (Horn & Johnson, 1990)), this bound is not sharp when . The following result follows from adapting the arguments in Tao et al. (2010) Section 8:
Let be positive integers. Suppose is an Gamma random matrix with independent, identically distributed entries with variance 1, mean . Then for every there exist and that depend only on the moments of such that, for all .
In order to prove Theorem 3.4, we need two results, presented here without proof:
(Distance Tail Bound; (Rudelson & Vershynin, 2009) Thm. 4.1). Let be a vector in whose coordinates are independent and identically distributed random variables with unit variance and bounded fourth moment. Let be a random subspace in spanned by vectors, whose coordinates are independent and identically distributed random variables with bounded fourth moment and unit variance, independent of . Then for every , we have that depend only on the moments such that
Although this result is stated for mean zero random variables, see discussion in e.g., Tao et al. (2010), Prop. 5.1 for discussion as to why it can be extended to noncentered random variables.
(Negative Second Moment; (Tao et al., 2010) Lemma A.4). Let and let be a full rank matrix with columns . For each , let be the hyperplane generated by the remaining columns of . Then
Now we prove Theorem 3.4.
By Proposition 3.5 and the union bound, w.p. we have for all . Thus, with this probability, the right-hand side of Eq. (5) is less than . On the other hand, as the are ordered decreasingly, the left-hand side of Eq. (5) is at least It follows that, w.p. ,
thus completing the proof. ∎
Now we are ready to derive a singular value bound for Dirichlet random matrices.
Let be a random matrix whose columns are independent identically distributed random vectors drawn from a symmetric Dirichlet distribution with parameter vector with concentration parameter . Then for every there exist and that depend only on the moments of such that, for all .:
Recall that for a symmetric Dirichlet distribution with concentration parameter , each entry of the -dimensional vector drawn from this distribution has mean . Fix . Observe that has variance 1 and mean for all . Therefore, by corollary 3.2, it follows that for any ,
We can exploit elementary tail bounds to control the sum in the denominator on the right-hand side above using standard concentration-of-measure results. For instance, by Chebyshev’s inequality and the mutual independence of the columns of , for any ,
We can make the second term on the right-hand side arbitrarily small by increasing , and for a fixed we can make the first term on the right-hand side arbitrarily small by decreasing for large enough. Therefore, we can find a for any such that for all large enough,
From the theorem above, we can deduce that, with high probability,
We are able to bound the error in estimating from a sample thanks to sample concentration lemmas for singular values that are analogoous to more well-known concentration lemmas for scalar random variables (e.g., Markov’s inequality).
(TO BE FINISHED)
Although we are unable to compute the estimator without knowledge of , we can use Theorem 3.8 to provide an upper bound for the elements .
with high probability that depends on (the constant can be chosen arbitrarily so that the probability is negligible 111For most applications, we recommend . We computed the least singular value for randomly generated Dirichlet random matrices with and ; all of these matrices were dominated by . ).
This suggests the following procedure to estimate the number of topics:
To compare the performance of our procedure against previous model order estimation methods, we replicated the same experimental setting used to demonstrate the model selection capabilities of hLDA (Griffiths et al., 2004). hLDA (a Gibbs sampling method for the nonparametric equivalent of LDA using the Chinese restaurant process prior) was shown to be much faster and accurate than the Bayes factors method (a likelihood-based hypothesis-testing method) in this setting. 210 corpora of 1000 10-word documents each were generated from an LDA model with , a vocabulary size of 100, and word-topic matrix with columns randomly generated from a symmetric Dirichlet ( for , so ) and .
hLDA requires the input of a concentration parameter that controls how frequently a new topic is introduced, so the authors set . Since Gibbs sampling is subject to local maxima, so the sampler is randomly restarted 25 times for each corpus. Each time, the sampler is burned in for 10000 iterations and subsequently samples are taken 100 iterations apart for another 1000 iterations. The restart with the highest average likelihood over the post-burn-in period is selected, and the number of topics for this restart that had non-zero word assignments throughout the burn-in period is selected as the hLDA prediction of model order. We used the Java implementation of the hLDA Gibbs sampler by Bleier (2010).
For our spectral model selection procedure, we set our topic relevance sensitivity threshhold at , which corresponds to an expected error rate of . We implemented our procedure using the Matlab standard library. Both methods are somewhat sensitive to and , so we set these parameters to the ground truth for both methods, just as in (Griffiths et al., 2004).
Figure 1 shows that our model outperforms hLDA for this experimental setting (points are jittered slightly to reveal overlapping points). Our procedure correctly estimated the model order for all of the 210 corpora without the optional ECA step, whereas for hLDA the error rate was 10 out of 210. (Griffiths et al., 2004) reported an error rate of 15 out of 210 for hLDA in this experimental setting, and an error rate of 80 out of 210 for the Bayes factors method.
The running time for hLDA Gibbs sampling procedure was sec per corpus on a single thread of a machine with an eight-core 2.67Ghz CPU, while the running time for the spectral model selection procedure without the ECA step was 0.252 sec per corpus. However, hLDA learns the latent matrix while estimating the model order. Including the ECA step in our spectral model selection procedure to learn , the running time increases to 2.05 sec per corpus.
The learnability and sample complexity of spectral algorithms for mixture models depend crucially on the latent variable matrix being well-conditioned. For instance (Anandkumar et al., 2012a)’s algorithm for learning LDA comes with the following guarantee:
((Anandkumar et al., 2012a) Thm 5.1). Fix . Let and let denote the smallest (non-zero) singular value of . Suppose that we obtain independent samples of in the LDA model. w.p. greater than , the following holds: for sampled uniformly over the sphere , w.p. greater than 3/4, Algorithm 5 in (Anandkumar et al., 2012a) returns a set such that there exists a permutation of the columns of so that for all
Theorem 3.8 allows us to replace the dependence on by a dependence on , , and :
Let , , , , , and be as in 4.1. Suppose that we obtain independent samples of in the LDA model. w.p. greater than ,
Proposition 1.2 follows from assuming that the variance parameter of each entry remains constant as increases (so that ), and from assuming that is fixed, so that
In this paper, we have derived a novel procedure for determining the number of latent topics in Latent Dirichlet Allocation. Our experiments suggest that this procedure can outperform nonparametric Bayesian models learned using MCMC.
Our results rely on a adapting random-matrix-theoretic results to the case of rectangular noncentered matrices, and connecting these results to the spectral properties of the moments of data generated by an LDA model. Similar random-matrix theoretic results should be applicable to the problem of finding the number of latent factors in many other mixture models with similar conditional independence properties, and we plan to present such results in future work.
- Anandkumar et al. (2012a) Anandkumar, A, Foster, DP, Hsu, D, Kakade, SM, and Liu, YK. Two SVDs suffice: spectral decompositions for probabilistic topic models and latent dirichlet allocation. arXiv preprint arXiv:1204.6703, 2012a.
- Anandkumar et al. (2012b) Anandkumar, A., Hsu, D., and Kakade, S.M. A method of moments for mixture models and hidden Markov models. JMLR: Workshop & Conference Proc., 23:33.1–33.34, 2012b.
- Anandkumar et al. (2012c) Anandkumar, Anima, Ge, Rong, Hsu, Daniel, Kakade, Sham M, and Telgarsky, Matus. Tensor decompositions for learning latent variable models. arXiv preprint arXiv:1210.7559, 2012c.
- Arora et al. (2012) Arora, S., Ge, R., and Moitra, R. Learning topic models – going beyond SVD. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, 2012.
- Blei et al. (2003) Blei, D.M., Ng, Andrew Y, and Jordan, M.I. Latent dirichlet allocation. J. Machine Learning Research, 3:993–1022, 2003.
- Bleier (2010) Bleier, A. Java Gibbs sampler for the HDP, 2010. https://github.com/arnim/HDP.
- Bordenave et al. (2012) Bordenave, C., Caputo, P., and Chafaï, D. Circular law theorem for random Markov matrices. Prob. Theory & Related Fields, 152:751–779, 2012.
- Chafaï (2010) Chafaï, D. The dirichlet markov ensemble. Journal of Multivariate Analysis, 101(3):555–567, 2010.
- Griffiths et al. (2004) Griffiths, T.L., Jordan, M.I, Tenenbaum, J.B., and Blei, D.M. Hierarchical topic models and the nested chinese restaurant process. Advances in neural information processing systems, 16:106–114, 2004.
- Horn & Johnson (1990) Horn, R.A. and Johnson, C.R. Matrix Analysis. Cambridge University Press, 1990.
- (11) Kulesza, Alex, Rao, N Raj, and Singh, Satinder. An exploration of low-rank spectral learning.
- Rudelson & Vershynin (2009) Rudelson, M. and Vershynin, R. The smallest singular value of a random rectangular matrix. Commun. Pure Appl. Math, 62:1707–1739, 2009.
- Tao & Vu (2008) Tao, T. and Vu, V. Random matrices: the circular law. Comm. in Contemp. Math., 10.02:261–307, 2008.
- Tao & Vu (2009) Tao, T. and Vu, V. Smooth analysis of the condition number and the least singular value. pp. 0805.3167v2, 2009.
- Tao et al. (2010) Tao, T., Vu, V., and Krishnapur, M. Random matrices: Universality of ESDs and the circular law. The Annals of Probability, 38(5):2023–2065, 2010.
- Teh et al. (2006) Teh, Y.W., Jordan, M.I., Beal, M., and Blei, D.M. Hierarchical Dirichlet processes. J. Am. Stat. Assoc., 101, 2006.
- Tropp (2011) Tropp, J.A. User-friendly tail bounds for sums of random matrices. Found. Comput. Math., X:X, 2011.
- Wallach et al. (2009) Wallach, H.M., Mimno, D.M., and McCallum, A. Rethinking LDA: Why priors matter. In NIPS, volume 22, pp. 1973–1981, 2009.
(Tropp (Tropp, 2011) Thm. 5.1 (Eigenvalue Bennett Inequality). Consider a finite sequence of independent, random, self-adjoint random matrices with dimension , all of which have zero mean. Given an integer , define . Then, for all ,
where the function for .