An end-to-end Differentially Private Latent Dirichlet Allocation Using a Spectral Algorithm

An end-to-end Differentially Private Latent Dirichlet Allocation Using a Spectral Algorithm

Seyed A. Esmaeili
Department of Computer Science
University of Maryland
esmaeili@cs.umd.edu
   Furong Huang
Department of Computer Science
University of Maryland
* furongh@cs.umd.edu
Abstract

Latent Dirichlet Allocation (LDA) is a powerful probabilistic model used to cluster documents based on thematic structure. We provide end-to-end analysis of differentially private LDA learning models, based on a spectral algorithm with established theoretically guaranteed utility. The spectral algorithm involves a complex data flow, with multiple options for noise injection. We analyze the sensitivity and utility of different configurations of noise injection to characterize configurations that achieve least performance degradation under different operating regimes.

1 Introduction

Topic modeling is a probabilistic latent variable model that assumes conditionally independent drawing of words given topics. It uses the simple bag of words model, along with a prior under which the topic proportion of the documents are drawn from. Learning topic modeling involves learning lower dimensional latent structure (e.g., topics) from high dimensional complex observations (e.g., documents). Given a set of documents, topic modeling algorithms map each document to a (small) set of topics, thereby mechanically categorizing large corpora. At its core, topic modeling algorithms output a generative model that describes the input data. Topic modeling has extensively been used to categorize natural language documents, with applications in personalized search, social sciences, and machine translation. In this paper, we focus on a popular topic model, Latent Dirichlet Allocation (LDA) [4], which is used extensively in multiple domains.

The method of moments provides guaranteed recovery of the topic model parameters [1, 3]. The first three order data moments of LDA model can be decomposed into model parameters. Therefore, matrix/tensor decompositions on the empirical estimated data moments will guarantee recovery of the model parameters. Tensor decomposition, such as simultaneous power method has been proven to render the true factors of the tensor [18]. Therefore we have an end-to-end learning algorithm that is guaranteed to recover the model parameters for Latent Dirichlet Allocation using matrix/tensor decomposition.

Consider a sensitive document corpus that is kept secret, but assume that an adversary can obtain the output of running LDA on . Unfortunately, depending on the structure of , even the LDA output may be leak sensitive information, e.g., if one more document is added to and the LDA output changes by one topic , then clearly the specific document is related to . Differential Privacy (DP) [8] is a general framework for quantifying such leakage of private information. At a high level, Differential Privacy bounds how much the output of a generic algorithm can change if changes by one record.

Given algorithm , its output may change arbitrarily with the change in input (indeed, it may simply emit its input!) A generic method to convert to be differentially private is to add sufficient noise, to ’s output. Note that the amount of noise that has to be added depends on the sensitivity of , which captures how much the output of changes with changes in the input. Different noise models can be used. We focus on the so called Gaussian Mechanism [7], which adds noise drawn from a Gaussian distribution with the noise variance dependent on ’s sensitivity.

In this paper, we describe different Differentially Private implementations of LDA topic modeling. Differentially Private LDA has many important applications, including privacy-preserving recommendation systems, classifying user data such as email and social network data. An obvious way to create a differentially private LDA is to simply run LDA over a dataset and then add (Gaussian) noise to the output. While safe, such an implementation may not provide high utility, which captures how “good”/useful the private output is compared to output without noise. Indeed, there are different ways to add noise to LDA implementations to maintain the same level of differential privacy, but vastly different utilities.

We introduce a data flow computation for LDA in figure 1, and evaluate different “configurations” which correspond to edges on this graph where noise is added such that the output is differentially private. (Adding noise to the output is simply one configuration.) For each configuration, we compute sensitivity and utility, and quantify regimes where certain configurations outperform others.

1.1 Summary of Contributions

We illustrate the data flow computation of the method of moments for learning Latent Dirichlet Allocation in figure 1. Based on this data flow graph, we list combinations of edges (configuration) on which noise could be added to guarantee differentially private LDA.

To solve the challenging problem of where to add the noise such that we achieve least performance degradation when (,)-differentially privacy is guaranteed, we characterize the sensitivity of the edges with respect to the input. To quantify the utility achieved by each configuration, we also characterize the utility as a function of noise for each configuration. Overall, combining the sensitivity and utility characterization, we obtain an end-to-end analysis of differentially private method of moments for LDA and obtain the regimes under which adding noise to which configuration guarantees the best performance. (Using this framework, we demonstrate multiple mechanisms, which permit differentially private algorithms whose utilities are advantageous in different regimes as listed in Remark 5.15.2 and  5.3.)

The edge sensitivities are listed in table 3 and the utilities are listed in table 2. Overall, if the constraint required by configuration 3 is met, it will likely provide the best utility for a given level of differential privacy. The configuration to achieve the next best utility is configuration 2, which only requires a constraint on . Configuration 1 does not require any constraints, however its utility is likely to be poor due to its dependence on .

1.2 Related Work

This work focuses on LDA parameter estimation based on spectral algorithms, which unlike EM-based algorithms[14, 13], guarantee parameter recovery [1, 2]. The spectral estimation method relies on matrix decomposition and tensor decomposition methods. Thus, differentially private PCA and tensor decomposition are related to our objective.

Differentially private PCA is an established topic, and differentially private PCA was achieved using the exponential mechanism in [6, 12]. The algorithm in [12] provides guarantees but with complexity ; in contrast, [6] introduces an algorithm that is near optimal but without an analysis of convergence time. Although differential privacy is a more loose definition of differential privacy, it leads to better utility. Comparative experimental results show that the PCA algorithm of [11] outperform significantly, and [9] introduce a simple input perturbation algorithm which achieves near optimal utility. In our work, we follow the definition and use [9] to obtain a differentially private matrix decomposition when needed.

Differentially private tensor decomposition is proposed in [19] with an incoherence basis assumption and it is not clear the extent to which such an assumption holds in topic modeling. Only utility bounds are proved for the top eigenvector in  [19]. The authors exclude the possibility of input perturbation as that would cause the privacy parameter to be lower bounded by the dimension () which may be prohibitive. However, the same analysis on the tensor of a reduced dimension would conclude that , which is acceptable for a reduced dimension whitened tensor as .

2 Preliminaries

Symbols Description
, Total # of documents, max. # of words in a document
 {} Document Corpus, Neighboring corpus
Number of words in document
 {} Empirical second order LDA moment {of neighboring corpus}
 {} Whitening matrix {of neighboring corpus}
 {} Whitened tensor {from neighboring corpus}
Whitened tensor
 {} Third order LDA moment {of neighboring corpus}
 {} Rank- approximation of {of }
Sensitivity of: if , if , if
Eigenvalue from tensor decomposition
Eigenvector from tensor decomp. before whitening
Table 1: List of Symbols. Symbols that refer to a neighboring corpus are listed within {}.

Notation

We represent a corpus of documents with the word-count matrix , where the column is equal to the word-count of document . Specifically, equals the number of times word shows up in document . Clearly, is the vocabulary size and is the total number of documents in the corpus. The length ( total number of words) of a document is denoted by . Furthermore, we use to refer to the word that appears in document , and is a one-hot encoded vector. Table 1 summarizes the notation used in this paper.

We use to refer to the norm of a vector and the spectral (operator) norm of matrix or a tensor. The Frobenius norm of matrix is referred to by . We use the same notation to refer to the Frobenius norm of a tensor which follows a similar definition.

Definition 2.1.

Let be a random mechanism. If and every possible pair of neighboring inputs and , the following holds : . Then is differentially private.

Proposition 2.2.

Gaussian mechanism ([7] ) Let , and and be two neighboring inputs, and . Define as the sensitivity of , i.e. . Then the output where and is sampled i.i.d. from , ensures differential privacy. Where with .

Proposition 2.3.

The Composition Theorem for Differential Privacy ([8]) Let and be two differentially private mechanism with privacy parameters and . Then is differentially private .

Note that proposition 2.3 implies that if we have then it is also differentially private, by the post-processing property [8].

3 Differentially Private LDA Topic Model

Latent Dirichlet Allocation (LDA) [4] has become increasingly popular since its introduction in 2002. Single topic model makes an oversimplified assumption that all words in the same document are drawn from a single latent class (topic). LDA, although still being a bag of words model, allows modeling of the mixed topics in a document to account for a more general case where a document belongs to several different latent classes (topics) simultaneously. More details about the topic model is available in Appendix A.

Definition 3.1 (LDA moments).

The first three order moments of LDA , and , are defined as follows

(1)
(2)
(3)

Using the properties of LDA, the moments can be decomposed as factors, and those factors are exactly model parameters we aim to estimate.

Lemma 3.2 (Moment Decompositions Recover Model Parameters).

The LDA moments are related to the model parameters and

(4)
(5)
(6)

We empirically estimate the moments , , without bias, and obtain the model parameters and by implementing tensor decomposition on those empirically estimated moments. Given the observations of word frequency vectors, we define moment estimators in Definition B.1 in Appendix B. According to Lemma B.2, the moment estimators in Definition B.1 in Appendix B are unbiased estimators.

3.1 Method of Moments & Tensor Decomposition

The method of moments uses the property of data moments of the LDA model (in Lemma 3.2) to estimate the parameters of topic model and , . The flow of the algorithm is depicted in Figure 1 and consists of the following steps: (1) Estimating and using equation (11) ( in figure 1) and equation (12) ( in figure 1), using the data which consists of word frequency vectors for document . (2) Implementing singular value decomposition on to obtain estimation of the whitening matrix , where and are the top singular vectors and singular values of . This is in figure 1. (3) Whitening the tensor using multilinear operations on with . This is and in figure 1. (4) Implementing tensor decomposition on the whitened tensor and denoting the resulting eigenvectors as , . This is in figure 1. (5) Obtaining the un-whitening matrix . This is in figure 1. (6) Un-whitening the singular vectors to obtain the word distributions per topic and , . This is and in figure 1.

Figure 1: Data flow computation graph using method of moments to learning topic model.

Method of moments guarantees the correct learning of topic models (See Lemma F.1) .

3.2 Differentially Private LDA

Problem Statement

We define two neighboring corpora and if and for some , i.e. can be formed from by dropping a column and replacing it by a new column. We assume the corpus of data is held by a trusted curator and an analyst will query for the parameters of the topic model. The curator has to output the model parameters in a differentially private manner. While it is easy to achieve differential privacy, the challenge is in guaranteeing high utility.

Config.
(Edge Set) Constraint Utility Loss ()
1: None
()
2:
()
3: ,
()
4: () ,
Table 2: Utility of different configurations that guarantee differentially private LDA. The table lists edges in Figure 1 on which to add Gaussian noise in order to achieve differentially private topic model using method of moments. is the sensitivity of , is the sensitivity of . We use as defined in Proposition 2.2 to decompose the dependence of the noise variance on both the sensitivity and the privacy parameters , i.e. , where is the sensitivity. .

We will use the Gaussian mechanism in this paper to achieve -differentially private topic model for each of the configurations. In the next section, we compute edge sensitivities in order to obtain the noise level that must be added to each edge for the different configurations listed table 2.

4 Sensitivity Analysis of Nodes in Data Flow Computation Graph

A key challenge in this problem, is the question of where to add noise in this data flow computation graph shown in Figure 1. A simple addition of noise to the inputs similar to [5] would guarantee differential privacy but at the expense of utility. It is reasonable to expect an efficient differentially private algorithm to scale well with and . As the data set becomes larger the algorithm should add less noise and yield higher utility. Furthermore, it is undesirable for the utility to exhibit dependence on the first power of or for the differential privacy parameter to be lower bounded by .

We follow a principled approach, where we calculate the sensitives at the different nodes of the data flow computation graph. Further, we consider various options and establish the utilities for different possible noise addition configurations as listed in Table 2. The starting point for sensitivity calculations are and of LDA. Similar to the covariance matrix, both quantities fall as .

At any given node in the data flow computation graph which has a value of where is the corpus. The sensitivity is calculated according to , and sensitivity according to , where are neighboring corpora that differ by one record only.

Theorem 4.1 (Sensitivity of second and third order LDA moments).

Let and be the sensitivities for and , respectively. Then both and are .

See appendix G for the proof and the exact forms. We note that the sensitivity bounds the sensitivity, similar to how for any given vector and at times we only bound the sensitivity which in turn bounds the . While it is easy to establish bounds on the and sensitives, it is not easy to do so for the whitened tensor . Unless the whitening matrix is stable when a record is changed in the document corpus, such an event is not guaranteed in general. However, the fact that the corpus consists of many topics, leads to a whitening matrix that is stable under the change of a given record as the data set size increases, i.e. grows with more data. Furthermore, the amount of perturbation introduced by changing a record becomes smaller as the data set size grows according to Theorem 4.1.

Theorem 4.2 (Sensitivity of the whitened tensor ).

Let be the sensitivity of the whitened tensor . Then if , we have that the sensitivity of the whitened tensor is upper bounded by

The proof is in appendix G. Using the guarantees from the simultaneous tensor power method [18], we can establish bounds on the eigenvectors and eigenvalues outputted from the decomposition. Note that this is a bound on the output before the whitening step.

Theorem 4.3 (Sensitivity of the output of tensor decomposition ).

Let and be the outputs of tensor decomposition (before whitening). Let and be the sensitivities of the any and , respectively. Assuming the spectral gap on is large enough (). Then based on [18] if is sufficiently small 111according to Theorem 1 in. [18] , we have: and , where , is the eigenvalue of .Thus both and are .

Proof.

Please see Appendix G.7. ∎

Theorem 4.4 (Sensitivity of the final output ).

If and is sufficiently small. Then, the sensitivity of the final output sensitivities and are .

See appendix G for a proof. We see that the sensitivities before the whitening step are on the order of . The whitening step causes the sensitivity to jump by , leading to . Further, the simultaneous power method causes the output to have a sensitivity increase by , leading to . The unwhitening changes the sensitivity by a factor of , leading to .

5 Differentially Private Latent Dirichlet Allocation

We consider ways to establish differential privacy by adding noise to different edges in the data flow computation graph of Figure 1. We refer to a given method as a “configuration”. Under each configuration, we compute the noise needed to obtain () differential privacy based on sensitivity (the composition theorem is used when edges are combined) thereby characterizing the utility based on necessary noise. We identify constraints that apply to specific configurations and characterize regimes of optimal utility. The final utility and sensitivity of each configuration is listed in Tables 2 and 3 (in the Appendix), respectively. (Proofs of all utility derivations are in Appendix H.) In what follows, if noise is added to edge , then refers to the associated differential privacy parameter.

Though it is possible to perform input perturbation, we exclude this option because the sensitivity is (where is the length of the longest document) which does not decay with the number of records. Therefore the utility of input perturbation is poor even with an infinite number of records.

5.1 Perturbation on , ()

The composition theorem guarantees that any composition of differentially private outputs remain differentially private with the privacy parameters summed. Perturbations on , will ensure that , , and are differentially private and the output of tensor decomposition will be as well due to the post-processing property. We can obtain a differentially private and by using differentially private PCA [9]. Unlike input perturbation, the sensitivity over (which will generate both whitening matrices) and fall as the dataset size increases (Theorem 4.1).

However, an issue with this configuration is that adding noise to leads to higher noise build up prior to the tensor decomposition. Note that by (I.6) w.h.p the norm of the error is , with being the variance of the noise (this bound would be if the noise is added to a symmetric tensor of size ). Tensor decomposition methods, in particular [18] require the spectral norm of the perturbation to the tensor to be lower than a certain threshold. Following arguments similar to [19], the spectral norm of the error is and should be below . Thus should satisfy, to establish utility guarantees for tensor decomposition. Following similar arguments, this time using the bound on the spectral norm of the noisy matrices, to guarantee utility, the differentially private whitening and pseudo-inverse should be close to their non-differentially private values, which requires both and to be . Although, the privacy parameters have a lower bound of , the bound also falls with .

5.2 Perturbation on and ()

This configuration has two properties: the noise level introduced is low because the whitening step reduces the tensor dimension from to . However, even though the dimension of the tensor is reduced, unless the whitening tensor (which results from a eigendecomposition over ) is stable, the sensitivity of the whitened tensor is not necessarily low.

This configuration requires a spectral gap between and , and this constraint is more likely to be met with increasing number of records. Note however, that the sensitivity of falls with (Theorem 4.1). Therefore, we expect the sensitivity of to drop with increasing number of records. In fact, as Theorem 4.2 states, , if . Thus, given the spectral gap requirement, the sensitivity of the whitened tensor is . We note here that we are adding noise to a tensor of a smaller dimension, but at the expense of an increased sensitivity by a factor of . In this configuration, the whitening matrix results from a noiseless , but the pseudo-inverse results from a noisy . To guarantee utility, we need and . The dependence on still remains however, as it originates from adding noise to which is still done for . Utility loss analysis for this method leads to much lower dependence on , on the order of .

5.3 Perturbation on the output of tensor decomposition , and for ()

This configuration adds noise to the output of the simultaneous power iteration. While it is true that the sensitivity after the output of the simultaneous power iteration increases by a factor of , we find that this method achieves slightly better utility. The dependence on in the last term drops from to .This is because although the previous configuration adds noise before the decomposition at a lower sensitivity, the error in the output grows by a factor of .

5.4 Perturbation on the final output , ()

The last option we consider is to add noise to the final output. This method is arguably the simplest, the previous methods (expect for input perturbation) involve the composition of two differentially private outputs, but this method only adds noise to one branch. We note that the worst case sensitivity is not increased in adding noise to edge instead of edge : even though sensitivity increases by a factor of , this is on the order of . Adding noise to instead of means that the noise vector increases in dimension from to which makes the utility loss larger.

5.5 Comparison Between the Different Configurations

Configuration 1 () has no constraints and has a sensitivity which falls as . Configuration 2 requires that the sensitivity of be bounded by its spectral gap between the and components. Configurations 3 () and 4 () require an extra constraint in addition, specifically that sensitivity of the whitened tensor be bounded by the product of its minimum spectral gap and its singular value divided by the square root of the number of topics .

Finally, we present a pairwise comparison between the utilities of different configurations. In the analysis, we refer to the utility loss computations from Table 2.

Remark 5.1.

Configuration 1 vs. 2:

The utility loss in configuration 1 is high compared to configuration 2 as the the singular values of are on the order of . For configuration 1 to be preferred over configuration 2, has to exceed , which is usually not practical. Configuration 1 has a factor higher utility loss compared to configuration 2. Therefore, assuming the spectral gap constraint is met, configuration 2 is preferred over configuration 1 in practice.

Remark 5.2.

Configuration 2 vs. 3:

The utility loss for configuration 3 is lower than that of configuration 2 by a factor of in the last term of the utility losses, assuming the same level of differential privacy. However, configuration 3 has the extra requirement that . Therefore the utility of configuration 3 outperforms that of configuration 2 if the constraint is met. The advantage is enlarged when the number of topics is large.

Remark 5.3.

Configuration 3 vs. 4:

Both configurations assume the same constraints. The first two terms of utility loss in configuration 3 are smaller than the utility loss in configuration 4. In order to compare the third term in configuration 3 with the utility loss in configuration 4, we consider different regimes. In the regime of , the third term of configuration 3 is smaller than that configuration 4. Therefore configuration 3 is preferred in the regime of .

Overall, if the constraint required by configuration 3 is met, it will likely provide the best utility for a given level of differential privacy. The configuration to achieve the next best utility is configuration 2, which only requires a constraint on . Configuration 1 does not require any constraints, however its utility is likely to be poor due to its dependence on .

6 Conclusion

We have provided an end-to-end analysis of differentially private LDA model using a spectral algorithm. The algorithm involves a dataflow that permits different locations for injecting noise. We present a detailed sensitivity and utility analysis for different differentially private configurations. Our analysis shows that no one configuration dominates, and characterizes constriants and operating regimes over which different configuration are optimal.

References

  • [1] Anima Anandkumar, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Yi-Kai Liu. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pages 917–925, 2012.
  • [2] Anima Anandkumar, Rong Ge, and Majid Janzamin. Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates. arXiv preprint arXiv:1402.5180, 2014.
  • [3] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773–2832, 2014.
  • [4] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
  • [5] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: the sulq framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 128–138. ACM, 2005.
  • [6] Kamalika Chaudhuri, Anand Sarwate, and Kaushik Sinha. Near-optimal differentially private principal components. In Advances in Neural Information Processing Systems, pages 989–997, 2012.
  • [7] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 486–503. Springer, 2006.
  • [8] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
  • [9] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 11–20. ACM, 2014.
  • [10] Daniel Hsu, Sham Kakade, Tong Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.
  • [11] Hafiz Imtiaz and Anand D Sarwate. Symmetric matrix perturbation for differentially-private principal component analysis. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 2339–2343. IEEE, 2016.
  • [12] Michael Kapralov and Kunal Talwar. On differentially private low rank approximation. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1395–1414. SIAM, 2013.
  • [13] Mijung Park, James Foulds, Kamalika Chaudhuri, and Max Welling. Private topic modeling. arXiv preprint arXiv:1609.04120, 2016.
  • [14] Mijung Park, Jimmy Foulds, Kamalika Chaudhuri, and Max Welling. Dp-em: Differentially private expectation maximization. arXiv preprint arXiv:1605.06995, 2016.
  • [15] G. W. Stewart. Matrix perturbation theory, 1990.
  • [16] Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
  • [17] Ryota Tomioka and Taiji Suzuki. Spectral norm of random tensors. arXiv preprint arXiv:1407.1870, 2014.
  • [18] Po-An Wang and Chi-Jen Lu. Tensor decomposition via simultaneous power iteration. In International Conference on Machine Learning, pages 3665–3673, 2017.
  • [19] Yining Wang and Anima Anandkumar. Online and differentially-private tensor decomposition. In Advances in Neural Information Processing Systems, pages 3531–3539, 2016.
  • [20] James Y Zou, Daniel J Hsu, David C Parkes, and Ryan P Adams. Contrastive learning using spectral methods. In Advances in Neural Information Processing Systems, pages 2238–2246, 2013.

Appendix: An end-to-end Differentially Private Latent Dirichlet Allocation Using a Spectral Algorithm

Appendix A LDA Topic Model

Topic Proportions

The proportion of words belonging to different topics (denoted as ) for document , also known as topic proportion, is drawn from a Dirichlet distribution  222, where . .

Word Distribution for Topics

LDA remains simple as each word belongs to one of the topics only. We denote the topic of word in document as . Therefore, and where Cat denotes the categorical distribution.

Word Generation

After the topics of the words are determined, words are assumed to be generated conditionally independently through , i.e., word . For different conditional distribution under different topics, are linearly independent, . In a document , if a token is the -th word in the dictionary, we denote it as where is -th the basis vector.

For LDA model, we define the first three order LDA moments in definition 3.1. Using the properties of LDA, the moments can be decomposed as factors, and those factors are exactly model parameters we aim to estimate. This is depicted in Lemma 3.2. Therefore, as long as we can empirically estimate the moments , , without bias, we obtain the model parameters and by implementing tensor decomposition on those empirically estimated moments.

Unbiased Empirical Moment Estimators

Here we list the mathematical forms of first, second and third order moment estimators for the single topic case [20]. Let be the word count vector for document . Let and . For notation simplicity, let us define some intermediate variables

(7)
(8)
(9)

Appendix B Method of Moments for Latent Dirichlet Allocation

Definition B.1.

The moment estimators for LDA moments are

(10)
(11)
(12)

where

(13)
(14)

and are formed from by permuting, i.e., and .

Based on the definition of the empirical estimators of the moments, we prove that these estimators are unbiased.

Lemma B.2 (Unbiased Moment Estimators).

The estimators defined in definition B.1 are unbiased, i.e.,

(15)
(16)
(17)
Proof.

We present a proof for the second order moment. The results for the first and third order moments can be derived through a similar analysis. For the first term of , It is clear that as this is the single topic case. For the second term, we pay attention to the following identity: . Now we observe that:

expanding the first term using the identity and canceling out with the second term, we get the following: