Consistent Estimation of Low-Dimensional Latent Structure in High-Dimensional Data

Consistent Estimation of Low-Dimensional Latent Structure
in High-Dimensional Data

\nameXiongzhi Chen \emailxiongzhi@princeton.edu
\nameJohn D. Storey \emailjstorey@princeton.edu
\addrCenter for Statistics and Machine Learning and
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544, USA
 
October 2015
Abstract

We consider the problem of extracting a low-dimensional, linear latent variable structure from high-dimensional random variables. Specifically, we show that under mild conditions and when this structure manifests itself as a linear space that spans the conditional means, it is possible to consistently recover the structure using only information up to the second moments of these random variables. This finding, specialized to one-parameter exponential families whose variance function is quadratic in their means, allows for the derivation of an explicit estimator of such latent structure. This approach serves as a latent variable model estimator and as a tool for dimension reduction for a high-dimensional matrix of data composed of many related variables. Our theoretical results are verified by simulation studies and an application to genomic data.

\jmlrheading

00201500-0000/0000/00Xiongzhi Chen and John D. Storey \ShortHeadingsConsistent Estimation of Latent StructureChen and Storey \firstpageno1

\editor{keywords}

exponential family distribution, factor analysis, high-dimensional data, latent variable model, spectral decomposition

1 Introduction

Low-dimensional latent variable models are often used to capture systematic structure in the conditional mean space of high-dimensional random variables (rv’s). This has been a popular strategy in high-dimensional probabilistic modeling and data analysis, and it serves as an attractive strategy for dimension reduction and recovering latent structure. Examples include factor analysis (Bartholomew et al., 2011), probabilistic principal components analysis (PCA) (Tipping and Bishop, 1999), non-negative matrix factorization (Lee and Seung, 1999), asymptotic PCA (Leek, 2011), latent Dirichlet allocation (LDA) (Blei et al., 2003), and exponential family distribution extensions of PCA (Collins et al., 2001).

Let be an observed data matrix of variables (one variable per row), each with observations, whose entries are rv’s such that

(1)

where is the expectation operator, is a matrix of latent variables, and is a matrix of coefficients relating the latent variables to the observed variables. Furthermore, the dimensions are such as . In model (1), the conditional mean of for a fixed only depends on the th column of , and each row of the conditional mean matrix lies in the row space of . The latent structure in is therefore induced by .

The above model is a general form of several highly used models. This includes instances of factor analysis (Bartholomew et al., 2011), probabilistic PCA (Tipping and Bishop, 1999), mixed membership clustering in population genetics (Pritchard et al., 2000; Alexander et al., 2009) which is closely related to LDA, and non-negative matrix factorization (Lee and Seung, 1999). Whereas the specialized models are often focused on the probabilistic interpretation of the columns of , we are instead here interested in its row space, . This row space is sufficient for: (i) characterizing systematic patterns of variation in the data , which can be used for exploratory data analysis or dimension reduction; (ii) accounting for the latent variables in downstream modeling of (Leek and Storey, 2007, 2008) that requires adjustment for these variables; (iii) potentially identifying suitable initial values or geometric constraints for algorithms that estimate probabilistically constrained versions of ; (iv) or recovering itself if additional geometric properties are known (e.g., as in Arora et al. (2013)). Furthermore, many of the above models make assumptions about the probability distribution of that may be untrue on a given data set. One can compare our estimate of (which makes minimal assumptions) to the space induced by the model-based estimates of to gauge the accuracy of the model assumptions and fit. Therefore, we focus on estimation of the latent variable space . Estimation of may be tractable, but we do not focus on that here.

Leek (2011) and Anandkumar et al. (2012, 2015) have carried out work that is complementary to that presented here. They both study moment estimators of linear latent variable models applied to high-dimensional data. We explain how our work is related to Leek (2011) in Section 4.3 and how it is related to Anandkumar et al. (2012, 2015) in Appendix A. The strategies employed in these papers have ties to what we do here; however, they each consider different probabilistic models with theory that does not directly apply to the models we study.

We show that both the row space of (in Section 2) and the row rank of (in Section 3) can be consistently estimated using information from a suitably adjusted matrix . In Section 4, we specialize these general results to ’s that, conditional on , come from exponential family distributions. In particular, we explicitly construct a nonparametric, consistent estimator of the row space of for rv’s that follow the natural exponential family (NEF) with quadratic variance function (QVF) using information only up to their second moments, and the estimator is computationally straightforward to implement. In Section 5, we extend the results of previous sections to the case where is random. A simulation study is conducted in Section 6 to check and confirm our theoretical findings, and we apply the estimators to a genomics data set in Section 7. Finally, we end the article with a discussion in Section 8, collect in Appendix B all technical proofs, and present the the full set of results from simulation studies in Appendix C and Appendix D.

2 Almost Surely Estimating the Latent Linear Space

The big picture strategy we take is summarized as follows. Carrying out a decomposition related to Leek (2011), we first expand into four components:

(3)

We then show the following about each of the components as , under the assumptions given in detail below:

  • : this term may be estimated arbitrarily well as by a diagonal matrix defined and studied below;

  • and : these terms converge to the zero matrix;

  • : this term converges to a symmetric matrix with positive eigenvalues whose leading eigenvectors span the row space of .

Once the convergence of these terms is rigorously established, the strategy we take is to form an estimate of term , denoted by , and show that the space spanned by the leading eigenvectors of converges to the row space of as . We then also provide a framework to estimate the dimension of the row space of and incorporate it into this estimation framework.

2.1 Model Assumptions

We first define the matrix norm for any real matrix and let denote a generic, finite constant whose value may vary at different occurrences. We first assume that is deterministic; the results in this section are extended to the case where is random in Section 5. The assumptions on model (1) are as follows:

A1)

and is finite; are jointly independent with variance such that (which implies ), where is the variance operator.

A2)

, where is the th row of . Further, for some and some non-negative sequence ,

(4)

Since we are considering model (1) for which is conditioned on , all random vectors in this model are by default conditioned on unless otherwise noted (e.g., Section 5). For conciseness we will omit stating “conditional on ” in this default setting.

We state a consequence of the assumption A2) as Lemma 2.1, whose proof is straightforward and omitted.

{lemma}

If the assumption A2) holds, then

for each and . The uniform boundedness results provided by Lemma 2.1 will be used to prove the convergence results later.

2.2 Asymptotically Preserves the Latent Linear Space

We first derive the asymptotic form of with the aid of the strong law of large numbers (SLLN) in Walk (2005). Let be the column-wise average variance of (where as defined above) and

(5)

the diagonal matrix composed of these average variances.

{theorem}

Under the assumptions for model (1),

(6)

where .

Section 2.2 shows that becomes arbitrarily close to as the number of variables . In fact, it gives much more information on the eigensystem of (as we take the convention that an eigenvector always has norm ). Let be the eigenvalues of ordered into for (where here we take the convention that the ordering of the designated multiple copies of a multiple eigenvalue is arbitrary),

and be the eigenvalues of ordered into for .

{corollary}

Under the assumptions for model (1),

(7)

Further, a.s. and

(8)

where denotes the linear space spanned by its arguments, and the symmetric set difference.

Corollary 2.2 reveals that asymptotically as the eigenvalues of converge to those of when both sets of eigenvalues are ordered the same way, that the dimension of the space spanned by all the eigenvectors corresponding to the largest eigenvalues of converges to as , and that is asymptotically spanned by the leading dimensional joint eigenspace induced by as . When the nonzero eigenvalues of are distinct, we easily have

(9)

where, modulo a sign, is the eigenvector corresponding to and that to .

When the dimension of the latent space and the diagonal matrix of the column-wise average variances are known, it follows by Corollary 2.2 that asymptotically spans the latent space , and converges to the row space with probability 1. However, in practice both and need to be estimated, which is the topic of the next three sections. Estimating the number of latent variables is in general a difficult problem. In our setting we also must accurately estimate , which can be a difficult task when the variances may all be different (i.e., heteroskedastic).

3 Consistently Estimating the Latent Linear Space Dimension

The strategy we take to consistently estimate the dimension of the latent variable space is to carefully scale the ordered eigenvalues of and identify the index of the eigenvalue whose magnitude separates the magnitudes of these eigenvalues into two particular groups when is large. Recall that, by Section 2.2 and Corollary 2.2, the difference between the vector of descendingly ordered eigenvalues of and that of those of converges to zero as . However, since and , we know that the largest eigenvalues of are positive but the rest are zero. This means that the largest eigenvalues of are all strictly positive as , while the smallest eigenvalues of converge to as . Depending on the speed of convergence of the smallest eigenvalues of to , if we suitably scale the eigenvalues of , then the scaled, ordered eigenvalues will eventually separate into two groups, those with very large magnitudes and the rest very small. The index of the scaled, ordered eigenvalues for which such a separation happens is then a consistent estimator of . If we replace with an estimator that satisfies a certain level of accuracy detailed below, then the previous reasoning applied to the eigenvalues of will also give a consistent estimator of .

To find the scaling sequence for the magnitudes of the eigenvalues of or equivalently the speed of convergence of the smallest eigenvalues of to , we define the matrix with entry , and study as a whole the random part of defined by

(10)

Note that the terms , and in correspond to those from equation (3). We will show that configured as a vector possesses asymptotic Normality after centering and scaling as . This then reveals that the scaling sequence for the eigenvalues of should be no smaller than being proportional to .

Let , and define

(11)

for , where and are respectively the th row of and the th column of . We have: {proposition} Under model (1), is a linear function of with global Lipschitz constant . If we assume A1) and A2) from Section 2 and also

A3)

The sequences , and for any are all convergent as ,

then converges in distribution to a multivariate Normal random vector with mean zero.

With the concentration property of established by Proposition 11, we will be able to explore the possibility of consistently estimating by studying the magnitudes of eigenvalues of

(12)

where each is an estimate of for , and the ’s are ordered into for .

{theorem}

Under the assumptions A1), A2) and A3), if

A4)

for some non-negative such that ,

(13)

then

(14)

where is the probability measure and

(15)

Further, for any such that and , as ,

(16)

for any fixed . Therefore, letting

(17)

gives

(18)

where is the indicator of a set .

Section 3 shows that the speed of convergence, , of the eigenvalues of (and those of ) is essentially is determined by those related to and ; see equations (13)–(15). Further, it reveals that, when is known, with probability approaching to as , the scaled, ordered eigenvalues eventually separates into two groups: for all lie above but the rest all lie below for any chosen . In other words, for a chosen , the number of when is large is very likely equal to . However, in practice and are unknown, and even if they are known or can be estimated, unfortunately the hidden constants in (14) are unknown (even when ). Further, when is finite, the smallest eigenvalues of may not yet be identically zero, the hidden constants may have a large impact on estimating the scaling sequence , and rates slightly different than may have to be chosen to balance the effects of the hidden constants; see in Section 6 a brief discussion on the estimation of and the effects of choosing on estimating for finite .

Before we apply our theory to ’s that follow specific parametric distributions, we pause to comment on the key results obtained so far. Section 2.2 and Corollary 2.2 together ensure that asymptotically as we can span the row space of by the leading eigenvectors of (see the definition of in (5)). However, the conclusions in these results are based on a known and . In contrast, Proposition 11 and Section 3 retain similar assertions to those of Section 2.2 and Corollary 2.2 by replacing the unknown by its consistent estimate , and they enable us to construct a consistent estimate of such that the linear space spanned by the leading eigenvectors of consistently estimates as . This reveals that to consistently estimate in a fully data driven approach using our theory, it is crucial to develop a consistent estimate of .

4 Specializing to Exponential Family Distributions

We specialize the general results obtained in Section 2 and Section 3 for model (1) to the case when follow the single parameter exponential family probability density function (pdf) given by

(19)

where is the canonical link function, and and are known functions such that is a proper pdf. The values that can take are . The following corollary says that our general results on consistent estimation of the latent space and its dimension hold when the link function is bounded on the closure of the parameter space . {corollary} Suppose is an open subset of such that its closure lies in the interior of . If conditional on each has pdf of the form (19) with and is bounded as a function of in the closure of , then , which implies that Section 2.2, Corollary 2.2, Proposition 11 and Section 3 hold.

We remark that the boundedness of on the closure of is not restrictive, in that in practice either is bounded in Euclidean norm or is bounded in supremum norm. The proof of Corollary 19 is a simple observation that is analytic in when (see, e.g. Letac and Mora, 1990) and that the derivative of in of any order is bounded when is bounded; the proof is thus omitted.

4.1 Estimating

With Corollary 19 (see also the discussion at the end of Section 3), we only need to estimate (which in turn yields ) in order to consistently estimate and . To obtain an estimate of when potentially all are different from each other (i.e., complete heteroskedasticity), we exploit the intrinsic relationship between and when come from a certain class of natural exponential family (NEF) distributions.

{lemma}

Let have marginal pdf (19). Then there exists a quadratic function such that if and only if parametrized by forms an NEF with quadratic variance function (QVF) defined by Morris (1982) such that

(20)

for some . Specifically, (20) implies

(21)

The proof of Lemma 4.1 is straightforward and omitted. Table 1 lists for the six NEFs with QVF.

Normal 1
Poisson
Binomial
NegBin
Gamma
GHS
Table 1: The function such that when has pdf (19) and comes from a NEF with QVF in Morris (1982). Note that for and “GHS” stands for “generalized hyperbolic secant distribution”.

Inspired by the availability of the function obtained in Lemma 4.1, we now state a general result on how to explicitly construct a that properly estimates .

{lemma}

Let have pdf (19) such that for some function satisfying

(22)

Then

(23)

satisfies a.s. for each . If additionally, for each and some ,

(24)

then converges in distribution to a Normal random variable as .

Lemma 4.1 shows that in (12) with defined by (23) satisfies . Note that the first assertion in Lemma 4.1, i.e., a.s., clearly applies to that follow NEFs with QVF when their corresponding are in a set described in Corollary 19. We remark that requiring the closure of to be in the interior of is not restrictive, since in practice the ’s are not the boundary points of .

4.2 Simultaneously Consistently Estimating the Latent Linear Space and Its Dimension

We are ready to present a consistent estimator of the latent linear space :

  1. Set as (12) with as in (23).

  2. Estimate as using (18) as given in Section 3.

  3. From the spectral decomposition where and , pick columns of corresponding to the largest .

  4. Set

    (25)

    to be the estimate of . Note that can be regarded as an estimator of even though it is not our focus.

The above procedure is supported by the following theorem, whose proof is straightforward and omitted. More specifically, in (25) consistently estimates . {theorem} Under the assumptions of Section 2.2 and Lemma 4.1, a.s. and a.s.. If additionally the conditions of Section 3 are satisfied, then equation (18) holds and .

4.3 Normal Distribution with Unknown Variances

One of the exponential family distributions we considered above is . Suppose instead we assume that , where are unknown. Leek (2011) studies this important case, and obtains several results related to ours. Let be the ordered singular values resulting from the singular value decomposition (SVD) of . If we regress the top right singular vectors in this SVD from each row of , this yields total residual variation that is of proportion to the total variation in . In order to estimate , Leek (2011) employs the estimate

Using our notation, Leek (2011) then sets , where is the identity matrix, and proceeds to estimate based on as we have done. However, it must be the case that in order for to be well-behaved, so the assumptions and theory in Leek (2011) have several important differences from ours. We refer the reader to Leek (2011) for specific details on this important case. We note that taking our results together with those of Leek (2011), this covers a large proportion of the models utilized in practice.

5 Letting the Latent Variable Coefficients Be Random

We now discuss the case where is random but then conditioned. Assume that is a random matrix with entries defined on the probability space , and that the entries of , conditional on and , are defined on the probability space . Rather than model (1), consider the following model:

(26)

Suppose assumption A4) holds (see (13)) and:

A1)

and is finite; conditional on and , are jointly independent with variance such that (which implies ), where and are the expectation and variance wrt .

A2)

Either A2) holds -a.s., i.e., and

(27)

or A2) holds in probability , i.e., as , and

(28)
A3)

Conditional on , , and for any are all convergent -a.s. as .

Note that assumptions A1), A2) and A3) are the probabilistic versions of assumptions A1), A2 and A3) that also account for the randomness of . Recall assumption A2) when is deterministic, i.e., for some and some non-negative sequence , . Assumption A2) implies that (see also Lemma 2.1):

(29)

These two uniform boundedness results in equation (29) are then used to show the a.s. convergence in Theorem 2.2 which induces the a.s. convergence in Corollary 2.2, the validity of Lindeberg’s condition as (34) that induces Proposition 11, convergence in probability in equations (14) and (16) that induces Theorem 3, Corollary 19, and Theorem 4.2.

Let be such that (27) does not hold, then . On the other hand, if (28) holds, then for any positive constants and there exists a such that the set

satisfies

whenever . Now if (27) holds, then the results on the a.s. convergence and on convergence in probability for remain true with respect to when is replaced by , as long as each is such that . In contrast, if (28) holds, then the results on the a.s. convergence for reduce to convergence in probability in . But those on convergence in probability for remain true with respect to when is replaced by , as long as each is with (where can be chosen to be small). Therefore, the results (except Lemmas 4.1 and 4.1) obtained in the previous sections hold with their corresponding probability statements, whenever, for some ,

(30)

holds a.s. or in probability in , and they allow to be random (and conditioned) or deterministic. Statistically speaking, an event with very small or zero probability is very unlikely to occur. This means that the practical utility of these results is not affected when (30) holds in the sense of (27) or (28) when is large.

Finally, we remark that Lemmas 4.1 and 4.1 are independent of the assumptions A1), A2) and A3), and of A1), A2) and A3). In other words, for model (26), Lemma 4.1 remains valid, and so does Lemma 4.1 when the assumptions involved there are replaced by those on .

6 A Simulation Study

We conducted a simulation study to demonstrate the validity of our theoretical results. The quality of is measured by

where is an estimate of , and . In fact, measures the difference between the orthogonal projections (as linear operators) induced by the rows of and those of respectively, and if and only if . We propose a rank estimator , directly applied to , based on Section 3. Specifically, for some positive and is used to scale the eigenvalues of , for which (dependent on ) is determined by a strategy of Hallin and Liska (2007) (that was also used by Leek, 2011); for more details, we refer the reader to these two references. We do so since the assertions in Section 3 are valid only for , assumption A3) may not be satisfied in our simulation study, and the unknown constants in equation (14) for the speed of convergence need to be estimated. However, we caution that, if the scaling sequence in Section 3 defined via (15) is not well estimated to capture the speed of convergence of the eigenvalues of , defined by (17) as an estimate of can be inaccurate; see our simulation study results.

6.1 Simulation Study Design

The settings for our simulation study are described below:

  1. We consider , and .

  2. The mean matrix and the observed data are generated within the following scenarios:

    1. , , and .

    2. , - with degrees of freedom and non-centrality , and .

    3. with and . is such that for , for and , and otherwise.

    4. with , , and .

    5. with shape and mean , , and .

  3. For each combination of values in Step 1 and distributions in Step 2 above, the data matrix is generated independently times and then the relevant statistics are calculated.

For each simulated data set, we measure the quality of by , that of by , and we record .

Binomial Gamma NegBin Normal Poisson
1000 3 (97) 24 (4) 3 100 84 (6)
5000 95 (5) 30 (9) 4 96 94
10,000 100 33 (22) 27 (9) 96 89
100,000 100 90 43 99 94
Table 2: Assessment of the estimator . Data were simulated under model (1) with and under the five distributions listed. Shown is the number of times that among simulated data sets for each scenario. Also shown in parentheses is the number of times that , if any instances occurred. Results from several additional scenarios are considered in Appendix C.

6.2 Simulation Study Results

Figure 1 and Table 2 display the performance of the nonparametric estimator of , that of , and that of when and ; for other values of and , the performance of is provided in Appendix C and that of in Appendix D. The following conclusions can be drawn from the simulation study:

Figure 1: Performance of the nonparametric estimator of when and . Column 1: Boxplots of the difference between the row spaces spanned by