Rich Component Analysis

Rich Component Analysis

Rong Ge
Microsoft Research, New England
rongge@microsoft.com
James Zou
Microsoft Research, New England
jazo@microsoft.com
Abstract

In many settings, we have multiple data sets (also called views) that capture different and overlapping aspects of the same phenomenon. We are often interested in finding patterns that are unique to one or to a subset of the views. For example, we might have one set of molecular observations and one set of physiological observations on the same group of individuals, and we want to quantify molecular patterns that are uncorrelated with physiology. Despite being a common problem, this is highly challenging when the correlations come from complex distributions. In this paper, we develop the general framework of Rich Component Analysis (RCA) to model settings where the observations from different views are driven by different sets of latent components, and each component can be a complex, high-dimensional distribution. We introduce algorithms based on cumulant extraction that provably learn each of the components without having to model the other components. We show how to integrate RCA with stochastic gradient descent into a meta-algorithm for learning general models, and demonstrate substantial improvement in accuracy on several synthetic and real datasets in both supervised and unsupervised tasks. Our method makes it possible to learn latent variable models when we don’t have samples from the true model but only samples after complex perturbations.

1 Introduction

A hallmark of modern data deluge is the prevalence of complex data that capture different aspects of some common phenomena. For example, for a set of patients, it’s common to have multiple modalities of molecular measurements for each individual (gene expression, genotyping, etc.) as well as physiological attributes. Each set of measurements corresponds to a view on the samples. The complexity and the heterogeneity of the data is such that it’s often not feasible to build a joint model for all the data. Moreover, if we are particularly interested in one aspect of the problem (e.g. patterns that are specific to a subset of genes that are not shared across all genes), it would be wasteful of computational and modeling resources to model the interactions across all the data.

More concretely, suppose we have two sets (views) of data, and , on a common collection of samples. We model this as and , where captures the latent component specific to , is specific to , and is common to both and and is related in the two views by an unknown linear transformation . Each component can be a complex, high-dimensional distribution. The observed samples from and are component-wise linear combinations of the unobserved samples from . To model all the data, we would need to jointly model all three , which can have prohibitive sample/computation complexity and also prone to model misspecification. Ideally, if we are only interested in the component that’s unique to the first view, we would simply write down a model for without making any parametric assumptions about and , except that they are independent.

In this paper, we develop a general framework of Rich Component Analysis (RCA) to explore such multi-component, multi-view datasets. Our framework allows for learning an arbitrarily complex model of a specific component of the data, , without having to make parametric assumptions about other components . This allows the analyst to focus on the most salient aspect of data analysis. The main conceptual contribution is the development of new algorithms to learn parameters of complex distributions without any samples from that distribution. In the two-view example, we do not observe samples from our model of interest, . Instead the observations from are compositions of true samples from with complex signal from another process which is shared with . Our approach performs consistent parameter estimation of without modeling , .

Outline.

RCA consists of two stages: 1) from the observed data, extract all the cumulants of the component that we want to model; 2) using the cumulants, perform method-of-moments or maximum likelihood estimation (MLE) of model parameters via polynomial approximations to gradient descent. We introduce the relevant properties of cumulants and tensors in Section 2. In Section 3, we develop the formal models for Rich Component Analysis (RCA) and the cumulant extraction algorithms. We discuss how RCA differs from existing models. Section 4 shows how to integrate the extracted cumulants with method-of-moments or stochastic gradient descent for MLE inference. We show the performance gains of RCA in Section 5. All the proofs are in the Appendix.

2 Preliminaries

In this section we introduce the basics of cumulants. For more information please refer to Appendix A. Cumulants provide an alternative way to describe the correlations of random variables. Unlike moments, cumulants have the nice property that the cumulant of sum of independent random variables equals to the sum of cumulants. For a random variable the cumulant is defined to be the coefficients of the cumulant generating function .

We can also define cross-cumulants which are cumulants for different variables (e.g. covariance). For variables , their cross-cumulant can be computed using the following formula:

 κt(X1,...,Xt)=∑π(|π|−1)!(−1)|π|−1∏B∈πE[∏i∈BXi].

In this formula, is enumerated over all partitions of , is the number of parts and runs through the list of parts. We also use when it’s the same random variable.

We can similarly define cumulants for multivariate distributions. For random vector , the -th order cumulant (and -th order moment) is an object in (a -th order tensor). The -th coordinate of cumulant tensor is . We often unfold tensors into matrices. Tensor unfolds into matrix : Cumulants have several nice properties that we summarize below.

• Suppose are random variables in . The -th order cumulant is a tensor in (X_1,…,X_t)(Y_1,…,Y_t)κ_t(X_1+Y_1,…,X_t+Y_t) = κ_t(X_1,…,X_t) + κ_t(Y_1,…,Y_t)κ_t(c_1 X_1,…,c_t X_t) = c_1c_2⋯c_t κ_t(X_1,…,X_t)A).

• (Computation) The cumulant can be computed in time.

The second order cross-cumulant, is equal to the covariance . Higher cumulants measures higher-order correlations and also provide a measure of the deviation from Gaussianity–all 3rd and higher order cumulants of Gaussian random variables are zero.

3 Rich Component Analysis

In this section, we show how to use cumulant to disentangle complex latent components. The key ideas and applications of RCA are captured in the contrastive learning setting when there are two views. We introduce this model next and then show how to extend it to general settings.

3.1 RCA for contrastive learning

Recall the example in the introduction where we have two views of the data, formally,

 U=S1+S2,V=AS2+S3. (1)

Here, are independent random variables that can have complicated distributions; is an unknown linear transformation. The observations consist of pairs of samples . Each pair is generated by drawing independent samples and adding these samples component-wise to obtain and . Note that the same shows up in both and , introducing correlation between the two views. We are interested in learning properties about , for example learning its maximum likelihood (MLE) parameters. For concreteness, we focus our discussion on learning although our techniques also apply to and . We don’t have any samples from . The observations of involves a potentially complicated perturbation by . Our hope is to remove this perturbation by utilizing the second view , and we would like to do this without assuming a particular model for or .

Note that the problem is inherently under-determined: it is impossible to find the means of , , without any additional information. This is in some sense the only ambiguity, as we will see if we know the mean of one distribution it is possible to extract all order cumulants of . For simplicity throughout this section we assume the means of are 0 (given the mean of any of , we can always use the means of and to compute the means of other distributions, and shift them to have mean 0).

Determining linear transformation

First we can find by the following formula:

 A⊤=unfold(κ4(V,U,U,U))†unfold(κ4(V,U,U,V)). (2)
Lemma 3.1.

Suppose the unfolding of the 4-th order cumulant has full rank, given the exact cumulants and , the above algorithm finds the correct linear transformation in time .

Intuitively, since only appears in both and , the cross-cumulants and depend only on . Also, by linearity of cumulants we must have (see Appendix B.1). In the lemma we could have used third order cumulants, however for many distributions (e.g. all symmetric distributions) the third order cumulant is 0. Most distributions satisfy the condition that is full rank, the only natural distribution that does not satisfy this constraint is the Gaussian distribution (where is 0).

Extracting cumulants

Even when the linear transformation is known, in most cases it is still information theoretically impossible to find the values of the samples as we only have two views. However, we can still hope to learn useful information about the distributions . In particular, we derived the following formulas to estimate the cumulants of the distributions:

 κt(S1) = κt(U)−κt(U,U,...,U,A−1V), (3) κt(S2) = κt(U,U,...,U,A−1V), (4) κt(S3) = κt(V)−κt(AU,V,V,...,V). (5)
Theorem 3.2.

For all , Equations (3)-(5) compute the -th order cumulants for in time

Proof of Theorem 3.2 relies on the fact that since only appears in both and , the cross-cumulant captures the cumulant of . Moreover, by independence, , so we can recover by subtracting off the estimated (and similarly for ). When the dimension of is smaller than the dimension of and has full column rank, the above formula with pseudo-inverse in place of still recovers all cumulants. In Appendix B.1, we prove that both the formulas for computing and for extracting the cumulants are robust to noise. In particular, we give the sample complexity for learning and from samples of and , both are polynomial in relevant quantities.

Given , we can use standard algorithms to compute moments of . Many learning algorithms are based on method-of-moments and can be directly applied (see Section 4.1). Other optimization-based algorithms can also be adapted (Section 4.2).

3.2 General model of Rich Component Analysis

We can extend the cumulant extraction algorithm in contrastive learning to general settings with more views and components. The ideas are very similar, but the algorithm is more technical in order to keep track of all the components. We present the intuition and the main results here and defer the details to Appendix B.2. Consider a set of observations , each is linearly related to a subset of variables , the variable appears in a subset of the observations. That is,

 ∀i∈[k]Ui=p∑j=1A(i,j)Sj, (6)

where are unknown linear transformations, and if . For simplicity we assume all the linear transformations are invertible. The variable models the latent source of signal that is common to the subset of observations . The matrix models the transformation of latent signal in view . In order for the model to be identifiable, it is necessary that all the subsets ’s are distinct (otherwise the latent sources with identical can be collapsed into one ). In the most general setting, we have a latent signal that is uniquely associated with every subset of observations. In this case, and corresponds to all the non-empty subsets of . In some settings, only specific subset of views share common signals and can be a small set. We measure the complexity of the set system using the following notion:

Definition 3.1 (L-distinguishable).

We say a set system is -distinguishable, if for every set , there exists a subset of size at most (called the distinguishing set) such that for any other set , either or .

For example, the set system of the contrastive model is and it is 2-distinguishable. Intuitively, for any set in the set system, there is a subset of size at most that distinguishes from all the other sets (except the supersets of ). We use Algorithm 1 to recover all the linear transformations (for more details of the algorithm see Algorithm 2 in Appendix). Algorithm 1 takes as input a set system that captures our prior belief about how the datasets are related. When we don’t have any prior belief, we can input the most general of size , which is -distinguishable. The algorithm automatically determines if certain variable . In the algorithm, is the smallest element of .

Lemma 3.3.

Given observations ’s as defined in Equation 6, suppose the sets ’s are -distinguishable, all the unknown linear transformations ’s are invertible, unfoldings is either 0 (if ) or have full rank, then given the exact -th order cumulants, Algorithm 1 outputs all the correct linear transformations in time .

Once all the linear transformations are recovered, we follow the same strategy as in the contrastive analysis case 3.1.

Theorem 3.4.

Under the same assumption as Lemma 3.3, for any Algorithm 3 computes the correct -th order cumulants for all the variables in time .

Note that in the most general case it is impossible to find cumulants with order , because there can be many different variables ’s but not enough views. Both Algorithms 1 and 3 are robust to noise, with sample complexity that depends polynomially on the relevant condition numbers, and exponential in the order of cumulant considered. For more details see Appendix B.2.

3.3 Related models

Independent component analysis (ICA)[4] may appear similar to our model, but it is actually quite different. In ICA, let be a vector of latent sources, where ’s are one dimensional independent, non-Gaussian random variables. There is an unknown mixture matrix and the observations are . Given many samples , the goal is to deconvolve and recover each sample . In our setting, each can be a high-dimensional vector with complex correlations. It is information-theoretically not possible to deconvolve and recover the individual samples . Instead we aim to learn the distribution without having explicit samples from it.

Another related model is canonical correlation analysis (CCA)[6]. The generative model interpretation of CCA is: there is a common signal , and view-specific signals . Each view is then sampled according to , where index the view. CCA is equivalent to maximum likelihood estimation of in this generative model. In our framework, CCA corresponds to the very restricted setting where are all Gaussians. RCA learns without making such parametric assumptions about and . Moreover, using CCA, it is not clear how to learn the distribution if it is not orthogonal to the shared subspace . In our experiments, we show that the naive approach of performing CCA (or kernel CCA) followed by taking the orthogonal projection leads to very poor performance. Factor analysis (FA)[5] also corresponds to a multivariate Gaussian model, and hence does not address the general problem that we solve. In FA, latent variables are sampled and the observations are .

A different notion of contrastive learning was introduced in [14]. They focused on settings where there are two mixture models with overlapping mixture components. The method there applies only for Latent Dirichlet allocation and Hidden Markov Models and requires explicit parametric models for each component.

4 Using Cumulants in learning applications

The cumulant extraction techniques of Section 3 constructs unbiased estimators for the cumulants of . In this section we show how to use the estimated cumulants/moments to perform maximum likelihood learning of . For concreteness, we frame the discussion on the contrastive learning setting, where we want to learn . For general RCA the method works when (see Definition 3.1) is small or the distributions have specific relationship between lower and higher cumulants.

4.1 Method-of-Moments

RCA recovers the cumulants of , from which we can construct all the moments of in time . This makes it possible to directly combine RCA with any estimation algorithm based on the method-of-moments. Method-of-moments have numerous applications in machine learning. The simplest (and most commonly used) example is arguably principal component analysis, where we want to find the maximum variance directions in . This is only related to the covariance matrix . RCA removes the covariance due to and constructs an unbiased estimator of , from which we can extract the top eigen-space.

The next simplest model is least squares regression (LSR). Suppose the distribution contains samples and labels , and only the samples are corrupted by perturbations, i.e. is independent of . LSR tries to find a parameter that minimizes . The optimal solution again only depends on the moments of : . Using the second-order cumulants/moments extracted from RCA , we can efficiently estimate .

Method-of-moment estimators, especially together with tensor decomposition algorithms have been successfully applied to learning many latent variable models, including Mixture of Gaussians (GMM), Hidden Markov Model, Latent Dirichlet Allocation and many others (see [1]). RCA can be used in conjunction with all these methods. We’ll consider learning GMM in Section 5.

There are many machine learning models where it’s not clear how to apply method-of-moments. Gradient descent (GD) and stochastic gradient descent (SGD) are general purpose techniques for parameter estimation across many models. Here we show how to combine RCA with gradient descent. The key idea is that the extracted cumulants/moments of forms a polynomial basis. If the gradient of the log-likelihood can be approximated by a low-degree polynomial in , then the extracted cumulants from RCA can be used to approximate this gradient.

Consider the general setting where we have a model with parameter , and for any sample the likelihood is . The maximum likelihood estimator tries to find the parameter that maximizes the likelihood of observed samples: In many applications, this is solved using stochastic gradient descent, where we pick a random sample and move the current guess to the corresponding gradient direction: where is a step size and is the -th sample. For convex functions this is known to converge to the optimal solution [12]. Even for non-convex functions this is often used as a heuristic.

If the gradient of log-likelihood is a low degree polynomial in , then using the lower order moments we can obtain an unbiased estimator for with bounded variance, which is sufficient for stochastic gradient to work. This is the case for linear least-squares regression, and its regularized forms using either or regularizer.

In the case when log-likelihood is not a low degree polynomial in , we approximate the gradient by a low degree polynomial, either through simple Taylor’s expansion or other polynomial approximations (e.g. Chebyshev polynomials, see more in [10]). This will give us a biased estimator for the gradient whose bias decreases with the degree we use. In general, when the (negative) log-likelihood function is strongly convex we can still hope to find an approximate solution:

Lemma 4.1.

Suppose the negative log-likelihood function is -strongly convex and -smooth, given an estimator for the gradient such that , gradient descent using with step size converges to a solution such that .

When high degree polynomials are needed to approximate the gradient, our algorithm requires number of samples that grows exponentially in the degree.

Logistic Regression

We give a specific example to illustrate using RCA and low degree polynomials to simulate gradient descent. Consider the basic logistic regression setting, where the samples , and the log-likelihood function is . The gradient of the log-likelihood is:

We can then approximate the function using a low degree polynomial in . As an example, we use 3rd degree Chebychev: The gradient we take in each step is

 E[∇θlogL(θ,S1)]≈E[YX]−0.5E[X]−0.245E[X(θ⊤X)]+0.014E[X(θ⊤X)3].

To estimate this approximation, we only need quadratic terms and a projection of the 4-th order moment . These terms are computed from the projected 2nd and 4-th order cumulants of that are extracted from the cumulants of and via Section 3. Because of the projection these quantities are much easier to compute (in fact, they can be estimated in linear time).

5 Experiments

In the experiments, we focus on the contrastive learning setting where we are given observations of and and the goal is to estimate the parameters for the distribution. Our approach can also learn the shared component as well as . We tested our method in five settings, where corresponds to: low rank Gaussian (PCA), linear regression, mixture of Gaussians (GMM), logistic regression and the Ising model. The first three settings illustrate combining RCA with method-of-moments and the latter two settings requires RCA with polynomial approximation to stochastic gradient descent. In each setting, we compared the following four algorithms:

1. The standard learning algorithm using the actual samples to learn the parameters . This is the gold-standard, denoted as ‘true samples’.

2. Our contrastive RCA algorithm using paired samples from and to learn .

3. The naive approach that ignores and uses to learn directly, denoted as ‘naive’.

4. First perform Canonical Correlation Analysis (CCA) on and , and project the samples from onto the subspace orthogonal to the canonical correlation subspace. Then learn from the projected samples of . We denote this as ‘CCA’.

In all five settings, we let be sampled uniformly from , where is the appropriate dimension of . The empirical results are robust to other choices of that we have tried, e.g. multivariate Gaussian or mixture of Gaussians.

Contrastive PCA.

was set to have a principal component along direction , i.e. . was sampled from and , are random unit vectors in . RCA constructs an unbiased estimator of from the samples of and . We then report the top eigenvector of this estimator as the estimated . We evaluate each algorithm by the mean squared error (MSE) of the inferred to the true .

Contrastive regression.

is the uniform distribution, and . was sampled from and , are random unit vectors in . Our approach gives unbiased estimator of from which we estimate . All algorithms are evaluated by the MSE between the inferred and the true .

Contrastive mixture of Gaussians.

is a mixture of spherical Gaussians in , . is also a mixture of spherical Gaussians, . RCA gives unbiased estimators of the third-order moment tensor, . We then use the estimator in [7] to get a low rank tensor whose components correspond to center vectors, and apply alternating minimization (see [9]) to infer . Algorithms are evaluated by the MSE between the inferred centers and the true centers .

Contrastive logistic regression.

Let and with probability . was sampled from , and , are unit vectors in . We use the 4-th order Chebychev polynomial approximation to the SGD of logistic regression as in Section 4.2. Evaluation is the MSE error between the inferred and the true .

Contrastive Ising model.

Let be a mean-zero Ising model on -by- grid with periodic boundary conditions. Each of the vertices are connected to four neighbors and can take on values . The edge between vertices and is associated with a coupling . The state of the Ising model, , has probability , where is the partition function. We let also be a -by- grid of spins where half of the spins are independent Bernoulli random variables and the other half are correlated, i.e. they are all 1 or all -1 with probability 0.5. We use composite likelihood to estimate the couplings of , which is asymptotically consistent with MLE of the true likelihood [13]. For the gold-standard baseline (which uses the true samples ), we use the exact gradient of the composite likelihood. For RCA , we used the 4-th order Taylor approximation to the gradient. Evaluation is the MSE between the true and the estimated .

Results.

For the method-of-moment applications–PCA, linear regression, GMM–we used 10 dimensional samples for and . The tradeoff between inference accuracy (measured in MSE) and sample size is shown in the top row of Figure 1. Even with just 100 samples, RCA performs significantly better than the naive approach and CCA. With 1000 samples, the accuracy of RCA approaches that of the algorithm using the true samples from . It is interesting to note that projecting onto the subspace orthogonal to CCA can perform much worse than even the naive algorithm. In the linear regression setting, for example, when the signal of happens to align with , the direction of prediction, projecting onto the subspace orthogonal to loses much of the predictive signal.

In the SGD settings, we used a 10 dimensional logistic model and a 5-by-5 Ising model (50 parameters to infer). RCA also performed substantially better than the two benchmarks (Figure 1 d, e). In all the cases, the accuracy of RCA improved monotonically with increasing sample size. This was not the case for the Naive and CCA algorithms, which were unable to take advantage of larger data due to model-misspecification. In Figure 1 f and g, we plot the learning trajectory of RCA over the SGD steps for representative runs of the algorithm with 1000 samples. RCA converges to the final state at a rate similar to the true-sample case. The residual error of RCA is due to the bias introduced by approximating the sigmoid with low-degree polynomial. When many samples are available, a higher-degree polynomial approximation can be used to reduce this bias.

We also explored how the algorithms perform as the magnitude of the signal in is increased compared to (Figure 1 h-j) with fixed 1000 samples. In these plots the -axis measures the ratio of standard deviations of and .At close to 0, most of the signal of comes from , and all the algorithms are fairly accurate. As the strength of the perturbation increases, RCA performs significantly better than the benchmarks, especially in the Ising model. Finally we empirically explored the sample complexity of the subroutine to recover the matrix from the 4th order cumulants. Figure 1 k shows the MSE between the true (sampled ) and the inferred as a function of the sample size. Even with 1000 samples, we can obtain reasonable estimates of .

Biomarkers experiment.

We applied RCA to a real dataset of DNA methylation biomarkers. Twenty biomarkers (10 test and 10 control) measured the DNA methylation level (a real number between 0 and 1) at twenty genomic loci across 686 individuals [15]. Each individual was associated with a binary disease status . Logistic regression on the ten test biomarkers was used to determine the weight vector, , which quantifies the contribution of the methylation at each of these ten locus to the disease risk. The other ten independent loci are control markers. Getting accurate estimates for the values of is important for understanding the biological roles of these loci. In this dataset, all the samples were measured on one platform, leading to relatively accurate estimate of . In many cases samples are collected from multiple facilities (or by different labs). We simulated this within our RCA framework. We let be the original data matrix of the ten test markers across the 686 samples. We let be the original data matrix of the ten control markers in these same samples. We modeled as a mixture model, where samples are randomly assigned to different components that capture lab specific biases. The perturbed observations are and , i.e. and simulate the measurements for the test and control markers, respectively, when the true signal has been perturbed by this mixtures distribution of lab biases. We assume that we can only access and and do not know , i.e. where each sample is generated. Running logistic regression directly on and the phenotype obtained a MSE of 0.24 (std 0.03) between the inferred and the true measured from directly regressing on . Directly using CCA also introduce significant errors with MSE of 0.25 (std 0.02). Using all the control markers as covariates in the logistic regression, the MSE of the test markers’ was 0.14 (std 0.03). In general, adding as covariates to the regression can eliminate at the expense of adding , and can reduce accuracy when is larger than . Using our RCA logistic regression on and , we obtained significantly more accurate estimates of , with MSE 0.1 (std 0.03). See Appendix for more analysis of this experiment.

References

• [1] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773–2832, 2014.
• [2] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, neural algorithms for dictionary learning. In Proceedings of The 28th Conference on Learning Theory, COLT, 2015.
• [3] AV Bulinskii. Bounds for mixed cumulants and higher-order covariances of bounded random variables. Theory of Probability & Its Applications, 19(4):835–839, 1975.
• [4] Pierre Comon and Christian Jutten. Handbook of Blind Source Separation: Independent component analysis and applications. Academic press, 2010.
• [5] Harry H Harman. Modern factor analysis. University of Chicago Press, 1976.
• [6] Harold Hotelling. Relations between two sets of variates. Biometrika, pages 321–377, 1936.
• [7] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 11–20. ACM, 2013.
• [8] John Francis Kenney and Ernest Sydney Keeping. Mathematics of statistics-part i. 1954.
• [9] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
• [10] Michael James David Powell. Approximation theory and methods. Cambridge university press, 1981.
• [11] Colin Rose and Murray D Smith. Mathematical statistics with Mathematica, volume 1. Springer New York, 2002.
• [12] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic Convex Optimization. In Proceedings of the Conference on Learning Theory (COLT), 2009.
• [13] Cristiano Varin, Nancy Reid, and David Firth. A overview of composite likelihood methods. Statistica Sinica, 21:5–42, 2011.
• [14] James Zou, Daniel Hsu, David Parkes, and Ryan Adam. Contrastive Learning Using Spectral Methods. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2013.
• [15] James Zou, Christoph Lippert, David Heckerman, Martin Aryee, and Jennifer Listgarten. Epigenome-wide association studies without the need for cell-type composition. Nature Methods, 11(3), 2014.

Appendix A More Tensor and Cumulant Notations

In this section we introduce the notations and basics for tensors and cumulants.

Matrix Notations

For a matrix we use to denote its spectral norm , to denote its Frobenius norm , and to denote its smallest singular value.

When and the matrix has full column rank, we use to denote its Moore-Penrose pseudoinverse which in particular satisfy .

We also sometimes use the Kronecker product of matrices, for and , is a matrix in that has the following block structure:

 A⊗B=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣A1,1BA1,2B⋯A1,nBA2,1BA2,2B⋯A2,nB⋮⋮⋮Am,1BAm,2B⋯Am,nB⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

The singular values of is just the product of singular values of and .

Tensor Notations

A tensor is a -dimensional array, and is frequently used to represent higher order moments or cumulants. We index the elements in the tensor using a -tuple . The entries of tensor product is simply the product of corresponding entries . We use to denote times.

For a distribution , the -th order moment is a tensor , whose -th entry is equal to . Later we shall see cumulants can also be conveniently represented as tensors.

A tensor can be viewed as a multi-linear form (just as a matrix can be viewed as a bilinear form ). For a tensor we define to be

 T(M1,M2,…,Mt)(i1,...,it)=∑(j1,...,jt)∈[d]tT(j1,...,jt)t∏l=1Ml(jl,il).

This multi-linear form works well with the moment tensors, especially for matrices we always have

 E[X⊗t](M1,...,Mt)=E[(M⊤1X)⊗(M⊤2X)⊗⋯⊗(M⊤tX)].

Often to simplify operations tensors are unfolded to become matrices. There can be many ways to unfold a tensor, but in this paper we mostly use a particular unfolding which makes the tensor into a matrix:

 unfold(T)(i1,...,it−1),it=T(i1,...,it).

Similar to matrices, we also define the Frobenius norm of tensors to be the norm of all its entries, in particular

 ∥T∥F=∥unfold(T)∥F=√∑i1,...,itT2(i1,...,it).

Cumulants

Cumulants provide an alternative way to describe the lower order correlations of a random variable. Unlike moments, cumulants have the nice property that the cumulant of sum of independent random variables equals to the sum of cumulants. Formally, for a random variable the cumulant is defined to be the coefficients of the cumulant generating function ( is just times the coefficient in front of ). When the variables are different the cross-cumulants (similar to covariance) can similarly be defined, and it can be computed as:

 κt(X1,...,Xt)=∑π(|π|−1)!(−1)|π|−1∏B∈πE[∏i∈BXi]. (7)

In this formula, is enumerated over all partitions of , is the number of parts in partition and runs through the list of all parts.

Similarly, it is possible to define cumulants for multivariate distributions. For random variable . This cross cumulant can be computed in a similar way as Equation (7), however the products should be replaced by tensor products and the ordering of coordinates is important when doing the tensor product.

• Suppose are random variables in . The -th order cumulant is a tensor in (X_1,…,X_t)(Y_1,…,Y_t)κ_t(X_1+Y_1,…,X_t+Y_t) = κ_t(X_1,…,X_t) + κ_t(Y_1,…,Y_t)κ_t(M_1^⊤X_1,…,M_t^⊤X_t) = κ_t(X_1,…,X_t)(M_1,…,M_t)ttttO(t!)tdO((td)^d)

Intuitively, cumulants can measure how correlated two distributions are. The simplest case is which is equal to the covariance , and is 0 only if the two variables are not correlated in second order. For more detailed introductions to cumulants see books like [8].

Appendix B Details for Section 3

In this section, we prove the equations and algorithms in Section 3 indeed compute the desirable quantity, and further we give sample complexity bounds.

b.1 Contrastive Learning

We first prove Equation (2) computes the correct linear transformation.

Lemma B.1 (Lemma 3.1 restated).

Suppose the unfolding of the 4-th order cumulant has full rank, given the exact cumulants and , Equation (2) finds the correct linear transformation in time .

Proof.

Since and , we know

 κ4(V,U,U,U) =κ4(AS2+S3,S1+S2,S1+S2,S1+S2) =κ4(0,S1,S1,S1)+κ4(AS2,S2,S2,S2)+κ4(S3,0,0,0) =κ4(AS2,S2,S2,S2).

Here the second step uses the fact that cumulants are additive for independent variables, and third step uses the linearity of cumulants.

Similarly, we know .

For the unfoldings of these cumulants, we have

 unfold(cum4(V,U,U,V))=unfold(cum4(V,U,U,U))A⊤.

Therefore when has full rank we can compute using pseudo-inverse.

For the running time, the main computation is a pseudo-inverse and a matrix product for matrices, both take time. ∎

Next we show given the linear transformation, it is possible to estimate the cumulants using Equations (3 - 5). In fact, we can also avoid computing the cross-cumulants and work with just the cumulants of variables:

 κt(S1) = κt(U)−κt(U+A−1V)−κt(U)−κt(A−1V)2t−2, (8) κt(S2) = κt(U+A−1V)−κt(U)−κt(A−1V)2t−2, (9) κt(S3) = κt(V)−κt(AU+V)−κt(AU)−κt(V)2t−2. (10)
Theorem B.2 (Theorem 3.2 restated).

For all , Equations (3)-(5) or (8)-(10) compute the correct cumulants for in time . Moreover, if has dimension higher than and has full column rank, replacing by still gives correct cumulants.

Proof.

The proof of Equations (3)-(5) is very similar to the previous lemma. Note that

 κt(U,U,...,U,A−1V) =κt(S1+S2,...,S1+S2,S2+A−1S3) =κt(S1,...,S1,0)+κt(S2,S2,S2,S2)+κt(0,...,0,A−1S3) =κt(S2).

So we have Equation (4), and using the fact that we get Equation (3). Equation (5) follows similarly.

In order to get Equations (8)-(10), first note that by the linearity of cumulants, we can write as the sum of terms:

 κt(U+A−1V)=∑z∈{0,1}tκt(z1U+(1−z1)A−1V,z2U+(1−z2)A−1V,...,ztU+(1−zt)A−1V).

Among all these terms, one is equal to , one is equal to , and all the other terms are cross-cumulants that involve both and . Since is the only variable that appears in both and , all the terms are equal to , therefore we have Equation (9). Equation (8) again follows from the fact that , and Equation (10) is very similar.

The moreover part follows by directly replacing with in the above argument. Note that in this case we can still find because