A Posterior distribution for the parameters

## Abstract

One important problem in genome science is to determine sets of co-regulated genes based on measurements of gene expression levels across samples, where the quantification of expression levels includes substantial technical and biological noise. To address this problem, we developed a Bayesian sparse latent factor model that uses a three parameter beta prior to flexibly model shrinkage in the loading matrix. By applying three layers of shrinkage to the loading matrix (global, factor-specific, and element-wise), this model has non-parametric properties in that it estimates the appropriate number of factors from the data. We added a two-component mixture to model each factor loading as being generated from either a sparse or a dense mixture component; this allows dense factors that capture confounding noise, and sparse factors that capture local gene interactions. We developed two statistics to quantify the stability of the recovered matrices for both sparse and dense matrices. We tested our model on simulated data and found that we successfully recovered the true latent structure as compared to related models. We applied our model to a large gene expression study and found that we recovered known covariates and small groups of co-regulated genes. We validated these gene subsets by testing for associations between genotype data and these latent factors, and we found a substantial number of biologically important genetic regulators for the recovered gene subsets.

A latent factor model with a mixture of sparse and dense factors to model gene expression data with confounding effects

Chuan Gao, Christopher D Brown, Barbara E Engelhardt

1 Institute for Genome Sciences & Policy, Duke University, Durham, NC, USA

2 Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA

3 Department of Biostatistics & Bioinformatics and Department of Statistical Science, Duke University, Durham, NC, USA

E-mail: barbara.engelhardt@duke.edu

## 1 Introduction

Fast evolving experimental techniques for assaying genomic data have enabled the generation of large scale gene expression and genotype data at an unprecedented pace [1, 2]. Studies to find genetic variants that regulate gene expression levels (called expression quantitative trait loci, or eQTLs), are now possible [3, 4]. However, due to the complicated nature of experimental assays to quantify cellular traits, substantial technical noise and biological covariates may confound measurements of gene expression levels. These confounding effects include batch effects [5, 6, 7, 8, 9], latent population structure among the samples [10, 11, 12], and biological covariates, including age, sex, or body mass index (BMI).

The most replicable and numerous eQTL associations that have been identified in humans are those for which the single nucleotide polymorphism (SNP) is within the cis region of, or local to, the associated gene [13, 14]. In practice, eQTL analyses are conducted by testing each genetic variant for an additive association with only the genes in cis, or local, which helps to alleviate some of the burden imposed by multiple testing [15, 16]. The biological reality is that genes cannot manifest their function alone; instead, genes tend to work together to achieve biological functions (Figure 1A) [17, 18, 19]. Furthermore, a SNP that regulates a gene in cis that, in turn, drives the expression levels of other genes, such as a transcription factor, may appear to co-regulate a subnetwork of genes (possibly in trans; Figure 1C,D). Methods that identify small, co-regulated groups of genes provide important information to a downstream eQTL analysis, enabling genetic variants that regulate multiple genes (pleiotropic eQTLs) to be identified (Figure 1B,D).

The most effective method to control for confounding effects in gene expression assays in order to have power to identify eQTLs remains an open question. Confounding effects are often controlled by estimating principal components (PCs) of the gene expression matrix and removing the effects of the initial PCs before downstream analysis on the normalized residuals [20, 14]; the downside of this two-step procedure is that it is possible that some of the sparse signal is removed in the first step [7, 21]. We address this problem by developing a Bayesian latent factor model to identify a large number of sparse gene clusters, where individual signals are perturbed by unobserved confounding noise. In this latent factor model, small clusters of co-regulated genes are captured by a large number of sparse factors. To jointly model and implicitly control for confounding noise, our model includes a two-component mixture that allows each factor loading to be regularized by either a sparsity-inducing prior or an alternative prior that does not induce sparsity, where the probability of a factor loading being sparse or dense is estimated from the data.

Latent factor models, and sparse latent factor models in particular, are a common and effective statistical methodology for identifying interpretable, low dimensional structure within a high dimensional matrix, and have frequently been used to identify latent structure in gene expression data [22, 23, 24, 25]. This approach assumes that the gene expression levels for each gene can be described by a linear combination of latent factors, and that the random noise in this matrix is approximately normal; thus each sample is modeled as being drawn from a multivariate normal distribution with a diagonal covariance matrix across genes, where the mean parameter is a linear combination of latent factors with a normal prior, and the variance term is estimated for each feature separately. Latent factor models assume that the total variation within the matrix can be partitioned into covariation among genes and variation specific to genes. This implies that a set of genes with correlated gene expression levels will contribute substantially to (have a substantial loading on) a single factor, because this co-variability will contribute to the overall variability in the matrix. In the setting of gene expression data, sparsity has often been imposed on the loading matrix to facilitate this clustering interpretation: genes with zero contribution to a factor are not included in the associated gene cluster [26].

In this work, we develop a flexible Bayesian sparse latent factor model, and we extend this sparse factor model to capture both sparse and dense latent factors by including a two-component mixture of priors on the loading matrix. We use the flexible three parameter beta () prior to induce local (element-specific), factor-specific, and global shrinkage within the loading matrix [27]. We then add a two-component mixture on the parameters of the factor-level three parameter beta prior to jointly model sparse and dense latent structure. While this model draws upon ideas in our previous work in sparse factor analysis [28, 29], the main contributions of this work are that i) we adapt the Bayesian two group regularization framework for regression [30, 31] to latent factor models in a natural way to create a flexible sparse latent factor model with desirable non-parametric and computational properties, and ii) we take advantage of this flexibility by jointly modeling sparse and dense factors. We believe that this sparse latent factor model will have broad utility in Bayesian statistics.

A general difficulty when working with latent factor models is that, in the basic model, the factors and loadings are only identifiable up to orthogonal rotation, scaling, and label switching [32]. We would like to develop metrics with which to compare both sparse and dense matrices in order to evaluate convergence in parameter estimates and to quantify the similarity of the recovered matrices and the underlying structure. These metrics must be robust to these invariances to be useful in this setting. While sparsity in the loading matrix facilitates rotational identifiability and enables more direct comparisons across fitted sparse latent factors, dense factors are not as trivially comparable because of this rotation invariance. In order to address these issues of comparison, we developed two statistics to quantify the stability across estimated factors and factor loading vectors that are sparse (contain zeros) and dense (do not contain zeros). Both statistics are invariant to label switching and scale. In addition, the dense matrix stability statistic is rotation invariant.

This paper is organized as follows. Section 2 provides a general background to sparse factor analysis to motivate our formulation of a Bayesian sparse latent factor model. Section 3 specifies our factor model with the prior and the equivalent model in terms of a simple hierarchical gamma prior. Section 3.2 extends the model to include a mixture of sparse and dense factors. The parameters are estimated using an approximate EM algorithm outlined in Section 4 and Appendix B. We motivate and describe our stability statistics in Section 5. To evaluate the performance of our model, we simulated data and compared our model to related methods based on these simulations (Section 6.1). In Section 6.4, we applied this model to real gene expression data on samples and genes, revealing interesting patterns in gene expression and confounding factors. Using these factors, we identify relevant genetic associations for the subsets of co-regulated genes.

## 2 Bayesian sparsity and latent factor models

Factor analysis has been used in a variety of settings to extract useful low dimensional features from high dimensional data [33, 22, 23, 24, 25]. We begin with a basic factor analysis model [34, 35], , with , , , and , , where and correspond respectively to the number of samples and the number of genes and, in practice, . To ensure conjugacy, the loading matrix and the latent factors have normal priors. This basic factor analysis model has a number of drawbacks: the latent factors and corresponding loadings are unidentifiable with respect to orthogonal rotation and scaling, and it is difficult to select the dimension of the latent factors, which is fixed a priori. One solution to addresses rotational invariance is to induce sparsity in the loading matrix, which allows for identifiability in the estimated matrices when the latent space is sufficiently sparse [28].

In this sparse factor analysis context, most approaches to inducing sparsity have applied shrinkage through a single parameter (generally, the variance of the factor loading matrix elements) on all loading parameters, which may sacrifice small signals to achieve high levels of sparsity. This behavior has been labeled the one group solution to inducing sparsity, because it effectively considers both signal and noise in a single group and regularizes them the same way [31]. In contrast, the two groups solution models noise and signal differently, strongly shrinking noise to zero but allowing signals to escape extreme shrinkage [30].

The canonical two groups solution in the Bayesian context is the so-called ‘spike-and-slab’ prior, which induces sparsity using a two-component mixture model including a point mass at zero and a normal distribution [50, 51]. The components that are noise are effectively removed from the model through the point mass at zero, while the signals are regularized using the normal distribution but remain in the model; this approach additionally allows an explicit posterior distribution on the inclusion probability of each component [26]. In the factor model framework, a spike-and-slab prior can be put on each element of the loading matrix, as in the Bayesian factor regression model (BFRM) [26]. Unfortunately, there is no closed form solution for the parameter estimates, because of the mixture component, and so MCMC is most generally used to estimate the parameters [26]. Because the parameter space for components includes configurations, this is computational intractable for large matrices [52]. These continuous sparsity-inducing priors all have the property that they impose strong shrinkage around zero but have sub-exponential tails, which allow signals to escape shrinkage. Because of these properties, these types of priors have been described as the ‘one-group answer to the original two-groups question’ [30].

In this work, we use a three parameter beta () distribution [27] to encourage sparsity in the elements of the factor loading matrix by shrinking their variance term. is a generalized form of the Beta distribution, with the third parameter further controlling the shape of the density. It has been shown that a linear transformation of the beta distribution, producing the inverse beta distribution or the horseshoe prior, has desirable shrinkage properties in sparse regression [40]. A linear transformation of the distribution can be used to mimic the inverse beta distribution, with the inverse beta variable scaled by . The distribution can also replicate other distributions, including the Strawderman-Berger prior [27]. The produces a -normal () distribution when coupled with the normal distribution, where, for and , this is equivalent to the normal-exponential-gamma distribution (NEG, Table 1[27]. The is thus appealing as a prior because it can recapitulate the sparsity-inducing properties of continuous one group priors, including the horseshoe, but it is also flexible enough to recapitulate other types of priors including some that do not induce sparsity (Table 1).

We build a sparse factor model using this sparsity-inducing prior following recent work in Bayesian regression [30]. In the regression context, a two groups model is achieved by setting the variance term for the regression coefficients to a scale mixture of normals:

 βj|λj,τ ∼ N(0,τ2λ2j) (1) λj ∼ π(λj) (2) (τ2,ϕ2) ∼ π(τ,ϕ), (3)

where , with a one group prior, is on the local variance component , and the same distribution is on the global variance component . This simple model exhibits two groups behavior, given the proper distributions for and , because effectively shrinks all of the regression coefficients to , then , which is allowed to be very large through a heavy-tailed distribution, rescues individual signals [30] by scaling the global shrinkage parameter that is very small.

To adapt this approach to the setting of latent factor models, we added an additional layer of shrinkage to each individual factor, which maintains the global-local-type model selection from the regression context, but allows factor-specific behavior. This creates, in effect, a three groups model, where signal and noise are modeled in a factor-specific way. In particular, a global parameter heavily shrinks all signals through the loading matrix toward zero, a factor-specific parameter rescues specific factors from global shrinkage, and a local parameter enables within-factor sparsity by shrinking individual elements of a factor. Each of the three layers serves a critical role: global regularization creates a non-parametric effect of removing factors from the model that are not necessary, factor-specific regularization identifies factors that will be included in the model, and local regularization enables sparsity, or model selection, within those selected factors. We impose regularization at all three levels of the loading matrix using the prior because it is continuous and flexible.

Recent work has produced a strong result in the Bayesian sparse factor model setting that, using specific local-global shrinkage priors, one obtains the minimax optimal rate of posterior concentration up to a log factor; this work is the first asymptotic justification for global-local approaches to Bayesian sparse factor analysis [53]. Although we use a different heavy-tailed local prior, this work motivates our general approach to Bayesian sparse factor analysis.

Our sparse latent factor model has a straightforward posterior distribution for which point estimates of the parameters are computed using expectation maximization (EM), making it computationally tractable via the careful application of this continuous distribution. Particularly in the dictionary learning setting of identifying an overcomplete, or , number of factors that may individually contribute minimally to the variation in the response matrix, this approach to inducing sparse in latent factor models is statistically and computationally well motivated [54].

## 3 Bayesian sparse factor model via TPB

We define a Bayesian factor analysis model in the following way:

 Y = XΛ+ϵ (4) Xi ∼ N(0,IK) (5) ϵj ∼ N(0,Ψj), (6)

where is the matrix of observed variables, is the factor matrix with factors, is the loading matrix, and is the residual error matrix. We assume is diagonal (but the diagonal elements are not necessarily the same). In this model, the covariance among the features in is captured in . For the latent factors in , we follow the usual convention by giving each element a standard normal prior, where is the identity matrix.

To induce sparsity in the factor loading matrix , we use the three parameter beta distribution parameterized to have a sparsity-inducing effect [27]. The three parameter beta distribution has the following form:

 f(x:a,b,ϕ)=Γ(a+b)Γ(a)Γ(b)ϕbxb−1(1−x)a−1{1+(ϕ−1)x}−(a+b), (7)

for , , and . We put the prior on the variance of each element of the loading matrix , creating the following hierarchical structure:

 ϱ ∼ TPB(e,f,ν). (8) ζk ∼ TPB(c,d,1ϱ−1) (9) φk,j ∼ TPB(a,b,1ζk−1) (10) Λk,j ∼ N(0,1φk,j−1) (11)

This specification provides three layers of shrinkage on the sparse loading matrix: provides local shrinkage for each element by shrinking the variance term of the normal prior; controls the shrinkage specific to each factor ; shrinks all elements of the matrix globally. This model captures different shrinkage scenarios depending on parameters at each level (Figure 2). We estimate the third term of the factor-specific and local priors, , from the data.

By tuning parameters and , we apply more or less shrinkage on the sparse loading matrix . In practice, we set to recapitulate the horseshoe prior at all three levels.

### 3.1 Equivalent model via gamma priors

We transform the parameter to , and we find that the following relationship holds [27]:

 φ∼TPB(a,b,ν)⇔θν∼Be′(a,b)⇔θ∼Ga(a,δ)andδ∼Ga(b,ν), (12)

where and indicate an inverse beta and a gamma distribution, respectively. For concreteness, we define the inverse beta distribution as follows:

 f(x;α,β)=xα−1(1+x)−α−βB(α,β), (13)

where is the beta function. We apply this transformation to Equations 8, 9, 10, and 11, specifically, the variance terms and . It can be shown that  [27]; the same relationship holds for other variables. This relationship implies the following hierarchical structure for the loading matrix :

 γ ∼ G(f,ν) (14) η ∼ G(e,γ) (15) τk ∼ G(d,η) (16) ϕk ∼ G(c,τk) (17) δk,j ∼ G(b,ϕk) (18) θk,j ∼ G(a,δk,j) (19) Λk,j ∼ N(0,θk,j), (20)

where the parameter controls the global shrinkage, controls the factor-specific shrinkage, and controls the local shrinkage for each element of the factor loading matrix .

### 3.2 Mixture of sparse and dense factors

We will define a sparse factor as factor associated with a loading vector that contains one or more zeros (or minimal contribution from some number of features); we similarly define a dense factor as a factor associated with a loading vector that contains no zeros (or contributions from all features). This formulation of the model (Equation 20) makes it suitable for generating sparse factors and, simultaneously, eliminating unnecessary factors. If we removed the local sparse components, and instead let each element of the loading matrix be generated from the factor-level variance term directly, , the model generates dense factors and simultaneously eliminates unused factors. Although there are other possible ways to model dense factors in this framework, we have found that this approach is both computationally tractable and numerically stable.

Using this approach, we added a mixture model to the prior on in order to jointly model both sparse and dense factors. In particular, we mix over generating each parameter from the gamma prior to encourage sparsity within a factor loading vector, and setting , , to the factor-specific parameter to encourage dense factor loadings:

 θk,j∼πGa(a,δk,j)+(1−π)δ(ϕk), (21)

where is the dirac delta function, which sets for all .

Let be a latent vector that indicates whether a factor is a sparse or a dense component. These indicator variables have a Bernoulli distribution with parameter , which we further assume are generated according to a beta distribution with parameters and . Therefore, we may view the gene expression data as being generated from the following model:

 π|α,β ∼Be(α,β) (22) Zk|π ∼Bern(π), k={1,…,K} (23) Λk,j|Zk ∼{p(Λk,j|θk,j,δk,j,ϕk)if Zk=1;p(Λk,j|ϕk)if Zk=0. (24) Xi,k ∼N(0,1) (25) Yi,j|Λk,j,Xi,k,ψj ∼N(K∑k=1Xi,kΛk,j,ψj). (26)

## 4 Approximate inference via EM

We present a fast expectation maximization (EM) algorithm for parameter estimation in this model; we also derived a Gibbs sampler (Appendix A). In the Expectation step of the EM algorithm, we take expectations of the latent factors and latent variables ; this is simple because and are conditionally independent of each other with respect to . We use maximum a posteriori (MAP) estimates for parameters in the M-step as in the original paper on EM [55] (see Appendix B for complete description). The posterior probability is written as follows:

 p(Λ,X,Z,Θ|Y) ∝p(Y|Λ,X,Θ,Z)p(X|Θ)p(Λ|Θ,Z)p(Z|Θ)p(Θ) (27) ∝[n∏i=1N(Yi|Λ,Xi)N(Xi|0,IK)][K∏k=1p∏j=1N(Λk,j|ϕk)\mathds1Zk=0] ×[K∏k=1p∏j=1{N(Λk,j|θk,j)Ga(θk,j|a,δk,j)Ga(δk,j|b,ϕk)}\mathds1Zk=1] ×[K∏k=1Bern(Zk|π)][K∏k=1Ga(ϕk|c,τk)Ga(τk|d,η)] ×Ga(η|e,γ)Ga(γ|f,ν)Beta(π|α,β)

Key elements of EM include: 1) the posterior of has a normal distribution, with its mode being a function of a weighted sum of the sparse and dense components, 2) the posterior of and are in a Generalized Inverse Gaussian () distribution, with MAP estimates of their modes being a closed form solution to a quadratic function; however, is only associated with the sparse components, whereas is a function of both sparse and dense components. 3) The parameters have a gamma distribution, for which the MAP estimates have a closed form solution because of conjugacy. For parameters , we used their MLE estimates when MAP estimates , which is the case for the horseshoe parameterization of the prior ().

## 5 Stability statistics

Factor models suffer from unidentifiability: in the general model, the likelihood is invariant up to orthogonal rotation and scaling of the factors and loadings, and the factors and loadings may be jointly permuted without affecting the likelihood, called the label switching problem. Because of these invariances, it is difficult to compare the results from fitted factor models, specifically and . However, it is important to be able to compare these fitted matrices because we would like to, for example, quantify how well simulated data are recapitulated or evaluate how sensitive the EM algorithm is to random starting points. In the sparse matrix setting, by imposing significant sparsity on the loading matrix, we eliminate rotational invariance for the most part. We therefore construct a stability measure to compare two sparse matrices that is invariant to scale and label switching. In the dense matrix setting, we develop a stability measure that quantifies the similarity between two matrices based on their underlying basis, which is invariant to rotation, scaling, label switching, and even a varying number of recovered factors.

### 5.1 Stability statistic for sparse factors

We propose the following stability measurement for two sparse matrices. Let , be the number of rows for two sparse matrix and , let denote the correlation matrix generated from two fitted sparse matrices and by computing the absolute value of the pairwise Pearson’s correlations among each sparse matrix column, we consider the following statistic:

 rs =12K1K1∑l=1⎧⎨⎩max(|Σl,.|)−∑K2t=1I(|Σl,t|>|¯¯¯¯¯¯¯Σl,.|)Σl,tK2−1⎫⎬⎭ (28) +12K2K2∑t=1⎧⎨⎩max(|Σ.,t|)−∑K1l=1I(|Σl,t|>¯¯¯¯¯¯¯¯¯¯¯|Σ.,t|)Σl,tK1−1⎫⎬⎭

where and denote the mean for the row and the column. The idea behind this metric is as follows: given two sparse matrices that are perfect matches despite label switching, there should be exactly one for the th row and th column, and the rest should be closer to zero (although, because we do not enforce orthogonal factor loadings, probably not exactly zero). The stability measure should reward this scenario, but penalize the comparisons when there are zero or more than one for the row or column (factor splitting). Conversely, we do not want to penalize small correlations among factors as correlations may exist, so we only penalize correlations that are greater than the mean correlation value for that factor, which may be smaller than the correlation between matching factors (with correlation near one) and larger than the correlation between non-matching factors (with correlation closer to zero).

### 5.2 Stability for dense factors

We built a stability measure to quantify the similarity of two dense matrices based on the estimates of the covariance of the features of matrix , , Although itself is unidentifiable up to an orthogonal rotation, the form is identifiable, so we will compare two dense matrices with these features using their respective covariance matrices. The problem of comparing two covariance matrices and has been well studied [56]. A test statistic that is a function of the determinant of the two covariance matrices will quantify the difference between the two [56, 57] for example. A determinant-based approach was rejected, though, because, in our model , so the covariance matrices are singular and therefore will have a determinant of zero. To address this, a simple squared trace, , was recently proposed to measure the distance between two dense matrices [57]. This metric is rotation invariant, invariant to label switching, and allows singular matrices; to make it scale invariant, we scale each row of the original matrices by . Given two scaled dense matrices and , we compare them by using the trace squared:

 rd=1p2Tr(MT1M1−MT2M2)2, (29)

which is proportional to the distance between the two matrices, with smaller values representing greater similarity in this scenario.

## 6 Results

### 6.1 Simulated data

To test the performance of this model, we simulated two types of gene expression measurements. First, we simulated ten data sets with only sparse components, in consideration of the models in this comparison that do not handle confounders explicitly (Sim 1). Second, we simulated ten data sets with sparse components plus dense confounders (Sim 2). The two types of simulated data were generated from the following model:

 Y=XΛ+FΩ+ϵ (30)

where and correspond to the sparse and dense loading matrices, and and correspond to the sparse and dense factors, respectively. To generate , for each row of , We sampled the number of genes in a single sparse factor from and then assigned values from at random to the included genes and set the loadings for the excluded genes to zero. The indices of the included genes were randomly sampled from the total genes. Some genes may appear in multiple rows, and thus correlations among the factors are possible. We sampled each row of . We simulated the dense matrix , , and the error term . For Sim 1, we simulated ten sparse factors; for Sim 2, we added five dense factors. For both simulations, we used these simulated factors to generate a gene expression matrix with dimension , (Figure 3). The same simulation scheme was replicated ten times to generate ten data sets for both Sim 1 and Sim 2.

### 6.2 Related methods for comparison

We validated our model on these simulations, and compared the results against six other related methods. We ran KSVD [54], BFRM [26], SPCA [37], SBIF [45], our model, after controlling for PCs in the matrix (SFAmix2), and our model, leaving as it was given (SFAmix), in the following way.

We ran KSVD assuming one element in each linear combination, which best recapitulated the sparsity in the simulations for both Sim 1 and Sim 2. The default number of iterations was used. We gave the method the correct number of factors. We also ran the method on Sim 1 setting . We ran BFRM and SPCA with the correct factor numbers; we used default values for the other parameters. We also ran BFRM and SPCA on Sim 1 setting . For SBIF, maximum a posteriori (MAP) estimates of and were used as the final point estimates. This method selected the factor number nonparametrically; However, we seeded the method with the correct number of factors. For Sim 1, we also seeded SBIF with .

For SFAmix2, we ran SFAmix as below, but controlled for confounders in matrix before applying our model to the residuals from a fitted linear model with the original matrix and the first five principal components (PCs) of . For SFAmix, we initialized the program with factors, and we set the parameter values to and to recapitulate the horseshoe. We set for a uniform prior distribution on the mixture proportions. We assessed convergence by checking changes in the number of non-zero elements in each iteration, and stopped when was stable for iterations. We also ran SFAmix using the correct number of factors on Sim 1.

Since a number of methods in this comparison did not recover matrices with substantial sparsity, we post processed the results for these methods to determine the sparse and dense loadings. We chose a cutoff , different for each method, so that, for a factor loading , we thresholded the vector elements to count the number of non-zero features in that factor: . We determined this cutoff based on factor loading histograms, resolving ambiguous cutoff levels in favor of the correct number of sparse and dense factors. Then we set elements for which to zero. For SFAmix, we used the posterior probability of the variables to determine whether a factor was sparse or dense (with a naive cutoff of ). We found for SFAmix that the threshold for removing a feature from a factor was , requiring minimal post-processing to determine the gene clusters.

### 6.3 Comparison between six methods on simulated data

We compared our mixture factor analysis model, SFAmix, and our model with a two-stage approach, SFAmix2, to KSVD, BFRM, SPCA, and SBIF. We evaluated the performance of each method based on the stability statistics between the true simulated and the recovered latent spaces, for both sparse and dense loadings and factor matrices. We ran each of the five methods on the ten data sets in Sim 1, and we compared each recovered sparse factor loadings with the true loading matrix (Figure 4). When the correct factor number was known for all methods other than SFAmix, all methods were able to recover the sparse factor loadings well, all producing an average stability measure over the ten simulations. When the factor number was unknown (SFAmix was given and all other methods were given , for simulated ) SFAmix recovered the sparse loading matrix equally well, followed by SBIF, while the remaining methods performed substantially worse. This suggests a benefit of the nonparametric behavior of SFAmix and SBIF, which both estimated the number of factors effectively when the underlying factor number was unknown a priori.

For Sim 2, we found that SFAmix recovered the sparse loadings well, followed by BFRM, SFAmix2, SBIF, KSVD, and SPCA (Figure 4B). Indeed, SFAmix was able to recover both the sparse and dense loading matrices without knowing the number or proportion of sparse and dense factors beforehand (Figure 4C). The amount of post processing required for BFRM may have artificially inflated the quality of those results relative to SBIF in particular. BFRM and SBIF allow variability on the shrinkage applied across factors; thus, they recover matrices with confounding factors better than KSVD and SPCA, which impose equal shrinkage across factors (Figure 4B). This difference is reflected in the dense stability measure, where SFAmix and SFAmix2 had the smallest average distance between the recovered and the true covariance matrices, followed by BFRM, SBIF, KSVD and SPCA (Figure 4C). We used the dense stability metric to compare the recovered factors corresponding to the dense loadings to the original dense factors, and we find an identical ranking of methods in terms of the factor recovery but with substantially greater variance across the different data sets in Sim 2 (Figure 4D). The results for Sim 2 suggest that estimating the sparse and dense components jointly offers benefits over the two-stage method (SFAmix2), which, even given the correct factor numbers, performs worse than the joint model in recovering the sparse components (Figure 4B,D).

We further investigated the recovered gene clusters in the sparse loadings for both Sim 1 and 2. We found that, for Sim 1, SFAmix and SPCA recapitulated the level of sparsity in the simulated loadings; in particular, the average number of non-zero components in a sparse loading () for SFAmix, KSVD, SPCA, BFRM and SPIF were 10, 50, 23, 495, and 500 respectively, where the simulated average cluster size was 15 (Figure 5). For Sim 2, we found that the sparsity levels were recovered well by SFAmix, and also by BFRM and SBIF when the number of sparse and dense loadings were approximated correctly (Figure 5). KSVD and SPCA do not approximate the sparse clusters well in the presence of dense factors. SFAmix recover the sparse latent structure well relative to other methods in the presence of confounding factors, with minimal post-processing.

### 6.4 Gene expression study

An RNA microarray generates gene transcription levels for tens of thousands of genes from an RNA sample rapidly and at low cost. Biologists have shown that genes are not transcribed into mRNA as independent units, but instead as correlated components of biological networks with various biochemical roles [58, 59]. As a result, genes that share similar biological roles may have correlated expression levels across samples because, for example, they may be regulated for a similar cellular purpose by a common transcription factor that is expressed at different levels across samples. Identifying these correlated sets of genes from high dimensional gene expression measurements is a fundamental biological problem [18, 60, 61] with many downstream applications.

#### Latent factors recovered from the gene expression data

We applied our method to expression levels from genes measured in a sample of human immortalized blood cell lines (LCLs) [14]. The data were processed according to previous work [14]; however, neither known covariates nor PCs were controlled for before quantile normalization. We also removed genes with probes on the gene expression array that aligned to multiple regions of the genome using a BLAST analysis and human reference genome hg19. In this experiment, the number of correlated sets of genes may be large relative to the number of genes in the gene expression matrix (and, certainly, relative to the number of samples) if we indeed identify small clusters of co-regulated genes. We set and ran EM from ten starting points with , and . We recovered approximately factors across different random starting values, approximately 25-30 of which were dense factors.

We present results from the run that produced the most factors. For this run, we found a total of factors, of which were dense (Figure 6A,B). We found that 98% of the sparse factors contained fewer than genes, and 81% contained fewer than genes (Figure 6D). To quantify gene correlation patterns within each factor, we calculated the correlation matrix using the gene expression values in the residual matrix for each gene included in each sparse factor (Figure 6A,B). We found that our model recovered factors containing groups of strongly correlated genes, even when the correlation was confounded by the structure of the dense factors in the original matrix .

A further look at the proportion of variance explained (PVE; Figure 6C) shows that dense factors individually explain as much as 13% of the total variance in the gene expression matrix. The sparse factors individually explain as much as 1.3% of the total variance, which is more than some of the dense factors explain. One feature of our joint mixture model is that sparse factors may capture substantial PVE, instead of controlling for this variance through PCs in a two-stage approach (SFAmix2) or implicitly controlling sparsity via a decreasing prior on the PVE (SBIF). Furthermore, we found that the recovered dense factors correlated well with some known biological and technical covariates, including batch effects, which are known to cause a substantial amount of variation in gene expression levels [62] (Figure 7).

#### Ontology term enrichment validates recovered gene clusters

Genes that have correlated transcription levels often have similar molecular functions [58, 59]. To validate the gene clusters recovered by SFAmix, we applied the Gene Ontology enrichment analysis tool, DAVID [63], to the genes in each sparse factor. Using FDR , we found that sparse factors were enriched for different biological functions (Supplemental Table S1). For example, a sparse factor with genes, including CMPK2, DDX60, and SP110, was enriched for the GO terms response to virus (FDR ) and antiviral defense (FDR ). The substantial enrichment of GO terms in the recovered sparse factors suggests that the induced gene clusters recovered by this model are biologically meaningful. Furthermore, these specific GO terms indicate that these samples have mounted a coordinated cellular response to virus, which, as we discuss later, reflects the immortalization process for the specific type of cells in this study [64].

#### eQTL analysis finds pleiotropic eQTLs

One downstream application of identifying subsets of correlated genes is to find genetic variants that are associated with transcription levels of the recovered subsets of possibly co-regulated genes [65, 66, 67, 68]. To further validate the recovered gene clusters, we performed eQTL association mapping to the sparse factors to identify genetic variants that regulate the corresponding small gene clusters (pleiotropic eQTLs). For this experiment, we projected each recovered factor to the quantiles of a standard normal distribution across the samples; we then tested for associations between each of these normalized latent factors and 2.6 million genetic variants (genotyped in the same individuals) using univariate Bayesian tests [70]. We also ran the same association tests on the permuted normalized latent factors to compute the false discovery rate (FDR) for specific values. We identified associated genetic variants (FDR; ), and associated genetic variants at a more strict FDR (FDR ; ); all identified eQTLs are presented in Supplemental Table S2.

We found that out of of our sparse factors () had at least one eQTL (FDR). We define cis associations as variants located within 1 Mb of the transcription start site (TSS) or the transcription end site (TES) of any gene in the factor. We found () cis-associations of total associations, recapitulating previous studies showing many more significant cis-eQTLs than trans-eQTLs [13, 14]. If we consider only the most significantly associated eQTL for each factor, out of factors with eQTLs (37%) are in cis; however, this proportion goes up to 60% ( out of ) at an FDR of (), and 84% ( out of ) at an FDR of (), suggesting that the cis associations represent stronger genetic effects than the trans associations [72]. All associations with have a cluster size of less than or equal to three genes; generally the eQTL is a short distance from the cis-gene’s transcribed region (Figure 8A). Less significant associations () show more variability in distance to the closest gene and cluster size (Figure 8A). As the number of genes in cis to the eQTL for a cluster increased, the association significance also tended to increase (Figure 8B). These associations suggest that this type of factor model can be used to capture small groups of genes that appear to be co-regulated by pleiotropic genetic loci.

For comparison, we found genetic variants associations with the dense factors (FDR), and associated genetic variants at a stricter FDR (FDR). This proportion of dense factors with eQTLs is smaller than the genetic associations for the sparse factors, supporting the hypothesis that most of the dense factors are not genetically driven but represent biological and experimental confounders. This also suggests that a joint modeling of gene clusters and confounding effects does not remove genetic signal unintentionally, although it is possible that a genetic effect constitutes only a small proportion of the variance explained by a dense factor, so those factors would still not appear to be associated with genetic variants.

Association mapping identified eQTLs associated with two factors, the first including ten genes (DDX58, GMPR, IFIT2, IFIT3, IFIT5, MOV10, OASL, PARP12, PARP9, XAF1), the second including four genes (CD55, CR1, CR2, IFNA2). Both eQTLs are unlinked with (i.e., in trans) all of the genes included in the factors. Both of these factors are enriched for GO terms related to interferon response, or response to invasion of host cells by pathogens including viruses and tumor cells; the first factor is enriched for interferon-induced 56K protein (FDR ), and the second factor is enriched for Sushi4 domain (FDR ) that is activated in response to specific viruses including Epstein-Barr. Both of these factors are relevant to the cell type in this study, lymphoblastoid cell lines, which have been immortalized using the Epstein-Barr virus, and it appears that we are able to observe the response that these cells have mounted against the viral pathogen. For the first factor, the trans-eQTL is located within a K-lysine acetyltransferase (KAT8, also known as MYST1), which is in our gene expression data but not included in this factor, and is a known interferon effector gene [73]. The eQTL for the second factor is similarly located within the TRAPPC9 gene, not in our gene expression data set, which is in the NF-B pathway and is activated during viral stress of host cells.

We also performed a univariate Bayesian test for association between the genes within each factor and the SNPs associated with these factors for SNP-factor associations with an FDR. We found that by jointly testing for association with the clustered genes, we identified associations with greater significance than testing the genes separately (Figure 9).

## 7 Conclusions

We developed a model for sparse factor analysis using a three parameter beta prior to induce shrinkage in the loading matrix at three levels of the hierarchy: globally, factor-specific, and element-wise. We found that this model has favorable properties for estimating possibly high-dimensional latent spaces, including accurate recovery of sparse signals and a non-parametric property of removing unused factors. We extended this model to explicitly include dense factors by adding a two-component mixture model within this hierarchy. We developed two simple metrics for stability across sparse and dense matrices that are invariant to scale, label switching, and (for dense matrices) orthogonal rotation. We validated our model on simulated data, and showed that our model recapitulated both sparse and dense factors with high accuracy relative to current state-of-the-art methods. We applied our model to a large gene expression data set and found biologically meaningful clusters of genes. The recovered dense factors correlate well with known biological and technical covariates. We used the sparse factors to identify genetic variants that are associated with transcriptional regulation of the genes within the individual sparse factors, and our results suggest that our sparse gene clusters capture genes that are co-regulated by genetic variants, and that our method is useful for identifying pleiotrophic eQTLs.

## Appendix A Posterior distribution for the parameters

The posterior probability for our model given matrix is written as follows:

 p(Λ,X,Z,Θ|Y) ∝p(Y|Λ,X,Θ,Z)p(X|Θ)p(Λ|Θ,Z)p(Z|Θ)p(Θ) (31) ∝[n∏i=1N(Yi|Λ,Xi)N(Xi|0,IK)][K∏k=1p∏j=1N(Λk,j|ϕk)\mathds1Zk=0] ×[K∏k=1p∏j=1{N(Λk,j|θk,j)Ga(θk,j|a,δk,j)Ga(δk,j|b,ϕk)}\mathds1Zk=1] ×[K∏k=1Bern(Zk|π)][K∏k=1Ga(ϕk|c,τk)Ga(τk|d,η)] ×Ga(η|e,γ)Ga(γ|f,ν)Beta(π|α,β).

The conditional probability for has the following form:

 Xi|Yi,Θ ∝exp{n∑i=1(12(Yi−XiΛ)TΨ−1(Yi−XiΛ))}exp{n∑i=1(12XTiXi)} (32) ∝exp{n∑i=1(12(Xi−μxi)TΣ−1X(Xi−μxi))}.

Thus, we have the following conditional probability for :

 Xi|Yi∼N(μxi,ΣX) (33)

where

 μxi = YTiΨ−1Λ(ΛΨ−1ΛT+IK)−1 (34) ΣX = (ΛΨ−1ΛT+IK)−1 (35)

The conditional probability for has a Bernoulli distribution:

 p(Zk=1|Θ)=π∏pj=1N(Λk,j|θk,j)Ga(θk,j|a,δk,j)Ga(δk,j|b,ϕk)(1−π)(∏pj=1N(Λk,j|ϕk))+π∏pj=1N(Λk,j|θk,j)Ga(θk,j|a,δk,j)Ga(δk,j|b,ϕk) (36)

Let ; then the conditional probability for is

 Zk|Θ ∼Bern(ρk) (37)

To derive the conditional probabilities for the parameters generating the matrix , we note that many of them have a generalized inverse Gaussian distribution, conditional on :

If

 Λk,j|Y,X,Θk,j,ψj,j ∼N⎛⎜⎝1ψj,j∑ni=1(yi,j−∑~k≠kxi,~kΛ~k,j)xi,k1ψj,j∑ni=1x2i,k+1θk,j,1ψj,jn∑i=1x2i,k+1θk,j⎞⎟⎠ (38) θk,j|Λk,j,δk,j ∼GIG(a−12,2δk,j,Λ2k,j) (39) δk,j|θk,j,ϕk ∼G(a+b,θk,j+ϕk) (40) ϕk|δk,j,τk ∼G(pb+c,p∑j=1δk,j+τk) (41)

If

 Λk,j|Y,X,ϕk,ψj,j Missing or unrecognized delimiter for \right (42) ϕk|τk,Λk,j ∼GIG(c−p2,2τk,p∑j=1Λ2k,j), (43)

where we used to denote any element but element .

The following parameters are not sparse or dense component specific, and they each have a gamma conditional probability because of conjugacy:

 τk|ϕk,η ∼G(c+d,ϕk+η) (44) η|γ,τk ∼G(Kd+e,γ+K∑k=1τk) (45) γ|η,ν ∼G(e+f,η+ν). (46)

The mixing proportion has a beta conditional probability:

 π|α,β,Zk∼Beta(α+K∑k=1\mathds1Zk=0,K−K∑k=1\mathds1Zk=1+β), (47)

where is the indicator function.

Finally, the variance of the error term has an inverse gamma distribution:

 ψj,j|Y,X,Λ∼IG⎛⎜ ⎜⎝n2−1,∑ni=1(yi,j−∑Kk=1xi,kΛk,j)22⎞⎟ ⎟⎠. (48)

## Appendix B Expectation maximization algorithm

We describe an expectation maximization algorithm, where we take the expected values of the latent variables and , enabling conjugate gradient methods for point estimates of the parameters in this space. To derive the EM updates, we write the auxiliary function, using the expected complete log posterior probability in lieu of the likelihood, as:

 Q(Θ) ∝n∑i=1p∑j=1⟨logp(yi,j|Λ,X,Θ,Z)⟩+n∑i=1K∑k=1⟨logp(xi,k|Θ)⟩ (49) +K∑k=1p∑j=1⟨p(Zk|Θ)logp(Λk,j|Θ,Zk)⟩+logp(Θ) ∝−p2ln|Ψ|−n∑i=1p∑j=1(yi,j−∑Kk=1⟨xi,k⟩Λk,j)22ψj,j−n∑i=1K∑k=1⟨x2i,k⟩2 +K∑k=1p∑j=1⟨1−⟨zk⟩⟩{−12lnϕk−Λ2k,j2ϕk} +K∑k=1p∑j=1⟨zk⟩{−12lnθk,j−Λ2k,j2θk,j+alnδkj+(a−1)lnθkj−δk,jθk,j} +K∑k=1p∑j=1⟨zk⟩{blnϕk+(b−1)lnδk,j−ϕkδk,j} +K∑k=1{⟨zk⟩lnπ+(1−⟨zk⟩)ln(1−π)} +K∑k=1{clnτk+(c−1)lnϕk−τkϕk+dlnη+(d−1)lnτk−ητk} +elnγ+(e−1)lnη−γη+flnν+(f−1)lnγ−νγ+αlnπ+βln(1−π)

First we write out the equations for the three expected sufficient statistics identified above. In section A, we established that has a Gaussian distribution; then the expected value is computed in the E-step as:

 ⟨Xi|Yi⟩ =YTiΨ−1Λ(ΛΨ−1ΛT+IK)−1 (50)

Similarly, is computed as follows:

 ⟨x2i,k⟩ =ΣXk,k+⟨xi,k⟩2 (51) ⟨xi,~kxi,k⟩ =⟨xi,~k⟩⟨xi,k⟩+ΣX~k,k. (52)

The expected value of was derived in Section A as:

 ⟨zk|Θ⟩ =p(Zk=1|Θ) (53) =π∏pj=1N(Λk,j|θk,j)Ga(θk,j|a,δk,j)Ga(δk,j|b,ϕk)(1−π)(∏pj=1N(Λk,j|ϕk))+π∏pj=1N(Λk,j|θk,j)Ga(θk,j|a,δk,j)Ga(δk,j|b,ϕk).

The parameter updates are computed in the M-step. We obtain their MAP estimates . Specifically, we solve equation to find the closed form of their MAP estimates. The same updates are obtained by finding the mode of the conditional probability of each parameter, as in Appendix A.

Our MAP estimate for is a function of the weighted sum of the two variance terms and :

 ^Λk,j Missing or unrecognized delimiter for \left (54)

where is calculated in the E-step. was derived in Appendix A as the th element in the matrix, and as the th element in the matrix.

As shown in Appendix A, has a generalized inverse Gaussian conditional probability, and its MAP estimates can either be obtained by directly taking the mode of this distribution, or by solving a quadratic formula. We obtain the following form for the parameter updates:

 ^θk,j =2a−3+√(2a−3)2+8Λ2k,jδk,j4δk,j. (55)

The estimates for are trivially obtained as:

 ^δk,j =a+b−1θk,j+ϕk. (56)

The estimates for are also based on a generalized inverse Gaussian. Unlike