A Critical Survey of Deconvolution Methods for Separating cell-types in Complex Tissues
Identifying properties and concentrations of components from an observed mixture, known as deconvolution, is a fundamental problem in signal processing. It has diverse applications in fields ranging from hyperspectral imaging to denoising readings from biomedical sensors. This paper focuses on in-silico deconvolution of signals associated with complex tissues into their constitutive cell-type specific components, along with a quantitative characterization of the cell-types. Deconvolving mixed tissues/cell-types is useful in the removal of contaminants (e.g., surrounding cells) from tumor biopsies, as well as in monitoring changes in the cell population in response to treatment or infection. In these contexts, the observed signal from the mixture of cell-types is assumed to be a convolution, using a linear instantaneous (LI) mixing process, of the expression levels of genes in constitutive cell-types. The goal is to use known signals corresponding to individual cell-types along with a model of the mixing process to cast the deconvolution problem as a suitable optimization problem.
In this paper, we present a survey and in-depth analysis of models, methods, and assumptions underlying deconvolution techniques. We investigate the choice of the different loss functions for evaluating estimation error, constraints on solutions, preprocessing and data filtering, feature selection, and regularization to enhance the quality of solutions, along with the impact of these choices on the performance of commonly used regression-based methods for deconvolution. We assess different combinations of these factors and use detailed statistical measures to evaluate their effectiveness. Some of these combinations have been proposed in the literature, whereas others represent novel algorithmic choices for deconvolution. We identify shortcomings of current methods and avenues for further investigation. For many of the identified shortcomings, such as normalization issues and data filtering, we provide new solutions. We summarize our findings in a prescriptive step-by-step process, which can be applied to a wide range of deconvolution problems.
Source separation, or deconvolution, is the problem of estimating individual signal components from their mixtures. This problem arises when source signals are transmitted through a mixing channel and the mixed sensor readings are observed. Source separation has applications in a variety of fields. One of its early applications was in processing audio signals [Pedersen2004, Meng2012, Emmanuel2014, Labastie2015]. Here, mixtures of different sound sources, such as speech or music, are recorded simultaneously using several microphones. Various frequencies are convolved by the impulse response of the room and the goal is to separate one or several sources from this mixture. This has direct applications in speech enhancement, voice removal, and noise cancellation in recordings from populated areas. In hyperspectral imaging, the spectral signature of each pixel is observed. This signal is the combination of pure spectral signatures of constitutive elements mixed according to their relative abundance. In satellite imaging, each pixel represents sensor readings for different patches of land at multiple wavelengths. Individual sources correspond to reflectances of materials at different wavelengths that are mixed according to the material composition of each pixel [Gillis2014, Ma2014, Nuzillard2000, Pauca2006, Villeneuve2012].
Beyond these domains, deconvolution has applications in denoising biomedical sensors. Tracing electrical current in the brain is widely used as a proxy for spatiotemporal patterns of brain activity. These patterns have significant clinical applications in diagnosis and prediction of epileptic seizures, as well as characterizing different stages of sleep in patients with sleep disorders. Electroencephalography (EEG) and magnetoencephalography (MEG) are two of the most commonly used techniques for cerebral imaging. These techniques measure voltage fluctuations and changes in the electromagnetic fields, respectively. Superconducting QUantum Interference Device (SQUID) sensors used in the latter technology are susceptible to magnetic coupling due to geometry and must be shielded carefully against magnetic noise. Deconvolution techniques are used to separate different noise sources and ameliorate the effect of electrical and magnetic coupling in these devices [Tang2000, Hesse2006, Vaya2007, Zhang2012].
At a high level, mixing channels can be classified as follows: (i) linear or nonlinear, (ii) instantaneous, delayed, or convolutive, and (iii) over/under determined. When neither the sources nor the mixing process is available, the problem is known as blind source separation (BSS). This problem is highly under-determined in general, and additional constraints; such as independence among sources, sparsity, or non-negativity; are typically enforced on the sources in practical applications. On the other hand, a new class of methods has been developed recently, which is known as semi or guided BSS [Pedersen2004, Hesse2006, Emmanuel2014, Labastie2015]. In these methods, additional information is available a priori on the approximate behavior of either sources or the mixing process. In this paper, we focus on the class of over-determined, linear instantaneous (LI) mixing processes, for which a deformed prior on sources is available. In this case, the parameters of the linear mixer, as well as the true identity of the original sources are to be determined.
In the context of molecular biology, deconvolution methods have been used to identify constituent cell-types in a tissue, along with their relative proportions. The inherent heterogeneity of tissue samples makes it difficult to identify separated, cell-type specific signatures, i.e., the precise gene expression levels for each cell-type. Relative changes in cell proportions, combined with variations attributed to the changes in the biological conditions, such as disease state, complicate identification of true biological signals from mere technical variations. Changes in tissue composition are often indicative of disease progression or drug response. For example, coupled depletion of specific neuronal cells with the gradual increase in the glial cell population is indicative of neurodegenerative disorders. An increasing proportion of malignant cells, as well as a growing fraction of tumor infiltrating lymphocytes (TIL) compared to surrounding cells, directly influence tumor growth, metastasis, and clinical outcomes for patients [Kuhn2012, Newman2015]. Deconvolving tissue biopsies allows further investigation of the interaction between tumor and micro-environmental cells, along with its role in the progression of cancer.
The expression level of genes, which is a proxy for the number of present copies of each gene product, is one of the most common source factors used for separating cell-types and tissues. In the linear mixing model, the expression of each gene in a complex mixture is estimated as a linear combination of the expression of the same gene in the constitutive cell-types. In silico deconvolution methods for separating complex tissues can be coarsely classified as either full deconvolution, in which both cell-type specific expressions and the percentages of each cell-type are estimated, or partial deconvolution methods, where one of these data sources is used to compute the other [Shen-Orr2013]. These two classes loosely relate to BSS and guided-BSS problems. Note that in cases where relative abundances are used to estimate cell-type-specific expressions, the problem is highly under-determined, whereas in the complementary case of computing percentages from noisy expressions of purified cells, it is highly over-determined. The challenge is to select the most reliable features that satisfy the linearity assumption. We provide an in-depth review of the recent deconvolution methods in Section 2.5.
In contrast to computational methods, a variety of experimental cell separation techniques have been proposed to enrich samples for cell-types of interest. However, these experimental methods not only involve significant time, effort, and expense, but may also result in insufficient RNA abundance for further quantification of gene expression. In this case, amplification steps may introduce technical artifacts into the gene expression data. Furthermore, sorting of cell-types must be embedded in the experiment design for the desired subset of cells, and any subsequent separation is infeasible. Computational methods, on the other hand, are capable of sorting mixtures at different levels of resolution and for arbitrary cell-type subsets of interest.
The organization of the remainder of the paper is as follows: in Section 2.1 we introduce the basic terminology from biology needed to formally define the deconvolution problem in the context of quantifying cell-type fractions in complex tissues. The formal definition of the deconvolution problem and its relationship to linear regression is defined in Section 2.2. Sections 2.3 and 2.4 review different choices and examples of the objective function used in regression. An overview of computational methods for biological deconvolution is provided in Section 2.5. Datasets and evaluation measures used in this study are described in Sections 3.1 and 3.2, respectively. The effect of the loss function, constraint enforcement, range filtering, and feature selection choices on the performance of deconvolution methods is evaluated systematically in Sections 3.4-3.8. Finally, a summary of our findings is provided in Section 4.
2 Background and Notation
2.1 Biological Underpinnings
The central dogma of biology describes the flow of genetic information within cells – the genetic code, represented in DNA molecules, is first transcribed to an intermediate construct, called messenger RNA (mRNA), which in turn translates into proteins. These proteins are the functional workhorses of the cell. Genes, defined as the minimal coding sections of the DNA, contain the recipe for making proteins. These instructions are utilized dynamically by the cell to adapt to different conditions. The amounts of various proteins in a cell can be measured at a time point. This corresponds to the level of protein expression. This process is limited by the availability of high-quality antibodies that can specifically target each protein. The amount of active mRNA in a cell, however, can be measured at the genome scale using high-throughput technologies such as microarrays and RNASeq. The former is an older technology that relies on the binding affinity of complementary base pairs (alphabets used in the DNA/RNA molecules), while the latter is a newer technique, using next generation sequencing (NGS). This technique estimates gene expression based on the overlap of mRNA fragments with known genomic features. Since microarrays have been used for years, extensive databases from different studies are publicly available. RNASeq datasets, in comparison, are relatively smaller but growing rapidly in scale and coverage. Both of these technologies provide reliable proxies for the amount of proteins in cells, with RNASeq being more sensitive, especially for lowly expressed genes. A drawback of these methods, however, is that the true protein expression is also regulated by additional mechanisms, such as post-transcriptional modifications, which cannot be assayed at the mRNA level.
The expression level of genes is tightly regulated in different stages of cellular development and in response to environmental changes. In addition to these biological variations due to cellular state, intermediate steps in each technology introduce technical variations in repeated measurement of gene expression in the same cell-type. To enhance reproducibly of measurements, one normally includes multiple instances of the same cell-type in each experiment, known as technical replicates. The expression profiles from these experiments provide a snapshot of the cell under different conditions. In addition to biological variation of genes within the same cell-type, there is an additional level of variation when we look across different cell-types. Some genes are ubiquitously expressed in all cell-types to perform housekeeping functions, whereas other genes exhibit specificity or selectivity for one, or a group of cell-types, respectively. A compendium of expression profiles of different cells at different developmental stages is the data substrate for in silico deconvolution of complex tissues.
2.2 Deconvolution: Formal Definition
We introduce formalisms and notation used in discussing different aspects of in silico deconvolution of biological signals. We focus on models that assume linearity, that is, the expression signature of the mixture is a weighted sum of the expression profile for its constitutive cell-types. In this case, sources are cell-type specific references and the mixing process is determined by the relative fraction of cell-types in the mixture.
We first introduce the mathematical constructs used:
: Mixture matrix, where each entry represents the raw expression of gene , , in heterogeneous sample , . Each sample, represented by , is a column of the matrix , and is a combination of gene expression profiles from constituting cell types in the mixture.
: Reference signature matrix for the expression of primary cell types, with multiple biological/technical replicates for each cell-type. In this matrix, rows correspond to the same set of genes as in , columns represent replicates and there is an underlying grouping among columns that collects profiles corresponding to the same cell-type.
: Reference expression profile, where the expression of similar cell-types in matrix is represented by the average value.
: Relative proportions of each cell-type in the mixture sample. Here, rows correspond to cell-types and columns represent samples in mixture matrix .
Using this notation, we can formally define deconvolution as an optimization problem that seeks to identify “optimal” estimates for matrices and , denoted by and , respectively. Since and/or are not known a priori, we use an approximation that is based on the linearity assumption. In this case, we aim to find and such that their product is close to the mixture matrix, . Specifically, given a function that measures the distance between the true and approximated solutions, also referred to as the loss function, we aim to solve:
In partial deconvolution, either or , or their noisy representation, is known a priori and the goal is to find the other unknown matrix. When matrix , referred to as the reference profile, is known, the problem is over-determined and we seek to distinguish features (genes) that closely conform to the linearity assumption, from the rest of the (variable) genes. In this case, we can solve the problem individually for each mixture sample. Let us denote by and the expression profile and estimated cell-type proportion of a mixture sample, respectively. Then, we can rewrite Equation 1 as:
This formulation is essentially a linear regression problem, with an arbitrary loss function. On the other hand, in the case of full deconvolution, we can still estimate in a column-by-column fashion. However, estimating is highly under-determined and we must use additional sources to restrict the search space. One such source of information is the variation across samples in , depending on the cell-type concentrations in the latest estimated value of . In general, most regression-based methods for full deconvolution use an iterative scheme that starts from either noisy estimates of and , or a random sample that satisfies given constraints on these matrices, and successively improves over this initial approximation. This iterative process can be formalized as follows:
Please note that the updating is typically row-wise (for each gene), whereas updation of is column-wise (for each sample). Non-negative matrix factorization (NMF) is a dimension reduction technique that aims to factor each column of the given input matrix as a nonnegative weighted sum of non-negative basis vectors, with the number of basis vectors being equal or less than the number of columns in the original matrix. The alternating non-negative least squares formulation (ANLS) for solving NMF can be formulated using the framework introduced in Equation 3. There are additional techniques for solving NMF, including the multiplicative updating rule and the hierarchical alternating least squares (HALS) methods, all of which are special cases of block-coordinate descent [Kim2013]. Two of the most common loss functions used in NMF are the Frobenius and Kullback-Leibler (KL) divergence [Kim2013].
In addition to non-negativity (NN), an additional sum-to-one (STO) constraint is typically applied over columns of the matrix , or the sample-specific vector . This constraint restricts the search space, which can potentially enhance the accuracy of the results, and simplifies the interpretation of values in as relative percentages. Finally, another fundamental assumption that is mostly neglected in prior work is the similar cell quantity (SCQ) constraint. The similar cell quantity assumption states that all reference profiles and corresponding mixtures must be normalized to ensure that they represent the expression level of the “same number of cells.” If this constraint is not satisfied, differences in the cell-type counts directly affect concentrations by rescaling the estimated coefficients to adjust for the difference.
In this paper, we focus on different loss functions ( functions), as well as the role of constraint enforcement strategies, in estimating . These constitute the key building blocks of both partial and full deconvolution methods.
2.3 Choice of Objective Function
In linear regression, often a slightly different notation is used, which we describe here. We subsequently relate it to the deconvolution problem. Given a set of samples, , where and , the regression problem seeks to find a function that minimizes the aggregate error over all samples. Let us denote the fitting error by . Using this notation, we can write the regression problem as:
where the loss function measures the cost of estimation error. We focus on the class of linear functions, that is , for which we have . In this formulation, corresponds to the expression level of a gene in the mixture, vector is the expression level of the same gene in the reference cell types, and is the fraction of each cell-type in the mixture. We can represent in the compact form by matrix , in which row corresponds to .
In cases where the number of parameters is greater than the number of samples, that is matrix is a fat matrix, minimizing Equation 4, directly, can result in the over-fitting problem. Furthermore, when features (columns of ) are highly correlated, solution may change drastically in response to small changes in the samples, specifically among the correlated features. This condition, known as multicollinearity, can result in inaccurate estimates, in which coefficients of similar features are greatly different. To remedy these problems, we can add a regularization term, which incorporates additional constraints (such as sparsity or flatness) to enhance the stability of results. We re-write the problem with the added regularizer as:
where the parameter controls the relative importance of estimation error versus regularization. There are different choices and combinations for the loss function and regularizer function , which we describe in the following sections.
Choice of Loss Functions
There are a variety of options for suitable loss functions. Some of these functions are known to be asymptotically optimal for a given noise density, whereas others may yield better performance in practice when assumptions underlying the noise model are violated. We summarize the most commonly used set of loss functions:
If we assume that the underlying model is perturbed by Gaussian white noise, the squared or quadratic loss, denoted by , is known to be asymptotically optimal. This loss function is used in classical least squares regression and is defined as:
Absolute deviation loss, denoted by , is the optimal choice if noise follows a Laplacian distribution. Formally, it is defined as:
Compared to , the choice of is preferred in the presence of outliers, as it is less sensitive to extreme values
Huber’s loss function, denoted by , is a parametrized combination of and . The main idea is that loss is more susceptible to outliers, while it is more sensitive to small estimation errors. To combine the best of these two functions, we can define a half-length parameter , which we use to transition from to . More formally:
The loss function used in support vector regression (SVR) is the -insensitive loss, denoted by . Similar to Huber loss, there is a transition phase between small and large estimation errors. However, -insensitive loss does not penalize the errors that are smaller than a threshold. Formally, we define -insensitive loss as:
Figure 1 provides a visual representation of these loss functions, in which we use and for the Huber and -insensitive loss functions, respectively. Note that for small residual values, , Huber and square loss are equivalent. However, outside this region Huber loss becomes linear.
Choice of Regularizers
When the reference profile contains many cell-types that may not exist in mixtures, or in cases where constitutive cell-types are highly correlated, regularizing the objective function can sparsify the solution or enhance the conditioning of the problem. We describe two commonly used regularizers here:
The norm-2 regularizer is used to shrink the regression coefficient vector to ensure that it is as flat as possible. A common use of this regularizer is in conjunction with loss to remedy the multi-collinearity problem in classical least squares regression. This regularizer is formally defined as:
Another common regularizer is the norm-1 regularizer, which is used to enforce sparsity over . Formally, it can be defined as:
In addition to these two regularizers, their combinations have also been introduced in the literature. One such example is elastic net, which uses a convex combination of the two, that is . Another example is group Lasso, which, given a grouping among cell-types, enforces flatness among members of the group, while enhancing the sparsity pattern across groups. This regularizer function can be written as , where is the weight of cell-types in the group.
2.4 Examples of objective functions used in practice
Ordinary Least Squares (OLS)
The formulation of OLS is based on squared loss, . Formally, we have:
where row of the matrix , also known as the design matrix, corresponds to . This formulation has a closed form solution given by:
In this formulation, we can observe that norm-2 regularization is especially useful in cases where the matrix is ill-conditioned and near-singular, that is, columns are dependent on each other. By shifting towards the identity matrix, we ensure that the eigenvalues are farther from zero, which enhances the conditioning of the resulting combination.
One of the main issues with the OLS formulation is that the design matrix, , should have full column rank . Otherwise, if we have highly correlated variables, the solution suffers from the multicollinearity problem. This condition can be remedied by incorporating a norm-2 regularizer. The resulting formulation, known as ridge regression, is as follows:
Similar to OLS, we can differentiate w.r.t. to find the close form solution for Ridge regression given by:
Least Absolute Selection and Shrinkage Operator (LASSO) Regression
Combining the OLS with a norm-1 regularizer, we have the LASSO formulation:
This formulation is especially useful for producing sparse solutions by introducing zero elements in vector . However, while being convex, it does not have a closed form solution.
It is known that is dominated by the largest elements of the residual vector , which makes it sensitive to outliers. To remedy this problem, different robust regression formulations have been proposed that use alternative loss functions. Two of the best-known formulations are based on the and loss functions. The formulation can be written as:
However, for the Huber loss function, while it can be defined similarly, it is usually formulated as an alternative convex Quadratic Program (QP):
which can be solved more efficiently using the following equivalent QP variant [Mangasarian2000]:
In both of these formulations, the scalar corresponds to half-length parameter of the Huber’s loss function.
Support Vector Regression
In machine learning, Support Vector Regression (SVR) is a commonly used technique that aims to find a regression by maximizing the margins around the estimated separator hyperplane from the closest data points on each side of it. This margin provides the region in which estimation errors are ignored. SVR has been recently used to deconvolve biological mixtures, where it has been shown to outperform other methods [Newman2015]. One of the variants of SVR is -SVR, in which parameter defines the margin, or the -tube. The primal formulation of -SVR with linear kernel can be written as [Vapnik1998]:
In this formulation, is a vector of all ones, is the element-wise product, and is the kernel matrix defined as . The dual formulation is often used to solve -SVR, because it can be easily extended to use different kernel functions to map to a -dimensional non-linear feature space. Additionally, when , such as the case of high-dimensional feature spaces, it provides a better way to solve the SVR problem. However, the primal problem provides a more straightforward interpretation. In addition, in the case where , it provides superior performance. To show the similarity with Equation 5, we can rewrite Equation 10 using the -insensitive loss function as follows:
2.5 Overview of Prior in silico Deconvolution Methods
A majority of existing deconvolution methods fall into two groups – they either use a regression-based framework to compute , , or both; or perform statistical inference over a probabilistic model. Abbas et al. [Abbas2009] present one of the early regression-based methods for estimating . This method is designed to identify cell-type concentrations from a known reference profile of immune cells. Their method is based on Ordinary Least Squares (OLS) regression and does not consider either non-negativity or sum-to-one constraints explicitly, but rather it enforces these constraints implicitly after the optimization procedure. An extension of this approach is proposed by Qiao et al. [Qiao2012], which uses non-negative least squares (NNLS) to explicitly enforce non-negativity as part of the optimization. Gong et al. [Gong2011] present a quadratic programming (QP) framework to explicitly encode both constraints in the optimization problem formulation. They also propose an extension to this method, called DeconRNASeq, which applies the same QP framework to RNASeq datasets. More recently Newman et al. [Newman2015] propose robust linear regression (RLR) and -SVR regression instead of based regression, which is highly susceptible to noise. Digital cell quantification (DCQ) [Altboum2014] is another approach designed for monitoring the immune system during infection. Compared to prior methods, DCQ forces sparsity by combining and regularization into an elastic net. This regularization is essential for successfully identifying the subset of active cells at each stage, given the larger number of cell-types included in their panel (213 immune cell sub-populations). In contrast to these techniques, Shen-Orr et al. [Shen-Orr2010] propose a method, call csSAM, which is specifically designed to identify genes that are differentially expressed among purified cell-types. The core of this method is regression over matrix to estimate matrix .
Full regression-based methods use variations of block-coordinate descent to successively identify better estimates for both and [Kim2013]. Venet et al. [Venet2001] present one of the early methods in this class, which uses an NMF-like method coupled with a heuristic to decorrelate columns of in each iteration. Repsilber et al. [Repsilber2010] propose an algorithm called deconf, which uses alternating non-negative least squares (ANLS) for solving NMF, without the decorrelation step of Vennet et al., while implicitly applying constraints on and at each iteration. Inspired by the work of Pauca et al. on hyperspectral image deconvolution [Pauca2006], Zuckerman et al. [Zuckerman2013] propose an NMF method based on the Frobenius norm for gene expression deconvolution. They use gradient descent to solve for and at each step, which converges to a local optimum of the objective function. Given that the expression domain of cell-type specific markers is restricted to unique cells in the reference profile, Gaujoux et al. [Gaujoux2012] present a semi-supervised NMF (ssNMF) method that explicitly enforces an orthogonality constraint at each iteration over the subset of markers in the reference profile. This constraint both enhances the convergence of the NMF algorithm, and simplifies the matching of columns in the estimated cell-type expression to the columns of the reference panel, . The Digital Sorting Algorithm (DSA) [Zhong2013] works as follows: if concentration matrix is known a priori, it directly uses quadratic programming (QP) with added constraints on the lower/upper bound of gene expressions to estimate matrix . Otherwise, if fractions are also unknown, it uses the average expression of given marker genes that are only expressed in one cell-type, combined with the STO constraint, to estimate concentrations matrix first. Population-specific expression analysis (PSEA) [Kuhn2011] performs a linear least squares regression to estimate quantitative measures of cell-type-specific expression levels, in a similar fashion as the update equation for estimating in Equation 3. In cases where the matrix is not known a priori, PSEA exploits the average expression of marker genes that are exclusively expressed in one of the reference profiles as reference signals to track the variation of cell-type fractions across multiple mixture samples.
In addition to regression-based methods, a large class of methods is based on probabilistic modeling of gene expression. Erikkila et al. [Erkkila2010] introduce a method, called DSection, which formulates the deconvolution problem using a Bayesian model. It incorporates a Bayesian prior over the noisy observation of given concentration parameters to account for their uncertainty, and employs a MCMC sampling scheme to estimate the posterior distribution of the parameters/latent variables, including and a denoised version of . The in-silico NanoDissection method [Ju2013] uses a classification algorithm based on linear SVM coupled with an iterative adjustment process to refine a set of provided, positive and negative, marker genes and infer a ranked list of genome-scale predictions for cell-type-specific markers. Quon et al. [Quon2013] propose a probabilistic deconvolution method, called PERT, which estimates a global, multiplicative perturbation vector to correct for the differences between provided reference profiles and the true cell-types in the mixture. PERT formulates the deconvolution problem in a similar framework as Latent Dirichlet Allocation (LDA), and uses the conjugate gradient descent method to cyclically optimize the joint likelihood function with respect to each latent variable/parameter. Finally, microarray microdissection with analysis of differences (MMAD) [Liebner2014] incorporates the concept of the effective RNA fraction to account for source and sample-specific bias in the cell-type fractions for each gene. They propose different strategies depending on the availability of additional data sources. In cases where no additional information is available, they identify genes with the highest variation in mixtures as markers and assign them to different reference cell-types using k-means clustering, and finally use these de novo markers to compute cell-type fractions. MMAD uses a MLE approach over the residual sum of squares to estimate unknown parameters in their formulation.
3 Results and Discussion
We now present a comprehensive evaluation of various formulations for solving deconvolution problems. Some of these algorithimic combinations have been proposed in literature, while others represent new algorithmic choices. We systematically assess the impact of these algorithmic choices on the performance of in-silico deconvolution.
In vivo mixtures with known percentages: We use a total of five datasets with known mixtures. We use CellMix to download and normalize these datasets [Gaujoux2013], which uses the soft format data available from Gene Expression Omnibus (GEO).
BreatBlood [Gong2011] (GEO ID: GSE29830): Breast and blood from human specimens are mixed in three different proportions and each of the mixtures is measured three times, with a total of nine samples.
CellLines [Abbas2009] (GEO ID: GSE11058): Mixture of human cell lines Jurkat (T cell leukemia), THP-1 (acute monocytic leukemia), IM-9 (B lymphoblastoid multiple myeloma) and Raji (Burkitt B-cell lymphoma) in four different concentrations, each of which is repeated three times, resulting in a total of 12 samples.
LiverBrainLung [Shen-Orr2010] (GEO ID: GSE19830): This dataset contains three different rat tissues, namely brain, liver, and lung tissues, which are mixed in 11 different concentrations with each mixture having three technical replicates, for a total of 33 samples.
RatBrain [Kuhn2011] (GEO ID: GSE19380): This contains four different cell-types, namely rat’s neuronal, astrocytic, oligodendrocytic and microglial cultures, and two replicates of five different mixing proportions, for a total of 10 samples.
Retina [Siegert2012] (GEO ID: GSE33076): This dataset pools together retinas from two different mouse lines and mixed them in eight different combinations and three replicates for each mixture, resulting in a total of 24 samples.
Mixtures with available cell-sorting data through flow-cytometry: For this experiment, we use two datasets available from Qiao et al. [Qiao2012]. We directly download these datasets from the supplementary material of the paper. These datasets are post-processed by the supervised normalization of microarrays (SNM) method to correct for batch effects. Raw expression profiles are also available for download under GEO ID GSE40830. This dataset contains two sub-datasets:
PERT_Uncultured: This dataset contains uncultured human cord blood mono-nucleated and lineage-depleted (Lin-) cells on the first day.
PERT_Cultured: This dataset contains culture-derived lineage-depleted human blood cells after four days of cultivation.
Table I summarizes overall statistics related to each of these datasets.
|Dataset||# features||# samples||# references|
3.2 Evaluation Measures
Let us denote the actual and estimated coefficient matrices by and , respectively. We first normalize these measures to ensure each column sums to one. Then, we define the corresponding percentages as and . Finally, let be the residual estimation error of cell-type in sample . Using this notation, we can define three commonly used measures of estimation error as follows:
Mean absolute difference (mAD): This is among the easiest measures to interpret. It is defined as the average of all differences for different cell-type percentages in different mixture samples. More specifically:
Root mean squared distance (RMSD): This measure is one of the most commonly used distance functions in the literature. It is formally defined as:
Pearson’s correlation distance: Pearson’s correlation measures the linear dependence between estimated and actual percentages. Let us vectorize percentage matrices as and . Using this notation, the correlation between these two vectors is defined as:
where cov and correspond to covariance and standard variation of vectors, respectively. Finally, we define the correlation distance measure as .
All codes and experiments have been implemented in Matlab. To implement different formulations of the deconvolution problem, we used CVX, a package for specifying and solving convex programs [gb08, cvx]. We used Mosek together with CVX, which is a high-performance solver for large-scale linear and quadratic programs [mosek]. All codes and datasets are freely available at github.com/shmohammadi86/DeconvolutionReview.
3.4 Effect of Loss Function and Constraint Enforcement on Deconvolution Performance
We perform a systematic evaluation of the four different loss functions introduced in Section 2.3.1, as well as implicit and explicit enforcement of non-negativity (NN) and sum-to-one (STO) constraints over the concentration matrix (), on the overall performance of deconvolution methods for each dataset. There are 16 configurations of loss functions/constraints for each test case. Additionally, for Huber and Hinge loss functions, where and are unknown, we perform a grid search with 15 values in multiples of 10 spanning the range to find the best values for these parameters. In order to evaluate an upper bound on the “potential” performance of these two loss functions, we use the true concentrations in each sample, , to evaluate each parameter choice. In practical applications, the RMSD of residual error between and is often used to select the optimal parameter. This is not always in agreement with the choice made based on known .
For each test dataset, we compute the three evaluation measures defined in Section 3.2. Additionally, for each of these measures, we compute an empirical p-value by sampling random concentrations from a Uniform distribution and enforcing NN and STO constraints on the resulting random sample. In our study, we sampled concentrations for each dataset/measure, which results in a lower bound of on the estimated p-values. Figure 2 presents the time each loss function takes to compute per sample, averaged over all constraint combinations. The actual times taken for Huber and Hinge losses are roughly 15 times those reported here, which is the number of experiments performed to find the optimal parameters for these loss functions. From these results, can be observed to have the fastest computation time, whereas is the slowest. Measures and fit in between these two extremes, with being faster the majority of times. We can directly compare these computation times, because we formulate all methods within the same framework; thus, differences in implementations do not impact direct comparisons.
Computation time, while important, is not the critical measure in our evaluation. The true performance of a configuration (selection of loss function and constraints) is measured by its estimation error. In order to rank different configurations, we first assess the agreement among different measures. To this end, we evaluate each dataset as follows: for each experiment, we compute , and independently. Then, we use Kendall rank correlation, a non-parametric hypothesis test for statistical dependence between two random variables, between each pair of measures and compute a log-transformed p-value for each correlation. Figure 3 shows the agreement among these measures across different datasets. Overall, and measures show higher consistency, compared to measure. However, the measure is easier to interpret as a measure of percentage loss for each configuration. Consequently, we choose this measure for our evaluation in this study.
Using as the measure of performance, we evaluate each configuration over each dataset and sort the results. Figure 4 shows various combinations for each dataset. The RatBrain, LiverBrainLung, BreastBlood, and CellLines datasets achieve high performance. Among these datasets, RatBrain, LiverBrainLung, and BreastBlood had the loss function as the best configuration, with the CellLines dataset being less sensitive to the choice of the loss function. Another surprising observation is that for the majority of configurations, enforcing the sum-to-one constraint worsens the results. We investigate this issue in greater depth in Section 3.5.
For Retina, as well as both PERT datasets, the overall performance is worse than the other datasets. In the case of PERT, this is expected, since the flow-sorted proportions are used as an estimate of cell-type proportions. Furthermore, the reference profiles come from a different study and therefore have greater difference with the true cell-types in the mixture. However, the Retina dataset exhibits unusually low performance, which may be attributed to multiple factors. As an initial investigation, we performed a quality control (QC) over different samples to see if errors are similarly distributed across samples. Figure 5 presents per-sample error, measured by , with median and median absolute deviation (MAD) marked accordingly. Interestingly, for the , and mixtures, the third replicate has much higher error than the rest. In the expression matrix, we observed a lower correlation between these replicates and the other two replicates in the batch. Additionally, for the mixture, all three replicates show high error rates. We expand on these results in later sections to identify additional reasons that contribute to the low deconvolution performance of the Retina dataset.
Finally, we note that in all test cases the performance of , and are comparable, while and needed an additional step of parameter tuning. Consequently, we only consider as a representative of this “robust” group of loss functions in the rest of our study.
3.5 Agreement of Gene Expressions with Sum-to-One (STO) Constraint
Considering the lower performance of configurations that explicitly enforce STO constraints, we aim to investigate whether features (genes) in each dataset respect this constraint. Under the STO and NN constraints, we use simple bounds for identifying violating features, for which there is no combination of concentration values that can satisfy both STO and NN. Let be the expression value of the gene in the given mixture, and be the corresponding expressions in different reference cell-types. Let and . Given that all concentrations are bound between , the minimum and maximum values that an estimated mixture value for the gene can attain are and , respectively (by setting for min/max value, and everywhere else). Using this notation, we can identify features that violate STO as follows:
The first condition holds because expression values in reference profiles are so large that we need the sum of concentrations to be lower than one to be able to match the corresponding gene expression in the mixture. The second condition holds in cases where the expression of a gene in the mixture is so high that we need the sum of concentrations to be greater than one to be able to match it. In other words, for feature , these constraints identify extreme expression values in reference profiles and mixture samples, respectively. Using these conditions, we compute the total number of features violating STO condition in each dataset.
Figure 6 presents violating features in mixtures and reference profiles, averaged over all mixture samples in each dataset. We normalize and report the percent of features to account for differences in the total number of features in each dataset. We first observe that for the majority of datasets, except Retina and BreastBlood, the percent of violating features is much smaller than violating features in reference profiles. These two datasets also have the highest number of violating features in their reference profiles, summing to a total of approximately of all features. This observation is likely due to the normalization used in pre-processing microarray profiles. Specifically, one must not only normalize and independently, but also with respect to each other. We suggest using control genes that are expressed in all cell-types with low variation to normalize expression profiles. A recent study aimed to identify subsets of housekeeping genes in human tissues that respect these conditions [Eisenberg2013]. Another choice is using ribosomal proteins, the basic building blocks of the cellular translation machinery, which are expressed in a wide range of species. The Remove Unwanted Variation (RUV) [Gagnon-Bartsch2012] method is developed to remove batch effects from microarray and RNASeq expression profiles, but also to normalize them using control genes. A simple extension of this method can be adopted to solve the normalization difference between mixtures and references.
Next, we evaluate how filtering these features affects deconvolution performance of each dataset. For each case, we run deconvolution using all configurations and report the change (delta mAD) independently. Figure 7 presents changes in the estimation error after removing violating features in both and before performing deconvolution. Similar to previous experiments, the Retina dataset exhibits widely different behavior than the rest of the datasets. Removing this dataset from further consideration, we find that the overall performance over all datasets improves, with the exception of the RatBrain dataset. In the case of the RatBrain dataset, we hypothesize that the initially superior performance can be attributed to highly expressed features. These outliers, that happens to agree with the true solution, result in over-fitting. Finally, we note a correlation between observed enhancements and the level of violation of features in . Consistent with this observation, we obtain similar results when we only filter violating features from mixtures, but not reference profiles.
3.6 Range Filtering– Finding an Optimal Threshold
Different upper/lower bounds have been proposed in the literature to prefilter expression values prior to deconvolution. For example, Gong et al. [Gong2011] suggest an effective range of , whereas Ahn et al. [Ahn2013] observe an optimal range of . To facilitate the choice of expression bounds, we seek a systematic way to identify an optimal range for different datasets. Kawaji et al. [Kawaji2014] recently report on an experiment to assess whether gene expression is quantified linearly in mixtures. To this end, they mix two cell-types (THP-1 and HeLa cell-lines) and see if experimentally measured expressions match with the computationally simulated datasets. They observe that expression values for microarray measurements are skewed for the lowly expressed genes (approximately ). This allows us to choose the lower bound based on experimental evidence. In our study, we search for the optimal bounds over a -linear space; thus, we set a threshold of on the minimum expression values, which is closest to the bound proposed by Kawaji et al. [Kawaji2014].
Choosing an upper bound on the expression values is a harder problem, since it relates to enhancing the performance of deconvolution methods by removing outliers. Additionally, there is a known relationship between the mean expression value and its variance [Tu2002], which makes these outliers noisier than the rest of the features. This becomes even more important when dealing with purified cell-types that come from different labs, since highly expressed time/micro-environment dependent genes would be significantly different than the ones in the mixture [Qiao2012]. A simple argument is to filter genes that the range of expression values in Affymetrix microarray technology is bounded by (due to initial normalization and image processing steps). Measurements close to this bound are not reliable as they might be saturated and inaccurate. However, practical bounds used in previous studies are far from these extreme values. In order to examine the overall distribution of expression values, we analyze different datasets independently. For each dataset, we separately analyze mixture samples and reference profiles, encoded by matrices and , respectively. For each of these matrices, we vectorize the expression values and perform kernel smoothing using the Gaussian kernel to estimate the probability density function.
Figure 8(a) and Figure 8(b) show the distribution of log-transformed expression values for mixtures and reference profiles, respectively. These expression values are greater than our lower bound of . In agreement with our previous results, we observe an unusually skewed distribution for the Retina dataset, which in turn contributes to its lower performance compared to other ideal mixtures. Additionally, we observe that approximately of the features in this dataset are smaller than , which are filtered and not shown in the distribution plot. For the rest of the datasets, in both mixtures and references, we observe a bell-shaped distribution with most of the features captured up to an upper bound of . Another exception to this pattern is the CellLines dataset, which has a heavier tail than other datasets, especially in its reference profile.
Next, we systematically evaluate the effect of range filtering by analyzing upper bounds increasing in factors of 10 in the range . In each case, we remove all features that at least one of the reference profiles or mixture samples has a value exceeding this upper bound. Figure 9 illustrates the percent of features that are retained, as we increase the upper bound. As mentioned earlier, approximately of the features in the Retina dataset are lower than , which is evident from the maximum percent of features left to be bounded by in this figure. Additionally, consistent with our previous observation over expression densities, more that of the features are covered between , except for the CellLine dataset.
Finally, we perform deconvolution using the remaining features given each upper bound. The results are mixed, but a common trend is that removing highly expressed genes decreases performance of ideal mixtures with known concentrations, while enhancing the performance of PERT datasets. Figure 10(a) and Figure 10(b) show the changes in mAD error, compared to unfiltered deconvolution, for the PERT dataset. In each case, we observe improvements up to 7 and 8 percent, respectively. The red and green points on the diagram show the significance of deconvolution. Interestingly, while both methods show similar improvements, all data points for cultured PERT seem to be insignificant, whereas uncultured PERT shows significance for the majority of data-points. This is due to the weakness of our random model, which is dependent on the number of samples and is not comparable across datasets. Uncultured PERT has twice as many samples as cultured PERT, which makes it less likely to have any random samples achieving as good an mAD as the observed estimation error. This dependency on the number of samples can be addressed by defining sample-based p-values. Another observation is that for the uncultured dataset, all measures have been improved, except with explicit NN and STO constraints. On the other hand, for the cultured dataset, both and with the explicit NN constraint perform well, whereas implicitly enforcing NN deteriorates their performance. Cultured and uncultured datasets have their peak at and , respectively.
For the rest of the datasets, range filtering decreased performance in a majority of cases, except the Retina dataset, which had an improved performance at with the best result achieved with with both explicit NN and STO enforcement. This changed the best observed performance of this datasest, measured as mAD, to be close to 7. These mixed results make it harder to choose a threshold for the upper bound, so we average results over all datasets to find a balance between improvements in PERT and overall deterioration in other datasets. Figure 11 presents the averaged mAD difference across all datasets. This suggests a “general” upper bound filter of to be optimal across all datasets.
We use this threshold to filter all datasets and perform deconvolution on them. Figure 12 presents the dataset-specific performance of range filtering with fixed bounds, measured by changes in the mAD value compared to the original deconvolution. As observed from individual performance plots, range filtering is most effective in cases where the reference profiles differ significantly from the true cell-types in the mixture, such as the case with the PERT datasets. In ideal mixtures, since cell-types are measured and mixed at the same time/laboratory, this distinction is negligible. In these cases, highly expressed genes in mixtures and references coincide with each other and provide additional clues for the regression. Consequently, removing these highly expressed genes often degrades the performance of deconvolution methods. This generalization of the upper bound threshold, however, should be adopted with care, since each dataset exhibits different behavior in response to range filtering. Ideally, one must filter each dataset individually based on the distribution of expression values. Furthermore, in practical applications, gold standards are not available to aid in the choice of cutoff threshold.
We now introduce a new method that adaptively identifies an effective range for each dataset. Figure 13 illustrates the normalized value of maximal expression for each gene in matrices and , sorted in ascending order. In all cases, intermediate values exhibit a gradual increase, whereas the top and bottom elements in the sorted list show a steep change in their expression. We aim to identify the critical points corresponding to these sudden changes in the expression values for each dataset. To this end, we select the middle point as a point of reference and analyze the upper and lower half, independently. For each half, we find the point on the curve that has the longest distance from the line connecting the first (last) element to the middle element. Application of this process over the CellTypes dataset is visualized in Figure 14. Green points in this figure correspond to the critical points, which are used to define the lower and upper bound for the expression values of this dataset.
We use this technique to identify adaptive ranges for each dataset prior to deconvolution. Table II summarizes the identified critical points for each dataset. Figure 15 presents the dataset-specific performance of each method after adaptive range filtering. While in most cases the results for fixed and adaptive range filtering are compatible, and in some cases adaptive filtering gives better results, the most notable difference is the degraded performance of LiverBrainLung, and, to some extent, RatBrain datasets. To further investigate this observation, we examine individual experiments for these datasets for fixed thresholds. Figure 16 illustrates individual plots for each dataset. The common trend here is that in both cases range filtering, in general, degrades the performance of deconvolution methods for all configurations. In other words, extreme values in these datasets are actually helpful in guiding the regression, and any filtering negatively impacts performance. This suggests that range filtering, in general, is not always helpful in enhancing the deconvolution performance, and in fact in some cases; for example the ideal datasets such as LiverBrainLung, RatBrain, and BreastBlood; it can be counterproductive.
3.7 Selection of Marker Genes – The Good, Bad, and Ugly
Selecting marker genes that uniquely identify a certain tissue or cell-type, prior to deconvolution, can help in improving the conditioning of matrix , thus improving its discriminating power and stability of results, as well as decreasing the overall computation time. A key challenge in identifying “marker” genes is the choice of method that is used to assess selectivity of genes. Various parametric and nonparametric methods have been proposed in literature to identify differentially expressed genes between two groups [Jeanmougin2010, Clark2014] or between a group and other groups [VanDeun2009]. Furthermore, different methods have been developed in parallel to identify tissue-specific and tissue-selective genes that are unique markers with high specificity to their host tissue/cell type [Cavalli2011, Kadota2006, Birnbaum2011, Mohammadi2015]. While choosing/developing accurate methods for identifying reliable markers is an important challenge, an in-depth discussion of the matter is beyond the scope of this article. Instead, we adopt two methods used in the literature. Abbas et al. [Abbas2009] present a framework for choosing genes based on their overall differential expression. For each gene, they use a t-test to compare the cell-type with the highest expression with the second and third highest expressing cell-type. Then, they sort all genes and construct a sequence of basis matrices with increasing sizes. Finally, they use condition number to identify an “optimal” cut among top-ranked genes that minimizes the condition number. Newman et al. [Newman2015] propose a modification to the method of Abbas et al., in which genes are not sorted based on their overall differential expression, but according to their tissue-specific expression when compared to all other cell-types. After prefiltering differentially expressed genes, they sort genes based on their expression fold ratio and use a similar cutoff that optimizes the condition number. Note that the former method increases the size of the basis matrix by one at each step, while the latter method increases it by (number of cell-types). The method of Newman et al. has the benefit that it chooses a similar number of markers per cell-type, which is useful in cases where one of the references has a significantly higher number of markers.
We implement both methods and assess their performance over the datasets. We observe slightly better performance with the second method and use it for the rest of our experiments. Due to unexpected behavior of the Retina dataset, as well as a low number of significant markers in all our trials, we eliminate this dataset from further study. In identifying differentially expressed genes, a key parameter is the q-value cutoff to report significant features. The distribution of corrected p-values exhibits high similarity among ideal mixtures, while differing significantly in CellLines mixtures and both PERT datasets. We find the range of to be an optimal balance between these two cases and perform experiments to test different cutoff values. Figure 17 shows changes in the measure after applying marker detection, using a q-value cutoff of , which resulted in the best overall performance in our study. We observe that the PERT_Uncultured and LiverBrainLung datasets have the highest gain across the majority of configurations, while BreastBlood and RatBrain exhibit an improvement in experiments with while their performance is greatly decreased. Finally, for the PERT_Cultured and CellLines datasets, we observe an overall decrease in performance in almost all configurations.
Next, we note that the internal sorting based on fold-ratio intrinsically prioritizes highly expressed genes and is susceptible to noisy outliers. To test this hypothesis, we perform a range selection using a global upper bound of prior to the marker selection method and examine if this combination can enhance our previous results. We find the q-value threshold of to be the better choice in this case. Figure 18 shows changes in performance of different methods when we prefilter expression ranges prior to marker selection. The most notable change is that both the PERT_Cultured and the CellLines, which were among the datasets with the lowest performance in the previous experiment, are now among the best-performing datasets, in terms of overall mAD enhancement. We still observe a higher negative impact on in this case, but the overall amplitude of the effect has been dampened in both BreastBlood and RatBrain datasets.
We note that there is no prior knowledge as to the “proper” choice of the marker selection method in the literature and that their effect on the deconvolution performance is unclear. An in-depth comparison of marker detection methods can benefit future developments in this field. An ideal marker should serve two purpose: (i) be highly informative of the cell-type in which it is expressed, (ii) shows low variance due to spatiotemporal changes in the environment (changes in time or microenvironment). Figure 19 shows a high-level classification of genes. An ideal marker is an invariant, cell-type specific gene, marked with green in the diagram. On the other hand, variant genes, both universally expressed and tissue-specific, are not good candidates, especially in cases where references are adopted from a different study. These genes, however, comprise ideal subsets of genes that should be updated in full deconvolution while updating matrix , since their expression in the reference profile may differ significantly from the true cell-types in the mixture. It is worth mentioning that the proper ordering to identify best markers is to first identify tissue-specific genes and then prune them based on their variability. Otherwise, when selecting invariant genes, we may select many housekeeping genes, since their expression is known to be more uniform compared to tissue-specific genes.
Another observation relates to the case in which groups of profiles of cell-types have high similarity within the group, but are significantly distant from the rest. This makes identifying marker genes more challenging for these groups of cell-types. An instance of this problem is when we consider markers in the PERT datasets. In this case, erythrocytes have a much larger number of distinguishing markers compared to other references. This phenomenon is primarily attributed to the underlying similarity between undifferentiated cell-types in the PERT datasets, and their distance from the fully differentiated red blood cells. In these cases, it is beneficial to summarize each group of similar tissues using a “representative profile” for the whole group, and to use a hierarchical structure to recursively identify markers at different levels of resolution [Mohammadi2015].
Finally, we examine the common choice of condition number as the optimal choice to select the number of markers. First, unlike the “U” shape plot reported in previous studies, in which condition number initially decreases to an optimal point and then starts increasing, we observe variable behavior in the condition number plot, both for Newman et al. and Abbas et al. methods. This makes the generalization of condition number as a measure applicable to all datasets infeasible. Additionally, we note that the lowest condition number is achieved if is fully orthogonal, that is for any constant . By selecting tissue-selective markers, we can ensure that the product of columns in the resulting matrix is close to zero. However, the norm-2 of each column can still be different. We developed a method that specifically grows the basis matrix by accounting for the norm equality across columns. We find that in all cases our basis matrix has a lower condition number than both the Newman et al. and Abbas et al. methods, but it did not always improve the overall performance of deconvolution methods using different loss functions. Further study on the optimal choice of the number of markers is another key question that needs further investigation
3.8 To Regularize or Not to Regularize
We now evaluate the impact of regularization on the performance of different deconvolution methods. To isolate the effect of the regularizer from prior filtering/feature selection steps, we apply regularization on the original datasets. The regularizer is typically applied in cases where the solution space is large, that is, the total number of available reference cell-types is a superset of the true cell-types in the mixture. This type of regularization acts as a “selector” to choose the most relevant cell-types and zero-out the coefficients for the rest of the cell-types. This has the effect of enforcing a sparsity pattern. Datasets used in this study are all controlled benchmarks in which references are hand-picked to match the ones in the mixture; thus, sparsifying the solution does not add value to the deconvolution process. On the other hand, an regularizer, also known as Tikhonov regularization, is most commonly used when the problem is ill-posed. This is the case, for example, when the underlying cell-types are highly correlated with each other, which introduces dependency among columns of the basis matrix. In order to quantify the impact of this type of regularization on the performance of deconvolution methods, we perform an experiment similar to the one in Section 3.4 with an added regularizer. In this experiment, we use and loss functions, as we previously showed that the performance of the other two loss functions is similar to . Instead of using Ridge regression introduced in Section 2.4.2, we implement an equivalent formulation, , which traces the same path but has higher numerical accuracy. To identify the optimal value of the parameter that balances the relative importance of solution fit versus regularization, we search over the range of . It is notable here that when is close to zero, the solution is identical to the one without regularization, whereas when the deconvolution process is only guided by the solution size. Similar to the range filtering step in Section 3.6, we use the minimum mAD error to choose the optimal value of .
Figure 20 presents changes in mAD error, compared to original errors, after regularizing loss functions with the regularizer. From these observations, it appears that PERT_Cultured has the most gain due to regularization, whereas for PERT_Uncultured, the changes are smaller. A detailed investigation, however, suggests that in the majority of cases for PERT_Cultured, the performance gain is due to over shrinkage of vector to the case of being almost uniform. Interestingly, the choice of uniform has lower mAD error for this dataset compared to most other results. Overall, both of the PERT datasets show significant improvements compared to the original solution, which can be attributed to the underlying similarity among hematopoietic cells. On the other hand, an unexpected observation is the performance gain over configurations for the BreastBlood dataset. This is primarily explained by the limited number of cell-types (only two), combined with the similar concentrations used in all samples (only combinations of and ).
To gain additional insight into the parameters used in each case during deconvolution, we plot the optimal values for each configuration in each dataset. Figure 21 summarizes the optimal values of the parameter. Large values indicate a beneficial effect for regularization, whereas small values are suggestive of negative impact. In all cases where the overall mAD score has been improved, their corresponding parameter was large. However, large values of do not necessarily indicate a significant impact on the final solution, as is evident in the CellLines and LiverBrainLung datasets. Finally, we observe that cases where the value of is close to zero are primarily associated with the loss function.
Based on our observations, we propose the following guidelines for the deconvolution of expression datasets:
Pre-process reference profiles and mixtures using invariant, universally expressed (housekeeping) genes to ensure that the similar cell quantity (SCQ) constraint is satisfied.
Filter violating features that cannot satisfy the sum-to-one (STO) constraint.
Filter lower and upper bounds of gene expressions using adaptive range filtering.
Select invariant (among references and between references and samples) cell-type-specific markers to enhance the discriminating power of the basis matrix.
Solve the regression using the loss function with explicit constraints (check), together with an regularizer, or group LASSO if sparsity is desired among groups of tissues/cell-types.
Use the L-curve method to identify the optimal balance between the regression fit and the regularization penalty.
4 Concluding Remarks
In this paper, we present a comprehensive review of different methods for deconvolving linear mixtures of cell-types in complex tissues. We perform a systematic analysis of the impact of different algorithmic choices on the performance of the deconvolution methods, including the choice of the loss function, constraints on solutions, data filtering, feature selection, and regularization. We find loss to be superior in cases where the reference cell-types are representative of constitutive cell-types in the mixture, while outperforms the in cases where this condition does not hold. Explicit enforcement of the sum-to-one (STO) constraint typically degrades the performance of deconvolution. We propose simple bounds to identify features violating this constraint and evaluate the total number of violating features in each dataset. We observe an unexpectedly high number of features that cannot satisfy the STO condition, which can be attributed to problems with normalization of expression profiles, specifically normalizing references and samples with respect to each other. In terms of filtering the range of expression values, we find that fixed thresholding is not effective and develop an adaptive method for filtering each dataset individually. Furthermore, we observed that range filtering is not always beneficial for deconvolution and, in fact, in some cases it can deteriorate the performance. We implement two commonly used marker selection methods from the literature to assess their effect on the deconvolution process. Orthogonalizing reference profiles can enhance the discriminating power of the basis matrix. However, due to known correlation between the mean and variance of expression values, this process alone does not always provide satisfactory results. Another key factor to consider is the low biological variance of genes in order to enhance the reproducibility of the results and allow deconvolution with noisy references. The combination of range filtering and marker selection eliminates genes with high mean expression, which in turn enhances the observed results. Finally, we address the application of Tikhonov regularization in cases where reference cell-types are highly correlated and the regression problem is ill-posed.
We summarize our findings in a simple set of guidelines and identify open problems that need further investigation. Areas of particular interest for future research include: (i) identifying the proper set of filters based on the datasets, (ii) expanding deconvolution problem to cases with more complex, hierarchical structure among reference vectors, and (iii) selecting optimal features to reduce computation time while maximizing the discriminating power.
This work is supported by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370, and by NSF Grant BIO 1124962.
Shahin Mohammadi received his Master’s degree in Computer Science from Purdue University in Dec. 2012 and is currently a Ph.D. candidate at Purdue. His research interests include computational biology, machine learning, and parallel computing. His current work spans different areas of Bioinformatics/Systems Biology and aims to develop computational methods coupled with statistical models for data-intensive problems, with application in mining the human tissue-specific transcriptome and interactome.
Neta Zuckerman received her PhD degree in computational biology from the University of Bar-Ilan, Israel in June 2010. She has completed her post-doctorate in 2015 as a computational biologist at Stanford University, School of Medicine and City of Hope as well as a visiting scholar at the department of Electrical Engineering, Stanford University. Her research interests focus on investigating the role of immune cells in the setting of various diseases, specifically cancer, utilizing algorithm development and microarray data analysis. Neta is currently a computational biologist at Genentech Inc.
Andrea Goldsmith Andrea Goldsmith is the Stephen Harris professor in the School of Engineering and a professor of Electrical Engineering at Stanford University. Her research interests are in information theory and communication theory, and their application to wireless communications as well as biology and neuroscience. She has received several awards for her work including the IEEE ComSoc Edwin H. Armstrong Achievement Award, the IEEE ComSoc and Information Theory Society Joint Paper Award, and the National Academy of Engineering Gilbreth Lecture Award. She is author of 3 textbooks, including Wireless Communications, all published by Cambridge University Press, as well as an inventor on 28 patents.
Ananth Grama received the PhD degree from the University of Minnesota in 1996. He is currently a Professor of Computer Sciences and Associate Director of the Center for Science of Information at Purdue University. His research interests span areas of parallel and distributed computing architectures, algorithms, and applications. On these topics, he has authored several papers and texts. He is a member of the American Association for Advancement of Sciences.