Integration of Differential Gene-combination Search and Gene Set Enrichment Analysis: A General Approach

Integration of Differential Gene-combination Search and Gene Set Enrichment Analysis: A General Approach

Gang Fang, Michael Steinbach,
Chad L. Myers, Vipin Kumar
Department of Computer Science,
University of Minnesota, Minneapolis, MN 55455, USA.
Abstract

Gene Set Enrichment Analysis (GSEA) and its variations aim to discover collections of genes that show moderate but coordinated differences in expression. However, such techniques may be ineffective if many individual genes in a phenotype-related gene set have weak discriminative power. A potential solution is to search for combinations of genes that are highly differentiating even when individual genes are not. Although such techniques have been developed, these approaches have not been used with GSEA to any significant degree because of the large number of potential gene combinations and the heterogeneity of measures that assess the differentiation provided by gene groups of different sizes.

To integrate the search for differentiating gene combinations and GSEA, we propose a general framework with two key components: (A) a procedure that reduces the number of scores to be handled by GSEA to the number of genes by summarizing the scores of the gene combinations involving a particular gene in a single score, and (B) a procedure to integrate the heterogeneous scores from combinations of different sizes and from different gene combination measures by mapping the scores to p-values. Experiments on four gene expression data sets demonstrate that the integration of GSEA and gene combination search can enhance the power of traditional GSEA by discovering gene sets that include genes with weak individual differentiation but strong joint discriminative power. Also, gene sets discovered by the integrative framework share several common biological processes and improve the consistency of the results among three lung cancer data sets.

1 Introduction

Microarray technology is an important tool to monitor gene-expression in bio-medical studies [27]. A common experimental design is to compare two sets of samples with different phenotypes, e.g. diseased and normal tissue, with the goal of discovering differentially expressed genes [16]. Statistical testing procedures, such as such as the t-test and significance analysis of microarrays [31], have been extensively studied and widely used. Subsequently, multiple testing corrections are usually applied [11]. A comprehensive review of such approaches are presented in [26].

Differential expression analysis based on univariate statistical tests has several well-known limitations. First, due to the low sample size, high dimensionality and the noisy nature of microarray data, individual genes may not meet the threshold for statistical significance after a correction for multiple hypotheses testing [28]. Second, the lists of differentially expressed genes discovered from different studies on the same phenotype have little overlap [28].

These limitations motivated the creation of Gene Set Enrichment Analysis GSEA [24, 28], which discovers collections of genes, for example, known biological pathways [28], that show moderate but coordinated differentiation. For example, Subramanian and Tamayo et al. [28] report that the p53 hypoxial pathway contains many genes that show moderate differentiation between two lung cancer sample groups with different phenotypes. Although the genes in the pathway are not individually significant after multiple hypothesis correction [28], the pathway is. For those familiar with GSEA and its output, Figure 1 shows the GSEA results for the p53 hypoxial pathway. GSEA also has the advantages of better interpretability and better consistency between the results obtained by different studies on the same phenotype [28]. Ackermann and Strimmer presented a comprehensive review of different GSEA variations in [1].

Unfortunately, GSEA and related techniques may be ineffective if many individual genes in a phenotype-related gene set have weak discriminative power. A potential solution to this problem is to search for combinations of genes that are highly differentiating even when individual genes are not. For this approach, the targets are groups of genes that show much stronger discriminative power when combined together [10]. For example, Figure 2(a) illustrates one type of differentially expressed gene combination discovered in [10]. The two genes have weak individual differentiation indicated by the overlapping class symbols on both the two axes. In contrast, these two genes are highly discriminative in a joint manner indicated by the different correlation structure in the two-dimensional plot, i.e. they are correlated along the blue and red dashed line respectively in the triangle and circle class. Such a joint differentiation may indicate that the interaction of the two genes is associated with the phenotypes even though the two genes, individually, are not.

Figure 1: A gene set (p53 hypoxial pathway) with many moderate but coordinated differential expression (towards the right tail of the ranked list). Figure generated with GSEA software [28]

Figure 2(b) illustrates another type of phenotype-associated gene combination discovered in [10], usually named differential coexpression [18, 12], in which the correlation of the two genes are high in one class but much lower in the other class. As discussed in [10, 15], existing multivariate tests such as Hotelling’s [23], Dempster’s T1 [9] are not suitable to detect such ‘complementary’ gene combinations because they only screen for differences in the multivariate mean vectors, and thus will favor pairs that consist of genes with strong marginal effects by themselves but not the genes like the four in Fig. 2. For clarification, we use differential gene combination search (denoted as DGCS) to refer to the multivariate data analyses that are designed to detect the complementarity of different genes, rather than those designed to model the correlation structure of different genes (such as Hotelling’s and Dempster’s T1 test). A variety of other DGCS measures for complementary gene combination search are proposed for gene pairs [20, 19] in addition to the two illustrated in figure 2. Several measures are designed for higher-order gene combination beyond pairs [12, 34]. These approaches can provide biological insights beyond univariate gene analysis as shown in [10, 12, 34].

(a)
(b)
Figure 2: Two highly differential gene-pairs with weak individual discriminative power. Axes indicate the expression level of indicated genes. Different color and shape of markers indicates the two phenotypes. Figures modified from Dettling et al. [10]. This two types of differential gene combinations are measured respectively by two measures and as described in section 2.

The limitations of GSEA and the capabilities of DGCS motivate a GSEA approach using gene combinations in which the score of a gene set is based on both the scores of individual genes in the set and the scores from the gene combinations in which these genes participate. Unfortunately, gene combination techniques have not been used with the GSEA approach in any significant way because of two key challenges.

  1. Finding a technique to reduce the vast number of gene combinations. There are exponentially more gene combinations than individual genes, i.e. in addition to the univariate genes, there are gene-pairs, gene-triplets, etc. Many variations of GSEA are based on a ranked list of the individual genes as illustrated in Figure 1. Including combinations in the ranked list might work for size-2 combinations [7, 35], but would not be feasible for handling gene combinations of larger sizes. Furthermore, this explosion in the number of gene combinations negatively impacts false discovery rates. Thus, by adding so many gene combinations, we run the risk that neither groups of genes nor individual genes will show statistically significant differentiation.

  2. Combining results from the heterogeneous measures used to score different size gene combinations. Furthermore, because a gene can be associated with the phenotype either as an univariate variable or together with other genes as a combination, the importance of a gene set should be based on both the univariate gene scores and the gene combination based scores of its set members. However, different measures have a different nature, scale and significance, and thus are not directly comparable (to be detailed in section 3.2). Indeed, differences exist even between gene combinations of the same measure but of different sizes. Therefore, the challenge lies in how to design a framework to combine different measures (a univariate measure222Ackermann and Strimmer [1] suggest that different univariate statistics have similar effect in GSEA and thus, we consider only one univariate statistic rather than multiple of them. plus one or more DGCS measures) together within the GSEA framework.

To the best of our knowledge, no existing work has sufficiently addressed these two challenges, although recent work presented in [7, 35] have made initial efforts at adding GSEA capabilities to gene combination search. More specifically, two approaches are proposed in [7, 35] to help the study of a specific type of size-2 differential combinations as illustrated in 2(b). The experiments in these two studies provide some evidence about the benefits of the integration. However, a more general framework is needed that can also handle other types of size-2 differential combinations as illustrated in figure 2(a), higher order differential combinations (e.g. SDC[12] and the n-statistic [34]), and multiple types of differential combinations.

Contributions: In this paper, we propose a general framework to address the above challenges for the effective integration of DGCS and GSEA. Specific contributions are as follows:

  1. A gene-combination-to-gene score summarization procedure (procedure A) that is designed to handle the exponentially increasing number of gene combinations. First, for a given gene combination measure and a certain , the score of a size- combination is partitioned into equal parts which are assigned to each of the genes in the combination. Because each gene can participate in up to size- combinations, each gene will be assigned with a score from each of these combinations. Secondly, an aggregation statistic, e.g., maximum absolute value is used to summarize the different scores for a gene. With such a procedure, scores for all the size- gene combinations are summarized to scores for genes. This procedure can effectively retain the length of the ranked list while handling gene combinations of size- ().

  2. A score-to-pvalue transformation and summarization procedure (procedure B) that is designed to integrate the scores contributed (in procedure ) from different gene combination measures and from gene combinations of different sizes. The transformation is based on p-values obtained from scores derived from phenotype permutations. Such a transformation enables the comparison of scores from different measures (either univariate or gene combination measures) and scores from the gene combinations of different sizes. Subsequently, among all the p-values of a gene, the best is used as an integrated score of statistical significance.

  3. Integration of the above two procedures with GSEA More specifically, after procedures A and B, each gene has a single integrated score. Unlike traditional univariate scores, these integrated scores are based on both the univariate statistic and the gene combination measures. For the type of GSEA variations that depend on phenotype permutation test, lists of integrative scores are computed, one for the real class labels and the other for the permutations. For the type of GSEA variations that are based on gene-set permutation test, only the list of integrated scores for the real class labels are needed. An independent Matlab implementation of the proposed framework is available for download, which allows most existing GSEA frameworks [1] to directly utilize the proposed framework to handle gene combinations with almost-zero modification.

  4. Experimental results that illustrate the effectiveness of the proposed framework. We integrated three gene combination measures and the GSEA approach presented in [28] and produced experimental results from four gene expression datasets. These results demonstrate that the integrative framework can discover gene sets that would have been missed without the consideration of gene combinations. This includes statistically significant gene sets with moderate differential gene combinations whose individual genes have very weak discriminative power. Thus, a gene combination assisted GSEA approach can improve traditional GSEA approaches by discovering additional disease-associated gene sets. Indeed, the integrative approach also improve traditional DGCS since most gene combinations are not statistically significant by themselves. Furthermore, we also show that the biologically relevant gene sets discovered by the integrative framework share several common biological processes and improve the consistency of the results among the three lung cancer data sets.

Overview: The rest of the paper is organized as follow. In section 2, we describe three gene combination measures used in the following discussion and experiments. In Section 3, we present the technical details of the two procedures of the general integrative framework. Experimental design and results are presented in Section 4, followed by conclusions and discussions in Section 5.

2 Differential Gene Combination Measures

In this section, we describe three DGCS measures for use in the following discussion and experiments. Let and be two phenotypic classes of samples of size and respectively. For each sample in and , we have the expression value of genes . First, we have the following two measures (denoted as and ) defined for a pair of genes as presented in [10]:

(1)
(2)

where and are two genes, and represents the correlation of and over the samples in set . As discussed in [10], and can detect the joint differential expression of two genes as illustrated in Figure 2(a) and Figure 2(b) respectively. and are used as two representative measures for gene pairs. Other options for gene combination measures for gene pairs have been investigated in [19, 20].

We use the subspace differential coexpression measure (denoted as ) proposed in [12] as the representative for measures for size- gene combinations, where can be any integer ().

(3)

where is a set of genes such that and . and respectively represent the fraction of samples in and over which the genes in are coexpressed. is a generalization of for detecting the differential coexpression of genes (), i.e. the genes are highly coexpressed over many samples in one class but over far fewer samples in the other. Other options for size-k combinations include the n-statistic [34], SupMaxPair [13], etc.

Signal-to-noise ratio (denoted as ) is used as the representative of traditional univariate statistics as in [28, 24].

(4)

where and are the mean expression of in class and respectively, and and are the standard deviation of the expression of in class and respectively. Many other univariate statistics can be found in [1].

In this paper, these four measures are used as representatives of each category for the illustration of the proposed integrative framework. However, the framework is general enough to handle other measures from each of these categories.

3 Methods

In Section 1, we motivated the integration of DGCS with GSEA, discussed two challenges associated with this integration, and briefly described two main procedures in the proposed framework. In this section, we present the technical details of the two procedures and their integration with GSEA.

3.1 Procedure : combination-to-gene score reduction

There are two steps in procedure . In step , for each DGCS measure and each size- gene combination, its score is divided into equal parts and assigned to each of the genes in the combination. In step , the scores assigned to a gene from all the size-k combinations in which the gene participates are summarized into a single score by an aggregation functions such as . Note that, for most univariate statistic and DGCS measures which can be either positive or negative (e.g. the four measures described in section 2), the maximum is taken over the absolute values of the scores, and the sign of the score with the highest absolute value is recorded for later use. Other simple statistics such as mean or median, or sophisticated ones such as weighted summation [33] can also be used. Since the focus of this paper is the overall integrative framework, we use for simplicity.

We provide a conceptual example of procedure for a gene with a certain DGCS measure . This example considers gene combinations up to size for illustration purpose. The gene is associated with scores assigned from gene combinations of size 2, 3 and 4 (denoted as , and respectively) in which participates. In step , the scores from , and are summarized by three maximum values, respectively. Please refer to the Appendix section for the illustration of this example.

Procedure serves as a general approach to summarize the scores of all the size- combinations into scores for the genes. If we want to integrate GSEA with one DGCS measure and a specific size-, procedure by itself can enable most existing variations of GSEA to search, with almost-zero modification, for statistically significant gene sets with moderate but coordinated gene combinations of size-. Such a GSEA approach can collectively consider the gene combinations affiliated with a gene set, and may provide better statistical power and better interpretability for DGCS, as will be shown in the experiments.

3.2 Procedure : Score-to-pvalue conversion and summarization

The hypothesis tested when one DGCS measure, say , is integrated with GSEA (by procedure ) is that, whether a gene set includes significantly many genes with highly positive (or highly negative) combination-based scores measured by . An extended hypothesis can be whether a gene set includes significantly many genes with highly positive (or highly negative) scores, either univariate or combination-based scores measured by different DGCS measures. The biological motivation of this extended hypothesis is that, a gene can be associated with the phenotype either as an univariate variable or together with other genes as a combination. To test this extended hypothesis, we design a second procedure (B) that can integrate the scores of a gene from different measures.

Before describing the steps in this procedure. We first discuss in detail the challenges of integrating heterogeneous scores from different DGCS measures and combinations of different sizes.

  1. The different nature of different measures: Different measures are designed to capture different aspects of the discriminative power of a gene or a gene combination between the two phenotypic classes. Signal-to-noise ratio (), a univariate gene-level statistic, measures the difference between the means of the expression of a gene in the two classes. In contrast, , a differential coexpression measure for a pair of genes describes the difference of the correlations of a gene-pair in the two classes. Thus, for a gene, the score of itself measured by and the score assigned and summarized from the size-2 gene combinations measured by are not directly comparable. Similarly, the scores of different DGCS measures can also have a different nature, e.g. and as illustrated in figure 2.

  2. The different scales of different measures: Different measures also have different ranges of values. For example, the range of , , , and are , , and respectively. Thus, they are not directly comparable.

  3. Differences in significance between different measures: Even after we normalize the scores of different measures to a single range, say , they are still not comparable because the scores of different measures have different statistical significance. For example, a normalized score of may be less significant than a normalized score of 0.5, if there are many genes with normalized score greater than in the permutation test [28], but very few genes with normalized score greater than in the permutation test. Note that, such differences in statistical significance also exists between gene combinations of different sizes, even for the same measure. Take the subspace differential coexpression measure as an example. A score of for a size-2 combination may not be as significant as a score of for a size-3 combination as discussed in [12].

To handle the above heterogeneity, we propose a score-to-pvalue transformation and summarization procedure that can enable the comparison and integration of the scores of different measures and combinations of different sizes. There are three major steps in procedure .

Figure 3: Illustration of step in procedure (score-to-pvalue transformation) for gene and measure .

3.2.1 Step : Score-to-pvalue transformation

Consider a concrete example. For a gene and a measure , procedure computes a single summarized score. In this step, the original phenotype class labels are permutated say times, and for each permutation, the same procedure is applied, and a corresponding score for and is computed. We denote the score of and summarized with the original label as , where is the gene index, and indicates the measure and means it is the score based on the original label. Similarly, we denote the scores computed in each of the permutation as , where .

These scores are organized in the table on the left in figure 3. The scores computed in the permutations can be considered as the null-distribution for gene and measure , and a p-value can be estimated for . Specifically, if is positive, the p-value is the ratio of the number of scores that are greater or equal to and the number of scores that are positive. Similarly if is negative, the p-value is computed as the ratio of the number of scores which are less or equal to and the number of scores which are negative.333Treating positive and negative scores separately follows the practice of GSEA [30] Note that, such a score-to-pvalue transformation is done for both and each of (), if the GSEA approach to be integrated is based on phenotype permutation test [30]. Otherwise, only needs to be transformed to p-value and will be used by the GSEA approaches that are based on gene-set permutation [30]. In this paper, we illustrate the proposed framework using the GSEA approach presented by Subramanian and Tamayo et al. [28] which is based on phenotype permutation test.

Essentially, step transforms the heterogeneous scores of a gene measured by different measures into their corresponding significance values, which are comparable to each other although their original values are not.

3.2.2 Step : P-value Summarization

Suppose that there are different measures to be integrated, one of which is a univariate statistic, and the others are different DGCS measures for which we consider combinations of sizes up to . After step , each gene has a p-value for the univariate measure and up to p-values for each size of gene combination for each measure. In step , the best444”Best” means it is the lowest raw p-value or the highest transformed p-value p-value associated with a gene is selected as the integrated significance.

Essentially, procedure integrates the scores of different DGCS measures for a gene and the univariate statistic of the gene into a single p-value. Such a statistical significance-based integration of heterogeneous scores enables the comparison and thus the ranking of all the genes. However, this ranked list does not maintain the original directionality of the integrated scores of each gene. In particular, most univariate statistics and DGCS measures (e.g. all the four measures described in section 2) can be either positive or negative. Such directionality information is lost in step and because the p-value is non-negative. Next, we describe a third step to maintain the directionality in the integration.

3.2.3 Step : Maintaining directionality associated with the integrated p-values

In the simple case, the measures to be integrated capture the same type of differentiation between the two phenotype classes, e.g. and . Suppose there are two genes and , whose integrated p-values are transformed respectively from two scores measured by and in step . The signs of these two scores are comparable to each other, because both and capture the change of coexpression of a combination of genes. Thus, we simply use the signs of these two scores as the signs associated with the integrated p-values of and . Similarly, we associate a sign to all the integrated p-values. And these p-values with associated signs can be used to rank the genes based on their significance as well as their direction of differentiation, i.e. p-values associated with positive signs are ranked with descending significance, and afterwards, p-values associated with negative signs are ranked with increasing significance.

In the other case, if the measures to be integrated capture different types of differentiation between the two phenotype classes, the directionality can not be fully maintained. For example, suppose there are two genes and , whose integrated p-values are transformed respectively from two scores measured by and in step . The signs of these two scores are not comparable, because captures the change of mean expression, and captures the change of coexpression of a combination of genes. Specifically, up-regulation of can be associated with either high or low coexpression of another gene-combination in which participates. Thus, it is not reasonable to follow the same strategy to associate signs to the integrated p-values. If we know the correspondence of the signs of different genes in advance, e.g. the up-regulation of is associated with the low coexpression of and , then the signs can be maintained. However, because it is not realistic to assume such prior knowledge, we propose the following heuristic approach which has proved a workable solution for our initial experiments. Specifically, since the focus of step is to integrate different DGCS measures in addition to the univariate statistic , we considered as the base measure. For the integrated p-values that are transformed from scores measured by in step (say there are of them), we use the signs of these scores for the integrated p-values. For the signs of the other genes, we assign positive signs to all of them once and negative signs to all of them a second time. Correspondingly, we have two ranked lists similar to the simple case described above.

Note that, if the directionality of differential measures can be preserved, the power of this approach will be enhanced. To deal with the situation where signs are not comparable, other approaches will be explored.

3.3 Integration with GSEA

From the above description of procedure and , we know that, if only one DGCS measure is used in the GSEA framework, only procedure is needed. If one or multiple DGCS measures are integrated together with the univariate statistic in the GSEA framework, procedure is needed in addition. In the first case, the integrative framework outputs a ranked list of scores with associated signs for the original class label, and lists corresponding to the permutation tests. In the second case, we have two sets of lists respectively for the two rounds of maintaining directionality in step in procedure .

In either case, the ranked lists along with the appropriate parameter settings and specification of gene sets can be used to run GSEA. The only modification to GSEA is the elimination of the initial GSEA step to generate the scores, simulated and actual, that measure the level of differentiation between genes across different phenotypes. The proposed integrative framework is implemented as a Matlab function (available at http://vk.cs.umn.edu/ICG/), independently from the GSEA framework to be integrated in this paper [28]. As summarized by Ackermann and Strimmer [1], hundreds of variations of GSEA are being used by different research groups. This independently implemented integrative framework can be easily applied to other variations of GSEA.

3.4 A further technical detail

In our experiments, in order to have a fair comparison, we transform the ranked lists into the exact sample distribution as the original lists corresponding to GSEA. Specifically, we only use the ranking information in the integrated ranked lists and map to them to the values in the original lists based on GSEA. Essentially, the values in the ranked list passed to the GSEA framework are exactly the same among GSEA, GSEA, GSEA, GSEA and GSEA, while the only difference is that the genes have different ranks in the lists. Such a mapping ensures that the additionally discovered gene sets are because of the integration of gene-combinations in addition to univariate statistic, rather than simply the different value distributions in the lists.

4 Results

In this section, we present the experimental design and results for the evaluation of the proposed integrative framework. We first provide a brief description of the data sets and parameters used in the experiments. Second, we describe and discuss the comparative experiments to study whether the integration of DGCS and GSEA (denoted as DGCSGSEA) improves both DGCS and GSEA. The two major evaluation criteria are the statistical power to discover (additional) significant results, and the consistency of the results across different datasets for the study of the same phenotype classes.

4.1 Data sets

The four datasets used in the experiments are described as follows:

  1. Three lung cancer datasets respectively denoted as Boston [4], Michigan [3] and Standford [14]: all the three data sets consist of gene-expression profiles in tumor samples from respectively 62, 86 and 24 patients with lung adenocarcinomas and provide clinical outcomes (classified as ”good” or ”poor” outcome). The two phenotypic classes in these three datasets are denoted as and as in [28].

  2. A data set from the NCI-60 collection of cancer cell lines for the study of p53 status [25] (denoted as data set): the mutational status of the p53 gene has been reported for 50 of the NCI-60 cell lines, with 17 being classified as normal and 33 as carrying mutations in the gene. The two phenotypic classes in this dataset are denoted as and as in [28].

All four datasets were downloaded from the GSEA website555http://www.broadinstitute.org/gsea/[28], and were already preprocessed as described in the supplementary file of [28]. For all four data sets, we use the gene sets from in MSigDB as in [28], as well as the same parameters.

4.2 Differential gene-combination measures

We consider one univariate statistic (), and three gene-combination measures (, and ) in our experiments. These four measures are described in section 2. and are defined only for size-2 combinations. For , we considered gene-combinations of size- and size- for the illustration of concept.

4.3 Q1: Does GSEA-assisted DGCS improve traditional DGCS?

Boston Michigan Stanford P53
0 2 0 0
645 1 2 1
10 1 0 0
Table 1: Number of gene combinations with FDR less than discovered from the four data sets by each combination measure

In this section, we study whether the question (Q1) of whether integration of DGCS and GSEA can improve traditional DGCS. For this comparison, we consider the integration of DGCS and GSEA as a GSEA-assisted DGCS approach. We first apply the traditional DGCS approaches on the four datasets to find statistically significant gene-combinations. We denote the three DGCS approaches respectively with the names of the three measures, i.e. , and . Second, we apply the integrative framework, in which GSEA is integrated respectively with the three DGCS measures, to find statistically significant gene sets with moderate but coordinated differential gene-combinations. We denote the three instances of the integrative approach respectively as GSEA, GSEA and GSEA. Then, we compare the results of , and , respectively with the results of GSEA, GSEA and GSEA.

Boston Michigan Stanford P53
GSEA 4 1 4 13
GSEA 1 7 4 0
GSEA 0 1 3 2
Table 2: Number of gene sets with FDR less than discovered from the four datasets by integrating GSEA with each of the three combination measures. One or multiple biological process(es) are indicated as superscript, from which we can observe the consistency across three lung cancer data sets. : apoptosis related pathways; : responses to hypoxia; : sppaPathway; : insulin-signaling sets; : oxidative-phosphorylation related sets; : p53-related sets. The names of the discovered gene sets and their FDRs are available in the Appendix section.

In this section, we study whether the question (Q1) of whether integration of DGCS and GSEA can improve traditional DGCS. For this comparison, we consider the integration of DGCS and GSEA as a GSEA-assisted DGCS approach. We first apply the traditional DGCS approaches on the four datasets to find statistically significant gene-combinations. We denote the three DGCS approaches respectively with the names of the three measures, i.e. , and . Second, we apply the integrative framework, in which GSEA is integrated respectively with the three DGCS measures, to find statistically significant gene sets with moderate but coordinated differential gene-combinations. We denote the three instances of the integrative approach respectively as GSEA, GSEA and GSEA. Then, we compare the results of , and , respectively with the results of GSEA, GSEA and GSEA.

Table 1 lists the number of statistically significant gene combinations discovered respectively by the three measures on each of the four datasets, with an FDR threshold of . Table 1 lists the number of statistically significant gene sets discovered by integrating GSEA respectively with the three DGCS measures on each of the four datasets, also with the same FDR threshold of . Three major observations can be made by comparing the two tables:

4.3.1 GSEA-assisted DGCS has better statistical power than traditional DGCS

Table 1 shows that, in most cases, traditional DGCS discovers very few (less than 3) statistically significant gene combinations (although and have and gene-combinations on the Boston data set, none of them have FDR lower than ). In contrast, table 2 shows that the integration of GSEA with the three combination measures discover multiple significant gene sets in most of the cases. This difference implies that the discovered statistically significant gene sets include many moderate but coordinated differential gene combinations, even though the combinations are not significant by themselves as shown in table 1. This comparison demonstrates that traditional DGCS, similar to univariate gene analysis, has limited statistical power, and DGCSGSEA can increase that power.

4.3.2 GSEA-assisted DGCS has better result consistency than traditional DGCS

We further compare DGCS and DGCSGSEA by studying the consistency of their results on the first three data sets that are all from lung cancer studies, as done in [28]. For DGCS, discovered genes on Michigan but nothing from Boston and Stanford; discovered combinations on Boston but only 1 and 2 from Michigan and Stanford, respectively, and there are no common ones between the 645, 1, and 2 gene combinations; discovered genes on Boston but only gene on Michigan and nothing from Stanford, and the 10 and 1 combinations do not overlap. The inconsistent results make the follow-up biological interpretation very difficult.

In contrast, when the three DGCS measures are integrated with GSEA, several consistent themes can be observed: (i) Apoptosis related pathways (marked by in table 2): GSEA discovered four gene sets on Boston, three of which are known to be closely related to cancer and specifically to apoptosis, i.e. nfkbpathway, ST-Gaq-Pathway and TNF-Pathway. This apoptosis theme is shared by the gene sets discovered by -GSEA from Michigan and Stanford, i.e. Monocyte-AD-Pathway, hivnefPathway, deathPathway and caspasePathway. These apoptosis related pathways are enriched with the lung cancer samples with good outcome, which makes sense biologically and also corresponds to the proliferation theme supported by the gene sets enriched with the samples with poor outcome as reported in [28]. Several other examples of the result consistency, as indicated by other superscripts in Table 2, are in the technical report. This comparison demonstrates that traditional DGCS, like univariate gene analysis, has poor result consistency across the three lung cancer data sets, and DGCSGSEA can improve its consistency by integrating DGCS measures with GSEA.

4.3.3 GSEA-assisted DGCS with different DGCS measures complement each other

The number of significant gene sets discovered by the three versions of GSEA varies, i.e. GSEA and GSEA discovered a bit larger number of significant gene sets than GSEA. However, GSEA still discovered several gene sets that are not discovered by GSEA or GSEA, e.g. one gene set from the Michigan data set and three from the Stanford data set. This indicates that GSEA, GSEA and GSEA have complementary perspectives, i.e. different combination measures capture different aspects of the difference between the phenotype classes (recall the two types of combinations in Figure 2). This also demonstrates the proposed framework is general enough to integrate any type of DGCS with GSEA.

4.4 Q2: Does DGCS-assisted GSEA improve GSEA?

In this Section, we want to answer the question (Q2) of whether the integration of DGCS and GSEA can improve traditional GSEA. For this comparison, we consider the integration of DGCS and GSEA as a DGCS-assisted GSEA approach. We design three sets of comparisons. Firstly, we compare the traditional univariate-statistic based GSEA (denoted as GSEA) with the integrative framework where one gene-combinations measure is used instead of . Specifically, we compare the gene sets discovered by GSEA with the gene sets discovered by GSEA, GSEA and GSEA. Then, we compare GSEA with the integrative framework where one gene-combinations measure is used in addition to , i.e. GSEA, GSEA and GSEA. Furthermore, we also study the integration of multiple gene-combinations measure in addition to , e.g. GSEA.

Figure 4 displays the statistically significant gene sets discovered with different (combinations of) measures respectively from the four datasets. An FDR threshold of is used as in [28] for comparison purpose. The results presented in [28] are exactly reproduced, i.e. the gene sets listed in the rows corresponding to GSEA. In each of these four figures, we consider the traditional univariate-statistic based GSEA (GSEA) as the baseline, and compare it with the rows corresponding to GSEA, GSEA, GSEA, GSEA, GSEA, GSEA and GSEA. From these comparisons, the following observations can be made.

Figure 4: Common captions for the four tables: Statistically significant gene sets discovered by different (combinations of) measures from each of the four data sets. The first row of each table shows the name of the data set, and the second row indicates the two phenotype classes in the data set that a gene set can be enriched with. The first column indicates the measures used in the integrative framework. For each data set and each (combination of) measure(s), we list the names of the statistically significant gene sets and the corresponding FDRs for both the classes. The traditional univariate-statistic based GSEA (M0GSEA) is considered as the baseline. For the other rows, we only list a gene set if it is only discovered by the integrative approach (with bolded name), or it has a non-trivially decreased FDR when it is discovered by the integrative approach (with bolded FDR). Please refer to the Appendix section for the complete tables.

4.4.1 DGCS-assisted GSEA discovers additional significant gene sets

First, we compare the rows corresponding to GSEA, GSEA, GSEA with the rows corresponding to GSEA. We bolded the additional gene sets that are only discovered by GSEA, GSEA, GSEA. For example, with GSEA, no statistically significant gene sets have been enriched with class A in the Boston data set. In contrast, GSEA discovered gene sets, three out of which (discussed in ) are related to apoptosis which is consistent with the results on Michigan and Stanford. On the Michigan dataset, GSEA discovered a gene set beta-Alanine-metabolism that is not discovered by GSEA. This gene set is related to the responses of hypoxia, which is consistent with the results on Boston and Stanford. It is worth noting that, although most studies did not report statistically significant gene sets on the Stanford dataset due to the very small sample size, GSEA, GSEA, GSEA respectively discovered 4, 4 and 3 significant gene sets. These additional gene sets were discovered because the three DGCS measures capture different types of the differentiation between the two phenotype classes, compared to the traditional univariate differential expression-based GSEA.

Second, we compare the rows corresponding to GSEA, GSEA, GSEA with the rows corresponding to GSEA. We bolded the additional gene sets that are only discovered by the integrative approach. For example, on the Boston data set, based GSEA discovered 8 gene sets. In addition, GSEA discovered the proteasomePathway gene set, and GSEA discovered the p53-signaling gene set. Both ubiquitin-proteasome pathway and p53-signaling pathway are well-known cancer-related pathways that are also specifically related to lung cancer [21, 6]. (Additional examples are in the technical report.) The gene sets that are discovered by DGCS-assisted GSEA but not by -GSEA illustrate the benefits of using DGCS to assist GSEA.

Next, we also observed that integrating multiple DGCS measures can further discover statistically significant gene sets. For illustration purpose, we compare the rows corresponding to GSEA, GSEA, GSEA with the rows corresponding to GSEA. GSEA discovers the g2Pathway gene set and the gsk3Pathway gene set, respectively from the Boston and the Michigan dataset. Neither of these two pathways are discovered by GSEA, GSEA, GSEA and GSEA. The curated gene set g2Pathway contains the genes related to the G2/M transition, which is shown to be regulated by p53 [29], a well-known cancer-related gene. The curated gene set gsk3Pathway is the signaling pathway of GSK-3-, which has been shown to be related to different types of cancer[5, 22]. These two cancer-related pathways are discovered by GSEA but not by GSEA, GSEA, GSEA and GSEA. This indicates that different members of these two pathways are differential between the two phenotype groups in different manners, i.e. the differentiation of some genes is captured by , some by , some by and some by . These two pathways can be discovered to be statistically significant only when these measures are used together in the integrative framework. This demonstrates the benefits of the proposed framework for integrating multiple DGCS measures with a univariate measure.

It is worth noting that, the gene sets discovered by the integrative framework with multiple measures are not necessarily a superset of those discovered by integrating each individual measure with GSEA since, when different DGCS measures are integrated with GSEA, the null-hypotheses tested in the GSEA framework are correspondingly different. The highlight of the integrative framework is that, additional gene sets can be discovered when different DGCS measures are used to assist the traditional univariate statistic-based GSEA. In practice, these different versions of GSEA should be used collectively.

4.4.2 DGCS-assisted GSEA discovers gene sets with lower FDRs

:

Even when a gene set is discovered both before and after a DGCS measure is integrated into the framework, we can observe several interesting cases where the FDR of a gene set becomes much lower after the integration. We bolded the FDRs that significantly decreased when they are discovered by the integrative approach. For example, -GSEA, in which , , and are integrated together, discovers p53hypoxialPathway with an much lower FDR of , two-order lower than -GSEA. This example indicates that several members of p53hypoxialPathway have weak individual differentiation measured by , but have more significant differentiation when they are measured by . This and other similar examples demonstrates the benefits of the proposed framework for integrating multiple DGCS measures.

4.4.3 DGCS-assisted GSEA further improve the consistency across the three lung cancer data sets

As presented in [28], GSEA discovered and gene sets respectively from the Boston and Michigan data sets, and 5 of the 8 in Boston and 6 of the 11 in Michigan are common. The three unmatched gene sets that are discovered in Boston but not in Michigan are GLUT-DOWN, LEU-DOWN and CellCycleCheckpoint. Interestingly, the latter two are discovered from both the Boston and the Michigan data sets by GSEA666The CellCyclePathway discovered on Michigan and the cell-cycle-checkpoint discovered on Boston are both cell-cycle related gene sets. Such observations suggest that DGCS-assisted GSEA also provides new insights to the consistency between different data sets.

4.4.4 Additional issues of multiple hypothesis testing

Because different combinations of measures are used in the integrative framework, additional issues of multiple hypothesis testing arise, even though multiple hypothesis testing has been addressed for each measure via the phenotype permutation test procedure in the GSEA framework proposed in [28]. To investigate this, we designed experiments with 4 of the possibilities of integrations, i.e. GSEA, GSEA, GSEA and GSEA. Even using a collective (meta-level) multiple hypothesis correction, many discovered gene sets would still be significant. For examples, GSEA discovers p53hypoxialPathway from the Boston data set with a low FDR of , and GSEA discovers deathPathway from the Michigan data set with a lower FDR of . We also did additional permutation tests, in which we generate random gene sets with the same sizes as the sets in MSigDB , and do the same set of experiments as shown in Figure 4. The FDR values of the random gene sets computed in the integrative framework are mostly insignificant (higher than ).

5 Discussion

In this paper we motivated the integration of differential gene-combination search and gene set enrichment analysis for bi-directional benefits on both them. We proposed a general integrative framework that can handle gene-combinations of different sizes () and different gene-combination measures in addition to an univariate statistic used in traditional GSEA. The experimental results demonstrated that, on one hand, GSEA-assisted DGCS has better statistical power and result consistency than traditional DGCS. On the other hand, DGCS-assisted GSEA can discover additional statistically significant gene sets that are ignored by traditional GSEA and further improve the result consistency of the traditional GSEA.

The proposed framework can be extended in several ways. Different variations of GSEA will be considered. Along these lines, we note that the proposed integrative framework is general enough to integrate most existing variations of GSEA approaches summarized in [1] with minimal amount of modification. Also, it should be possible to integrate DGCS and gene-subnetwork discovery. Both GSEA and gene-subnetwork discovery [17, 8] can discover collections of genes, either known gene sets [28] or subnetworks in a molecular network (e.g. protein interaction network), that show moderate but coordinated differentiation. In this paper, we integrate DGCS and GSEA as an illustration of the general framework for integrating scores from different gene-combination measures and gene-combinations of different sizes, in addition to the traditional univariate statistic, but the same framework also applies to the integration of DGCS and gene subnetwork discovery. Another direction is the use of this framework for the analysis of (GWAS) SNP data, by following the methodology proposed in recently work on pathway/network based analysis of GWAS datasets [32, 2]. Finally, it may be possible to use constraints on gene-combinations to improve our framework. In procedure , for each gene-combination measure and an integer , the score of a gene is assigned from all the possible gene-combinations. A further extension of procedure is to only consider the gene combinations, in which the genes appear in a common gene set, e.g. a pathway. Such gene-set-based constraints may better control false positive gene combinations and improve the statistical power of the whole integrative framework.

6 Appendix

6.1 Illustration of procedure

The illustration of procedure is in figure 5.

Figure 5: Illustration of procedure (combination-to-gene score assignment) for . The three ellipses represent the three gene combinations that participates to with respect to measure .

6.2 Complete Gene-set Table

Due to the space limit, Table 2 and Figure 4 are both summarized from the four complete tables that are available at http://vk.cs.umn.edu/ICG/. Specifically, Table 2 is a high-level summary of the number of gene sets discovered and the biological processes associated with each of the gene sets. In figure 4, we listed the complete results for GSEA (the baseline), while for the other rows, we only list a gene set if it is only discovered by the integrative approach (with bolded name), or it has a non-trivially decreased FDR when it is discovered by the integrative approach (with bolded FDR).

References

  • [1] M. Ackermann and K. Strimmer. A general modular framework for gene set enrichment analysis. BMC bioinformatics, 10(1):47, 2009.
  • [2] S. Baranzini, N. Galwey, J. Wang, P. Khankhanian, R. Lindberg, D. Pelletier, et al. Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Human Molecular Genetics, 18(11):2078, 2009.
  • [3] D. Beer, S. Kardia, C. Huang, T. Giordano, A. Levin, D. Misek, L. Lin, G. Chen, T. Gharib, D. Thomas, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine, 8(8):816–824, 2002.
  • [4] A. Bhattacharjee, W. Richards, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. PNAS, 98(24):13790, 2001.
  • [5] Q. Cao, X. Lu, and Y. Feng. Glycogen synthase kinase-3 positively regulates the proliferation of human ovarian cancer cells. Cell Research, 16(7):671–677, 2006.
  • [6] E. Chenette. Signalling: A Ras and NF-B pas de deux. Nature Reviews Cancer, 9(12):847, 2009.
  • [7] Y. Choi and C. Kendziorski. Statistical methods for gene set co-expression analysis. Bioinformatics, 25(21):2780, 2009.
  • [8] H.-Y. Chuang, E. Lee, Y.-T. Liu, D. Lee, and T. Ideker. Network-based classification of breast cancer metastasis. Mol. Sys. Bio., 3:140, 2007.
  • [9] A. Dempster. A high dimensional two sample significance test. The Annals of Mathematical Statistics, pages 995–1010, 1958.
  • [10] M. Dettling, E. Gabrielson, and G. Parmigiani. Searching for differentially expressed gene combinations. Genome Biology, 6(10):R88, 2005.
  • [11] S. Dudoit, J. Shaffer, and J. Boldrick. Multiple hypothesis testing in microarray experiments. Statistical Science, 18(1):71–103, 2003.
  • [12] G. Fang, R. Kuang, G. Pandey, M. Steinbach, C. L. Myers, and V. Kumar. Subspace Differential Coexpression Analysis: Problem Definition and A General Approach. In Pacific Symposium on Biocomputing, pages 145–156, 2010.
  • [13] G. Fang, G. Pandey, M. Gupta, M. Steinbach, and V. Kumar. Mining low-support discriminative patterns from dense and high-dimensional data. Technical Report 09-011, Department of Computer Science, University of Minnesota, 2009.
  • [14] M. Garber, O. Troyanskaya, K. Schluens, S. Petersen, et al. Diversity of gene expression in adenocarcinoma of the lung. PNAS, 98(24):13784, 2001.
  • [15] G. Glazko and F. Emmert-Streib. Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics, 25(18):2348, 2009.
  • [16] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531, 1999.
  • [17] T. Ideker, O. Ozier, et al. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics, 18:233–240, 2002.
  • [18] D. Kostka and R. Spang. Finding disease specific alterations in the co-expression of genes. Bioinformatics, 20(194–199), 2004.
  • [19] Y. Lai, B. Wu, et al. A statistical method for identifying differential gene-gene co-expression patterns. Bioinformatics, 20(17):3146–3155, 2004.
  • [20] K. Li. Genome-wide coexpression dynamics: theory and application. PNAS, 99(26):16875–16880, 2002.
  • [21] T. Li, L. Ho, B. Piperdi, et al. Phase II study of the proteasome inhibitor bortezomib (PS-341, Velcade®) in chemotherapy-naive patients with advanced stage non-small cell lung cancer. Lung Cancer, 2009.
  • [22] X. Liao, J. Thrasher, J. Holzbeierlein, S. Stanley, and B. Li. Glycogen Synthase Kinase-3 Activity Is Required for Androgen-Stimulated Gene Expression in Prostate Cancer. Endocrinology, 145(6):2941, 2004.
  • [23] Y. Lu, P. Liu, P. Xiao, and H. Deng. Hotelling’s T2 multivariate profiling for detecting differential expression in microarrays. Bioinformatics, 21(14):3105, 2005.
  • [24] V. Mootha, C. Lindgren, K. Eriksson, A. Subramanian, et al. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics, 34(3):267–273, 2003.
  • [25] M. Olivier et al. The IARC TP53 database: new online mutation analysis and recommendations to users. Human Mutation, 19(6):607–614, 2002.
  • [26] W. Pan. A comparative review of statistical methods for discovering differentially expressed genes in microarray experiments. Bioinformatics, 18(4):546, 2002.
  • [27] M. Schena, D. Shalon, R. Davis, P. Brown, et al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270(5235):467–470, 1995.
  • [28] A. Subramanian, P. Tamayo, V. Mootha, S. Mukherjee, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 102(43):15545–15550, 2005.
  • [29] W. Taylor and G. Stark. Regulation of the G2/M transition by p53. Oncogene, 20(15):1803–1815, 2001.
  • [30] L. Tian, S. Greenberg, S. Kong, J. Altschuler, I. Kohane, and P. Park. Discovering statistically significant pathways in expression profiling studies. PNAS, 102(38):13544, 2005.
  • [31] V. Tusher, R. Tibshirani, and G. Chu. Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. PNAS, 98:5116–5121, 2001.
  • [32] K. Wang, M. Li, and M. Bucan. Pathway-based approaches for analysis of genomewide association studies. AJHG, 81(6):1278–1283, 2007.
  • [33] K. Wang, M. Narayanan, H. Zhong, M. Tompa, E. E. Schadt, and J. Zhu. Meta-analysis of inter-species liver co-expression networks elucidates traits associated with common human diseases. PLoS Comp. Bio., 5(12):e1000616, 2009.
  • [34] Y. Xiao, R. Frisina, A. Gordon, L. Klebanov, and A. Yakovlev. Multivariate search for differentially expressed gene combinations. BMC Bioinfo., 5(1):164, 2004.
  • [35] J. Zhang, J. Li, H. Deng, and I. Jordan. Identifying Gene Interaction Enrichment for Gene Expression Data. PLoS ONE, 4(11):e8064, 2009.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
116150
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description