Integration of Differential Genecombination Search and Gene Set Enrichment Analysis: A General Approach
Abstract
Gene Set Enrichment Analysis (GSEA) and its variations aim to discover collections of genes that show moderate but coordinated differences in expression. However, such techniques may be ineffective if many individual genes in a phenotyperelated gene set have weak discriminative power. A potential solution is to search for combinations of genes that are highly differentiating even when individual genes are not. Although such techniques have been developed, these approaches have not been used with GSEA to any significant degree because of the large number of potential gene combinations and the heterogeneity of measures that assess the differentiation provided by gene groups of different sizes.
To integrate the search for differentiating gene combinations and GSEA, we propose a general framework with two key components: (A) a procedure that reduces the number of scores to be handled by GSEA to the number of genes by summarizing the scores of the gene combinations involving a particular gene in a single score, and (B) a procedure to integrate the heterogeneous scores from combinations of different sizes and from different gene combination measures by mapping the scores to pvalues. Experiments on four gene expression data sets demonstrate that the integration of GSEA and gene combination search can enhance the power of traditional GSEA by discovering gene sets that include genes with weak individual differentiation but strong joint discriminative power. Also, gene sets discovered by the integrative framework share several common biological processes and improve the consistency of the results among three lung cancer data sets.
1 Introduction
Microarray technology is an important tool to monitor geneexpression in biomedical studies [27]. A common experimental design is to compare two sets of samples with different phenotypes, e.g. diseased and normal tissue, with the goal of discovering differentially expressed genes [16]. Statistical testing procedures, such as such as the ttest and significance analysis of microarrays [31], have been extensively studied and widely used. Subsequently, multiple testing corrections are usually applied [11]. A comprehensive review of such approaches are presented in [26].
Differential expression analysis based on univariate statistical tests has several wellknown limitations. First, due to the low sample size, high dimensionality and the noisy nature of microarray data, individual genes may not meet the threshold for statistical significance after a correction for multiple hypotheses testing [28]. Second, the lists of differentially expressed genes discovered from different studies on the same phenotype have little overlap [28].
These limitations motivated the creation of Gene Set Enrichment Analysis GSEA [24, 28], which discovers collections of genes, for example, known biological pathways [28], that show moderate but coordinated differentiation. For example, Subramanian and Tamayo et al. [28] report that the p53 hypoxial pathway contains many genes that show moderate differentiation between two lung cancer sample groups with different phenotypes. Although the genes in the pathway are not individually significant after multiple hypothesis correction [28], the pathway is. For those familiar with GSEA and its output, Figure 1 shows the GSEA results for the p53 hypoxial pathway. GSEA also has the advantages of better interpretability and better consistency between the results obtained by different studies on the same phenotype [28]. Ackermann and Strimmer presented a comprehensive review of different GSEA variations in [1].
Unfortunately, GSEA and related techniques may be ineffective if many individual genes in a phenotyperelated gene set have weak discriminative power. A potential solution to this problem is to search for combinations of genes that are highly differentiating even when individual genes are not. For this approach, the targets are groups of genes that show much stronger discriminative power when combined together [10]. For example, Figure 2(a) illustrates one type of differentially expressed gene combination discovered in [10]. The two genes have weak individual differentiation indicated by the overlapping class symbols on both the two axes. In contrast, these two genes are highly discriminative in a joint manner indicated by the different correlation structure in the twodimensional plot, i.e. they are correlated along the blue and red dashed line respectively in the triangle and circle class. Such a joint differentiation may indicate that the interaction of the two genes is associated with the phenotypes even though the two genes, individually, are not.
Figure 2(b) illustrates another type of phenotypeassociated gene combination discovered in [10], usually named differential coexpression [18, 12], in which the correlation of the two genes are high in one class but much lower in the other class. As discussed in [10, 15], existing multivariate tests such as Hotelling’s [23], Dempster’s T1 [9] are not suitable to detect such ‘complementary’ gene combinations because they only screen for differences in the multivariate mean vectors, and thus will favor pairs that consist of genes with strong marginal effects by themselves but not the genes like the four in Fig. 2. For clarification, we use differential gene combination search (denoted as DGCS) to refer to the multivariate data analyses that are designed to detect the complementarity of different genes, rather than those designed to model the correlation structure of different genes (such as Hotelling’s and Dempster’s T1 test). A variety of other DGCS measures for complementary gene combination search are proposed for gene pairs [20, 19] in addition to the two illustrated in figure 2. Several measures are designed for higherorder gene combination beyond pairs [12, 34]. These approaches can provide biological insights beyond univariate gene analysis as shown in [10, 12, 34].
The limitations of GSEA and the capabilities of DGCS motivate a GSEA approach using gene combinations in which the score of a gene set is based on both the scores of individual genes in the set and the scores from the gene combinations in which these genes participate. Unfortunately, gene combination techniques have not been used with the GSEA approach in any significant way because of two key challenges.

Finding a technique to reduce the vast number of gene combinations. There are exponentially more gene combinations than individual genes, i.e. in addition to the univariate genes, there are genepairs, genetriplets, etc. Many variations of GSEA are based on a ranked list of the individual genes as illustrated in Figure 1. Including combinations in the ranked list might work for size2 combinations [7, 35], but would not be feasible for handling gene combinations of larger sizes. Furthermore, this explosion in the number of gene combinations negatively impacts false discovery rates. Thus, by adding so many gene combinations, we run the risk that neither groups of genes nor individual genes will show statistically significant differentiation.

Combining results from the heterogeneous measures used to score different size gene combinations. Furthermore, because a gene can be associated with the phenotype either as an univariate variable or together with other genes as a combination, the importance of a gene set should be based on both the univariate gene scores and the gene combination based scores of its set members. However, different measures have a different nature, scale and significance, and thus are not directly comparable (to be detailed in section 3.2). Indeed, differences exist even between gene combinations of the same measure but of different sizes. Therefore, the challenge lies in how to design a framework to combine different measures (a univariate measure^{2}^{2}2Ackermann and Strimmer [1] suggest that different univariate statistics have similar effect in GSEA and thus, we consider only one univariate statistic rather than multiple of them. plus one or more DGCS measures) together within the GSEA framework.
To the best of our knowledge, no existing work has sufficiently addressed these two challenges, although recent work presented in [7, 35] have made initial efforts at adding GSEA capabilities to gene combination search. More specifically, two approaches are proposed in [7, 35] to help the study of a specific type of size2 differential combinations as illustrated in 2(b). The experiments in these two studies provide some evidence about the benefits of the integration. However, a more general framework is needed that can also handle other types of size2 differential combinations as illustrated in figure 2(a), higher order differential combinations (e.g. SDC[12] and the nstatistic [34]), and multiple types of differential combinations.
Contributions: In this paper, we propose a general framework to address the above challenges for the effective integration of DGCS and GSEA. Specific contributions are as follows:

A genecombinationtogene score summarization procedure (procedure A) that is designed to handle the exponentially increasing number of gene combinations. First, for a given gene combination measure and a certain , the score of a size combination is partitioned into equal parts which are assigned to each of the genes in the combination. Because each gene can participate in up to size combinations, each gene will be assigned with a score from each of these combinations. Secondly, an aggregation statistic, e.g., maximum absolute value is used to summarize the different scores for a gene. With such a procedure, scores for all the size gene combinations are summarized to scores for genes. This procedure can effectively retain the length of the ranked list while handling gene combinations of size ().

A scoretopvalue transformation and summarization procedure (procedure B) that is designed to integrate the scores contributed (in procedure ) from different gene combination measures and from gene combinations of different sizes. The transformation is based on pvalues obtained from scores derived from phenotype permutations. Such a transformation enables the comparison of scores from different measures (either univariate or gene combination measures) and scores from the gene combinations of different sizes. Subsequently, among all the pvalues of a gene, the best is used as an integrated score of statistical significance.

Integration of the above two procedures with GSEA More specifically, after procedures A and B, each gene has a single integrated score. Unlike traditional univariate scores, these integrated scores are based on both the univariate statistic and the gene combination measures. For the type of GSEA variations that depend on phenotype permutation test, lists of integrative scores are computed, one for the real class labels and the other for the permutations. For the type of GSEA variations that are based on geneset permutation test, only the list of integrated scores for the real class labels are needed. An independent Matlab implementation of the proposed framework is available for download, which allows most existing GSEA frameworks [1] to directly utilize the proposed framework to handle gene combinations with almostzero modification.

Experimental results that illustrate the effectiveness of the proposed framework. We integrated three gene combination measures and the GSEA approach presented in [28] and produced experimental results from four gene expression datasets. These results demonstrate that the integrative framework can discover gene sets that would have been missed without the consideration of gene combinations. This includes statistically significant gene sets with moderate differential gene combinations whose individual genes have very weak discriminative power. Thus, a gene combination assisted GSEA approach can improve traditional GSEA approaches by discovering additional diseaseassociated gene sets. Indeed, the integrative approach also improve traditional DGCS since most gene combinations are not statistically significant by themselves. Furthermore, we also show that the biologically relevant gene sets discovered by the integrative framework share several common biological processes and improve the consistency of the results among the three lung cancer data sets.
Overview: The rest of the paper is organized as follow. In section 2, we describe three gene combination measures used in the following discussion and experiments. In Section 3, we present the technical details of the two procedures of the general integrative framework. Experimental design and results are presented in Section 4, followed by conclusions and discussions in Section 5.
2 Differential Gene Combination Measures
In this section, we describe three DGCS measures for use in the following discussion and experiments. Let and be two phenotypic classes of samples of size and respectively. For each sample in and , we have the expression value of genes . First, we have the following two measures (denoted as and ) defined for a pair of genes as presented in [10]:
(1) 
(2) 
where and are two genes, and represents the correlation of and over the samples in set . As discussed in [10], and can detect the joint differential expression of two genes as illustrated in Figure 2(a) and Figure 2(b) respectively. and are used as two representative measures for gene pairs. Other options for gene combination measures for gene pairs have been investigated in [19, 20].
We use the subspace differential coexpression measure (denoted as ) proposed in [12] as the representative for measures for size gene combinations, where can be any integer ().
(3) 
where is a set of genes such that and . and respectively represent the fraction of samples in and over which the genes in are coexpressed. is a generalization of for detecting the differential coexpression of genes (), i.e. the genes are highly coexpressed over many samples in one class but over far fewer samples in the other. Other options for sizek combinations include the nstatistic [34], SupMaxPair [13], etc.
Signaltonoise ratio (denoted as ) is used as the representative of traditional univariate statistics as in [28, 24].
(4) 
where and are the mean expression of in class and respectively, and and are the standard deviation of the expression of in class and respectively. Many other univariate statistics can be found in [1].
In this paper, these four measures are used as representatives of each category for the illustration of the proposed integrative framework. However, the framework is general enough to handle other measures from each of these categories.
3 Methods
In Section 1, we motivated the integration of DGCS with GSEA, discussed two challenges associated with this integration, and briefly described two main procedures in the proposed framework. In this section, we present the technical details of the two procedures and their integration with GSEA.
3.1 Procedure : combinationtogene score reduction
There are two steps in procedure . In step , for each DGCS measure and each size gene combination, its score is divided into equal parts and assigned to each of the genes in the combination. In step , the scores assigned to a gene from all the sizek combinations in which the gene participates are summarized into a single score by an aggregation functions such as . Note that, for most univariate statistic and DGCS measures which can be either positive or negative (e.g. the four measures described in section 2), the maximum is taken over the absolute values of the scores, and the sign of the score with the highest absolute value is recorded for later use. Other simple statistics such as mean or median, or sophisticated ones such as weighted summation [33] can also be used. Since the focus of this paper is the overall integrative framework, we use for simplicity.
We provide a conceptual example of procedure for a gene with a certain DGCS measure . This example considers gene combinations up to size for illustration purpose. The gene is associated with scores assigned from gene combinations of size 2, 3 and 4 (denoted as , and respectively) in which participates. In step , the scores from , and are summarized by three maximum values, respectively. Please refer to the Appendix section for the illustration of this example.
Procedure serves as a general approach to summarize the scores of all the size combinations into scores for the genes. If we want to integrate GSEA with one DGCS measure and a specific size, procedure by itself can enable most existing variations of GSEA to search, with almostzero modification, for statistically significant gene sets with moderate but coordinated gene combinations of size. Such a GSEA approach can collectively consider the gene combinations affiliated with a gene set, and may provide better statistical power and better interpretability for DGCS, as will be shown in the experiments.
3.2 Procedure : Scoretopvalue conversion and summarization
The hypothesis tested when one DGCS measure, say , is integrated with GSEA (by procedure ) is that, whether a gene set includes significantly many genes with highly positive (or highly negative) combinationbased scores measured by . An extended hypothesis can be whether a gene set includes significantly many genes with highly positive (or highly negative) scores, either univariate or combinationbased scores measured by different DGCS measures. The biological motivation of this extended hypothesis is that, a gene can be associated with the phenotype either as an univariate variable or together with other genes as a combination. To test this extended hypothesis, we design a second procedure (B) that can integrate the scores of a gene from different measures.
Before describing the steps in this procedure. We first discuss in detail the challenges of integrating heterogeneous scores from different DGCS measures and combinations of different sizes.

The different nature of different measures: Different measures are designed to capture different aspects of the discriminative power of a gene or a gene combination between the two phenotypic classes. Signaltonoise ratio (), a univariate genelevel statistic, measures the difference between the means of the expression of a gene in the two classes. In contrast, , a differential coexpression measure for a pair of genes describes the difference of the correlations of a genepair in the two classes. Thus, for a gene, the score of itself measured by and the score assigned and summarized from the size2 gene combinations measured by are not directly comparable. Similarly, the scores of different DGCS measures can also have a different nature, e.g. and as illustrated in figure 2.

The different scales of different measures: Different measures also have different ranges of values. For example, the range of , , , and are , , and respectively. Thus, they are not directly comparable.

Differences in significance between different measures: Even after we normalize the scores of different measures to a single range, say , they are still not comparable because the scores of different measures have different statistical significance. For example, a normalized score of may be less significant than a normalized score of 0.5, if there are many genes with normalized score greater than in the permutation test [28], but very few genes with normalized score greater than in the permutation test. Note that, such differences in statistical significance also exists between gene combinations of different sizes, even for the same measure. Take the subspace differential coexpression measure as an example. A score of for a size2 combination may not be as significant as a score of for a size3 combination as discussed in [12].
To handle the above heterogeneity, we propose a scoretopvalue transformation and summarization procedure that can enable the comparison and integration of the scores of different measures and combinations of different sizes. There are three major steps in procedure .
3.2.1 Step : Scoretopvalue transformation
Consider a concrete example. For a gene and a measure , procedure computes a single summarized score. In this step, the original phenotype class labels are permutated say times, and for each permutation, the same procedure is applied, and a corresponding score for and is computed. We denote the score of and summarized with the original label as , where is the gene index, and indicates the measure and means it is the score based on the original label. Similarly, we denote the scores computed in each of the permutation as , where .
These scores are organized in the table on the left in figure 3. The scores computed in the permutations can be considered as the nulldistribution for gene and measure , and a pvalue can be estimated for . Specifically, if is positive, the pvalue is the ratio of the number of scores that are greater or equal to and the number of scores that are positive. Similarly if is negative, the pvalue is computed as the ratio of the number of scores which are less or equal to and the number of scores which are negative.^{3}^{3}3Treating positive and negative scores separately follows the practice of GSEA [30] Note that, such a scoretopvalue transformation is done for both and each of (), if the GSEA approach to be integrated is based on phenotype permutation test [30]. Otherwise, only needs to be transformed to pvalue and will be used by the GSEA approaches that are based on geneset permutation [30]. In this paper, we illustrate the proposed framework using the GSEA approach presented by Subramanian and Tamayo et al. [28] which is based on phenotype permutation test.
Essentially, step transforms the heterogeneous scores of a gene measured by different measures into their corresponding significance values, which are comparable to each other although their original values are not.
3.2.2 Step : Pvalue Summarization
Suppose that there are different measures to be integrated, one of which is a univariate statistic, and the others are different DGCS measures for which we consider combinations of sizes up to . After step , each gene has a pvalue for the univariate measure and up to pvalues for each size of gene combination for each measure. In step , the best^{4}^{4}4”Best” means it is the lowest raw pvalue or the highest transformed pvalue pvalue associated with a gene is selected as the integrated significance.
Essentially, procedure integrates the scores of different DGCS measures for a gene and the univariate statistic of the gene into a single pvalue. Such a statistical significancebased integration of heterogeneous scores enables the comparison and thus the ranking of all the genes. However, this ranked list does not maintain the original directionality of the integrated scores of each gene. In particular, most univariate statistics and DGCS measures (e.g. all the four measures described in section 2) can be either positive or negative. Such directionality information is lost in step and because the pvalue is nonnegative. Next, we describe a third step to maintain the directionality in the integration.
3.2.3 Step : Maintaining directionality associated with the integrated pvalues
In the simple case, the measures to be integrated capture the same type of differentiation between the two phenotype classes, e.g. and . Suppose there are two genes and , whose integrated pvalues are transformed respectively from two scores measured by and in step . The signs of these two scores are comparable to each other, because both and capture the change of coexpression of a combination of genes. Thus, we simply use the signs of these two scores as the signs associated with the integrated pvalues of and . Similarly, we associate a sign to all the integrated pvalues. And these pvalues with associated signs can be used to rank the genes based on their significance as well as their direction of differentiation, i.e. pvalues associated with positive signs are ranked with descending significance, and afterwards, pvalues associated with negative signs are ranked with increasing significance.
In the other case, if the measures to be integrated capture different types of differentiation between the two phenotype classes, the directionality can not be fully maintained. For example, suppose there are two genes and , whose integrated pvalues are transformed respectively from two scores measured by and in step . The signs of these two scores are not comparable, because captures the change of mean expression, and captures the change of coexpression of a combination of genes. Specifically, upregulation of can be associated with either high or low coexpression of another genecombination in which participates. Thus, it is not reasonable to follow the same strategy to associate signs to the integrated pvalues. If we know the correspondence of the signs of different genes in advance, e.g. the upregulation of is associated with the low coexpression of and , then the signs can be maintained. However, because it is not realistic to assume such prior knowledge, we propose the following heuristic approach which has proved a workable solution for our initial experiments. Specifically, since the focus of step is to integrate different DGCS measures in addition to the univariate statistic , we considered as the base measure. For the integrated pvalues that are transformed from scores measured by in step (say there are of them), we use the signs of these scores for the integrated pvalues. For the signs of the other genes, we assign positive signs to all of them once and negative signs to all of them a second time. Correspondingly, we have two ranked lists similar to the simple case described above.
Note that, if the directionality of differential measures can be preserved, the power of this approach will be enhanced. To deal with the situation where signs are not comparable, other approaches will be explored.
3.3 Integration with GSEA
From the above description of procedure and , we know that, if only one DGCS measure is used in the GSEA framework, only procedure is needed. If one or multiple DGCS measures are integrated together with the univariate statistic in the GSEA framework, procedure is needed in addition. In the first case, the integrative framework outputs a ranked list of scores with associated signs for the original class label, and lists corresponding to the permutation tests. In the second case, we have two sets of lists respectively for the two rounds of maintaining directionality in step in procedure .
In either case, the ranked lists along with the appropriate parameter settings and specification of gene sets can be used to run GSEA. The only modification to GSEA is the elimination of the initial GSEA step to generate the scores, simulated and actual, that measure the level of differentiation between genes across different phenotypes. The proposed integrative framework is implemented as a Matlab function (available at http://vk.cs.umn.edu/ICG/), independently from the GSEA framework to be integrated in this paper [28]. As summarized by Ackermann and Strimmer [1], hundreds of variations of GSEA are being used by different research groups. This independently implemented integrative framework can be easily applied to other variations of GSEA.
3.4 A further technical detail
In our experiments, in order to have a fair comparison, we transform the ranked lists into the exact sample distribution as the original lists corresponding to GSEA. Specifically, we only use the ranking information in the integrated ranked lists and map to them to the values in the original lists based on GSEA. Essentially, the values in the ranked list passed to the GSEA framework are exactly the same among GSEA, GSEA, GSEA, GSEA and GSEA, while the only difference is that the genes have different ranks in the lists. Such a mapping ensures that the additionally discovered gene sets are because of the integration of genecombinations in addition to univariate statistic, rather than simply the different value distributions in the lists.
4 Results
In this section, we present the experimental design and results for the evaluation of the proposed integrative framework. We first provide a brief description of the data sets and parameters used in the experiments. Second, we describe and discuss the comparative experiments to study whether the integration of DGCS and GSEA (denoted as DGCSGSEA) improves both DGCS and GSEA. The two major evaluation criteria are the statistical power to discover (additional) significant results, and the consistency of the results across different datasets for the study of the same phenotype classes.
4.1 Data sets
The four datasets used in the experiments are described as follows:

Three lung cancer datasets respectively denoted as Boston [4], Michigan [3] and Standford [14]: all the three data sets consist of geneexpression profiles in tumor samples from respectively 62, 86 and 24 patients with lung adenocarcinomas and provide clinical outcomes (classified as ”good” or ”poor” outcome). The two phenotypic classes in these three datasets are denoted as and as in [28].

A data set from the NCI60 collection of cancer cell lines for the study of p53 status [25] (denoted as data set): the mutational status of the p53 gene has been reported for 50 of the NCI60 cell lines, with 17 being classified as normal and 33 as carrying mutations in the gene. The two phenotypic classes in this dataset are denoted as and as in [28].
All four datasets were downloaded from the GSEA website^{5}^{5}5http://www.broadinstitute.org/gsea/[28], and were already preprocessed as described in the supplementary file of [28]. For all four data sets, we use the gene sets from in MSigDB as in [28], as well as the same parameters.
4.2 Differential genecombination measures
We consider one univariate statistic (), and three genecombination measures (, and ) in our experiments. These four measures are described in section 2. and are defined only for size2 combinations. For , we considered genecombinations of size and size for the illustration of concept.
4.3 Q1: Does GSEAassisted DGCS improve traditional DGCS?
Boston  Michigan  Stanford  P53  

0  2  0  0  
645  1  2  1  
10  1  0  0 
In this section, we study whether the question (Q1) of whether integration of DGCS and GSEA can improve traditional DGCS. For this comparison, we consider the integration of DGCS and GSEA as a GSEAassisted DGCS approach. We first apply the traditional DGCS approaches on the four datasets to find statistically significant genecombinations. We denote the three DGCS approaches respectively with the names of the three measures, i.e. , and . Second, we apply the integrative framework, in which GSEA is integrated respectively with the three DGCS measures, to find statistically significant gene sets with moderate but coordinated differential genecombinations. We denote the three instances of the integrative approach respectively as GSEA, GSEA and GSEA. Then, we compare the results of , and , respectively with the results of GSEA, GSEA and GSEA.
Boston  Michigan  Stanford  P53  

GSEA  4  1  4  13 
GSEA  1  7  4  0 
GSEA  0  1  3  2 
In this section, we study whether the question (Q1) of whether integration of DGCS and GSEA can improve traditional DGCS. For this comparison, we consider the integration of DGCS and GSEA as a GSEAassisted DGCS approach. We first apply the traditional DGCS approaches on the four datasets to find statistically significant genecombinations. We denote the three DGCS approaches respectively with the names of the three measures, i.e. , and . Second, we apply the integrative framework, in which GSEA is integrated respectively with the three DGCS measures, to find statistically significant gene sets with moderate but coordinated differential genecombinations. We denote the three instances of the integrative approach respectively as GSEA, GSEA and GSEA. Then, we compare the results of , and , respectively with the results of GSEA, GSEA and GSEA.
Table 1 lists the number of statistically significant gene combinations discovered respectively by the three measures on each of the four datasets, with an FDR threshold of . Table 1 lists the number of statistically significant gene sets discovered by integrating GSEA respectively with the three DGCS measures on each of the four datasets, also with the same FDR threshold of . Three major observations can be made by comparing the two tables:
4.3.1 GSEAassisted DGCS has better statistical power than traditional DGCS
Table 1 shows that, in most cases, traditional DGCS discovers very few (less than 3) statistically significant gene combinations (although and have and genecombinations on the Boston data set, none of them have FDR lower than ). In contrast, table 2 shows that the integration of GSEA with the three combination measures discover multiple significant gene sets in most of the cases. This difference implies that the discovered statistically significant gene sets include many moderate but coordinated differential gene combinations, even though the combinations are not significant by themselves as shown in table 1. This comparison demonstrates that traditional DGCS, similar to univariate gene analysis, has limited statistical power, and DGCSGSEA can increase that power.
4.3.2 GSEAassisted DGCS has better result consistency than traditional DGCS
We further compare DGCS and DGCSGSEA by studying the consistency of their results on the first three data sets that are all from lung cancer studies, as done in [28]. For DGCS, discovered genes on Michigan but nothing from Boston and Stanford; discovered combinations on Boston but only 1 and 2 from Michigan and Stanford, respectively, and there are no common ones between the 645, 1, and 2 gene combinations; discovered genes on Boston but only gene on Michigan and nothing from Stanford, and the 10 and 1 combinations do not overlap. The inconsistent results make the followup biological interpretation very difficult.
In contrast, when the three DGCS measures are integrated with GSEA, several consistent themes can be observed: (i) Apoptosis related pathways (marked by in table 2): GSEA discovered four gene sets on Boston, three of which are known to be closely related to cancer and specifically to apoptosis, i.e. nfkbpathway, STGaqPathway and TNFPathway. This apoptosis theme is shared by the gene sets discovered by GSEA from Michigan and Stanford, i.e. MonocyteADPathway, hivnefPathway, deathPathway and caspasePathway. These apoptosis related pathways are enriched with the lung cancer samples with good outcome, which makes sense biologically and also corresponds to the proliferation theme supported by the gene sets enriched with the samples with poor outcome as reported in [28]. Several other examples of the result consistency, as indicated by other superscripts in Table 2, are in the technical report. This comparison demonstrates that traditional DGCS, like univariate gene analysis, has poor result consistency across the three lung cancer data sets, and DGCSGSEA can improve its consistency by integrating DGCS measures with GSEA.
4.3.3 GSEAassisted DGCS with different DGCS measures complement each other
The number of significant gene sets discovered by the three versions of GSEA varies, i.e. GSEA and GSEA discovered a bit larger number of significant gene sets than GSEA. However, GSEA still discovered several gene sets that are not discovered by GSEA or GSEA, e.g. one gene set from the Michigan data set and three from the Stanford data set. This indicates that GSEA, GSEA and GSEA have complementary perspectives, i.e. different combination measures capture different aspects of the difference between the phenotype classes (recall the two types of combinations in Figure 2). This also demonstrates the proposed framework is general enough to integrate any type of DGCS with GSEA.
4.4 Q2: Does DGCSassisted GSEA improve GSEA?
In this Section, we want to answer the question (Q2) of whether the integration of DGCS and GSEA can improve traditional GSEA. For this comparison, we consider the integration of DGCS and GSEA as a DGCSassisted GSEA approach. We design three sets of comparisons. Firstly, we compare the traditional univariatestatistic based GSEA (denoted as GSEA) with the integrative framework where one genecombinations measure is used instead of . Specifically, we compare the gene sets discovered by GSEA with the gene sets discovered by GSEA, GSEA and GSEA. Then, we compare GSEA with the integrative framework where one genecombinations measure is used in addition to , i.e. GSEA, GSEA and GSEA. Furthermore, we also study the integration of multiple genecombinations measure in addition to , e.g. GSEA.
Figure 4 displays the statistically significant gene sets discovered with different (combinations of) measures respectively from the four datasets. An FDR threshold of is used as in [28] for comparison purpose. The results presented in [28] are exactly reproduced, i.e. the gene sets listed in the rows corresponding to GSEA. In each of these four figures, we consider the traditional univariatestatistic based GSEA (GSEA) as the baseline, and compare it with the rows corresponding to GSEA, GSEA, GSEA, GSEA, GSEA, GSEA and GSEA. From these comparisons, the following observations can be made.
4.4.1 DGCSassisted GSEA discovers additional significant gene sets
First, we compare the rows corresponding to GSEA, GSEA, GSEA with the rows corresponding to GSEA. We bolded the additional gene sets that are only discovered by GSEA, GSEA, GSEA. For example, with GSEA, no statistically significant gene sets have been enriched with class A in the Boston data set. In contrast, GSEA discovered gene sets, three out of which (discussed in ) are related to apoptosis which is consistent with the results on Michigan and Stanford. On the Michigan dataset, GSEA discovered a gene set betaAlaninemetabolism that is not discovered by GSEA. This gene set is related to the responses of hypoxia, which is consistent with the results on Boston and Stanford. It is worth noting that, although most studies did not report statistically significant gene sets on the Stanford dataset due to the very small sample size, GSEA, GSEA, GSEA respectively discovered 4, 4 and 3 significant gene sets. These additional gene sets were discovered because the three DGCS measures capture different types of the differentiation between the two phenotype classes, compared to the traditional univariate differential expressionbased GSEA.
Second, we compare the rows corresponding to GSEA, GSEA, GSEA with the rows corresponding to GSEA. We bolded the additional gene sets that are only discovered by the integrative approach. For example, on the Boston data set, based GSEA discovered 8 gene sets. In addition, GSEA discovered the proteasomePathway gene set, and GSEA discovered the p53signaling gene set. Both ubiquitinproteasome pathway and p53signaling pathway are wellknown cancerrelated pathways that are also specifically related to lung cancer [21, 6]. (Additional examples are in the technical report.) The gene sets that are discovered by DGCSassisted GSEA but not by GSEA illustrate the benefits of using DGCS to assist GSEA.
Next, we also observed that integrating multiple DGCS measures can further discover statistically significant gene sets. For illustration purpose, we compare the rows corresponding to GSEA, GSEA, GSEA with the rows corresponding to GSEA. GSEA discovers the g2Pathway gene set and the gsk3Pathway gene set, respectively from the Boston and the Michigan dataset. Neither of these two pathways are discovered by GSEA, GSEA, GSEA and GSEA. The curated gene set g2Pathway contains the genes related to the G2/M transition, which is shown to be regulated by p53 [29], a wellknown cancerrelated gene. The curated gene set gsk3Pathway is the signaling pathway of GSK3, which has been shown to be related to different types of cancer[5, 22]. These two cancerrelated pathways are discovered by GSEA but not by GSEA, GSEA, GSEA and GSEA. This indicates that different members of these two pathways are differential between the two phenotype groups in different manners, i.e. the differentiation of some genes is captured by , some by , some by and some by . These two pathways can be discovered to be statistically significant only when these measures are used together in the integrative framework. This demonstrates the benefits of the proposed framework for integrating multiple DGCS measures with a univariate measure.
It is worth noting that, the gene sets discovered by the integrative framework with multiple measures are not necessarily a superset of those discovered by integrating each individual measure with GSEA since, when different DGCS measures are integrated with GSEA, the nullhypotheses tested in the GSEA framework are correspondingly different. The highlight of the integrative framework is that, additional gene sets can be discovered when different DGCS measures are used to assist the traditional univariate statisticbased GSEA. In practice, these different versions of GSEA should be used collectively.
4.4.2 DGCSassisted GSEA discovers gene sets with lower FDRs
:
Even when a gene set is discovered both before and after a DGCS measure is integrated into the framework, we can observe several interesting cases where the FDR of a gene set becomes much lower after the integration. We bolded the FDRs that significantly decreased when they are discovered by the integrative approach. For example, GSEA, in which , , and are integrated together, discovers p53hypoxialPathway with an much lower FDR of , twoorder lower than GSEA. This example indicates that several members of p53hypoxialPathway have weak individual differentiation measured by , but have more significant differentiation when they are measured by . This and other similar examples demonstrates the benefits of the proposed framework for integrating multiple DGCS measures.
4.4.3 DGCSassisted GSEA further improve the consistency across the three lung cancer data sets
As presented in [28], GSEA discovered and gene sets respectively from the Boston and Michigan data sets, and 5 of the 8 in Boston and 6 of the 11 in Michigan are common. The three unmatched gene sets that are discovered in Boston but not in Michigan are GLUTDOWN, LEUDOWN and CellCycleCheckpoint. Interestingly, the latter two are discovered from both the Boston and the Michigan data sets by GSEA^{6}^{6}6The CellCyclePathway discovered on Michigan and the cellcyclecheckpoint discovered on Boston are both cellcycle related gene sets. Such observations suggest that DGCSassisted GSEA also provides new insights to the consistency between different data sets.
4.4.4 Additional issues of multiple hypothesis testing
Because different combinations of measures are used in the integrative framework, additional issues of multiple hypothesis testing arise, even though multiple hypothesis testing has been addressed for each measure via the phenotype permutation test procedure in the GSEA framework proposed in [28]. To investigate this, we designed experiments with 4 of the possibilities of integrations, i.e. GSEA, GSEA, GSEA and GSEA. Even using a collective (metalevel) multiple hypothesis correction, many discovered gene sets would still be significant. For examples, GSEA discovers p53hypoxialPathway from the Boston data set with a low FDR of , and GSEA discovers deathPathway from the Michigan data set with a lower FDR of . We also did additional permutation tests, in which we generate random gene sets with the same sizes as the sets in MSigDB , and do the same set of experiments as shown in Figure 4. The FDR values of the random gene sets computed in the integrative framework are mostly insignificant (higher than ).
5 Discussion
In this paper we motivated the integration of differential genecombination search and gene set enrichment analysis for bidirectional benefits on both them. We proposed a general integrative framework that can handle genecombinations of different sizes () and different genecombination measures in addition to an univariate statistic used in traditional GSEA. The experimental results demonstrated that, on one hand, GSEAassisted DGCS has better statistical power and result consistency than traditional DGCS. On the other hand, DGCSassisted GSEA can discover additional statistically significant gene sets that are ignored by traditional GSEA and further improve the result consistency of the traditional GSEA.
The proposed framework can be extended in several ways. Different variations of GSEA will be considered. Along these lines, we note that the proposed integrative framework is general enough to integrate most existing variations of GSEA approaches summarized in [1] with minimal amount of modification. Also, it should be possible to integrate DGCS and genesubnetwork discovery. Both GSEA and genesubnetwork discovery [17, 8] can discover collections of genes, either known gene sets [28] or subnetworks in a molecular network (e.g. protein interaction network), that show moderate but coordinated differentiation. In this paper, we integrate DGCS and GSEA as an illustration of the general framework for integrating scores from different genecombination measures and genecombinations of different sizes, in addition to the traditional univariate statistic, but the same framework also applies to the integration of DGCS and gene subnetwork discovery. Another direction is the use of this framework for the analysis of (GWAS) SNP data, by following the methodology proposed in recently work on pathway/network based analysis of GWAS datasets [32, 2]. Finally, it may be possible to use constraints on genecombinations to improve our framework. In procedure , for each genecombination measure and an integer , the score of a gene is assigned from all the possible genecombinations. A further extension of procedure is to only consider the gene combinations, in which the genes appear in a common gene set, e.g. a pathway. Such genesetbased constraints may better control false positive gene combinations and improve the statistical power of the whole integrative framework.
6 Appendix
6.1 Illustration of procedure
The illustration of procedure is in figure 5.
6.2 Complete Geneset Table
Due to the space limit, Table 2 and Figure 4 are both summarized from the four complete tables that are available at http://vk.cs.umn.edu/ICG/. Specifically, Table 2 is a highlevel summary of the number of gene sets discovered and the biological processes associated with each of the gene sets. In figure 4, we listed the complete results for GSEA (the baseline), while for the other rows, we only list a gene set if it is only discovered by the integrative approach (with bolded name), or it has a nontrivially decreased FDR when it is discovered by the integrative approach (with bolded FDR).
References
 [1] M. Ackermann and K. Strimmer. A general modular framework for gene set enrichment analysis. BMC bioinformatics, 10(1):47, 2009.
 [2] S. Baranzini, N. Galwey, J. Wang, P. Khankhanian, R. Lindberg, D. Pelletier, et al. Pathway and networkbased analysis of genomewide association studies in multiple sclerosis. Human Molecular Genetics, 18(11):2078, 2009.
 [3] D. Beer, S. Kardia, C. Huang, T. Giordano, A. Levin, D. Misek, L. Lin, G. Chen, T. Gharib, D. Thomas, et al. Geneexpression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine, 8(8):816–824, 2002.
 [4] A. Bhattacharjee, W. Richards, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. PNAS, 98(24):13790, 2001.
 [5] Q. Cao, X. Lu, and Y. Feng. Glycogen synthase kinase3 positively regulates the proliferation of human ovarian cancer cells. Cell Research, 16(7):671–677, 2006.
 [6] E. Chenette. Signalling: A Ras and NFB pas de deux. Nature Reviews Cancer, 9(12):847, 2009.
 [7] Y. Choi and C. Kendziorski. Statistical methods for gene set coexpression analysis. Bioinformatics, 25(21):2780, 2009.
 [8] H.Y. Chuang, E. Lee, Y.T. Liu, D. Lee, and T. Ideker. Networkbased classification of breast cancer metastasis. Mol. Sys. Bio., 3:140, 2007.
 [9] A. Dempster. A high dimensional two sample significance test. The Annals of Mathematical Statistics, pages 995–1010, 1958.
 [10] M. Dettling, E. Gabrielson, and G. Parmigiani. Searching for differentially expressed gene combinations. Genome Biology, 6(10):R88, 2005.
 [11] S. Dudoit, J. Shaffer, and J. Boldrick. Multiple hypothesis testing in microarray experiments. Statistical Science, 18(1):71–103, 2003.
 [12] G. Fang, R. Kuang, G. Pandey, M. Steinbach, C. L. Myers, and V. Kumar. Subspace Differential Coexpression Analysis: Problem Definition and A General Approach. In Pacific Symposium on Biocomputing, pages 145–156, 2010.
 [13] G. Fang, G. Pandey, M. Gupta, M. Steinbach, and V. Kumar. Mining lowsupport discriminative patterns from dense and highdimensional data. Technical Report 09011, Department of Computer Science, University of Minnesota, 2009.
 [14] M. Garber, O. Troyanskaya, K. Schluens, S. Petersen, et al. Diversity of gene expression in adenocarcinoma of the lung. PNAS, 98(24):13784, 2001.
 [15] G. Glazko and F. EmmertStreib. Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics, 25(18):2348, 2009.
 [16] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531, 1999.
 [17] T. Ideker, O. Ozier, et al. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics, 18:233–240, 2002.
 [18] D. Kostka and R. Spang. Finding disease specific alterations in the coexpression of genes. Bioinformatics, 20(194–199), 2004.
 [19] Y. Lai, B. Wu, et al. A statistical method for identifying differential genegene coexpression patterns. Bioinformatics, 20(17):3146–3155, 2004.
 [20] K. Li. Genomewide coexpression dynamics: theory and application. PNAS, 99(26):16875–16880, 2002.
 [21] T. Li, L. Ho, B. Piperdi, et al. Phase II study of the proteasome inhibitor bortezomib (PS341, Velcade®) in chemotherapynaive patients with advanced stage nonsmall cell lung cancer. Lung Cancer, 2009.
 [22] X. Liao, J. Thrasher, J. Holzbeierlein, S. Stanley, and B. Li. Glycogen Synthase Kinase3 Activity Is Required for AndrogenStimulated Gene Expression in Prostate Cancer. Endocrinology, 145(6):2941, 2004.
 [23] Y. Lu, P. Liu, P. Xiao, and H. Deng. Hotelling’s T2 multivariate profiling for detecting differential expression in microarrays. Bioinformatics, 21(14):3105, 2005.
 [24] V. Mootha, C. Lindgren, K. Eriksson, A. Subramanian, et al. PGC1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics, 34(3):267–273, 2003.
 [25] M. Olivier et al. The IARC TP53 database: new online mutation analysis and recommendations to users. Human Mutation, 19(6):607–614, 2002.
 [26] W. Pan. A comparative review of statistical methods for discovering differentially expressed genes in microarray experiments. Bioinformatics, 18(4):546, 2002.
 [27] M. Schena, D. Shalon, R. Davis, P. Brown, et al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270(5235):467–470, 1995.
 [28] A. Subramanian, P. Tamayo, V. Mootha, S. Mukherjee, et al. Gene set enrichment analysis: A knowledgebased approach for interpreting genomewide expression profiles. PNAS, 102(43):15545–15550, 2005.
 [29] W. Taylor and G. Stark. Regulation of the G2/M transition by p53. Oncogene, 20(15):1803–1815, 2001.
 [30] L. Tian, S. Greenberg, S. Kong, J. Altschuler, I. Kohane, and P. Park. Discovering statistically significant pathways in expression profiling studies. PNAS, 102(38):13544, 2005.
 [31] V. Tusher, R. Tibshirani, and G. Chu. Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. PNAS, 98:5116–5121, 2001.
 [32] K. Wang, M. Li, and M. Bucan. Pathwaybased approaches for analysis of genomewide association studies. AJHG, 81(6):1278–1283, 2007.
 [33] K. Wang, M. Narayanan, H. Zhong, M. Tompa, E. E. Schadt, and J. Zhu. Metaanalysis of interspecies liver coexpression networks elucidates traits associated with common human diseases. PLoS Comp. Bio., 5(12):e1000616, 2009.
 [34] Y. Xiao, R. Frisina, A. Gordon, L. Klebanov, and A. Yakovlev. Multivariate search for differentially expressed gene combinations. BMC Bioinfo., 5(1):164, 2004.
 [35] J. Zhang, J. Li, H. Deng, and I. Jordan. Identifying Gene Interaction Enrichment for Gene Expression Data. PLoS ONE, 4(11):e8064, 2009.