Tellipsoid: Exploiting inter-gene correlation for improved detection of differential gene expression

Tellipsoid: Exploiting inter-gene correlation for improved detection of differential gene expression

Keyur Desai 1, J.R. Deller, Jr.  and J. Justin McCormick 
Dept. of Electrical and Computer Engineering
Dept. of Biochemistry and Molecular Biology
Michigan State University, E. Lansing, MI 48824, USA
11to whom correspondence should be addressed; contact: desaikey@egr.msu.edu
February 11, 2008
Abstract

Motivation:

Algorithms for differential analysis of microarray data are vital to modern biomedical research. Their accuracy strongly depends on effective treatment of inter-gene correlation. Correlation is ordinarily accounted for in terms of its effect on significance cut-offs. In this paper it is shown that correlation can, in fact, be exploited to share information across tests, which, in turn, can increase statistical power.

Results:

Vastly and demonstrably improved differential analysis approaches are the result of combining identifiability (the fact that in most microarray data sets, a large proportion of genes can be identified a priori as non-differential) with optimization criteria that incorporate correlation. As a special case, we develop a method which builds upon the widely used two-sample -statistic based approach and uses the Mahalanobis distance as an optimality criterion. Results on the prostate cancer data of Singh et al. (2002) suggest that the proposed method outperforms all published approaches in terms of statistical power.

Availability:

The proposed algorithm is implemented in MATLAB and in R. The software, called Tellipsoid, and relevant data sets are available at www.egr.msu.edu/~desaikey.

1 Introduction

Detecting differentially expressed genes in the presence of substantial inter-gene correlation is a challenging problem. Research has focused largely on understanding the harmful effects of correlation on the threshold settings demarcating null and non-null genes. In fact, however, the nominally confounding correlation can be exploited to increase statistical power of microarray studies. This observation has engendered a fruitful research direction introduced in this paper.

The literature is not devoid of attempts to develop more powerful summary statistics which exploit correlation, but such efforts have not produced compelling results. We posit that the limitations of such developments are due, at least in part, to neglecting identifiability – the fact that in most microarray data sets, a large proportion of genes requires no technical effort to be identified as explicitly non-differential.

In this paper, we present a new differential analysis method which exploits identifiability through an optimization criterion which incorporates inter-gene correlation. The method builds upon the widely-used two-sample -statistic approach with a minimization of the Mahalanobis distance as the optimality criterion. Although this paper focuses primarily on two-sample microarray studies, the framework is readily generalizable for use in studies with multiple or continuous covariates, as well as to other multiple comparison applications.

Let us assume that genes are measured on microarrays, under two different experimental conditions, such as control and treatment. Based on the gene expression matrix , we are interested in identifying a small number of genes that are responsible for class distinction. One widely used strategy [e.g., Tusher et al. (2001) and Efron (2004)] is to represent each gene by its null hypothesis and two-sample -statistic,222null = non-differential; non-null = differential say

Null hypotheses:
Test statistics:

The magnitudes of the -statistics establish a gene-ranking and the
(presumably) genes with the largest -scores are reported as statistically significant discoveries. The investigator can either explicitly supply a or rely on the false discovery rate (FDR) calculations to find a maximal with the allowable FDR.

The issue of correctly estimating the FDR in the presence of correlation has received much recent attention because highly correlated tests increase the variance of the FDR leading to unreliable results (Owen, 2005). As discussed in Efron (2007) and Desai et al. (2007), for “over powered” ’s, there may be significantly fewer tail-area null counts than expected, while for “under powered” ’s, the situation can worsen with many more tail-area null counts than expected. Importantly though, techniques for correctly estimating the FDR do not change the gene-ranking, but only the size of the reported list.

The present research was motivated by the notion that, for “under powered” ’s, it might be possible to exploit correlation across ’s to establish a gene ranking that has better statistical power than the raw based ranking. The method that resulted from an exploration of this question indeed seems to improve the statistical power of all ’s. The proposed method uses, (i) a vector of -statistics () and (ii) an estimate of the covariance matrix of the vector , to output a substantially revised version of , denoted , whose corresponding entries can be used to establish an improved gene-ranking.

The method is summarized as follows. Let333No distributional information nor higher order statistics of are assumed. . Now based on the measured value of , we can estimate , but while doing so we invoke the zero assumption (ZA) (Efron, 2006) that the smallest (%) of ’s are associated with null genes. Based on the ZA, we can set corresponding entries of to zero. For the remaining entries of we obtain minimum Mahalanobis distance (Mahalanobis, 1936) estimates.

Inter-gene correlation causes to distribute around its center of mass in an hyperellipsoidal manner, and the Mahalanobis distance is a natural way to measure vector distances in such a distribution. In fact, the name Tellipsoid is adopted to emphasize the geometric intuition of tracking the center of an ellipsoid. We have done extensive experimentation with both real and simulated data and found that for a truly null which happens to be in tail-area, the corresponding consistently tends to zero (its theoretical value).

Two prior research efforts were particularly useful in formulating the present approach. Storey et al. (2007), present a more general approach to the notion of sharing information across ’s, which they describe as “borrowing strength across the tests,” for a potential increase in statistical power. Tibshirani and Wasserman (2006), discuss a new statistic called the “correlation-shared” -statistic, and derive theoretical bounds on its performance; however, their approach requires strong assumptions regarding the nature of correlation between null and non-null genes which may not hold in many real data sets. The proposed method requires no such assumptions and yet provides considerable increase in power as demonstrated by results reported in Section 4.

The outline of this paper is as follows. Section 2 defines, and obtains closed form expressions for, the minimum Mahalanobis distance estimates in the -statistic space. Section 3 builds on the theory of Section 2 to develop the Tellipsoid gene-ranking method. In Section 4 we apply Tellipsoid to real cancer data of Singh et al. (2002) and compare its accuracy with the state-of-the-art differential analysis tools EDGE (Storey et al., 2007) and SAM (Tusher et al., 2001). Section 4 also discusses the implications of these results. We conclude with Section 5 summarizing the main ideas.

2 Approach

2.1 Per gene summary statistic

Let be the matrix of gene expressions, for genes and samples. We assume that the samples fall into two groups and and there are samples in group with . We start with the standard (unpaired) -statistic:

(1)

where is the mean of gene in group and is the pooled within-group standard deviation of gene . If the gene is indeed null, then we expect , where is the number of degrees of freedom, obtained from either the unpaired -test theory or the permutation null calculations as in Efron (2007). If gene is non-null, then we expect . For non-null genes, the values of and depend on the amount of up / down regulation, the number of samples in each group, and .

2.2 The zero assumption

Without loss of generality, we may assume that the genes are indexed so that

(2)

Then, a reasonable way to impose identifiability on null genes is through the ZA, namely, that (%) of the genes – those with the smallest statistics – are null. Efron (2006) discusses the use of the ZA in a variety of differential analysis approaches. It plays a central role in the literature on estimating the proportion of null genes, as in Pawitan et al. (2005) and Langaas et al. (2005). The ZA is equally crucial for the two-group model approach developed in the Bayesian microarray literature, as in Lee et al. (2000), Newton et al. (2001), and Efron et al. (2001). Furthermore, the claim that the method in Storey (2002) improves upon the well-known Benjamini and Hochberg (1995) FDR procedure (in terms of statistical power) also crucially relies on an adaptive version of the ZA. The use of the ZA is justified in the present situation as long as is sufficiently small so that the bottom (%) genes would almost certainly be declared null for reasonable FDR’s.

Formally, the ZA is stated as follows: Let be the largest integer (gene index) such that , denoted

(3)

then genes with indices are assumed null. Let us partition the set of statistics into those corresponding to genes declared null under the ZA, , and those for the remaining genes which continue to compete for the non-null designation, . For convenience, we introduce the following vector notation,

Then the random vector is distributed in the following way:

(4)

where is the underlying mean vector and the covariance matrix. The corresponding partitions of are denoted

The central hypothesis of this paper is that there is a vector, say , whose elements represent a reordering of the elements of the , such that gene-ranking represented by has better statistical power for detecting non-null genes than that based on itself.

(In the present paper, we focus on mainly the second-moment distributional characteristics of . However, in fact, if the gene expressions are normally distributed, then, perhaps, is described more accurately by the multivariate Student distribution. Exploitation of this additional structure will be considered in future work.)

2.3 Estimating

We are interested in obtaining an estimate of the vector mean based on the observation . This requires an appropriate metric in the space of vectors, with which to quantify the distance of the observed from the center of mass , say . The norm induces a useful metric between and provided that we first decorrelate the vector elements as , thus yielding

(5)

This weighted Euclidean distance is sometimes called the Mahalanobis distance in the pattern classification literature as in Deller Jr et al. (1993), Devijver and Kittler (1982), etc.

We can relate to through the Mahalanobis distance but while doing so we invoke the ZA, which, in turn, implies that the first entries of are zero. This yields the estimate

In effect, combines the identifiability information based on the ZA with the information about the covariance structure of which too can be obtained from the measured itself. Notably the optimization in Eqn. (2.3) enjoys closed form solution:

(7)

The derivation leading from Eqn. (2.3) to Eqn. (7) is provided in the Appendix.

2.4 Estimating

To estimate the required entries of , we make several observations. The validity of these observations can be established through computer simulations using the MATLAB script testtcov.m available with the Tellipsoid software. (Equation (7) does not require the covariance between two non-null ’s.)

Observation 1. If genes and both are null, then

(8)

Efron (2007) and Owen (2005) use this observation for their conditional FDR calculations. This observation maybe intuitive to the reader from Eqn. (1) itself, or it is easily verified through a computer simulation.

Observation 2. Similarly, if the gene is null and non-null (or conversely), then

(9)

where denotes the correlation between gene and within group . Equation (9) accommodates the possibility that the correlation between a null and a non-null gene may change across groups. If this does not occur, then Eqn. (9) reduces to Eqn. (8).

Observation 3. Furthermore, if (which is the case for most microarray studies), then Eqn. (9) simplifies to:

(10)

Equations (8) and (10) suggest that we may use sample correlation to estimate :

(11)

where denotes the expression level of the gene measured on the microarray after subtracting the average response within each treatment group. The scale factors cancel in the terms and , so that estimating /(-2) [see Eqn. (8)] is not required.

2.5 Tellipsoid equation

In light of Eqn. (11), Eqn. (7) takes the practical form

(12)

where is the sample correlation matrix of the gene expression matrix (after removing the treatment effects). In most cases computing the full matrix inverse () is not necessary and solving the term through some efficient linear solver reduces the computation considerably. (See Subsection 3.3 for detail).

3 Algorithm

3.1 Tellipsoid

This section outlines a self-contained differential analysis algorithm based on the ideas discussed in Section 2. Its name Tellipsoid was coined to emphasize the geometric intuition of tracking the center of an ellipsoid.

Tellipsoid takes gene expression matrix and provides a specified number, , of the most differentially expressed genes. In principle, the ranking is based on the set from Eqn. (2.3). In practice, we rely on the estimates from Eqn. (12).

The algorithm begins by reindexing the genes based on their two-sample -statistics [Eqn. (2)]. Then, based on the ZA, the first genes are identified as null, as specified in Eqn. (3). By default, is set to 50(%). Although the choice 50% is somewhat arbitrary, this fraction has worked well empirically in the data sets tested. Future research may yield more rigorous methods for choosing .

In order to nullify any genuine treatment differences, is converted to by subtracting each gene s average response within each treatment group. The sample correlation matrix of is computed subsequently. The crucial step is to compute based on Eqn. (12). The elements of determine the gene-ranking: A gene with bigger is assigned a higher rank. The first genes are reported as top statistical discoveries.

3.2 Numerical stability

Because the number of samples is often less than the number of genes , the sample correlation matrix turns out to be singular and hence non-invertible. Therefore, we add a very small correction term (= ) to its diagonal entries to make it nonsingular, and in effect, invertible. After this correction, Tellipsoid shows excellent numerical stability.

3.3 Computational complexity

Equation (12) involves matrix inversion, which, if performed in a naive way, could be a prohibitive operation, since microarray data sets may have several tens of thousand genes. Indeed, solving the term as a system of simultaneous linear equations () is much faster than explicitly computing . In particular, we can employ the Cholesky decomposition to exploit the fact that the matrix is symmetric and positive definite. MATLAB implementation of Tellipsoid uses the in-built linslove with appropriate settings, which, in turn, uses highly optimized routines of LAPACK (Linear Algebra PACKage—http://www.netlib.org/lapack/).

For the Prostate data (used in Section 4) with genes and samples, Tellipsoid, running on a computer with a 2.2 GHz dual-core AMD Opteron processor with 8 GB of RAM and MATLAB version R2006b, requires just under 40 seconds to report the final gene list. For the same settings, the implementation with explicit matrix inversions takes minutes.

3.4 Summary of Tellipsoid


Tellipsoid: An improved differential analysis algorithm
Input: Labeled gene expression matrix; Size of gene list
Output: The gene list containing top differentially expressed genes

  1. Calculate two-sample (unpaired) -statistics:

  2. Reindex genes such that

  3. Gather first = ’s in a vector ; By default =50(%)

  4. Convert to by subtracting each gene s average response within each treatment group

  5. Compute = the sample correlation matrix of

  6. Find = , where =

  7. Determine gene-ranking: Assign a gene with bigger a higher rank

  8. Report top genes as statistical discoveries

4 Results and Discussion

We compare Tellipsoid with two of the leading techniques, SAM [Significance Analysis of Microarrays (Tusher et al., 2001)] and EDGE [Extraction and analysis of Differential Gene Expression (Leek et al., 2006)]. SAM adds a small exchangeability factor to the pooled sample variance when computing the two-sample -statistic: ; whereas EDGE is based on a general framework for sharing information across tests (see Storey et al. (2007)). EDGE is reported to show substantial improvement (in terms of statistical power) over five of the leading techniques including SAM (Storey et al., 2007). The other four are: (i) /–test of Kerr et al. (2000) and Dudoit et al. (2002); (ii) Shrunken /–test of Cui et al. (2005); (iii) The empirical Bayes local FDR of Efron et al. (2001); (iv) The a posteriori probability approach of Lonnstedt and Speed (2002). It is noteworthy that Tellipsoid shows a major improvement over EDGE itself.

To determine the performance quality of various techniques, we focus primarily on the empirical FDR in the reported gene list: Empirical FDR = , where NoFP = the number of false positives. Broadly speaking, smaller the FDR better the technique.

4.1 Prostate data

(Singh et al., 2002) studied genes on oligonucleotide microarrays, comparing healthy males with prostate cancer patients. The purpose of their study was to identify genes that might anticipate the clinical behavior of Prostate cancer. We downloaded the .CEL files from http://www-genome.wi.mit.edu/MPR/prostate. The software RMAExpress (Irizarry et al., 2003) was used to obtain high quality gene expressions from these .CEL files. We let RMAExpress apply its in-built background adjustment, however, the quantile normalization was skipped. Each gene was represented in the final expression matrix by the logarithm (base 10) of its expression level. Taking the is thought to increase normality and stabilize across group standard deviations (Tsai et al., 2003).

4.1.1 Data with known results

Algorithm testing required an expression matrix with the knowledge of truly non-differential genes. At the same time, we wanted the inter-gene correlation in to resemble that in the real microarray data. These two seemingly conflicting requirements were satisfied concurrently by row standardizing a real . The prostate cancer matrix was transformed to by subtracting each gene’s average response within each treatment group, and by normalizing within group sample mean squares. That is, for each group , and . Here, the sum runs over corresponding samples only. With this transformation, all genes have equal energy and yet the same within group inter-gene correlation structure as the original .444Note. Normalizing within group sample mean squares to unity is not implemented in the Tellipsoid algorithm.

To generate a test data set from , its 102 columns were randomly divided into groups of 50 (=) and 52 (=). Next () genes were randomly chosen for up (down) regulation by adding a positive (negative) offset () to the corresponding entries in group 2. Various choices of were tested to represent a range of differential analysis scenarios encountered in practice.

4.1.2 Results for prostate data

Two cases were studied. In the first, the proportion of truly differential genes, say , was taken to be relatively small: 0.01–0.05. The second case employed a larger 0.1. The former simulates microarray studies seeking genes that distinguish subtypes of cancer, diabetes, etc., whereas the latter resembles studies comparing healthy versus diseased cell activity.

Results were obtained using the subroutines samr.r from the package “samr” and statex.r from the package “edge.” Both routines compute their native gene summary statistics given the matrix and corresponding column labels. These statistics, in turn, can be used to determine top genes.

Case 1 [, =200, =100, =0.1, and =-0.1]. Figure 1 shows plots of the FDRs for 40 different data sets with the size of the reported list, =. A large value of coincides with an attempt to extract as many differential expressions as possible, a desired goal especially in microarray studies performed to identify genes that are to be explored further – experimentally or computationally – to gain better understanding of underlying gene networks. Since the differential signal =0.1 and =-0.1 is rather weak, recovering a good list is not easy as evident from the results – among all methods only Tellipsoid achieved sufficiently low FDRs to rescue a few ’s.

Figure 1: FDRs for Case 1. The number of truly differential gene is 300. Panel (a) =300; Panel (b) =100. Tellipsoid offers better detection. Square marker = Tellipsoid. Lines: solid = raw -statistic; dotted = SAM; dashed = EDGE.
Figure 2: FDRs for Case 2a. The number of truly differential gene is 1200. Panel (a) =1200; Panel (b) =300. Tellipsoid performs much better. Square marker = Tellipsoid. Lines: solid = raw -statistic; dotted = SAM; dashed = EDGE.

Figure 1 presents results for =100. A smaller would be chosen to identify high-quality class distinguishing features for gene-expression-profiling-based clinical diagnosis and prognosis, where the goal is to build accurate classifiers and predictors. Whereas Singh et al. (2002) build a classifier around only 16 of 12625 features, they do discuss the need to include as many reliable features as possible. Remarkably, for 37 out of 40 matrices, Tellipsoid reports gene lists with no false discoveries at all, while the other techniques fail to obtain a single gene list with an FDR0.5.

Case 2a [, =600, =600, =0.02, and =-0.02]. In this set of experiments, is increased, but the differential signal is reduced. This situation also proves to be challenging for the existing techniques. However, Tellipsoid provides the FDR of for =1200, and, again for =300, while it reports most gene lists with no false discoveries at all.

Case 2b [0.1, =600, =600, =0.1, and =-0.1]. This subcase is designed to assess the effects of small sample sizes on performance. and are both reduced to 20. We randomly chose 20 columns per group from the original prostate cancer , and then applied the data generation process (including row standardization) detailed in Subsection 4.1.1. Reduction in the number of samples is compensated by increase in the differential signal. The FDRs for Tellipsoid, Fig. 3, are excellent suggesting that Tellipsoid increases power of small sample data sets too.

Figure 3: FDRs for Case 2b. Panel (a) =1200; Panel (b) =300. The sample size is smaller than that in Cases 1 2a and yet Tellipsoid performs well. Square marker = Tellipsoid. Lines: solid = raw -statistic; dotted = SAM; dashed = EDGE.

4.2 Simulated data

Before devising the test data setup of Subsection 4.1.1, Tellipsoid was tested on several simulated data sets. Below we discuss some simulation results that shed further light on the small sample behavior.

Let us denote by the column of a simulated expression matrix . We assume that the random vector is multivariate Gaussian with mean and covariance matrix . Each such column represents =3226 genes with a covariance matrix that introduces roughly the same amount of correlation as found in the BRCA data of Hedenfalk et al. (2001). We choose , and . Figure 4 shows plots of the FDRs for =50 and =100. Table 1 shows results for some realization from Fig. 4. Shown are the top 100 values of and each corresponding original with concomitant rank. With smaller , preeminence of Tellipsoid with respect to existing techniques scales down a bit. Nevertheless, for =50 case, for 25 out of 40 simulated realizations, Tellipsoid achieves a low FDR of 0.1 or less.

Figure 4: FDRs for simulated data. Panel (a) =100; Panel (b) =50. (Small) sample sizes: =10, =10. Yet, Tellipsoid performs better than the rest. Square marker = Tellipsoid. Lines: solid = raw -statistic; dotted = SAM; dashed = EDGE.
rank rank rank rank
1 4.22 5.87 1 51 2.18 3.45 23
2 -4.17 -5.55 2 52 2.17 3.05 42
3 -3.93 -4.26 5 53 2.16 2.80 82
4 -3.74 -4.12 7 54 -2.15 -2.57 122
5 -3.58 -4.49 4 55 2.15 1.96 357
6 -3.49 -3.34 28 56 -2.14 -1.47 751
7 -3.45 -4.25 6 57 2.13 2.25 229
8 3.35 3.87 10 58 2.13 1.77 486
9 -3.33 -3.20 35 59 2.13 1.44 785
10 3.33 3.77 13 60 2.12 2.14 273
11 3.25 3.42 25 61 -2.11 -1.48 744
12 -3.16 -2.18 260 62 2.10 1.80 453
13 3.14 4.54 3 63 2.09 2.60 114
14 3.10 2.87 65 64 -2.09 -2.05 312
15 -3.08 -3.54 17 65 2.09 2.70 96
16 -3.07 -2.80 80 66 2.09 2.23 237
17 3.06 3.49 20 67 2.08 2.34 188
18 3.02 2.29 213 68 -2.08 -2.24 232
19 -2.99 -3.34 27 69 -2.06 -2.53 130
20 -2.93 -3.13 38 70 -2.04 -2.11 283
21 -2.92 -2.92 57 71 -2.04 -2.95 54
22 2.86 3.26 31 72 -2.03 -3.08 40
23 -2.83 -2.82 74 73 -2.02 -2.30 210
24 2.82 2.37 180 74 -2.01 -3.67 15
25 -2.81 -2.13 276 75 2.00 2.62 109
26 2.81 3.48 21 76 -1.98 -2.38 171
27 -2.79 -3.01 47 77 1.98 1.43 795
28 2.70 2.87 64 78 1.96 1.69 549
29 -2.66 -3.15 37 79 -1.95 -1.47 746
30 -2.58 -3.85 11 80 1.95 1.95 361
31 -2.56 -2.84 71 81 1.95 2.81 77
32 -2.55 -1.72 524 82 -1.94 -1.41 813
33 -2.54 -2.63 106 83 1.94 3.40 26
34 -2.54 -2.69 98 84 1.94 1.30 948
35 2.53 2.30 209 85 -1.94 -3.27 30
36 2.48 2.45 148 86 -1.93 -1.11 1190
37 -2.47 -2.29 212 87 -1.93 -1.37 872
38 -2.46 -3.21 33 88 -1.93 -3.44 24
39 -2.43 -2.44 154 89 -1.92 -3.07 41
40 2.43 2.71 94 90 -1.92 -1.50 726
41 2.40 2.86 66 91 1.90 3.62 16
42 -2.34 -2.60 115 92 -1.90 -2.82 75
43 -2.34 -2.98 50 93 1.89 1.25 1007
44 -2.33 -3.80 12 94 1.87 3.89 8
45 -2.32 -2.06 306 95 -1.86 -3.49 19
46 2.29 1.81 444 96 -1.86 -2.08 300
47 -2.27 -1.17 1110 97 1.85 1.20 1074
48 2.26 1.97 347 98 -1.83 -2.90 60
49 2.24 3.75 14 99 1.83 1.39 833
50 2.20 3.88 9 100 -1.82 -1.95 367
Table 1: Tellipsoid in action with Top 100 ’s. Corresponding ’s and their rank are also shown. The results are for some realization from Fig. 4. Tellipsoid = 22 NoFPs; raw -statistics = 68 NoFPs. Truly null genes are printed in bold-sans typeface.

Interestingly, with a smaller , SAM outperforms the other two techniques. This is not entirely surprising as a smaller can make the noise in the per gene pooled variance (and possibly the equivalent quantity in the EDGE algorithm) more prominent. Nevertheless, SAM does mitigate this issue in some measure by using the exchangeability factor to adjust the effective pooled variance (Tusher et al., 2001).

4.3 Discussion

By allowing researchers to examine the simultaneous expressions of enormous numbers of genes, microarrays promised to revolutionize the understanding of complex diseases and usher in an era of personalized medicine. However, the shift in perception of that promise is palpable in the literature. A 1999 Nature Genetics article (Lander, 1999) is entitled “Array of hope,” but a 2005 Nature Reviews article (Frantz, 2005) is entitled “An array of problems.” It is not unusual for impacts of new technologies to be overestimated when first deployed, then to have the expectations moderated as the technologies reveal new complexities in the problems they are designed to solve. In the study of microarray data, the need for exceeding care in the design and regularization of experiments and data collection are understood to be critical, but the biggest hindrance to progress has been the data interpretation. In particular, as Efron (2007) and Owen (2005) point out, the biggest challenge seems to be the treatment of intrinsic inter-gene correlation.

In most microarray data there are at least three vital resources: (i) identifiability (ii) immense parallel structure, and (iii) inter-gene correlation itself. Thoughtful analyses in the papers by Efron (Efron, 2005, 2000) have suggested this view of the rich information structure inherent in the data. In this light, Tellipsoid can be viewed as exploiting more than correlation as a means of sharing information across tests, as it also involves identifiability.

A crucial step in formulating Tellipsoid was the comprehension of the effects of inter-gene correlation on . In light of Observations 1–3, the choice of the Mahalanobis distance was intuitive, as it is already known to give computationally attractive solutions through the matrix inversion lemma.

Limited time and resources – and perhaps also the necessity for scientific focus – often require biomedical researchers to work on only a small number of “hot (gene) prospects.” Even under such highly conservative conditions, however, misleading results can occur, as evident in the results of Figs. 14. For all their careful development and statistical power, even state-of-the-art tools like EDGE and SAM can report spurious gene lists. The extra statistical power of Tellipsoid promises to further guard against anomalous results that can have serious consequences for the trajectory of a study of gene function, causation, and interaction.

5 Conclusion

This paper has reported the development and testing of a novel framework for the detection of differential gene expression. The framework combines the exploitation of inter-gene correlation to share information across tests, with identifiability – the fact that in most microarray data sets, a large proportion of genes can be identified a priori as non-differential. When applied to the widely used two-sample -statistic approach, this viewpoint yielded an elegant differential analysis technique, Tellipsoid, which requires as inputs only a gene expression matrix, related two-sample labels, and the size of desired gene-list . Tellipsoid was tested on the prostate cancer data of Singh et al. (2002) and some simulated data. Compared to SAM (Tusher et al., 2001), EDGE (Storey et al., 2007), and the raw -statistic approach itself, Tellipsoid shows substantial improvement in statistical power. Usually, with increase in microarray samples, power tends to increase considerably, but, even for small sample sizes, Tellipsoid’s performance improvement is noticeable. The software (coded in MATLAB and in R) and test data sets are available at www.egr.msu.edu/~desaikey.

Funding

KD is supported by a graduate research fellowship from the Quantitative Biology Initiative at Michigan State University.

Acknowledgement

We are grateful to the anonymous reviewers of an earlier publication (Desai et al., 2007) for their insightful remarks that seeded this work. We also appreciate the support of the Michigan State University High Performance Computing Center.

Appendix

Suppose that

(A-1)

We also have . Substituting these in Eqn. 2.3 yields:

(A-2)

In Eqn. A-2, by setting the gradient w.r.t to 0, we obtain:

(A-3)

Now for , we can appeal to the matrix inversion lemma (Golub and Van Loan, 1996):

where . Plugging this in Eqn. A-3 yields:

(A-4)

Combining Eqn. A-4 with Eqn. A-1 provides the desired expression:

References

  • Benjamini and Hochberg (1995) Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B. Methodological, 57(1):289–300, 1995.
  • Cui et al. (2005) X. Cui, J.T.G. Hwang, J. Qiu, N.J. Blades, and G.A. Churchill. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics, 6(1):59–75, 2005.
  • Deller Jr et al. (1993) J.R. Deller Jr, J.G. Proakis, and J.H. Hansen. Discrete Time Processing of Speech Signals. Prentice Hall PTR Upper Saddle River, NJ, USA, 1993.
  • Desai et al. (2007) K. Desai, J.R. Deller, Jr., and J.J. McCormick. The distribution of the number of false discoveries in highly correlated dna microarray data. Submitted to the Annals of Applied Statstics, preprint at http://www.egr.msu.edu/~desaikey, 2007.
  • Devijver and Kittler (1982) P.A. Devijver and J. Kittler. Pattern Recognition: A Statistical Approach. Englewood Cliffs, New Jersey etc, 1982.
  • Dudoit et al. (2002) S. Dudoit, Y.H. Yang, M.J. Callow, and T.P. Speed. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica, 12(1):111–139, 2002.
  • Efron (2000) B. Efron. RA Fisher in the 21st Century. Statistics for the 21st Century: Methodologies for Applications of the Future, 1:09, 2000.
  • Efron (2004) B. Efron. Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis. Journal of the American Statistical Association, 99(465):96–105, 2004.
  • Efron (2005) B. Efron. Bayesians, Frequentists, and Scientists. Journal of the American Statistical Association, 100(469):1–6, 2005.
  • Efron (2006) B. Efron. Microarrays, Empirical Bayes, and the Two-Groups Model. Preprint, Dept. of Statistics, Stanford University, 2006.
  • Efron (2007) B. Efron. Correlation and Large-Scale Simultaneous Significance Testing. Journal of the American Statistical Association, 102(477):93–103, 2007.
  • Efron et al. (2001) B. Efron, R. Tibshirani, J.D. Storey, and V. Tusher. Empirical Bayes Analysis of a Microarray Experiment. Journal of the American Statistical Association, 96(456):1151–1161, 2001.
  • Frantz (2005) S. Frantz. An array of problems. Nat Rev Drug Discov, 4(5):362–3, 2005.
  • Golub and Van Loan (1996) G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, 1996.
  • Hedenfalk et al. (2001) I. Hedenfalk, D. Duggan, Y. Chen, et al. Gene-Expression Profiles in Hereditary Breast Cancer. New England Journal of Medicine, 344(8):539–548, 2001.
  • Irizarry et al. (2003) R.A. Irizarry, B.M. Bolstad, F. Collin, L.M. Cope, B. Hobbs, and T.P. Speed. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research, 31(4):e15, 2003.
  • Kerr et al. (2000) M.K. Kerr, M. Martin, and G.A. Churchill. Analysis of Variance for Gene Expression Microarray Data. Journal of Computational Biology, 7(6):819–837, 2000.
  • Lander (1999) E.S. Lander. Array of hope. Nature Genetics, 21(1):3–4, 1999.
  • Langaas et al. (2005) M. Langaas, E. Ferkingstad, and B.H. Lindqvist. Estimating the proportion of true null hypotheses, with application to DNA microarray data. J Roy Stat Soc, Series B, 67:555–72, 2005.
  • Lee et al. (2000) M.L.T. Lee, F.C. Kuo, GA Whitmore, and J. Sklar. Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proceedings of the National Academy of Sciences of the United States of America, 97(18):9834, 2000.
  • Leek et al. (2006) J.T. Leek, E. Monsen, A.R. Dabney, and J.D. Storey. EDGE: extraction and analysis of differential gene expression, 2006.
  • Lonnstedt and Speed (2002) I. Lonnstedt and T. Speed. Replicated microarray data. Statistica Sinica, 12(1):31–46, 2002.
  • Mahalanobis (1936) P.C. Mahalanobis. On the generalized distance in statistics. Proc Natl Inst Sci India, 2(1):49–55, 1936.
  • Newton et al. (2001) MA Newton, CM Kendziorski, CS Richmond, F.R. Blattner, and KW Tsui. On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data. Journal of Computational Biology, 8(1):37–52, 2001.
  • Owen (2005) A.B. Owen. Variance of the number of false discoveries. Journal of the Royal Statistical Society Series B, 67(3):411–426, 2005.
  • Pawitan et al. (2005) Y. Pawitan, K.R.K. Murthy, S. Michiels, A. Ploner, and O. Journals. Bias in the estimation of false discovery rate in microarray studies. Bioinformatics, 21(20):3865–3872, 2005.
  • Singh et al. (2002) D. Singh, P.G. Febbo, K. Ross, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2):203–209, 2002.
  • Storey (2002) J.D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B(Statistical Methodology), 64(3):479–498, 2002.
  • Storey et al. (2007) J.D. Storey, J.Y. Dai, and J.T. Leek. The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. Biostatistics, 8(2):414, 2007.
  • Tibshirani and Wasserman (2006) R. Tibshirani and L. Wasserman. Correlation-sharing for detection of differential gene expression. Arxiv preprint math.ST/0608061, 2006.
  • Tsai et al. (2003) C.A. Tsai, Y.J. Chen, J.J. Chen, and O. Journals. Testing for differentially expressed genes with microarray data. Nucleic Acids Research, 31(9):e52, 2003.
  • Tusher et al. (2001) V. Tusher, R. Tibshirani, and C Chu. Significance analysis of microarrays applied to ionizing radiation response. Proceedings of the National Academy of Sciences, 98:5116–5121, 2001.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
50623
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description