# Imputation of truncated p-values for meta-analysis methods and its genomic application\thanksrefT2

###### Abstract

Microarray analysis to monitor expression activities in thousands of genes simultaneously has become routine in biomedical research during the past decade. A tremendous amount of expression profiles are generated and stored in the public domain and information integration by meta-analysis to detect differentially expressed (DE) genes has become popular to obtain increased statistical power and validated findings. Methods that aggregate transformed -value evidence have been widely used in genomic settings, among which Fisher’s and Stouffer’s methods are the most popular ones. In practice, raw data and -values of DE evidence are often not available in genomic studies that are to be combined. Instead, only the detected DE gene lists under a certain -value threshold (e.g., DE genes with -value) are reported in journal publications. The truncated -value information makes the aforementioned meta-analysis methods inapplicable and researchers are forced to apply a less efficient vote counting method or naïvely drop the studies with incomplete information. The purpose of this paper is to develop effective meta-analysis methods for such situations with partially censored -values. We developed and compared three imputation methods—mean imputation, single random imputation and multiple imputation—for a general class of evidence aggregation methods of which Fisher’s and Stouffer’s methods are special examples. The null distribution of each method was analytically derived and subsequent inference and genomic analysis frameworks were established. Simulations were performed to investigate the type I error, power and the control of false discovery rate (FDR) for (correlated) gene expression data. The proposed methods were applied to several genomic applications in colorectal cancer, pain and liquid association analysis of major depressive disorder (MDD). The results showed that imputation methods outperformed existing naïve approaches. Mean imputation and multiple imputation methods performed the best and are recommended for future applications.

10.1214/14-AOAS747 \volume8 \issue4 2014 \firstpage2150 \lastpage2174 \docsubtyFLA

Imputation in microarray meta-analysis

A]\fnmsShaowu \snmTanglabel=e1]sht41@pitt.edu\thanksrefm1, B]\fnmsYing \snmDinglabel=e2]yid5@pitt.edu\thanksrefm2, C]\fnmsEtienne \snmSibillelabel=e3]Etienne.Sibille@camh.ca\thanksrefm3, D]\fnmsJeffrey S. \snmMogillabel=e4]jeffrey.mogil@mcgill.ca\thanksrefm4, E]\fnmsWilliam R. \snmLarivierelabel=e5]lariwr@upmc.edu\thanksrefm5 and F]\fnmsGeorge C. \snmTseng\correflabel=e6]ctseng@pitt.edu\thanksrefm5

T2Supported in part by NIH R21MH094862.

Microarray analysis \kwdmeta-analysis \kwdFisher’s method \kwdStouffer’s method \kwdmissing value imputation

## 1 Introduction and motivation

Microarray analysis to monitor expression activities in thousands of genes simultaneously has become routine in biomedical research during the past decade. The rapid development in biological high-throughput technology results in a tremendous amount of experimental data and many data sets are available from public domains such as Gene Expression Omnibus (GEO) and ArrayExpress. Since most microarray studies have relatively small sample sizes and limited statistical power, integrating information from multiple transcriptomic studies using meta-analysis techniques is becoming popular. Microarray meta-analysis usually refers to combining multiple transcriptomic studies for detecting differentially expressed (DE) genes (or candidate markers). DE gene analysis identifies genes differentially expressed across two or more conditions (e.g., cases and controls) with statistical significance and/or biological significance (e.g., fold change). Microarray meta-analysis in many situations refers to performing traditional meta-analysis techniques on each gene repeatedly and then controlling the false discovery rate (FDR) to adjust -values for multiple comparison [Borovecki et al. (2005); Cardoso et al. (2007); Pirooznia, Nagarajan and Deng (2007); Segal et al. (2004)]. Fisher’s method [Fisher (1925)] was the first meta-analysis technique introduced in microarray data analysis in [Rhodes et al. (2002)], followed by Tippett’s minimum -value method in [Moreau et al. (2003)]. Subsequently, many meta-analysis approaches have been used in this field, including extensions of existing meta-analysis techniques and novel methods to encompass the challenges presented in the genomic setting [Choi et al. (2003), Choi et al. (2007), Moerau et al. (2003), Owen (2009), Li and Tseng (2011), and see a review paper by Tseng, Ghosh and Feingold (2012)].

To combine findings from multiple research studies, one needs to know either the effect size or the -value for each study. Since the differences in data structures and statistical hypotheses across multiple studies may make the direct combination of effect sizes impossible or the result suspicious, combining -values from multiple studies is often more appealing. Popular -value combination methods [see review and comparative papers Tseng, Ghosh and Feingold (2012) and Chang et al. (2013)] can be split into two major categories of evidence aggregation methods (including Fisher’s, Stouffer’s and logit methods) and order statistic methods [such as minimum -value, maximum -value and th ordered -value by Song and Tseng (2014)]. Evidence aggregation methods utilize summation of certain transformations of -values as their test statistics to aggregate differential expression evidence across studies. Among evidence aggregation methods, Fisher’s method is the most well known, in which the test statistic is defined as , where is the number of independent studies that are to be combined and is the -value of individual study . Under the null hypothesis of no effect size in all studies and assuming that studies are independent and models for assessing -values are correctly specified, follows a chi-square distribution with degrees of freedom . Fisher’s method has been popular due to its simplicity and some theoretical properties, including admissibility under Gaussian assumption [Birnbaum (1954, 1955)] and asymptotically Bahadur optimality (ABO) under equal nonzero effect sizes across studies [Littel and Folk (1971, 1973)]. Some variations of Fisher’s methods were proposed by using unequal weights or a trimmed version of Fisher’s test statistic [Olkin and Saner (2001)]. Another widely used evidence aggregation method is the Stouffer’s method [Stouffer (1949)], in which the test statistic is defined as , where is the inverse cumulative distribution function (CDF) of standard normal distribution.

In order to combine -values, all -values across studies should be known. In genomic applications, however, raw data and thus -values are often not available and usually only a list of statistically significant DE genes (-value less than a threshold) is provided in the publication [Griffith, Jones and Wiseman (2006)]. Although many journals and funding agencies have encouraged or enforced data sharing policies, the situation has only improved moderately. Many researchers are still concerned about data ownership, and researchers whose studies are sponsored by private funding are not obligated to share data in the public domain. For example, in Chan et al. (2008), publications of colorectal cancer versus normal gene expression profiling studies were collected to perform meta-analysis to identify consistently reported candidate disease-associated genes. However, only one raw data set is available from the Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/, GSE3294) and most other papers only provided a list of DE genes (and their -values) under a prespecified -value threshold. A second motivating example comes from a microarray meta-analysis study for pain research [LaCroix-Fralish et al. (2011)], in which microarray studies of pain models were collected to detect the gene signature and patterns of pain conditions. Among the studies, only one raw data set was available on the author’s website and all the other papers reported the DE gene lists under different thresholds.

In these two motivating examples (details to be shown in Sections 4.1 and 4.2), the incomplete data forced researchers to either drop studies with incomplete -values or apply the convenient vote counting method [Hedges and Olkin (1980)]. Dropping studies with incomplete information greatly reduces the statistical power and is obviously not applicable in the two motivating examples since the complete data was available in only one study. The conventional vote counting procedure is well known as flawed and low-powered [McCarley et al. (2001)]. Ioannidis et al. (2009) attempted to reproduce microarray studies published in Nature Genetics during 2005–2006. Interestingly, only two were “in principle” replicated, six “partially” replicated and ten could not be reproduced. This result illustrates well the widespread difficulty of obtaining raw data or reproducing published results in the field. Therefore, developing methods to efficiently combine studies with truncated -value information is an important problem in microarray meta-analysis.

In this paper, we assume that studies are combined. In studies, the raw gene expression data matrix and sample annotations are available and the complete -values ( for genes and ) can be reproduced for meta-analysis. For the remaining studies, either the raw data or annotation is not available. Only incomplete information of a DE gene list (under -value threshold for study ) is provided in the journal publication. In this situation, the available information is an indicator function to represent whether the -value of gene in study is smaller than or not. We outline the paper structure as the following. In Section 2 a general class of evidence aggregation meta-analysis methods under a univariate scenario was investigated for the mean imputation, the single random imputation and the multiple imputation methods, respectively, in which the exact or approximate null distributions were derived under the null hypotheses and the results are shown for the Fisher and the Stouffer methods. In Section 3.1 simulations of the expression profile were performed to compare performance of different methods. Simulations were further performed in Section 3.2 using 8 major depressive disorder (MDD) and 7 prostate cancer studies where raw data were completely available and the true best performance (complete case) could be obtained. In Section 4 the proposed methods were applied to the two motivating examples. In Section 4.1 the methods were applied to colorectal cancer studies, where the raw data were available only in studies. In Section 4.2 the proposed methods were applied to microarray studies of pain conditions, where no raw data were available. In Section 4.3 we developed an unconventional application of the proposed methods to facilitate the large computational and data storage needs in a liquid association meta-analysis. Discussions and conclusions are included in Section 5 and all proofs are left in the Appendix.

## 2 Methods and inferences

### 2.1 Evidence aggregation meta-analysis methods

Here we consider a general class of univariate evidence aggregation meta-analysis methods (for gene fixed), in which the test statistics are defined as the sum of selected transformations of -values for each individual study. Without loss of generality, assuming that is the cumulative distribution function (CDF) of a continuous random variable , the test statistic is defined as

(1) |

where is the -value from the th study. Theoretically, can be any continuous random variable. However, in practice, is usually selected such that the test statistic follows a simple distribution. For instance, when , it holds (Fisher’s method) and holds, provided (Souffer’s method).

The hypothesis that corresponds to testing the homogeneous effect sizes of studies by evidence aggregation methods is a union-intersection test (UIT) [Roy (1953)]:

(2) |

In this paper, we focus on two popular special cases: {longlist}[1.]

Fisher’s method [Fisher (1925)]: When , .

Stouffer’s method [Stouffer (1949)]: When , .

Another example is the logit method [Hedges and Olkin (1985)], where . But since this method is rarely used in practice, we will not examine it further in this paper. To apply the evidence aggregation meta-analysis methods mentioned above, all the -values should be observed. However, in genomic applications, it often happens that -values of some studies are truncated and only their ranges are reported. Two naïve methods are commonly used to overcome this situation: the vote counting method or the available-case method which only combines studies with observed -values. The available-case method discards rich information contained in the studies with truncated -values and, therefore, the statistical power is reduced. Hedges and Olkin (1980) showed that the power of vote counting converges to when many studies of moderate effect sizes are combined and, therefore, the vote counting method should be avoided whenever possible. In this section, three imputation methods—mean imputation, single random imputation and multiple imputation method—are proposed and investigated to combine studies with truncated -values and the corresponding null distributions are derived analytically, respectively. We first define some notation.

Assume that independent studies are to be combined and are the corresponding -values. Without loss of generality, assume that all the -values are available in the first studies and only the indicator function of DE evidence is reported in the other studies.

Define a pair for each study, in which is the “censoring” indicator satisfying

(3) |

and is the final observed values which is defined as

(4) |

where is the -value threshold for study . For each , one can impute the missing value by :

with , and . Sections 2.2–2.4 develop three imputation methods for selection of and .

### 2.2 Mean imputation method

The simplest imputation method is the mean imputation method, in which and . Then the test statistic for truncated data satisfies

with

for . Recall that under the null hypothesis, the random variable satisfies for the Fisher method and for the Stouffer method. Obviously, follows a Bernoulli distribution.

The results can be summarized into the following theorem (proof left to Appendix B.1):

###### Theorem 1

For and given , by defining

(7) | |||||

it holds

(8) | |||

where is the CDF of . Given the CDF, the expected values of test statistic under null distributions can be calculated as follows: {longlist}[1.]

For Fisher’s method, it holds

while the expectation of the original is .

For Stouffer’s method, it holds

while the expectation of the original is .

Note that there are terms summation in the right-hand side of equation (1), which may cause severe computing problem when is large. However, when some are equal, the formula can be simplified. Without loss of generality, assume there are different -value thresholds such that

(9) | |||||

then by defining for and , the formula can be simplified as

(10) | |||

Therefore, the summation is reduced from terms to terms.

From the above theorem one concludes that is a biased estimator of the original . This motivates the following two stochastic imputation methods.

### 2.3 Single random imputation method

It is well known that the mean imputation method will underestimate the variance of [Little and Rubin (2002)]. Furthermore, Theorem 1 indicates that the test statistic from the mean imputation method is a biased estimator of the original . To avoid this problem, one can replace the mean by randomly simulating and from and , respectively.

Recall that for , . The next theorem (proof left to Appendix B.2) states that holds under the null hypothesis, that is, and follow the same distribution.

###### Theorem 2

For , it holds

(11) |

The following corollary is a simple consequence of the above theorem. {corol*} For the single random imputation method, the following facts hold for : {longlist}[1.]

For Fisher’s method, it holds and therefore .

For Stouffer method, it holds and therefore .

Therefore, in this case, is an unbiased estimator of defined in equation (1).

### 2.4 Multiple imputation method

Although the single random imputation method allows the use of standard complete-data meta-analysis methods, it cannot reflect the sampling variability from one random sample. The multiple imputation method (MI) overcomes this disadvantage [Little and Rubin (2002)]. In MI, each missing value is imputed times. Therefore, is a sequence of test statistics which are defined as

(12) |

with

(13) |

The test statistic is defined as , which satisfies

Since with probability and with probability , is a mixture distribution of and and, therefore, is a mixture distribution of .

Note that and are independent and identically distributed (i.i.d.) for fixed . Denoting by the mean and variance of and , respectively, then by the central limit theorem one concludes that for large enough it holds

Then the following theorem holds.

###### Theorem 3

For , by defining which satisfies

(14) | |||

then for sufficiently large , it holds approximately that

(15) | |||

The detailed notation is left to Appendix C.

Similar to the mean imputation method, the formula can be simplified when some -value thresholds are equal, that is,

(16) |

with and .

## 3 Simulation results

### 3.1 Simulated expression profiles

To evaluate performance of the proposed imputation methods in the genomic setting, we simulated expression profiles with correlated gene structure and variable effect sizes as follows:

Simulate gene correlation structure for genes, samples in each study and studies. In each study, of the genes belong to independent clusters. {longlist}[Step 1.]

Randomly sample gene cluster labels of genes ( and ), such that clusters each containing genes are generated [] and the remaining genes are unclustered genes [].

For any cluster in study , sample , where denotes the inverseWishart distribution, is the identity matrix and is the matrix with all the entries being . Set vector as the square roots of the diagonal elements in . Calculate such that .

Denote as the indices for genes in cluster . In other words, , where and . Sample the expression of clustered genes by , where and . Sample the expression for unclustered genes for and if . Simulate differential expression pattern. {longlist}[Step 4.]

Sample effect sizes from for as DE genes and set for as non-DE genes.

For the first control samples, . For cases, . In the simulated data sets, studies with genes simulated. Within each study, there were cases and controls. The first genes were DE in all studies with effect sizes randomly simulated from a uniform distribution on , respectively, and the remaining were non-DE genes. We chose this effect size range to produce an averaged standardized effect size at so that the DE analysis generates 500–600 candidate DE genes (Table 1), a commonly seen range in real applications. In each study, gene clusters existed, each containing genes. The correlation structure within each cluster was simulated from an inverse Wishart distribution.

In the simulations, we performed a two sample -test for each gene in each study and then combined the -values using the imputation methods proposed in this paper. For simplicity, we viewed the -values from the last studies as truncated with thresholds , respectively. In most genomic meta-analysis, researchers often use conventional permutation analysis by permuting sample labels to compute the -values to preserve gene correlation structure. However, such a nonparametric approach is not applicable in our situation, since raw data are not available in some studies. In order to control the false discovery rate (FDR), we examined the Benjamini–Hochberg (B–H) method [Benjamini and Hochberg (1995)] and the Benjamini–Yekutieli (B–Y) method [Benjamini and Yekutieli (2001)] separately. The number of DE genes detected at nominal FDR rate were recorded and the true FDR rates were computed for each meta-analysis method by

In the multiple imputation method, was selected. Simulations were repeated for times and the mean and standard errors of numbers of DE genes controlled by B–H and B–Y methods and their true FDR are reported in Table 1. The results showed that the FDRs were controlled well for B–H correction but rather conservative for B–Y correction (the true FDR of B–Y is only of B–H at nominal ). This is consistent with the previous observation that the B–Y adjustment tends to be over-conservative since it guards against any type of correlation structure [Benjamini and Yekutieli (2001)]. As a result, the B–H correction will be used for all applications hereafter. The simulation results showed consistently that imputation methods had higher statistical power than the available-case method, and the mean imputation and multiple imputation methods outperform the single random imputation method with similar performance. Surprisingly, the ratio of detected DE genes compared to the complete case increased from in the available case () to in mean imputation () using Fisher’s method. The improvement is even more significant using Stouffer’s method (from to ), while at the same time the true FDRs were controlled at a similar level for all methods. The result shows that imputation methods successfully utilize the incomplete -value information to greatly recover the detection power.

Fisher | Stouffer | ||||
---|---|---|---|---|---|

Method/Mean (s.e.) | No. DE | True FDR | No. DE | True FDR | |

B–H | Complete cases | 0.046 (0.0015) | |||

Available-case | 0.064 (0.022) | ||||

Mean imputation | 0.047 (0.0022) | ||||

Single imputation | 0.045 (0.0027) | ||||

Multiple imputation | 0.050 (0.0019) | ||||

B–Y | Complete cases | 0.0036 (0.00097) | |||

Available-case | 0.0029 (0.00096) | ||||

Mean imputation | 0.0034 (0.00073) | ||||

Single imputation | 0.0039 (0.0015) | ||||

Multiple imputation | 0.0050 (0.0010) |

We further examined the situation when gene dependence structure does not exist [i.e., steps 1–3 were skipped and ]. Table 2 shows the true type I error control under nominal significance level (i.e., true type I error ). The result shows adequate type I error control and confirms the validity of the closed form or approximated formula of different imputation methods in Section 2.

Fisher | Stouffer | |
---|---|---|

Complete cases | 0.050 (0.00031) | 0.050 (0.00037) |

Available-case | 0.050 (0.00035) | 0.050 (0.00033) |

Mean imputation | 0.050 (0.00031) | 0.050 (0.00033) |

Single imputation | 0.050 (0.00032) | 0.051 (0.00032) |

Multiple imputation | 0.050 (0.00031) | 0.051 (0.00031) |

To investigate the impact of on the performance of the multiple imputation method, simulations were performed for . The result is shown in Appendix A, Figure 3, which demonstrates that the performance of the multiple imputation method is quite robust for different number of imputation . We use throughout this paper.

### 3.2 Simulation from complete real data sets

In this subsection the proposed methods were applied to two real microarray data sets, including prostate cancer studies [Gorlov et al. (2009)] and major depressive disorder (MDD) studies [Wang et al. (2012)]. The details are summarized in Supplement Table [Tang et al. (2014)]. For each data set, about half of the studies (four for MDD and three for prostate cancer) were randomly selected with -value truncation threshold . Five methods including complete data, available-case, single random imputation, mean imputation and multiple imputation methods were applied to the data sets with the simulated incomplete data to impute by Stouffer’s and Fisher’s methods, respectively. The generated -values were corrected by the B–H method and the simulation was repeated for times. Figure 1 shows boxplots of the numbers of differentially expressed (DE) genes at for different methods in MDD and for prostate cancer data. Figure 1 indicates similar conclusions that the multiple imputation and the mean imputation methods detect more DE genes than the available-case method and single random imputation method. In the MDD example, very few DE genes (average of and for Fisher and Stouffer, resp.) were detected using the available-case method if half of the studies have truncated -values. The mean and multiple imputation methods greatly improved the detection sensitivity. About (Fisher) and (Stouffer) of DE genes detected by the mean imputation method overlapped with DE genes detected by complete data analysis in MDD and about (Fisher) and (Stouffer) of DE genes detected by the mean imputation method overlapped with DE genes detected by complete data analysis in prostate cancer, showing the ability of imputation methods to recover DE gene detection power.

## 4 Applications

### 4.1 Application to colorectal cancer

In the first motivating example, we followed Chan et al. (2008) and attempted to collect 23 colorectal cancer versus normal gene expression profiling studies. Raw data were available in only one study [Bianchini et al. (2006)] and of the other studies containing more than DE genes at different -value thresholds were included in our analysis. We searched the GEO database and identified two additional new studies [Jiang et al. (2008) and Bellot et al. (2012)]. The seven studies under analysis were summarized in Table 3. After gene-matching, genes overlapped in all three studies with raw data. The available-case method, the mean imputation method, the single random imputation method and the multiple imputation method were applied for the seven studies for the Fisher and Stouffer methods, respectively, and the results were reported in Table 4. For the single random imputation method and the multiple imputation method, the analyses were repeated times and the mean and standard error of the number of DE genes detected were reported under FDR control by the B–H method. The results demonstrate that for various FDR thresholds, the mean imputation method and the multiple imputation method detected more DE genes than the available-case method and the single random imputation method, which was consistent with previous findings in simulations. Under control, the Fisher and Stouffer mean imputation detected () and () times of DE genes than those by the available-case method, respectively.

No. of | No. of | Raw data | No. of DE | No. of overlapped | -value | |
---|---|---|---|---|---|---|

Study | samples | genes | availability | genes | DE genes | threshold |

Bianchini | 24 | 7403 | GSE3294 | – | – | – |

Bellot | 17 | 18,191 | GSE24993 | – | – | – |

Jiang | 48 | 18,197 | GSE10950 | – | – | – |

Grade | 103 | 21,543 | – | 1950 | 635 | 1e–7 |

Croner | 33 | 22,283 | – | 130 | 47 | 0.006 |

Kim | 32 | 18,861 | – | 448 | 143 | 0.001 |

Bertucci | 50 | 8074 | – | 245 | 97 | 0.009 |

Fisher | Stouffer | |||||||
---|---|---|---|---|---|---|---|---|

FDR | Available | Mean | Single | Multiple | Available | Mean | Single | Multiple |

2587 | 2855 | 2172.4 (2.90) | 2785.4 (2.93) | 1318 | 1675 | 668.4 (3.96) | 1616.0 (2.10) | |

1472 | 1874 | 1265.6 (2.34) | 1805.7 (1.50) | 299 | 709 | 252.7 (1.93) | 680.5 (1.12) | |

571 | 1183 | 748.4 (1.89) | 1138.6 (2.00) | 37 | 383 | 102.5 (1.65) | 366.7 (0.69) |

### 4.2 Application to pain research

The second motivating example comes from the meta-analysis of microarray studies of pain to detect the patterns of pain [LaCroix-Fralish et al. (2011)]. The original meta-analysis utilized DE gene lists from each study under different threshold criteria from -value, FDR or fold change and identified “statistically significant” genes that appeared in the DE gene lists of four or more studies. The vote counting method essentially lost a tremendous amount of information with flawed statistical inference. When we attempted to repeat the meta-analysis, raw data of only one of the studies (Barr) could be found. The old platform used in that study, however, contained only genes and had to be excluded from further meta-analysis. In the remaining studies, studies contained DE gene lists under various -value thresholds (marked bold in Supplement Table 2 [Tang et al. (2014)]) and were included in our application. In other words, this example contained exclusively only studies with truncated -values. Table 5 shows the results of three imputation methods. Fisher and Stouffer identified 280 and 45 genes under FDR control, respectively. Note that the original meta-analysis tested the 79 genes using an overall binomial test and the statistical significance was controlled at an overall -value level, not at a gene-specific FDR level. As a result, DE gene lists from the new imputation methods are theoretically more powerful and accurate.

To validate the finding, we used the Gene Functional Annotation tool from the DAVID Bioinformatics Resources website (http://david.abcc.ncifcrf.gov). DAVID applied a modified Fisher’s exact test to evaluate the association between the DE gene lists and pathways. Functional annotation of the 280 DE genes from the Fisher’s mean imputation method identified 208 pathways at , among which selected important pain-related pathways were grouped into five major biological categories and displayed in Table 6. In contrast, the 79 genes from vote counting identified only 14 pathways, of which the expected pain-related pathways under the categories of inflammation and of differentiation, development and projection are missing (see Table 6). The pathway enrichment -values after multiple comparison control of the “280 gene list” were very significant, while those of the “79 gene list” were not. Since the -value calculation from Fisher’s exact test can be impacted by the DE gene size, we further compared the enrichment odd-ratios of genes in the pathway versus in the DE gene list. Still the enrichment odds-ratios of the “280 gene list” were generally much higher than those for the “79 gene list,” showing stronger pain functional association from the Fisher’s mean imputation method.

Fisher | Stouffer | |
---|---|---|

Mean | 280 | 45 |

Single | 57.04 (1.6228) | 16.44 (0.8605) |

Multiple | 280.36 (0.8105) | 77.56 (0.6616) |

280 DE | 79 DE | ||||||
---|---|---|---|---|---|---|---|

(Fisher’s mean imputation) | (Vote counting) | ||||||

odds | odds | ||||||

Category | Pathway ID | -value | -value | ratio | -value | -value | ratio |

Differentiation, | neuron differentiation | 5.6e–6 | 0.0006 | 0.26 | |||

development and | regulation of neuron differentiation | 1.6e–5 | 0.0011 | 0.37 | |||

projection | neuron development | 2.5e–6 | 0.0003 | 0.24 | |||

regulation of nervous system development | 6.5e–6 | 0.0006 | 0.29 | ||||

neuron projection development | 1.6e–5 | 0.0012 | 0.27 | ||||

cell projection | 3.6e–11 | 3.2e–9 | 0.033 | ||||

neuron projection | 3.0e–11 | 3.4e–9 | 0.043 | ||||

cell projection organization | 1.6e–5 | 0.0012 | 0.24 | ||||

Response | response to wounding | 3.8e–10 | 2.8e–7 | 2.7e–5 | |||

to stimuli | response to endogenous stimulus | 3.2e–8 | 1.7e–5 | 0.35 | |||

positive regulation of response to stimulus | 7.9e–8 | 2.5e–5 | 0.0049 | ||||

regulation of response to external stimulus | 1.1e–5 | 0.001 | 0.043 | ||||

Immune | positive regulation of immune response | 4.2e–7 | 7.6e–5 | 0.018 | |||

positive regulation of immune system process | 1.9e–6 | 0.0003 | 0.0009 | ||||

complement activation | 3.0e–5 | 0.0016 | 0.011 | ||||

antigen processing and presentation | 1.3e–6 | 0.00022 | 0.00098 | ||||

of exogenous peptide antigen | |||||||

Inflammation | regulation of acute inflammatory response | 1.4e–6 | 0.0002 | 0.19 | |||

acute inflammatory response | 7.1e–6 | 0.0007 | 0.012 | ||||

regulation of inflammatory response | 1.9e–5 | 0.0012 | 0.17 | ||||

inflammatory response | 1.5e–5 | 0.0012 | 0.001 | ||||

Regulation of | regulation of transmission of nerve impulse | 6.0e–6 | 0.0006 | 0.057 | |||

Transmission |

### 4.3 Application to a three-way association method (liquid association)

So far the proposed imputation methods were applied successfully to two real microarray data sets of colorectal cancer and pain research in which the actual -values of some genes were not reported in a subset of studies. In this section we show that the proposed imputation methods can be useful in the meta-analysis of “big data” such as GWAS or eQTL, where the main computational problem is often the data storage.

In the literature it has long been argued that positively correlated expression profiles are likely to encode functionally related proteins. Liquid association (LA) analysis [Li (2002)] is an advanced three-way co-expression analysis beyond the traditional pairwise correlations. For any triplet of genes and , the LA score measures the effect that expression of to control on and off of the co-expression between and . For example, high expression of Z turns on positive correlation between and , while when expression of is low, and are negatively or noncorrelated. Theory in Li (2002) simplified calculation of the LA score to a linear order of sample size and made the genome-wide computation barely feasible. Supposing we want to combine studies of the liquid association, liquid association -values of all triplets in all studies have to be stored for meta-analysis. When the number of genes , the number of -values to be stored is GB. For a reasonable genome-wide analysis, storage size for all -values quickly increases to TB. One may argue that univariate (i.e., triplet by triplet) meta-analysis may be applied repeatedly to avoid the need of storing all -value results. There are many other genomic meta-analysis situations when this may not be feasible. For example, in GWAS meta-analysis under a consortium collaboration, raw genotyping data cannot be shared for privacy reasons and only the derived statistics or -values can be transferred for meta-analysis. Below we describe how imputation methods can help circumvent the tremendous data storage problem.

We performed a small scale of analysis on DE genes previously reported from the meta-analysis of the eight MDD studies used in Section 3.2 [Wang et al. (2012)]. The total number of possible triplets was . By setting up a -value threshold at , we only needed to store exact -values for (2.32%) triplets and the remaining were truncated as considered in this paper. Since we also needed to store the truncation index information, we only needed to store of the information and the compression ratio was . To investigate the loss of information by the truncation, Figure 2 shows meta-analysis -values [at scale] from Fisher’s method using full data and the Fisher mean imputation method using truncated data. The result shows high concordance in the top significant triplets, which are the major targets of this exploratory analysis. Among the top triplets detected by Fisher’s method using complete -value information, of them were also identified by the top by Fisher mean imputation. The remaining triplets were still in top ranks (rank between and ) using truncated data in the result of Fisher mean imputation. This result suggests a good potential of applying data truncation to preserve the most informative information and performing imputation to approximate the finding of the top targets when meta-analysis of “big data” is needed. The compression ratio may further increase by a more stringent truncation threshold, but the performance may somewhat decline as a trade-off.

## 5 Discussion and conclusion

When combining multiple genomic studies by -value combination methods, the raw data are often not available and only the ranges of -values are reported for some studies in genomic applications. This is especially true for microarray meta-analysis since owners of many microarray studies tend not to publish their data in the public domain. This incomplete data issue is often encountered when one attempts to perform a large-scale microarray meta-analysis. If raw data are not available, two naïve methods—vote counting method and available-case method—are commonly used. Since these two methods completely or largely neglect the information contained in the truncated -values, their statistical power is generally low. In this paper, we proposed three imputation methods for a general class of evidence aggregation meta-analysis methods to combine independent studies with truncated -values: mean imputation, single random imputation and multiple imputation methods. For each proposed imputation method, the null distribution was derived analytically for the Fisher and Stouffer methods. Theoretical results showed that the test statistics from the single random imputation and the multiple imputation methods were unbiased, while those for mean imputation methods were biased. Simulations were performed for the imputed Fisher method and imputed Stouffer method. The simulation results showed that type I errors were well controlled for all methods, which was consistent with our theoretical derivation. Compared to the naive available-case method, all the imputation methods achieved higher statistical powers, and the mean imputation and the multiple imputation methods recovered much of the power that the complete cases method achieved even when half of the studies had truncated -values. Furthermore, Figure 3 in Appendix A showed that the power of the multiple imputation method was robust to the number of imputation . Although small to moderate provided good results, we recommend choosing being larger than to guarantee that the central limit theorem can approximate well. Applications to two motivating examples in colorectal cancer and pain conditions showed that both mean imputation and multiple imputation performed among the best in terms of detection sensitivity and biological validation by pathway analysis.

In regression-type missing-data imputation methods, the null distribution of the error term is unknown and is assumed to be normally distributed with equal variance, a setting in which the multiple imputation method usually outperforms the mean imputation in practice and in theory [Little and Rubin (2002)], particularly because mean imputation underestimates the true variance. However, our simulation results demonstrated that the power of the two methods were quite similar. Two reasons may contribute to this result. First, although the test statistic from the mean imputation method is biased and neglects the variation of truncated -values, its -value can be computed accurately when the null distribution is derived analytically. Second and more importantly, we find that the test statistic of mean imputation is in fact , while for sufficiently large , the test statistic of multiple imputation converges to in distribution. It is easy to show that these two quantities are very close to each other for a small range of , provided is smooth. Since is infinitely differentiable for the Fisher and Stouffer methods, and the small -value range in is particularly of interest to us, it is not surprising that the mean imputation method and multiple imputation method perform similarly. Since the mean imputation method achieved almost the same power as the multiple imputation method with less computational complexity, it is more appealing and is recommended for microarray meta-analysis, where the imputed meta-analysis method is performed repeatedly for thousands of genes. In this paper only the evidence aggregation meta-analysis methods are investigated and further work will be needed to extend these results to order statistic based methods such as minP and maxP.

Note that although the truncated -value issue discussed in this paper may appear similar to the problem of “publication bias,” it is fundamentally different. Publication bias refers to the fact that a study with a large positive treatment effect is more likely to be published than a study with a relatively small treatment effect, resulting in bias if one only considers published studies. Denote by the -values of all conducted studies that should have been collected. Only a subset of likely more significant -values are observed. Under this setting, is unknown and are unknown as well. Since the number of missing publications is unknown, Duval and Tweedie proposed the “Trim and Fill” method to identify and correct for funnel plot asymmetry arising from publication bias [Duval and Tweedie (2000a) and (2000b)], in which an estimate of the number of missing studies is provided and an adjusted treatment effect is estimated by performing a meta-analysis including the imputed studies. For the truncated -value problem we consider here, the total number of studies, the number of studies with truncated -values and the -value truncation thresholds are all known. Therefore, investigation of the imputation of truncated -values in meta-analysis is different from the traditional “publication bias” problem and has not been studied in the meta-analysis literature, to the best of our knowledge.

In this paper the methods we developed mainly target on microarray meta-analysis, but the issue can happen frequently in other types of genomic meta-analysis [e.g., GWAS; Begurn et al. (2012)]. In Section 4.3 we demonstrated an unconventional application of our methods to meta-analysis of liquid association. Due to the large number of triplets tested in the three-way association, the needed -value storage is huge. By preserving only the most informative data by truncation, the storage burden is greatly alleviated and our imputation methods help approximate and recover the top meta-analysis targets with little power loss. In an ongoing project, we also attempt to combine multiple genome-wide eQTL results via meta-analysis. In eQTL, regression analysis is used to investigate the association of a SNP genotyping and a gene expression. It is impractical to store all genome-wide eQTL -values, as the storage space required is too large ( genes SNPS -values). A practical solution is to record only the eQTL -values smaller than a threshold (say, ) for meta-analysis, which leads to the same statistical setting as discussed in this paper. In another project we combine results from multiple ChIP-seq peak calling algorithms to develop a meta-caller. Since each peak caller algorithm can only report the top peaks with -values smaller than a certain -value threshold, we again encounter the same truncated -value problem in meta-analysis. As more and more complex genomic data are generated and the need for meta-analysis increases, we expect the imputation methods we propose in this paper will find even more applications in the future.

## Appendix A Supplementary figure

Fig. 3. Power analysis at significance level for different numbers of imputation . The dashed lines represent the theoretical asymptotic power obtained by setting .

## Appendix B Proofs of theorems

### b.1 Proof for Theorem 1

Note that in this case, for , it holds

(17) |

Let . Since under the null hypothesis, it holds

and, therefore,

(19) |

For given , it holds