Variation-preserving normalization unveils
blind spots in gene expression profiling
RNA-Seq and gene expression microarrays provide comprehensive profiles of gene activity, but lack of reproducibility has hindered their application. A key challenge in the data analysis is the normalization of gene expression levels, which is currently performed following the implicit assumption that most genes are not differentially expressed. Here, we present a mathematical approach to normalization that makes no assumption of this sort. We have found that variation in gene expression is much larger than currently believed, and that it can be measured with available assays. Our results also explain, at least partially, the reproducibility problems encountered in transcriptomics studies. We expect that this improvement in detection will help efforts to realize the full potential of gene expression profiling, especially in analyses of cellular processes involving complex modulations of gene expression.
differential gene expression
gene expression microarrays
normalization of high-throughput data
DEG: Differentially Expressed Gene
FDR: False Discovery Rate
MedianCD normalization: Median Condition-Decomposition normalization
SVCD normalization: Standard-Vector Condition-Decomposition normalization
Since the discovery of DNA structure by Watson and Crick, molecular biology has progressed increasingly fast, with rapid advances in sequencing and related genomic technologies. Among these, DNA microarrays and RNA-Seq have been widely adopted to obtain gene expression profiles, by measuring the concentration of tens of thousands of mRNA molecules in single assays [Schena et al., 1995; Lockhart et al., 1996; Duggan et al., 1999; Mortazavi et al., 2008; Wang et al., 2009]. Despite their enormous potential [Golub et al., 1999; van ’t Veer et al., 2002; Ivanova et al., 2002; Chi et al., 2003], problems of reproducibility and reliability [Tan et al., 2003; Frantz, 2005; Couzin, 2006] have discouraged their use in some areas, e.g. biomedicine [Michiels et al., 2005; Weigelt and Reis-Filho, 2010; Brettingham-Moore et al., 2011; Boutros, 2015].
The normalization of gene expression, which is required to set a common reference level among samples [Irizarry et al., 2003; Tarca et al., 2006; Garber et al., 2011; Conesa et al., 2016], has been reported to be problematic, affecting the reproducibility of results with both microarray [Shi et al., 2006; Shippy et al., 2006; Draghici et al., 2006] and RNA-Seq [Bullard et al., 2010; Dillies et al., 2013; Su et al., 2014; Lin et al., 2016]. Batch effects and their influence on normalization have recently received a great deal of attention [Leek et al., 2010; Reese et al., 2013; Li et al., 2014], resulting in approaches aiming to remove unwanted technical variation caused by differences between batches of samples or by other sources of expression heterogeneity [Listgarten et al., 2010; Gagnon-Bartsch and Speed, 2012; Risso et al., 2014]. A different issue, however, is the underlying assumption made by the most widely used normalization methods to date, such as Median and Quantile normalization [Bolstad et al., 2003] for microarrays, or RPKM (Reads Per Kilobase per Million mapped reads) [Mortazavi et al., 2008], TMM (Trimmed Mean of M-values) [Robinson and Oshlack, 2010], and DESeq [Anders and Huber, 2010] normalization for RNA-Seq, which posit that all or most genes are not differentially expressed [van de Peppel et al., 2003; Hannah et al., 2008; Lovén et al., 2012; Dillies et al., 2013; Hicks and Irizarry, 2015]. Although it may seem reasonable for many applications, this lack-of-variation assumption has not been confirmed. Moreover, results obtained with external controls [van de Peppel et al., 2003; Hannah et al., 2005, 2008; Lovén et al., 2012] or with RT-qPCR [Shi et al., 2006; Bullard et al., 2010] suggest that it may not be valid.
Some methods have been proposed to address this issue, based on the use of spike-ins [van de Peppel et al., 2003; Hannah et al., 2008; Lovén et al., 2012], negative control probes (SQN, Subset Quantile normalization) [Wu and Aryee, 2010], or negative control genes (RUV-2, Remove Unwanted Variation, 2-step) [Gagnon-Bartsch and Speed, 2012]. These methods use external or internal controls that are known a priori not to be differentially expressed [Lippa et al., 2010]. Their applicability, however, has been limited by this requirement of a priori knowledge, which is rarely available for a sufficiently large number of controls. In addition, other methods have been proposed to address the lack-of-variation assumption by identifying a subset of non-differentially expressed genes from the assay data, such as Cross-Correlation normalization [Chua et al., 2006], LVS (Least-Variant Set) normalization [Calza et al., 2008], and NVAS (Nonparametric Variable Selection and Approximation) normalization [Ni et al., 2008]. While LVS normalization requires setting in advance a number for the fraction of genes to be considered as non-differentially expressed, with values in the range 40–60% [Calza et al., 2008], Cross-Correlation and NVAS normalization are expected to degrade in performance when more than 50% of genes are differentially expressed [Chua et al., 2006; Ni et al., 2008]. More recently, CrossNorm has been introduced [Cheng et al., 2016], based on the mixture of gene expression distributions from the experimental conditions. This method, however, has been proposed for two experimental conditions, and specially for paired samples. The extension of this approach to experimental designs with unpaired samples and more than a few experimental conditions would lead, as far as we can hypothesize, to an unmanageable size of the data matrix to process.
Thus, to clarify and overcome the limitations imposed by the lack-of-variation assumption, we have developed an approach to normalization that does not assume lack-of-variation and that is suitable to most real-world applications. Hence, we aimed to avoid the need of spike-ins, a priori knowledge of control genes, or assumptions on the number of differentially expressed genes. The analysis of several gene expression datasets using this approach confirmed that our methods reached these goals. Furthermore, our results show that assuming lack-of-variation can severely undermine the detection of gene expression variation in real assays. We have found that large numbers of differentially expressed genes, with substantial expression changes, are missed or misidentified when data are normalized with methods that assume lack-of-variation.
E. crypticus and Synthetic Datasets
A large gene expression dataset was obtained from biological triplicates of Enchytraeus crypticus (a globally distributed soil organism used in standard ecotoxicity tests), sampled under 51 experimental conditions (42 treatments and 9 controls), involving exposure to several substances, at several concentrations and durations according to a factorial design (Supplementary Table S1). Gene expression was measured using a customized high-density oligonucleotide microarray [Castro-Ferreira et al., 2014], resulting in a dataset with 18,339 gene probes featuring good hybridization signal in all 153 samples. Taking into account the design of the microarray [Castro-Ferreira et al., 2014], we refer to these gene probes as genes in what follows.
To further explore and compare outcomes between normalization methods, two synthetic random datasets were built and analyzed. One of them was generated with identical means and variances gene-by-gene to the real E. crypticus dataset, and under the assumption that no gene was differentially expressed. In addition, normalization factors were applied, equal to those obtained from the real dataset. Thus, this synthetic dataset was similar to the real one, while complying by construction with the lack-of-variation assumption. The other synthetic dataset was also generated with comparable means and variances to the real dataset and with normalization factors, but in this case differential expression was added. Depending on the experimental condition, several numbers of differentially expressed genes and ratios between over- and under-expressed genes were introduced (see Methods). Together, these synthetic datasets with and without differential gene expression represent, respectively, the alternative and null hypotheses for a statistical test of differential gene expression.
The gene expression datasets were normalized with four methods. Two of these methods are the most widely used procedures for microarrays, namely Median (or Scale) normalization and Quantile normalization [Bolstad et al., 2003]. (Note that current methods of normalization for RNA-Seq, such as RPKM [Mortazavi et al., 2008], TMM [Robinson and Oshlack, 2010], and DESeq [Anders and Huber, 2010], perform between-sample normalization by introducing a scaling per sample obtained with some form of mean or median, using all or a large set of genes. Thus their performance, in what concerns the issues addressed here, is expected to be similar to that of Median normalization for microarrays.)
The other two normalization methods were developed for this study, they being called Median Condition-Decomposition normalization and Standard-Vector Condition-Decomposition normalization, respectively MedianCD and SVCD normalization in what follows.
With the exception of Quantile normalization, all used methods apply a multiplicative factor to the expression levels of each sample, equivalent to the addition of a number in the usual -scale for gene expression levels. Solving the normalization problem consists of finding these correction factors. This problem can be exactly and linearly decomposed into several sub-problems: one within-condition normalization for each experimental condition and one final between-condition normalization for the condition averages (see Methods). In the within-condition normalizations, the samples (replicates) subjected to each experimental condition are normalized separately, whereas in the final between-condition normalization average levels for all conditions are normalized together. Because there are no genes with differential expression in any of the within-condition normalizations, the lack-of-variation assumption only affects the final between-condition normalization. The assumption is avoided by using, in this normalization, expression levels only from no-variation genes, defined as genes that show no evidence of differential expression under a statistical test. An important detail is that the within-condition normalizations ensure good estimates of the within-condition variances, which are required by the statistical test for identifying no-variation genes. This requisite also implies that a minimum of two samples is required per experimental condition. Both methods of normalization proposed here, MedianCD and SVCD normalization, follow this condition-decomposition approach.
With MedianCD normalization, all normalizations are performed with median values, as in conventional Median normalization, but only no-variation genes are employed in the between-condition step. Otherwise, if all genes were used in this final step, the resulting total normalization factors would be exactly the same as those obtained with conventional Median normalization.
For SVCD normalization, a vectorial procedure was developed to carry out each normalization step, called Standard-Vector normalization. The samples of any experimental condition, in a properly normalized dataset, must be exchangeable. In mathematical terms, the expression levels of each gene can be considered as an -dimensional vector, where is the number of samples for the experimental condition. After standardization (mean subtraction and variance scaling), these standard vectors are located in a -dimensional hypersphere. The exchangeability mentioned above implies that, when properly normalized, the distribution of standard vectors must be invariant with respect to permutations of the samples and must have zero expected value. These properties allow to obtain a robust estimator of the normalization factors, under fairly general assumptions that do not imply any particular distribution of gene expression (see Methods).
It is worth mentioning that the limit case when the number of samples is two () represents a degenerate case for Standard-Vector normalization, in which the space of standard vectors reduces to a 0-dimensional space with only two points. In this degenerate case, Standard-Vector normalization is equivalent to global Loess normalization [Yang et al., 2002; Smyth and Speed, 2003], i.e. Loess normalization without correction for non-linearities with respect to the level of gene expression or microarray print-tips. In this sense, Standard-Vector normalization is a generalization to any number of samples of the approach underlying the different types of Loess normalization.
Figure 1 displays the results of applying the four normalization methods to the real and two synthetic datasets. Each panel shows the interquartile range of expression levels for the 153 samples, grouped in triplicates exposed to each experimental condition. Both Median and Quantile normalization (second and third rows) yielded similar outputs for the three datasets. In contrast, MedianCD and SVCD normalization (fourth and fifth rows) detected much greater variation between conditions in the real dataset and the synthetic dataset with differential gene expression. Conventional Median normalization makes, by design, the median of all samples to be the same, while Quantile normalization makes the full distribution of gene expression of all samples to be the same. Hence, if there were differences in medians or distributions between experimental conditions, both methods would have removed them. Such variation was indeed present in the synthetic dataset with differential gene expression (Fig. 1k,n), and hence we can hypothesize the same for the real dataset (Fig. 1j,m).
Influence of No-Variation Genes on Normalization
To clarify how MedianCD and SVCD normalization preserved the variation between conditions, we studied the influence of the choice of no-variation genes in the final between-condition normalization. To this end, we obtained the between-condition variation as a function of the number of no-variation genes, in two families of cases. In one family, no-variation genes were chosen in decreasing order of -values from the statistical test used to analyze variation between conditions. In the other family, genes were chosen at random. The first option was similar to the approach implemented to obtain the results presented in Fig. 1j–o, with the difference that, there, the no-variation genes were chosen automatically, by a subsequent statistical test performed on the distribution of -values (see Methods).
For the real dataset (Fig. 2a), the random choice of genes resulted in decays ( being the number of chosen genes), followed by a plateau. The decays reflect the error in the estimation of normalization factors. Selecting the genes by decreasing -values, however, yielded a completely different result. Up to a certain number of genes, the variance remained similar, but for larger numbers of genes the variance dropped rapidly. Figure 2a shows, therefore, that between-condition variation was removed as soon as the between-condition normalizations used genes that changed in expression level across experimental conditions. The big circles in Fig. 2a indicate the working points of the normalizations used for the results displayed in Fig. 1j,m. In fact, these points slightly underestimated the variation between conditions. Although the statistical test for identifying no-variation genes ensured that no evidence of variation was found, the expression of some selected genes varied across conditions.
The results obtained for the synthetic dataset with differential gene expression (Fig. 2b) were qualitatively similar to those of the real dataset, but with two important differences. The amount of between-condition variation detected (by selecting no-variation genes by decreasing -values) was smaller than with the real dataset, implying that the real dataset had larger differential gene expression. Additionally, the variation detected in the synthetic dataset had a simpler dependency on the number of genes, an indication that the differential gene expression introduced in the synthetic dataset had a simpler structure than that of the real dataset.
Figure 2c shows the results for the synthetic dataset without differential gene expression. There were no plateaus when no-variation genes were chosen randomly, only decays, and differences were small when no-variation genes were selected by decreasing -values. Big circles show that the working points of Fig. 1l,o were selected with all available genes as no-variation genes, which is the optimum choice when there is no differential gene expression.
Overall, Fig. 2 shows that the between-condition variation displayed in Fig. 1j,k,m,n is not an artifact caused by using an exceedingly small or extremely particular set of genes in the final between-condition normalization, but that this variation originated from the datasets. The positions of the big circles in Fig. 2 highlight the good performance of the statistical approach for choosing no-variation genes in the normalizations carried out for Fig. 1j–o. Besides, the residual variation displayed by the decays implies that, as estimators of the normalization factors, SVCD normalization features smaller error than MedianCD normalization.
Differential Gene Expression
In what follows, we call detected positives the differentially expressed genes (DEGs) resulting from the statistical analyses, treatment positives the DEGs introduced in the synthetic dataset with differential gene expression, true positives the detected positives which were also treatment positives, and false positives the detected positives which were not treatment positives. Corresponding terms for negatives refer to genes which were not DEGs.
Figure 3 shows the numbers of DEGs detected in the real and synthetic datasets, for each of the 42 experimental treatments compared to the corresponding control (Supplementary Table S2), after normalizing with the four methods. For the real dataset (Fig. 3a), the number of DEGs identified after MedianCD and SVCD normalization were much larger for most treatments, in some cases by more than one order of magnitude. For the synthetic dataset with differential gene expression (Fig. 3b), results were qualitatively similar, but with less differential gene expression detected, consistently with Fig. 2a,b. The number of treatment positives can be displayed in this case (empty black down triangles, Fig. 3b), showing a better correlation, with MedianCD and SVCD normalization, between the number of treatment positives and detected positives. For the synthetic dataset without differential gene expression (Fig. 3c), no DEG was found but for one or two DEGs in two conditions. Given that the false discovery rate was controlled to be less than 5% per treatment, this is expected to happen when evaluating 42 treatments.
Figure 3a reports, for the real dataset, statistically significant changes of gene expression, that is, changes that cannot be explained by chance. Equally important is the effect size, i.e. the scale of detected variation in DEGs, which is displayed by Fig. 4. The boxplots show absolute fold changes of expression level for all DEGs detected after applying each normalization method. MedianCD and SVCD normalization allowed to detect smaller changes of gene expression, which were otherwise missed when using Median and Quantile normalization. This differential gene expression detected with MedianCD and SVCD normalization can hardly be considered negligible, given that, for all treatments, the interquartile range of absolute fold changes was above 1.5-fold, and, for more than 28 (67%) treatments, the median absolute fold change was greater than 2-fold. Interestingly, the scale of differential gene expression detected with MedianCD and SVCD normalization in this assay is of similar magnitude to those reported by studies of global mRNA changes using external controls with microarrays and/or RNA-Seq [van de Peppel et al., 2003; Lovén et al., 2012].
Figure 5 displays the balance of differential gene expression, i.e. the comparison between the number of over- and under-expressed genes, for the real dataset. The quantity in the -axes is the mean of an indicator variable , which assigns to each over-expressed DEG and to each under-expressed DEG. Hence, balance of differential gene expression corresponds to , all DEGs over-expressed to , and, for example, 60% DEGs under-expressed to . As discussed below, and as it has been reported before [Irizarry et al., 2006; Calza et al., 2008; Ni et al., 2008; Zhu et al., 2010], the balance of differential gene expression has a strong impact on the performance of normalization methods. Figure 5 shows that, regardless of the normalization method used, the unbalance of differential gene expression detected in the real dataset was substantial for most conditions. Detected unbalances were (in absolute value) larger with MedianCD and SVCD normalization, in both cases with more than 30 (71%) treatments having , that is, more than 75% of over- or under-expressed genes. Moreover, the differences between the unbalances detected with Median and Quantile normalization, on one hand, and MedianCD and SVCD normalization, on the other, were specially notorious for the treatments with more DEGs (treatments 26–42, Fig. 3a). In those cases, Median and Quantile normalization resulted in the smallest detected unbalances, whereas MedianCD and SVCD normalization yielded the largest ones, with values near for all but two treatments.
True differential expression was known, by construction, for the synthetic dataset with differential gene expression. Thus, Fig. 6 shows for this dataset the true positive rate (ratio between true positives and treatment positives, also known as statistical power or sensitivity) and the false discovery rate (FDR, ratio between false positives and detected positives). With conditions 1 to 20, which correspond to those conditions with less than approximately 10% of treatment positives (Fig. 3b, empty black down triangles), the true positive rate was similarly low for all normalizations. Regarding the FDR, when the (total) number of detected positives was up to a few tens, variability of the FDR around the target bound at 0.05 is to be expected, given that the bound is defined over an average of repetitions of the multiple-hypothesis test. Yet, the FDR obtained after Median and Quantile normalization was higher than the 0.05 bound for most conditions. More striking, however, was the behavior for conditions 21 to 42 (more than 10% of treatment positives). The true positive rates obtained after Median and Quantile normalization were much lower than those obtained with MedianCD and SVCD normalization, while the FDR after Median and Quantile normalization was clearly over the bound at 0.05. In comparison, MedianCD and SVCD normalization, besides offering better sensitivity of differential gene expression, maintained the FDR consistently below the desired bound.
Figure 7 further explores these results, by representing the true positive rate and false discovery rate (FDR) as a function of the unbalance between over- and under-expressed genes. Figure 7 shows that the unbalance of differential gene expression was a key factor in the results obtained with Median and Quantile normalization. When most DEGs were over- or under-expressed, both the true positive rate and FDR degraded markedly after using Median or Quantile normalization. In contrast, the true positive rate and FDR were not affected by the unbalance of differential gene expression when using MedianCD or SVCD normalization.
Concerning the identification of no-variation genes, both MedianCD and SVCD normalization performed well. In the synthetic dataset without differential gene expression, both methods identified all genes as no-variation genes, which is the best possible result. In the synthetic dataset with differential gene expression, 1,834 genes (10% of a total of 18,339 genes) were, by construction, negatives across all treatments. MedianCD and SVCD normalization detected, respectively, 1,723 and 1,827 no-variation genes, among which 96.9% and 95.2% were true negatives.
Analysis of the Golden Spike and Platinum Spike Datasets
To provide additional evidence of the performance of MedianCD and SVCD normalization, we analyzed the Golden Spike [Choe et al., 2005] and Platinum Spike [Zhu et al., 2010] datasets. Both of them are artificial real datasets, the largest ones for which true DEGs are known. Hence, they have been widely used to benchmark normalization methods [Choe et al., 2005; Schuster et al., 2007; Pearson, 2008; Calza et al., 2008; Ni et al., 2008; Zhu et al., 2010; Cheng et al., 2016].
The design of the Golden Spike dataset was questioned for reasons concerning, among others, the anomalous null distribution of -values, the lack of biological replicates, and the high concentration of spike-ins [Dabney and Storey, 2006; Irizarry et al., 2006; Gaile and Miecznikowski, 2007]. Nevertheless, this dataset is worth considering here because it challenges what we claim are key capabilities of our approach, that is, to correctly normalize gene expression data when many genes are differentially expressed, even with large unbalance between over- and under-expression. This dataset consists of microarray data obtained with the Affymetrix GeneChip DrosGenome1, with two experimental conditions and three technical replicates per condition. Excluding Affymetrix internal control probes, the dataset contains a total of 13,966 gene probe sets, of which 3,876 were spiked-in, which we call known in what follows. Among these, 1,328 (34.3%) were over-expressed (known positives) to varying degrees between 1.1- and 4-fold, while the remaining 2,535 (65.4%) were spiked-in at the same concentration in both conditions (known negatives). (Percentages do not add up to 100% because of a very small number of probe sets with weak matching to multiple clones [Choe et al., 2005].)
In addition to the normalization methods used above, we included Cyclic Loess normalization [Yang et al., 2002; Ballman et al., 2004] in this case, because it facilitates a better comparison of results with previous studies [Choe et al., 2005; Schuster et al., 2007; Pearson, 2008; Calza et al., 2008; Ni et al., 2008; Zhu et al., 2010; Cheng et al., 2016]. Figure 8 summarizes the results obtained for the Golden Spike dataset, by displaying Receiver Operating Characteristic (ROC) curves for the detection of differential gene expression. The upper panel shows the true positive rate (as before, ratio between true positives and treatment positives) versus the false positive rate (ratio between false positives and treatment negatives), while the lower panel shows the number of true positives versus the number of false positives. In both cases, detected and treatment positives/negatives were restricted to known genes, following previous studies [Gaile and Miecznikowski, 2007; Schuster et al., 2007; Pearson, 2008]. Doing otherwise would have given an excessively dominant role to the issue of cross-hybridization in the analysis of differential gene expression [Schuster et al., 2007]. Additionally, the analysis was performed using only probe sets with hybridization signal in all samples, with the aim of factoring out differences between normalization methods caused by the response to missing data. Results obtained without this restriction (Supplementary Fig. S3) or with -tests instead of limma [Ritchie et al., 2015] analysis (Supplementary Fig. S4) were very similar to those of Fig. 8.
The comparison of ROC curves shown in Fig. 8 highlight the superior performance of MedianCD and, in particular, SVCD normalization. Dashed lines show results when the list of known negatives was given as an input to some of the normalization methods (something than cannot be done in real assays). It is remarkable that SVCD normalization featured equally well with or without this information.
Points in Fig. 8 indicate the results when controlling the false discovery rate (FDR) to be below 0.01 (left point on each curve) or 0.05 (right point). Figure 8b shows reference lines for actual FDR equal to 0.01, 0.05, 0.1, 0.2 and 0.5 (from left to right). In all cases, the FDR was not adequately controlled, although the difference between intended and actual FDR was notably smaller with MedianCD and SVCD normalization. Lack of control of the FDR in the analysis of this dataset has been previously reported [Choe et al., 2005; Pearson, 2008]. It is caused by the non-uniform (hence anomalous) distribution of -values for negative genes, which results from the analysis of differential gene expression [Dabney and Storey, 2006; Gaile and Miecznikowski, 2007; Fodor et al., 2007; Pearson, 2008]. It has been argued that this anomalous distribution of -values is, in turn, a consequence of the own experimental design of the dataset, in particular the lack of biological replication and the way clone aliquots were mixed to produce each gene group with a given fold change [Dabney and Storey, 2006]. Later studies have attributed this issue mostly to non-linear or intensity-dependent effects, not properly corrected in the within-sample normalization step (e.g. background correction) of the analysis pipeline [Gaile and Miecznikowski, 2007; Fodor et al., 2007; Pearson, 2008; Zhu et al., 2010].
Concerning the identification of no-variation genes, both MedianCD and SVCD normalization worked correctly. MedianCD normalization identified 561 no-variation genes, of which 93.9% were known, and among which 84.1% were known negatives. SVCD normalization, in comparison, featured better detection, with 1,224 no-variation genes identified, of which 94.4% were known, and among which 90.0% were known negatives.
The design of the Platinum Spike dataset [Zhu et al., 2010] took into account the concerns raised by the Golden Spike dataset, offering a dataset with two experimental conditions and nine (three biological three technical) replicates per condition, and including near 50% more spike-ins. Besides, differential gene expression was balanced, with respect to both total mRNA amount and extent of over- and under-expression. Gene expression data was obtained with Affymetrix Drosophila Genome 2.0 microarrays. Excluding Affymetrix internal control probes, the dataset contained a total of 18,769 probe sets, of which 5,587 were spiked-in, called known as above. Among these, 1,940 (34.7%) were differentially expressed (known positives) to varying degrees between 1.2- and 4-fold (1,057 over-expressed, 883 under-expressed), while the remaining 3,406 (61.0%) were spiked-in at the same concentration in both conditions (known negatives).
Figure 9 shows ROC curves for the Platinum Spike dataset. As above, only known genes were considered for detected and treatment positives/negatives. Additionally, gene probes were restricted to those with signal in all samples. Results obtained without this restriction (Supplementary Fig. S5) or with -tests instead of limma analysis (Supplementary Fig. S6) were again very similar. In contrast to the Golden Spike dataset (Fig. 8), the performance concerning true and false positives resulting from the different normalization methods was much more comparable. In this case, MedianCD and SVCD normalization were only marginally better. Note, however, that the FDR was again not properly controlled (Fig. 9b). Similarly to the Golden Spike dataset, and despite biological replication and a different experimental setup, obtained distributions of -values for negative genes have been reported to be non-uniform [Zhu et al., 2010]. This fact is consistent with previous arguments relating the lack of control of the FDR to a general problem concerning the correction of non-linearities in the preprocessing of microarray data [Gaile and Miecznikowski, 2007; Fodor et al., 2007; Pearson, 2008; Zhu et al., 2010].
Regarding the identification of no-variation genes, MedianCD and SVCD normalization also worked correctly with this dataset. MedianCD normalization identified 2,090 no-variation genes, of which 95.4% were known, and among which 98.7% were known negatives. SVCD normalization featured slightly better, with 2,232 no-variation genes identified, of which 95.3% were known, and among which 98.3% were known negatives.
The lack-of-variation assumption underlying current methods of normalization was self-fulfilling, removing variation in gene expression that was actually present. Moreover, it had negative consequences for downstream analyses, as it removed potentially important biological information and introduced errors in the detection of gene expression. The resulting decrease in statistical power or sensitivity is a handicap, which can be addressed by increasing the number of samples per experimental condition. However, degradation of the (already weak) control of the false discovery rate when using Median or Quantile normalization is a major issue for real-world applications.
The removal of variation can be understood as additive errors in the estimation of normalization factors. Considering data and errors vectorially (see Methods), the length of each vector equals, after centering and up to a constant factor, the standard deviation of the data or error. Errors of small magnitude, compared to the data variance, would only have minor effects. However, errors of similar or greater magnitude than the data variance may, depending on the vector lengths and the angle between the vectors, severely distort the observed data variance. This will, in turn, cause spurious results in the statistical analyses. Furthermore, the angles between the data and the correct normalization factors (considered as vectors) are random, given that expression data reflect biological variation while normalization factors respond to technical variation. If the assay is repeated, even with exactly the same experimental setup, the errors in the normalization factors will vary randomly, causing random spurious results in the downstream analyses. This explains, at least partially, the lack of reproducibility found in transcriptomics studies, especially for the detection of changes in gene expression of small-to-medium magnitude (up to 2-fold), because variation of this size is more likely to be distorted by errors in the estimation of normalization factors. Accordingly, the largest differences in numbers of differentially expressed genes detected by Median and Quantile normalization, compared to MedianCD and SVCD normalization, occurred in the treatments with the smallest magnitudes of gene expression changes (Figs. 3a, 4).
The variation between medians displayed in Fig. 1j,k,m,n may seem surprising, given routine expectations based on current methods (Fig. 1d,e,g,h). Nevertheless, this variation inevitably results from the unbalance between over- and under-expressed genes. As an illustration of this issue, let us consider a case with two experimental conditions, in which the average expression of a given gene is less than the distribution median under one condition, but greater than the median under the other. The variation of this gene alone will change the value of the median to the expression level of the next ranked gene. Therefore, if the number of over-expressed genes is different from the number of under-expressed genes, and enough changes cross the median boundary, then the median will substantially differ between conditions. Only when differential expression is negligible or is balanced with the respect to the median, will the median stay the same. Note that this is a related but different requirement from the number of over- and under-expressed genes being the same. This argument applies equally to any other quantile in the distribution of gene expression. The case of Quantile normalization is the least favorable, because it requires that changes of gene expression are balanced with respect to all distribution quantiles.
Compared with other normalization approaches that try to identify no-variation genes from expression data, such as Cross-Correlation [Chua et al., 2006], LVS [Calza et al., 2008], or NVAS [Ni et al., 2008] normalization, our proposal is able to work correctly with higher degrees of variation in gene expression, given that those methods are not expected to work correctly when more than 50–60% of genes vary. The reason for this difference in performance lies in that those methods use a binning strategy over the average expression between conditions (Cross-Correlation, NVAS), or need to assume an a priori fraction (usually 40-60%) of non-differentially expressed genes (LVS). When the majority of genes are differentially expressed, very few of those bins may be suitable for normalization, or the assumed fraction of non-differentially expressed genes may not hold. In contrast, our approach makes one single search in a space of -values, and without assuming any fraction of non-differentially expressed genes. As long as there are a sufficient number of non-differentially expressed genes, of the order of several hundreds, normalization is possible, including cases with global mRNA changes or transcriptional amplification [van de Peppel et al., 2003; Hannah et al., 2005, 2008; Lovén et al., 2012]. In general, it is a matter of comparison between the magnitude of the error in the estimation of normalization factors and the amount of biological variation. The estimation error decreases with the number of no-variation genes detected (Fig. 2), and whenever normalization error is well below biological variation, normalization between samples will be correct and beneficial for downstream analyses.
Our approach to normalization is based in four key ideas: first, decomposing the normalization by experimental conditions and normalizing separately each condition before normalizing the condition means; second, using the novel Standard-Vector normalization (or alternatively median scaling) to perform each normalization; third, identifying no-variation genes from the distribution of -values resulting from a statistical test of variation between conditions; and fourth, employing only no-variation genes for the final between-condition normalization. These four ideas are grounded on rigorous mathematical statistics (see Methods and Supplementary Information). It is also worth noting that both Median and Standard-Vector normalization, as methods for each normalization step, are distribution-free methods; they do not assume Gaussianity or any other kind of probability distribution for the expression levels of genes. MedianCD and SVCD normalization are freely available in the R package cdnormbio, installable from GitHub (see Code Availability).
Previous assumptions that gene variation is rather limited could suggest that there is no need for more comprehensive normalization methods such as our proposal. In line with this, it could be argued that the amount of variation in our real (E. crypticus) dataset is exceptional and much larger than the variation likely to be occur in most experiments. We think that this an invalid belief. Most of the available evidence concerning widespread variation in gene expression is inadequate, because it involves circular reasoning. We have shown here that current normalization methods, used by almost all studies to date, assume no variation in gene expression between experimental conditions, and they remove it if it exists, unless it is balanced. Therefore, these methods cannot be used to discern the extent and balance of global variation in gene expression. Only methods that are able to normalize correctly, whatever these extent and balance are, can be trusted for this task. The fact that our methods perform well with large and unbalanced differential gene expression does not imply that they perform poorly when differential gene expression is more moderate or balanced. Our results show that this is not case. In the design of our methods, no compromise was made to achieve good performance with high variation in exchange for not so good performance with low variation. The downside of our approach lies elsewhere, in a greater algorithmic complexity and a greater demand of computing resources. Yet, we consider this a minor demand, given the capabilities of today’s computers and the resources required by current high-throughput assays.
Our results have being obtained from microarray data, but similar effects are expected to be found in RNA-Seq assays. Current normalization procedures for RNA-Seq, such as RPKM [Mortazavi et al., 2008], TMM [Robinson and Oshlack, 2010], or DESeq [Anders and Huber, 2010], perform between-sample normalization based on some form of global scaling and under the assumption that most genes are not differentially expressed. This makes RPKM, TMM, and DESeq normalization, in what concerns between-sample normalization and the removal or distortion of variation discussed here, similar to conventional Median normalization. An example of this issue, including results from microarray and RNA-Seq assays, has been reported in a study of the transcriptional amplification mediated by the oncogene c-Myc [Lovén et al., 2012].
Importantly, MedianCD and SVCD normalization were designed with no dependencies on any particular aspect of the technology used to globally measure gene expression, i.e. microarrays or RNA-Seq. The numbers in the input data are interpreted as steady state concentrations of mRNA molecules, in order to identify the normalization factors, and irrespectively of whether the concentrations were obtained from fluorescence intensities of hybridized cDNA (microarrays) or from counts of fragments of cDNA (RNA-Seq). Both technologies require between-sample normalization, because in some step of the assay the total mRNA or cDNA mass in each sample must be equalized within a given range required by the experimental platform, This equalization of total mass, together with other sources of variation in the total efficiency of the assay, amounts to a factor multiplying the concentration of each mRNA species. This factor is different for each sample, and it is what between-sample normalization aims to detect and correct for. Moreover, the total mRNA mass in each sample is, in many cases, mostly determined by a few highly expressed genes, rather than an unbiased average over the total mRNA population. This makes between-sample normalization critical regarding comparisons of gene expression between different experimental conditions, as our results illustrate. It is also important to highlight that this between-sample uncertainty in the measurement of mRNA concentrations is different from other issues, such as for example non-linearities. These other problems are usually more specific to each technology, and they are the scope of within-sample normalization (e.g. background correction for microarrays and gene-length normalization for RNA-Seq), which are obviously also necessary and should be applied before between-sample normalization. Similarly, methods that address the influence of biological or technical confounding factors on downstream analyses, such as SVA [Leek and Storey, 2007] or PEER [Stegle et al., 2010], should be applied when necessary, after normalizing.
Finally, the significance of widespread variation in gene expression merits consideration from the viewpoint of molecular and cell biology. Established understanding about the regulation of gene expression considers it as a set of processes that generally switch on or off the expression of genes, performed mostly at transcription initiation, by the combinatorial regulation of a large number of transcription factors, and with an emphasis on gene expression programs associated with cell differentiation and development. Recent studies, however, have expanded this understanding, offering a more complex perspective on the regulation of gene expression, by identifying other rate-limiting regulation points between transcription initiation and protein translation, such as transcription elongation and termination, as well as mRNA processing, transport and degradation. Promoter-proximal pausing of RNA polymerase II (in eukaryotes) [Core et al., 2008; Adelman and Lis, 2012] and transcript elongation [Jonkers and Lis, 2015], in particular, have received a great deal of attention recently, in connection with gene products involved in signal transduction pathways. These mechanisms, which seem to be highly conserved among metazoans, would allow cells to tune the expression of activated genes in response to signals concerning, for example, homeostasis, environmental stress or immune response. As an illustration, studies about the transcription amplification mediated by the oncogene c-Myc have uncovered that it regulates the promoter-proximal pausing of RNA polymerase II, affecting a large number of genes already activated by other regulatory mechanisms [Lin et al., 2012; Nie et al., 2012; Littlewood et al., 2012]. Our results for the toxicity experiment with E. crypticus are consistent with regulatory capabilities for broad fine-tuning of gene expression levels, far beyond what conventional methods of normalization would allow to detect. This contrast underlines that normalization methods that truly preserve variation between experimental conditions are necessary for high-throughput assays exploring genome-wide regulation of gene expression, as required by current research in molecular and cell biology.
In summary, this study proves that large numbers of genes can change in expression level across experimental conditions, and too extensively to ignore in the normalization of gene expression data. Current normalization methods for gene expression microarrays and RNA-Seq, because of a lack-of-variation assumption, likely remove and distort variation in gene expression. The normalization methods proposed here solve this problem, offering a means to investigate broad changes in gene expression that have remained hidden to date. We expect this to provide revealing insights about diverse biomolecular processes, particularly those involving substantial numbers of genes, and to assist efforts to realize the full potential of gene expression profiling.
Test Organism and Exposure Media
The test species was Enchytraeus crypticus. Individuals were cultured in Petri dishes containing agar medium, in controlled conditions [Gomes et al., 2015b].
For copper () exposure, a natural soil collected at Hygum, Jutland, Denmark was used [Gomes et al., 2015b; Scott-Fordsmand et al., 2000]. For silver () and nickel () exposure, the natural standard soil LUFA 2.2 (LUFA Speyer, Germany) was used [Gomes et al., 2015b]. The exposure to ultra-violet (UV) radiation was done in ISO reconstituted water [OECD, 2004a].
The tested Ag forms [Gomes et al., 2015b] included silver nitratre (, Sigma Aldrich), non-coated nanoparticles (-NPs Non-Coated, 20–30 nm, American Elements), Polyvinylpyrrolidone (PVP)-coated nanoparticles (-NPs PVP-Coated, 20–30 nm, American Elements), and NM300K nanoparticles ( NM300K, 15 nm, JRC Repository). The NM300K was dispersed in 4% Polyoxyethylene Glycerol Triolaete and Polyoxyethylene (20) orbitan mono-Laurat (Tween 20), thus the dispersant was tested alone as control (CTdisp).
The tested forms included nickel nitrate (, Fluka) and Ni nanoparticles (-NPs, 20 nm, American Elements).
Spiking for the and materials was done as previously described [Gomes et al., 2015b]. For the materials, the -NPs were added to the soil as powder, following the same procedure as for the materials. , being soluble, was added to the pre-moistened soil as aqueous dispersions.
The concentrations tested were selected based on the reproduction effect concentrations and , for E. crypticus, within 95% of confidence intervals, being: = 290/360 mg/kg, -NPs = 980/1760 mg/kg, -Nwires = 850/1610 mg/kg, -Field = 500/1400 mg/kg, = 45/60 mg/kg, -NP PVP-coated = 380/550 mg/kg, -NP Non-coated = 380/430 mg/kg, NM300K = 60/170 mg/kg, CTdisp = 4% w/w Tween 20, = 40/60 mg/kg, -NPs = 980/1760 mg/kg.
Four biological replicates were performed per test condition, including controls. For exposure, the control condition for all the treatments consisted of soil from a control area at Hygum site, which has a background concentration of 15 mg/kg [Scott-Fordsmand et al., 2000]. For exposure, two control sets were performed: CT (un-spiked LUFA soil, to be the control condition for , -NPs PVP-Coated and -NPs Non-Coated treatments) and CTdisp (LUFA soil spiked with the dispersant Tween 20, to be the control condition for the NM300K treatments). For exposure, the control consisted of un-spiked LUFA soil.
In soil (i.e. for , and ) exposure followed the standard ERT [OECD, 2004b] with adaptations as follows: twenty adults with well-developed clitellum were introduced in each test vessel, containing 20 g of moist soil (control or spiked). The organisms were exposed for three and seven days under controlled conditions of photoperiod (16:8 h light:dark) and temperature without food. After the exposure period, the organisms were carefully removed from the soil, rinsed in deionized water and frozen in liquid nitrogen. The samples were stored at , until analysis.
For UV exposure, the test conditions [OECD, 2004a] were adapted for E. crypticus [Gomes et al., 2015a]. The exposure was performed in 24-well plates, where each well corresponded to a replicate and contained 1 ml of ISO water and five adult organisms with clitellum. The test duration was five days, at . The organisms were exposed to UV on a daily basis, during 15 minutes per day to two UV intensities (280–400nm) of and , corresponding to total UV doses of 7511.6 and 8118.35 , respectively. The remaining time was spent under standard laboratory illumination (16:8 h photoperiod). UV radiation was provided by an UV lamp (Spectroline XX15F/B, Spectronics Corporation, NY, USA, peak emission at 312 nm) and a cellulose acetate sheet was coupled to the lamp to cut-off UVC-range wavelengths [Gomes et al., 2015a]. Thirty two replicates per test condition (including control without UV radiation) were performed to obtain 4 biological replicates for RNA extraction, each one with 40 organisms. After the exposure period, the organisms were carefully removed from the water and frozen in liquid nitrogen. The samples were stored at , until analysis.
RNA Extraction, Labeling and Hybridization
RNA was extracted from each replicate, which contained a pool of 20 and 40 organisms, for soil and water exposure, respectively. Three biological replicates per test treatment (including controls) were used. Total RNA was extracted using SV Total RNA Isolation System (Promega). The quantity and purity were measured spectrophotometrically with a nanodrop (NanoDrop ND-1000 Spectrophotometer) and its quality checked by denaturing formaldehyde agarose gel electrophoresis.
500 ng of total RNA were amplified and labeled with Agilent Low Input Quick Amp Labeling Kit (Agilent Technologies, Palo Alto, CA, USA). Positive controls were added with the Agilent one-color RNA Spike-In Kit. Purification of the amplified and labeled cRNA was performed with RNeasy columns (Qiagen, Valencia, CA, USA).
The cRNA samples were hybridized on custom Gene Expression Agilent Microarrays (4 x 44k format), with a single-color design [Castro-Ferreira et al., 2014]. Hybridizations were performed using the Agilent Gene Expression Hybridization Kit and each biological replicate was individually hybridized on one array. The arrays were hybridized at with a rotation of 10 rpm, during 17 h. Afterwards, microarrays were washed using Agilent Gene Expression Wash Buffer Kit and scanned with the Agilent DNA microarray scanner G2505B.
Fluorescence intensity data was obtained with Agilent Feature Extraction Software v. 10.7.3.1, using recommended protocol GE1_107_Sep09. Quality control was done by inspecting the reports on the Agilent Spike-in control probes.
Analyses were performed with R [R Core Team, 2016] v. 3.3.1, using R packages plotrix [Lemon, 2006] v. 3.6.3 and RColorBrewer [Neuwirth, 2014] v. 1.1.2, and with Bioconductor [Huber et al., 2015] v. 3.3 packages affy [Gautier et al., 2004] v. 1.50.0, drosgenome1.db v .3.2.3, drosophila2.db v. 3.2.3, genefilter v. 1.54.2, and limma [Ritchie et al., 2015] v. 3.28.20. Background correction was carried out by Agilent Feature Extraction software for the real (E. crypticus) dataset, while the Affymetrix MAS5 algorithm, as implemented in the limma package, was used for the Golden and Platinum Spike datasets.
To ensure an optimal comparison between the different normalization methods, only gene probes with good signal quality (flag IsPosAndSignif = True) in all samples were employed for the analysis of the E. crypticus dataset. This implied the selection of 18,339 gene probes from a total of 43,750. For the Golden and Platinum Spike datasets, data were considered as missing when probe sets were not called present by the MAS5 algorithm.
The synthetic dataset without differential gene expression was generated gene by gene as normal variates with mean and variance equal, respectively, to the sample mean and sample variance of the expression levels for each gene, as detected from the real E. crypticus dataset after SVCD normalization. The synthetic dataset with differential gene expression was generated equally, except for the introduction of differences in expression averages between treatments and controls. The magnitude of the difference in averages was equal, for each differentially expressed gene (DEG), to twice the sample variance. The percentage of DEGs for each treatment was chosen randomly, in logarithmic scale, from a range between 0.9% and 90%, while ensuring that 10% of genes were not differentially expressed across the entire dataset. One third of the treatments were mostly over-expressed (for each treatment independently, the probability of a DEG being over-expressed was ), one third of the treatments were mostly under-expressed (), and the remaining third had mostly balanced differential gene expression (). For both synthetic datasets, the applied normalization factors were those detected by SVCD normalization from the real E. crypticus dataset.
Median normalization was performed, for each sample, by subtracting the median of the distribution of expression levels, and then adding the overall median to preserve the global expression level. Quantile normalization was performed as implemented in the limma package.
The two condition-decomposition normalizations, MedianCD and SVCD, proceeded in the same way: first, independent within-condition normalization for each experimental condition, using all genes. Then, one between-condition normalization, iteratively identifying no-variation genes and normalizing until convergence of the set of no-variation genes. And finally, another between-condition normalization using only the no-variation genes detected, to calculate the between-condition normalization factors.
The criterion for convergence of MedianCD normalization was to require that the relative changes in the standard deviation of the normalization factors were less than 0.1%, or less than 10% for 10 steps in a row. In the case of SVCD normalization, convergence required that numerical errors were, compared to estimated statistical errors (see below), less than 1%, or less than 10% for 10 steps in a row. Convergence of the set of no-variation genes was achieved by intersection of the sets found during 10 additional steps under convergence conditions. These default convergence parameters were used for all the MedianCD and SVCD normalizations reported, with the exception of MedianCD with the Golden Spike dataset, which used 30% (instead of 10%) of relative change for 10 steps in a row, to reach convergence.
In SVCD normalization, the distribution of standard vectors was trimmed in each step to remove the 1% more extreme values of variance.
Differentially expressed genes were identified with limma analysis or t-tests, controlling the false discovery rate to be below 5%, independently for each comparison of treatment versus control.
The reference distributions with permutation symmetry shown in the polar plots of Supplementary Movies S1–S3 were calculated through the six possible permutations of the empirical standard vectors. The Watson statistic was calculated with the two-sample test [Durbin, 1973], comparing with an equal number of samples obtained by sampling with replacement the permuted standard vectors.
Condition Decomposition of the Normalization Problem
In a gene expression dataset with genes, experimental conditions and samples per condition, the observed expression levels of gene in condition , , can be expressed in -scale as
where is the vector of true gene expression levels and is the vector of normalization factors.
Given a sample vector , the mean vector is , and the residual vector is . Then, (1) can be linearly decomposed into
For full details about this condition-decomposition approach, see Supplementary Mathematical Methods in the Supplementary Information.
The samples of gene in a given condition can be modeled with the random vectors . Again, , where is a fixed vector of normalization factors. It can be proved under fairly general assumptions (see Supplementary Information), that the true standard vectors have zero expected value
whereas the observed standard vectors verify, as long as ,
At convergence, , which implies and . Convergence is faster the more symmetric the empirical distribution of is on the unit -sphere. Convergence is optimal with spherically symmetric distributions, such as the Gaussian distribution, because in that case
Assuming no dependencies between genes, an approximation of the statistical error at step can be obtained with
This statistical error was compared with the numerical error to assess convergence.
See Supplementary Mathematical Methods in the Supplementary Information for full details about this algorithm. See also Supplementary Movies S1–S3 for normalization examples.
Identification of No-Variation Genes
No-variation genes were identified with one-sided Kolmogorov-Smirnov tests, as goodness-of-fit tests against the uniform distribution, carried out on a distribution of -values. These -values were obtained from ANOVA tests on the expression levels of genes, grouped by experimental condition. The KS test was rejected at .
See Supplementary Mathematical Methods in the Supplementary Information for more details about this approach to identify no-variation genes. See also Supplementary Movies S4–S6 for examples of use.
- Adelman and Lis  K. Adelman and J. T. Lis. Promoter-proximal pausing of RNA polymerase II: emerging roles in metazoans. Nat. Rev. Genet., 13(10):720–31, 2012.
- Anders and Huber  S. Anders and W. Huber. Differential expression analysis for sequence count data. Genome Biol., 11(10):R106, 2010.
- Ballman et al.  K. V. Ballman, D. E. Grill, A. L. Oberg, and T. M. Therneau. Faster cyclic loess: normalizing RNA arrays via linear models. Bioinformatics, 20(16):2778–2786, 2004.
- Bolstad et al.  B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19:185–193, 2003.
- Boutros  P. C. Boutros. The path to routine use of genomic biomarkers in the cancer clinic. Genome Res., 25(10):1508–13, 2015.
- Brettingham-Moore et al.  K. H. Brettingham-Moore, C. P. Duong, A. G. Heriot, R. J. S. Thomas, and W. A. Phillips. Using gene expression profiling to predict response and prognosis in gastrointestinal cancers-the promise and the perils. Ann of Surg Oncol, 18:1484–1491, 2011.
- Bullard et al.  J. H. Bullard, E. Purdom, K. D. Hansen, and S. Dudoit. Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC Bioinformatics, 11:94, 2010.
- Calza et al.  S. Calza, D. Valentini, and Y. Pawitan. Normalization of oligonucleotide arrays based on the least-variant set of genes. BMC Bioinformatics, 9:140, 2008.
- Castro-Ferreira et al.  M. P. Castro-Ferreira, T. E. de Boer, J. K. Colbourne, R. Vooijs, C. A. M. van Gestel, N. M. van Straalen, A. M. V. M. Soares, M. J. B. Amorim, and D. Roelofs. Transcriptome assembly and microarray construction for enchytraeus crypticus, a model oligochaete to assess stress response mechanisms derived from soil conditions. BMC Genomics, 15:302, 2014.
- Chang et al.  Y. Chang, M. L. Lye, and H. C. Zeng. Large-scale synthesis of high-quality ultralong copper nanowires. Langmuir, 21:3746–3748, 2005.
- Cheng et al.  L. Cheng, L.-Y. Lo, N. L. S. Tang, D. Wang, and K.-S. Leung. CrossNorm: a novel normalization strategy for microarray data in cancers. Sci. Rep., 6:18898, 2016.
- Chi et al.  J.-T. Chi, H. Y. Chang, G. Haraldsen, F. L. Jahnsen, O. G. Troyanskaya, D. S. Chang, Z. Wang, S. G. Rockson, M. van de Rijn, D. Botstein, and et al. Endothelial cell diversity revealed by global expression profiling. Proc Natl Acad Sci USA, 100:10623–10628, 2003.
- Choe et al.  S. E. Choe, M. Boutros, A. M. Michelson, G. M. Church, and M. S. Halfon. Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol., 6(2):R16, 2005.
- Chua et al.  S.-W. Chua, P. Vijayakumar, P. M. Nissom, C.-Y. Yam, V. V. T. Wong, and H. Yang. A novel normalization method for effective removal of systematic variation in microarray data. Nucleic Acids Res., 34(5):e38, 2006.
- Conesa et al.  A. Conesa, P. Madrigal, S. Tarazona, D. Gomez-Cabrero, A. Cervera, A. McPherson, M. W. Szcześniak, D. J. Gaffney, L. L. L. Elo, X. Zhang, and A. Mortazavi. A survey of best practices for RNA-seq data analysis. Genome Biol., 17:13, 2016.
- Core et al.  L. J. Core, J. J. Waterfall, and J. T. Lis. Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science, 322(5909):1845–8, 2008.
- Couzin  J. Couzin. Genomics. microarray data reproduced, but some concerns remain. Science, 313:1559, 2006.
- Dabney and Storey  A. R. Dabney and J. D. Storey. A reanalysis of a published Affymetrix GeneChip control dataset. Genome Biol., 7(3):401, 2006.
- Dillies et al.  M.-A. Dillies, A. Rau, J. Aubert, C. Hennequet-Antier, M. Jeanmougin, N. Servant, C. Keime, G. Marot, D. Castel, J. Estelle, and et al. A comprehensive evaluation of normalization methods for Illumina high-throughput rna sequencing data analysis. Brief Bioinform, 14:671–683, 2013.
- Draghici et al.  S. Draghici, P. Khatri, A. C. Eklund, and Z. Szallasi. Reliability and reproducibility issues in dna microarray measurements. Trends Genet, 22:101–109, 2006.
- Duggan et al.  D. J. Duggan, M. Bittner, Y. Chen, P. Meltzer, and J. M. Trent. Expression profiling using cdna microarrays. Nat Genet, 21:10–14, 1999.
- Durbin  J. Durbin. Distribution Theory for Tests Based on the Sample Distribution Function. Society for Industrial and Applied Mathematics, Philadelphia, 1973.
- Eaton  M. L. Eaton. Multivariate Statistics: A Vector Space Approach. Institute of Mathematical Statistics, Beachwood, Ohio, 2007.
- Fang et al.  K. Fang, S. Kotz, and K. W. Ng. Symmetric Multivariate and Related Distributions. Chapman and Hall, New York, 1990.
- Feller  W. Feller. An Introduction to Probability Theory and Its Applications, volume 2. Wiley, New York, 2 edition, 1971.
- Fodor et al.  A. A. Fodor, T. L. Tickle, and C. Richardson. Towards the uniform distribution of null P values on Affymetrix microarrays. Genome Biol., 8(5):R69, 2007.
- Frantz  S. Frantz. An array of problems. Nat Rev Drug Discov, 4:362–363, 2005.
- Gagnon-Bartsch and Speed  J. A. Gagnon-Bartsch and T. P. Speed. Using control genes to correct for unwanted variation in microarray data. Biostatistics, 13:539–52, 2012.
- Gaile and Miecznikowski  D. P. Gaile and J. C. Miecznikowski. Putative null distributions corresponding to tests of differential expression in the Golden Spike dataset are intensity dependent. BMC Genomics, 8:105, 2007.
- Garber et al.  M. Garber, M. G. Grabherr, M. Guttman, and C. Trapnell. Computational methods for transcriptome annotation and quantification using rna-seq. Nat Methods, 8:469–477, 2011.
- Gautier et al.  L. Gautier, L. Cope, B. M. Bolstad, and R. A. Irizarry. affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20(3):307–315, 2004.
- Golub et al.  T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, and et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999.
- Gomes et al. [2015a] S. I. L. Gomes, G. Caputo, N. Pinna, J. J. Scott-Fordsmand, and M. J. B. Amorim. Effect of 10 different and (nano)materials on the soil invertebrate Enchytraeus crypticus. Environ Toxicol Chem, 34:2409–2416, 2015a.
- Gomes et al. [2015b] S. I. L. Gomes, J. J. Scott-Fordsmand, and M. J. B. Amorim. Cellular energy allocation to assess the impact of nanomaterials on soil invertebrates (enchytraeids): The effect of Cu and Ag. Int J Environ Res Public Health, 12:6858–6878, 2015b.
- Gupta et al.  A. K. Gupta, T. Varga, and T. Bodnar. Elliptically Contoured Models in Statistics and Portfolio Theory. Springer, New York, 2013.
- Hannah et al.  M. A. Hannah, A. G. Heyer, and D. K. Hincha. A global survey of gene regulation during cold acclimation in Arabidopsis thaliana. PLoS Genet., 1(2):e26, 2005.
- Hannah et al.  M. A. Hannah, H. Redestig, A. Leisse, and L. Willmitzer. Global mRNA changes in microarray experiments. Nat. Biotechnol., 26(7):741–742, 2008.
- Hicks and Irizarry  S. C. Hicks and R. A. Irizarry. quantro: a data-driven approach to guide the choice of an appropriate normalization method. Genome Biol, 16:117, 2015.
- Huber et al.  W. Huber, V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, S. Davis, L. Gatto, T. Girke, and et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods, 12:115–121, 2015.
- Irizarry et al.  R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf, and T. P. Speed. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4:249–264, 2003.
- Irizarry et al.  R. A. Irizarry, L. M. Cope, and Z. Wu. Feature-level exploration of a published Affymetrix GeneChip control dataset. Genome Biol., 7(8):404, 2006.
- Ivanova et al.  N. B. Ivanova, J. T. Dimos, C. Schaniel, J. A. Hackney, K. A. Moore, and I. R. Lemischka. A stem cell molecular signature. Science, 298:601–604, 2002.
- Jonkers and Lis  I. Jonkers and J. T. Lis. Getting up to speed with transcription elongation by RNA polymerase II. Nat. Rev. Mol. Cell Biol., 16(3):167–177, 2015.
- Kallenberg  O. Kallenberg. Probabilistic Symmetries and Invariance Principles. Springer, New York, 2005.
- Leek and Storey  J. T. Leek and J. D. Storey. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet, 3:1724–35, 2007.
- Leek et al.  J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. Johnson, D. Geman, K. Baggerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet, 11:733–739, 2010.
- Lemon  J. Lemon. Plotrix: a package in the red light district of r. R-News, 6(4):8–12, 2006.
- Li et al.  S. Li, P. P. Labaj, P. Zumbo, P. Sykacek, W. Shi, L. Shi, J. Phan, P.-Y. Wu, M. Wang, C. Wang, D. Thierry-Mieg, J. Thierry-Mieg, D. P. Kreil, and C. E. Mason. Detecting and correcting systematic variation in large-scale rna sequencing data. Nat Biotechnol, 32:888–895, 2014.
- Lin et al.  C. Y. Lin, J. Lovén, P. B. Rahl, R. M. Paranal, C. B. Burge, J. E. Bradner, T. I. Lee, and R. A. Young. Transcriptional amplification in tumor cells with elevated c-Myc. Cell, 1511215(1):56–67, 2012.
- Lin et al.  Y. Lin, K. Golovnina, Z.-X. Chen, H. N. Lee, Y. L. S. Negron, H. Sultana, B. Oliver, and S. T. Harbison. Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster. BMC Genomics, 17:28, 2016.
- Lippa et al.  K. A. Lippa, D. L. Duewer, M. L. Salit, L. Game, and H. C. Causton. Exploring the use of internal and external controls for assessing microarray technical performance. BMC Res Notes, 3:349, 2010.
- Listgarten et al.  J. Listgarten, C. Kadie, E. E. Schadt, and D. Heckerman. Correction for hidden confounders in the genetic analysis of gene expression. Proc Natl Acad Sci USA, 107:16465–70, 2010.
- Littlewood et al.  T. D. Littlewood, P. Kreuzaler, and G. I. Evan. All things to all people. Cell, 151(1):11–3, 2012.
- Lockhart et al.  D. J. Lockhart, H. Dong, M. C. Byrne, M. T. Follettie, M. V. Gallo, M. S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, and et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol, 14:1675–1680, 1996.
- Lovén et al.  J. Lovén, D. A. A. Orlando, A. A. A. Sigova, C. Y. Y. Lin, P. B. B. Rahl, C. B. B. Burge, D. L. L. Levens, T. I. I. Lee, and R. A. A. Young. Revisiting global gene expression analysis. Cell, 151:476–482, 2012.
- Michiels et al.  S. Michiels, S. Koscielny, and C. Hill. Prediction of cancer outcome with microarrays: A multiple random validation strategy. Lancet, 365:488–492, 2005.
- Mortazavi et al.  A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods, 5:621–628, 2008.
- Neuwirth  E. Neuwirth. RColorBrewer: ColorBrewer Palettes, 2014. URL https://CRAN.R-project.org/package=RColorBrewer. R package version 1.1-2.
- Ni et al.  T. T. Ni, W. J. Lemon, Y. Shyr, and T. P. Zhong. Use of normalization methods for analysis of microarrays containing a high degree of gene effects. BMC Bioinformatics, 9:505, 2008.
- Nie et al.  Z. Nie, G. Hu, G. Wei, K. Cui, A. Yamane, W. Resch, R. Wang, D. R. Green, L. Tessarollo, R. Casellas, K. Zhao, and D. Levens. c-Myc is a universal amplifier of expressed genes in lymphocytes and embryonic stem cells. Cell, 151(1):68–79, 2012.
- OECD [2004a] OECD. Guidelines for the Testing of chemicals No 202. Daphnia sp. Acute Immobilization Test. Organization for Economic Cooperation and Development, Paris, 2004a.
- OECD [2004b] OECD. Guidelines for the Testing of chemicals No. 220. Enchytraeid Reproduction Test. Organization for Economic Cooperation and Development, Paris, 2004b.
- Pearson  R. D. Pearson. A comprehensive re-analysis of the Golden Spike data: towards a benchmark for differential expression methods. BMC Bioinformatics, 9:164, 2008.
- R Core Team  R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2016. URL https://www.R-project.org/.
- Reese et al.  S. E. Reese, K. J. Archer, T. M. Therneau, E. J. Atkinson, C. M. Vachon, M. de Andrade, J.-P. A. Kocher, and J. E. Eckel-Passow. A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis. Bioinformatics, 29:2877–83, 2013.
- Risso et al.  D. Risso, J. Ngai, T. P. Speed, and S. Dudoit. Normalization of rna-seq data using factor analysis of control genes or samples. Nat Biotechnol, 32:896–902, 2014.
- Ritchie et al.  M. E. Ritchie, B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. limma powers differential expression analyses for rna-sequencing and microarray studies. Nucleic Acids Res, 43:e47, 2015.
- Robinson and Oshlack  M. D. Robinson and A. Oshlack. A scaling normalization method for differential expression analysis of rna-seq data. Genome Biol, 11:R25, 2010.
- Schena et al.  M. Schena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science, 270:467–470, 1995.
- Schuster et al.  E. F. Schuster, E. Blanc, L. Partridge, and J. M. Thornton. Estimation and correction of non-specific binding in a large-scale spike-in experiment. Genome Biol., 8(6):R126, 2007.
- Scott-Fordsmand et al.  J. J. Scott-Fordsmand, P. H. Krogh, and J. M. Weeks. Responses of Folsomia fimetaria (collembola: Isotomidae) to copper under different soil copper contamination histories in relation to risk assessment. Environ Toxicol Chem, 19:1297–1303, 2000.
- Shi et al.  L. Shi, L. H. Reid, W. D. Jones, R. Shippy, J. A. Warrington, S. C. Baker, P. J. Collins, F. de Longueville, E. S. Kawasaki, K. Y. Lee, and et al. The microarray quality control (maqc) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol, 24:1151–1161, 2006.
- Shippy et al.  R. Shippy, S. Fulmer-Smentek, R. V. Jensen, W. D. Jones, P. K. Wolber, C. D. Johnson, P. S. Pine, C. Boysen, X. Guo, E. Chudin, and et al. Using rna sample titrations to assess microarray platform performance and normalization techniques. Nat Biotechnol, 24:1123–1131, 2006.
- Smyth and Speed  G. K. Smyth and T. Speed. Normalization of cDNA microarray data. Methods, 31:265–273, 2003.
- Stegle et al.  O. Stegle, L. Parts, R. Durbin, and J. Winn. A bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput Biol, 6:e1000770, 2010.
- Storey  J. D. Storey. The positive false discovery rate: a bayesian interpretation and the q-value. Ann Stat, 31:2013–2035, 2003.
- Storey and Tibshirani  J. D. Storey and R. Tibshirani. Statistical significance for genomewide studies. Proc Natl Acad Sci USA, 100:9440–9445, 2003.
- Su et al.  Z. Su, P. P. Labaj, S. Li, J. Thierry-Mieg, D. Thierry-Mieg, W. Shi, C. Wang, G. P. Schroth, R. A. Setterquist, J. F. Thompson, and et al. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol, 32:903–914, 2014.
- Tan et al.  P. K. Tan, T. J. Downey, E. L. Spitznagel, P. Xu, D. Fu, D. S. Dimitrov, R. A. Lempicki, B. M. Raaka, and M. C. Cam. Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res, 31:5676–5684, 2003.
- Tarca et al.  A. L. Tarca, R. Romero, and S. Draghici. Analysis of microarray experiments of gene expression profiling. Am. J. Obstet. Gynecol., 195(2):373–388, 2006.
- van de Peppel et al.  J. van de Peppel, P. Kemmeren, H. van Bakel, M. Radonjic, D. van Leenen, and F. C. P. Holstege. Monitoring global messenger RNA changes in externally controlled microarray experiments. EMBO Rep., 4(4):387–393, 2003.
- van ’t Veer et al.  L. J. van ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, and et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415:530–536, 2002.
- Wang et al.  Z. Wang, M. Gerstein, and M. Snyder. Rna-seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 10:57–63, 2009.
- Weigelt and Reis-Filho  B. Weigelt and J. S. Reis-Filho. Molecular profiling currently offers no more than tumour morphology and basic immunohistochemistry. Breast Cancer Res, 12 Suppl. 4:S5, 2010.
- Wu and Aryee  Z. Wu and M. J. Aryee. Subset quantile normalization using negative control features. J Comput Biol, 17:1385–1395, 2010.
- Yang et al.  Y. H. Yang, S. Dudoit, P. Luu, D. M. Lin, V. Peng, J. Ngai, and T. P. Speed. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res., 30(4):e15, 2002.
- Zhu et al.  Q. Zhu, J. C. Miecznikowski, and M. S. Halfon. Preferred analysis methods for Affymetrix GeneChips. II. An expanded, balanced, wholly-defined spike-in dataset. BMC Bioinformatics, 11:285, 2010.
This work was supported by the European Union FP7 projects MODERN (Ref. 309314-2, to C.P.R., J.J.S.-F.), MARINA (Ref. 263215, to J.J.S.-F.), and SUN (Ref. 604305, to C.P.R., S.I.L.G., M.J.B.A., J.J.S.-F.), by FEDER funding through COMPETE (Programa Operacional Factores de Competitividade) and Portugal funding through FCT (Fundação para a Ciência e Tecnologia) within the project bio-CHIP (Refs. FCOMP-01-0124-FEDER-041177, FCT EXPL/AAG-MAA/0180/2013, to S.I.L.G., M.J.B.A.), and by a post-doctoral grant (Ref. SFRH/BPD/95775/2013, to S.I.L.G.)
S.I.L.G., M.J.B.A. and J.J.S.-F. designed the toxicity experiment. S.I.L.G. carried out the experimental work and collected the microarray data. C.P.R. designed and implemented the novel normalization methods. C.P.R. performed the statistical analyses. All the authors jointly discussed the results. C.P.R. drafted the paper, with input from all the authors. All the authors edited the final version of the paper.
Data Deposition and Code Availability
MIAME-compliant microarray data were submitted to the Gene Expression Omnibus (GEO) repository at the NCBI website (platform: GPL20310; series: GSE69746, GSE69792, GSE69793 and GSE69794). The novel normalization methods were implemented as R functions. This code, together with R scripts that generate the synthetic datasets and reproduce all reported results starting from the raw microarray data, are available at the GitHub repository https://github.com/carlosproca/gene-expr-norm-paper and in the Supplementary Data File gene-expr-norm.zip. Additionally, MedianCD and SVCD normalization are available via the R package cdnormbio, installable from the GitHub repository https://github.com/carlosproca/cdnormbio.
Competing Financial Interests
The authors declare that they have no competing financial interests.
Supplementary Figure S1
Supplementary Figure S2
Supplementary Figure S3
Supplementary Figure S4
Supplementary Figure S5
Supplementary Figure S6
Legends of Supplementary Movies
Supplementary Movie S1. Example of one within-condition Standard-Vector normalization, for the real (E. crypticus) dataset. The movie shows the 14 steps of the convergence of Standard-Vector normalization performed for the condition Ag.NM300K.EC20.3d (exposure to Ag NM300K nanoparticles, with an dose for three days). Left panels show a subset of 10,000 randomly-chosen sample standard vectors, with one gray line per gene, in the plane of residual vectors, i.e. the plane perpendicular to the vector of coordinates . The lines labeled s1–s3 indicate the projection of the axes onto this plane, the number 1–3 being the sample number. The red line is the estimated vector of normalization factors at each step, with length , which results from the bias of the standard vectors towards that direction. Right panels show the polar distribution of vector angles (black solid curve), compared to the distribution of vector angles after all six possible permutations of the sample labels (blue dashed curve). The Watson statistic provides a measure of the difference between both distributions. In the initial step, there is a large bias towards the first and second sample, compared to the third one. The bias is reduced in each step by subtracting the normalization factor estimate, which makes the distribution of standard vectors more permutationally symmetric and with a correspondingly smaller . After convergence in 14 steps, there is no detectable bias left.
Supplementary Movie S2. Example of one within-condition Standard-Vector normalization, for the synthetic dataset without differential gene expression. The movie displays the Standard-Vector normalization performed for the condition Ag.NM300K.EC20.3d, on the synthetic dataset generated with the standard normal as base distribution. Format and labels are the same as in Supplementary Movie S1. Note the uniform distribution of angles after normalizing, which corresponds to a parametric family of probability distributions with spherical symmetry. The corresponding movie for the synthetic dataset with differential gene expression is virtually identical, given that standard vectors are independent of sample averages.
Supplementary Movie S3. Example of one within-condition Standard-Vector normalization, for synthetic log-normal data. The movie shows the standard-vector normalization performed for the condition Ag.NM300K.EC20.3d, on a synthetic dataset generated in the same way as that of Supplementary Movie S2, except for using as base distribution the log-normal , which has large positive skewness (). Format and labels are the same as in Supplementary Movies S1, S2. Note that the distribution of standard vector angles after normalizing is not uniform, but it has permutation symmetry.
Supplementary Movie S4. Identification of no-variation genes (non-differentially expressed genes) for the real (E. crypticus) dataset. The movie shows the 27 steps of the corresponding between-condition normalization, with SVCD normalization. Both panels show the empirical distribution function of -values obtained from ANOVA tests on expression levels, per gene and grouped by experimental condition. The left panel shows the complete interval , while the right panel depicts the interval close to 1 where the first goodness-of-fit (GoF) test was not rejected. The black portion of the distribution corresponds to -values at which the GoF test was rejected, the big black circle indicates the first -value at which the GoF test was not rejected, and the red portion shows the range of -values whose genes, as a result, were identified as no-variation genes. The dashed blue line and the dotted blue line indicate, respectively, the theoretical distribution function of the uniform distribution and the threshold of the one-sided Kolmogorov-Smirnov test (, equal to the number of -values for the first GoF test that was not rejected). Convergence criteria was met from steps 18 to 27. These last ten steps ensured stability of the detected set of no-variation genes, by cumulative intersection of the successive sets identified, each one with no-variation genes, as shown. The resulting final set had 974 no-variation genes.
Supplementary Movie S5. Identification of no-variation genes for the synthetic dataset with differential gene expression. The movie shows the 15 steps of the corresponding between-condition normalization, with SVCD normalization. Format and labels are the same as in Supplementary Movie S4. Note the similarity with the behavior observed for the real dataset (Supplementary Movie S4).
Supplementary Movie S6. Identification of no-variation genes for the synthetic dataset without differential gene expression. The movie shows the 14 steps of the corresponding between-condition normalization, with SVCD normalization. Format and labels are the same as in Supplementary Movies S4, S5. Note that the distribution of -values at convergence (steps 5–14) is uniform in the whole interval , up to the level detected by the goodness-of-fit test. This corresponds to a dataset with no differentially expressed genes.
Supplementary Mathematical Methods
S1 Vectorial representation of sample data
Let be the samples of independent and identically distributed random variables . Let us represent the samples with the column vector , and let us denote the sample mean by .
Let us define the vectorial operators mean () and residual (), respectively, as
being the all-ones column vector of dimension .
Thus, any sample vector can be decomposed as
The mean vector contains the sample mean, while the residual vector carries the sample variation around the mean.
Proof. Let us denote and .
An essential property of the mean and residual vectors is that they belong to subspaces that are orthogonal complements [Eaton, 2007]. Hence, for any sample vector , the mean vector belongs to the subspace of dimension 1 spanned by the unit vector , while the residual vector belongs to the -dimensional hyperplane orthogonal to .
The lengths of the mean vector and residual vector are equal, up to a scaling factor, to the sample mean and sample standard deviation, respectively. For a set of samples , where , let us denote the sample mean as before by , and the sample variance as . Then, the lengths of the mean and residual vectors obtained from the sample vector are
Finally, let us define the standard vector of the sample vector (), as
whenever , or otherwise as . is the all-zeros column vector of dimension .
For a given number of samples , all the non-zero standard vectors belong to the ()-sphere of radius , embedded in the -dimensional hyperplane perpendicular to . Besides, all the components of a standard vector are equal to the corresponding standardized samples,
For the degenerate case of having only two samples (), the only possible values of a non-zero standard vector are .
S2 Linear decomposition of the normalization problem
Let us consider a gene expression dataset, with genes and experimental conditions. Each condition has samples. The total number of samples is .
Let us denote the observed expression level of gene in the sample of condition by . We assume that the observed level is equal, in the usual -scale, to the addition of the normalization factor to the true gene expression level ,
Solving the normalization problem amounts to finding the normalization factors from the observed values . The normalization factors can be understood as sample-wide changes in the concentration of mRNA molecules by multiplicative factors equal to . These changes are caused by technical reasons in the assay and are independent of the biological variation in the true levels .
Let us represent the true and observed expression levels, and , of gene in the samples of condition , by the -dimensional vectors
Let us also represent the unknown normalization factors of condition by the -dimensional vector
The residual-vector equations (28) correspond to the within-condition normalizations. Each within-condition normalization uses the equations (28) particular to a condition , for the subset of genes that have expression level available and of enough quality in that experimental condition.
Let us denote the condition means for each gene as
being the all-ones column vector of dimension .
Then, the mean-vector equations (27) can be written as
so they reduce to the scalar equations
Let us define the vectors of conditions means as
and let us express the condition-mean equations in vectorial form as
Applying again the mean and variance operators, we obtain
The residual-vector equations on condition means (42) correspond to the single between-condition normalization, in a similar way as (28) do for the each of the within-condition normalizations. There is one equation (42) per gene. The only equations used in the between-condition normalization are those of the subset of genes that show no evidence of variation across experimental conditions, according to a statistical test.
Given that , (41) has the only unknown . The meaning of is a conversion factor between the scale the true and observed expression levels. This factor depends on the technology used to measure the expression levels and finding it is out of the scope of the normalization problem. Therefore, without loss of generality, we assume , so
The solution of the between-condition normalization, , allows to find the mean vectors of the normalization factors , via (34), (39) and (44). The within-condition normalizations yield the residual vectors . The complete solution to the normalization problem is finally obtained, for each condition , with
Thus, the original normalization problem (26) has been divided in normalization sub-problems on residual vectors, stated by (28) and (42). In fact, this linear decomposition is possible for any partition of the set of samples. The choice of the partition as the one defined by the experimental conditions is motivated by the need to control the biological variation among the genes used in each normalization. All the normalizations face the same kind of normalization of residuals problem, which we define in general as follows.
Normalization of Residuals Problem. Let be the -th observed value of feature , in a dataset with observations for each of the features. The observed values are equal to the true values plus the normalization factors , which are constant across features. In vectorial form, there are equations
where the vectors belong to . As a consequence
Solving the normalization of residuals problem amounts to finding the residual vector of normalization factors from the observed residual vectors . In the within-condition normalizations, the features are gene expression levels, with one observation per sample of the corresponding experimental condition. In the between-condition normalization, the features are means of gene expression levels, with one observation per condition.
There is, however, an additional requirement imposed by the methods with which we propose to solve the between-condition normalization. We would like to consider the condition means in (36) as sample data across conditions. This only holds when all the conditions have the same number of samples. Otherwise, we balance the condition means so that they result from the same number of samples in all conditions, according to the procedure described in the following.
Let be the minimum number of samples across conditions, . Let be independent random samples (without replacement) of size from the set of indexes , with one sample per gene and condition . Then, the balanced condition means are defined as
Moreover, the average of across the sampling subsets is equal to the unknown . This implies that (51) are, on average, equivalent to (36). Hence, we use the following vectors of balanced conditions means
S3 Permutation invariance of multivariate data
Let and be, respectively, the true and observed values of a dataset with observations of features, as defined in the normalization of residuals problem above.
We have assumed that the true values of feature are samples of independent and identically distributed random variables . These random variables can be represented with the random vector , carried by the probability space and with induced space . Let us define the random vectors and with the vectorial operators mean (13) and residual (14), respectively,
holds for any random vector , as well as the other properties presented above. Let us assume that and that , which imply that has length 1 almost surely.
The standard random vector is a pivotal quantity, where the location (mean) and scale (standard deviation) of feature have been removed. The probability distribution of across the remaining degrees of freedom over the unit -sphere is governed by the parametric family of the random variables . Moreover, the independence and identity of distribution across the observations implies that the distribution of is exchangeable, i.e. invariant with respect to permutations of the observation labels. As a result, is also permutation invariant, which geometrically corresponds to symmetries with respect to the permutations of the axes in the -dimensional space of random vectors, projected onto the -dimensional hyperplane of residual vectors.
Residual vectors and standard vectors have been widely studied, especially in relation to elliptically symmetric distributions and linear models [Fang et al., 1990; Gupta et al., 2013], and to the invariances of probability distributions [Kallenberg, 2005]. Here, we consider these vectors from the viewpoint of the problem of normalizing multivariate data, and its relationship with permutation invariance.
It is well know that, for a multivariate distribution with independent and identically distributed components, the expected value of the standard vector is zero [Eaton, 2007], given that it is so for each component. We prove this here for completeness, and to show that it is also a necessary consequence of the permutation invariance of the distribution.
Proposition. The expected value of any true (i.e. without normalization issues) standard vector is zero. If the samples of feature are independent and identically distributed, then
Proof. Let be the set of all the permutation matrices in . Then, for any , is equal in distribution to . This implies that
The only vectors that are invariant with respect to all possible permutations are those that have all components identical. Therefore, , with . However, , so that . Hence .
For each true random vector , there is an observed random vector , where is the random vector of normalization factors. The random vectors and are independent, representing biological and technical variation, respectively. Therefore, and without loss of generality, we assume in what follows a fixed vector of normalization factors , i.e. we condition on the event . We also assume that , which implies that has length 1 almost surely.
In contrast to the true standard vector
the observed standard vector
is biased toward the direction of , with the result that the expected value is not zero.
Proposition. If the samples of feature are independent and identically distributed, whenever ,
When , there is the additional requirement that . This threshold of detection only occurs for the degenerate case of .
Proof. Let us consider the projection of