Functional modelling of microarray time series with covariate curves
M. Berk]Maurice Berk
Biological systems are inherently dynamic and gene expression levels may be temporally regulated for a wide range of reasons including the cell cycle, circadian rythms, developmental processes or in response to stimuli (e.g. drug treatment or environmental stress) (Spellman et al., 1998; Wang and Kim, 2003; Calvano et al., 2005). Microarrays are a high throughput assaying technique for measuring these expression levels of thousands of genes simultaneously. Each microarray hybridisation provides a snapshot of expression levels at a single point in time; by carrying out sequential hybridisations on biological samples arising from the same source (e.g. a human patient), the evolution of these expression levels over time can then be elucidated.
The resulting microarray time series give rise to data that possess certain characteristics which make their analysis particularly challenging. Specifically, due to the large number of genes under study simultaneously, the data is very highly dimensional and there are many more genes than there are time points. Each time series will be replicated typically no more than ten times, and experiments with no replication are not uncommon. The number of genes will often number in the tens of thousands while there are rarely more than ten time points. Even with the falling cost of microarray technology, the limiting factor is often the ability to obtain biological samples which may be restricted due to ethical concerns or other practical, experimental issues. Other challenges include the fact that the data is noisy, with frequent missing observations, and individual heterogeneity.
Our focus is on longitudinal study designs. In this type of microarray experiment, multiple biological units - for example human patients, individual mice or cell lines - are each repeatedly sampled over time to give a collection of observed time series for each gene under study. This type of biological replication is essential for making inference about population parameters but is often overlooked in microarray studies due to experimental issues. A longitudinal microarray experiment is described in Section 2 and provides the data for our case study. The purpose of the study was to follow twelve female and ten male adult human subjects over a period of months, in order to characterise the change in gene expression levels over time in healthy humans. Figure 1 shows some of the raw data for a probe corresponding to the TMEFF1 gene from this example data set where some of the characteristics discussed above can be seen to manifest themselves. A key aspect of human data sets is that the gene expression levels are often collected with covariates - for example, the individual’s age, sex and other phenotypic data such as height or weight may be recorded. In the case study, the individuals were stratified by age and gender which allows us to explore not only the evolution of gene expression levels over time but also which genes are differentially expressed between the two gender or age groups.
When modelling experimental data arising from longitudinal microarray experiments there are three distinct challenges: (a) modelling each individual time series, across all genes and individuals, (b) accounting for the correlation between individuals on a gene by gene basis and (c) modelling the correlation between genes. Accounting for each of these sources of correlation - the temporal, the within-gene (between-individual) and the between-gene - is vital for obtaining better parameter estimates and avoiding a loss of power when testing for genes which are differentially or temporally expressed. With less than 10 timepoints, achieving (a) is not possible with standard time series analysis techniques - it is unlikely, for instance, that we would observe any periodicity. Instead, a field which has proven to be quite successful in this area is that of functional data analysis (FDA). In the FDA paradigm, it is assumed that our observations are noisy realisations of an underlying smooth function of time which is to be estimated. These estimated functions, or curves, are then treated as the fundamental unit of data in any subsequent analysis. Formally, the signal-in-noise model assumed is that observation taken at time is given by
where is the function of interest to be estimated and is an error term. Typically the infinite dimensional function is projected onto some finite dimensional basis using parameterisations such as splines, wavelets or fourier bases. In our discussion we will focus on splines in particular as these regularly occur in the literature in terms of both microarray and functional data analysis. For a thorough treatment of FDA, the monograph Ramsay and Silverman (2005) provides an excellent introduction.
In a longitudinal study, for a particular gene, observations will be collected on not just a single function , but a collection of functions , one for each individual biological unit. Often the main quantity of interest is the population mean function characterising the overall population gene expression level over time. In this case we extend the signal-in-noise model (1) so that the th observation on individual at time is given by
This is known as the functional mixed-effects model and is an extension of the standard linear mixed-effects model (Harville, 1977) where the fixed- and random-effects are both considered functions. Function is treated as a fixed-effect as it is assumed to be some fixed, but unknown, population function to be estimated. In constrast, the functions represent a random sample from the population as a whole and are assumed to be i.i.d realisations of an underlying stochastic process. Model (1) has appeared in a number of different forms depending upon the exact parameterisation of the fixed- and random-effects. For instance, Guo (2002) models both as cubic smoothing splines while Rice and Wu (2001) prefer a B-spline representation.
The task of handling correlations amongst the genes has, to date, generally been overlooked by researchers. It is a challenging, open problem to model both the between- and within-gene correlation simultaneously given the size of the data. Although it is well known that genes are co-regulated, for the sake of tractability the most common approach is to simply model each gene independently. In other words, given the framework outlined thus far, each gene would be modelled as a separate functional mixed-effects model.
In this paper we propose a functional-mixed effects model and a framework for estimation and testing in one-sample problems. The model enables the estimation of a mean response curve with the inclusion of covariates, such as gender and sex, also modeled as time-varying smooth functions. We also show how a functional PCA can be applied to the estimated mean curves in order to identify the principal modes of functional variation in the data set, and visually represent the entire set of genes in a low-dimensional plot.
The structure of the paper is as follow. Section 2 provides a description of a data set, previously collected and analysed by Karlovich et al. (2009), that we use here as a case study. The proposed model, inferential procedures and functional PCA are provided in Section 3. In Section 4 we present the experimenal results obtained in the context of our case study. In Section 5 we discuss how our methodology compares to related models that have appeared in the literature and compare our experimental results to that of the original study, as well as highlight some of the biological implications. Finally, we conclude in Section 6.
2 Data description
The data set used in our case study is taken from Karlovich et al. (2009). The purpose of the study was to characterise the gene expression levels of healthy human individuals over a period of 6 months. 22 subjects were studied, with gene expression levels assayed from blood samples at days 1, 14, 28, 90 and 180. One subject developed lung cancer during the course of the study and died prior to Day 180, thus contributing only a partial time series. All other individuals completed the study and were observed at all 5 time points. Twelve of the individuals were female and ten were male. In the original study, the subjects were divided into two age groups, with the younger group taken to be those subjects less than or equal to 55 years of age, and the older group those subjects over 55.
In the original paper, the observation for a given gene on individual at time was modelled as
This is a standard linear mixed-effects model. is the average gene expression level across all individuals after controlling for gender, age and time effects. is an individual specific term allowing for a deviation in terms of the intercept of the model. The , and parameters separate out the gender, age and time effects respectively while is an error term.
The model and study design permitted a wide range of biological issue to be explored. Using t-tests, the significance of the age and gender effects was determined. After correcting for multiple-testing by controlling the false discovery rate (FDR) using the procedure of Benjamini and Hochberg (1995), no genes showed a significant age effect. This was somewhat unexpected given previous studies (Eady et al., 2005; Whitney et al., 2003; Tang et al., 2004) but it was noted that these age effects might be harder to detect in blood than in other tissuses. 78 unique gender genes were identified including XIST, responsible for deactivating one of the X chromosomes in females in order to ensure dosage equivalence, and 23 genes mapped to the Y chromosome. Temporally regulated genes were identified by performing pairwise comparisons between Day 14 and Day 1, Day 28 and Day 14, and Day 180 to Day 90. This was partly due to concerns about a potential batch effect, as Days 1, 14 and 28 were processed in one batch, with Days 90 and 180 being processed in a second batch. No temporally regulated genes were identified in the Day 14 vs Day 1 or the Day 28 vs Day 14 comparisons, but 248 probes were found to be differentially expressed when comparing Day 180 to Day 90, corresponding to 157 unique genes.
Our proposed approach is to replace the original linear mixed-effects model with a functional one. The age and gender effects will be modelled as functions of time, along with the mean and individual curves. To avoid over-parameterisation, all curves will be represented using smoothing splines. The result is a flexible model which permits the interaction of age and gender with time, if the data supports it. During our preprocessing we found little evidence of a batch effect and we will use the entire time course to identify temporally regulated genes, on the basis of the fitted mean function.
We propose the following functional mixed-effects model for the data described in Section 2. Each gene is modelled independently. For a given gene, the observed gene expression level for individual at time is given by
where models the mean expression levels across all individuals after accounting for age and gender effects; is the gender effect for gender to which individual belongs with ; is the age group effect for group to which individual belongs where ; is the individual specific effect for individual and is an error term. The functions , , and are assumed to be smooth functions of time which we wish to estimate based on the noisy observations. We treat , and as fixed-effects, unknown population functions to be estimated, and the functions which are treated as random-effects as they represent a random sample of functions from the population as a whole. Formally, the are assumed to be i.i.d. realisations of an underlying Gaussian Process with mean 0 and covariance function .
The functions can be parameterized in a number of ways but we favour smoothing splines as these offer a fine degree of control over the amount to which the data is smoothed. Writing the vector of all observed time points for individual as where is the total number of observations on individual , ( ‣ 3) can be written in matrix form as
where and are vectors of length and is a vector of length . The vectors , and are defined similarly to . The values denote the distinct design time points, of which there are in total, and may differ from these may differ if individual has missing data or duplicate observations for some time points. The matrix is an incidence matrix of dimension where each row contains all zeroes aside from the column where . Further details on forming the incidence matrices and an example can be found in Appendix A.1. Recall that , then the vectors are multivariate-normally distributed with mean and covariance matrix where . Similarly the noise term is multivariate-normally distributed with mean and covariance matrix , and we assume that the vectors and are independent. For simplicty we assume that , although a more complicated structure could be modelled at the expense of fitting more parameters. It is further necessary to impose the identifiability constraint that the age and gender fixed-effects for the two groups sum to zero, i.e. and . For simplicity, therefore, we model a single gender and age effect, and respectively. These constraints can equivalently be expressed be rewriting ( ‣ 3) as
Let , then ( ‣ 3) can be rewritten more compactly as
Finally, the complete data vector for all individuals, , can be expressed as
where is an length vector, and and are similarly defined, is an matrix and is an matrix, with the operator denoting a block diagonal matrix. The vectors and are both multivariate-normally distributed with mean and covariance matrix and respectively.
3.1 Parameter Estimation
Model ( ‣ 3) is in the form of the standard linear mixed-effects model (Laird and Ware, 1982). Standard practice for obtaining estimates of the fixed- and random-effects, and would be to maximise the joint likelihood of and (Robinson, 1991). This is equivalent to minimising the following generalized log likelihood (GLL) criterion
However, in our model the fixed- and random-effects are the fitted values of the smoothing spline estimates of the functions , , , , and it is necessary to incorporate a penalty term for the roughness of the smoothing splines into the likelihood. The penalized GLL is then given by
where the integrals quantify the roughness of the curves , , , in terms of their squared second derivative, although other penalties could be used. The scalars and are positive-valued smoothing parameters that control the roughness of the fit. For a given smoothing spline fit, would correspond to an interpolation of the data points while as tends to infinity, the fit tends to a straight line. Note that the same smoothing parameter is used for the three fixed-effects functions, , , , and similarly the same smoothing parameter, , is used for all random-effect functions . This is conceptually justified as each function is assumed to be a realisation of the same underlying Gaussian Process, but it is possible to envisage selecting a separate smoothing parameter for each fixed- and random-effect function, albeit at the expense of a far greater computational cost.
Minimization of ( ‣ 3.1) requires calculation of the integral of the squared second derivative of the fixed- and random-effects. In the case of cubic smoothing splines, for a given function observed at time points such that , there is a roughness matrix which can be calculated in a computationally efficient manner that satisfies:
where is a block diagonal matrix comprised of the matrix repeated times. Similarly, is a block diagonal matrix comprised of repeated three times.
After a rearrangement on the terms featuring in the penalised log-likelihood, the model can be re-written in terms of the regularised covariance matrices and , so called because the matrix is obtained by regularising the covariance matrix with the term . This method of imposing the smoothness constraints by regularisation of the covariance matrix can be credited to Wu and Zhang (2006).
Minimising ( ‣ 3.1) gives the BLUE and BLUP of the fixed- and random-effects as
The discussion thus far has assumed that the variance components and were known. Of course, in practical applications this will not be the case. Assuming the random-effects and error terms are known, the maximum likelihood estimators and are given as
As the random-effects and error terms are not, in fact, directly observed, we resort to the Expectation-Maximisation algorithm where they can be treated as missing data. In this procedure the sufficient statistics of and - and respectively - are replaced by their conditional expectations which are calculated at the E-step. In the M-step, the maximum likelihood estimators are then calculated having replaced the sufficient statistics by these conditional expectations, which are given by
where denotes the trace of a matrix and . Derivations of these conditional expectations are given in Appendix A.3.
3.2 Model Selection
Thus far we have treated the smoothing parameters and as fixed. In reality, optimal values of these parameters must be found using a model selection procedure. Guo (2002) made use of the relationship between a smoothing spline and a linear mixed-effects model in order to treat the smoothing parameters as variances components that could be estimated during the normal course of the EM-algorithm. We prefer, however, to dissociate the model selection from parameter estimation and numerically optimise over the two dimensional space of non-negative reals () as this is a much more flexible approach. There are a number of different criteria for scoring the smoothing parameters, all of which essentially trade off between model fit and model complexity.
Ma et al. (2006)’s smoothing-spline clustering approach for microarray data, for instance, employed Wahba (1977)’s generalized cross validation (GCV) criterion. It is well known, however, that GCV tends to undersmooth (Lee, 2003). Alternatively, we can employ either the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC):
These two criteria both score the smoothing parameters in terms of the likelihood - measuring the model fit - adjusted for a penalty term for the model complexity, in terms of degrees of freedom. The difference lies in the size of the penalty term, with BIC giving more conservative results when , in other words when there are more than 9 data points.
Both of these criteria, and GCV, have a sound theoretical basis. We suggest, therefore, to choose which one to use on the basis of a priori knowledge about the kind of patterns we expect to observe in a given data set. If, as in our example data set, we do not expect there to be many genes with curvy temporal profiles, then we may prefer the more conservative BIC. On the other hand, in a data set with a greater number of time points and with more expected variability - in response to infection for instance - then we may prefer the AIC in order to better capture the more complex patterns expected.
3.2.1 Smoother Matrices
In order to evaluate the criteria, it is necessary to calculate the degrees of freedom of the model. As per Buja et al. (1989), the degrees of freedom associated with the fixed- and random-effects, and , can be expressed as the trace of some smoother matrix such that . Equivalently, it is useful to determine the two smoother matrices so that the degrees of freedom of the fixed- and random-effects can be accounted for separately.
Recall that the fitted values of the fixed-effects at the design time points can be written as . Replacing with ( ‣ 3.1) gives
and so the smoother matrix is given by
Similarly, the fitted values of the random-effects at the design time points can be written as , which gives
The degrees of freedom of the model can then be calculated as , which is the trace of the smoother matrix plus an additional paramter for fitting the noise variance .
With the scoring function in place any kind of two-dimensional optimisation routine can be used, although in practice a simple grid search or sequential line optimisation is recommended (Wu and Zhang, 2006). We have found that a more sophisticated simplex-search optimiser (Nelder and Mead, 1965) can be employed without incurring a significant computational cost. This allows optimisation over the two smoothing parameters and simultaneously without needing to calculate the derivative of the criterion.
3.3 Confidence Bands
Pointwise confidence bands at the design time points for each of the fixed-effects functions can be determined either theoretically or using a bootstrap resampling procedure. In the case of the former, we have
The diagonal elements of , therefore, give the variance of the fixed-effects at the design time points with the first elements corresponding to , the next elements to , and the final elements to . In fact, due to the block diagonal structure of , these elements will be the same across all three fixed-effects. Confidence bands for a significance level at the design time points can then be calculated for as , where is the critical value under the normality assumption such that . These bands can be calculated for the other fixed-effects and in an identical fashion.
Alternatively, confidence intervals can be estimated by resampling the between- and within-individual residuals. To construct a bootstrapped sample for a single individual, first one of the individual functions is randomly selected and evaluated at the design time points - denote this vector as . Next, residuals from the noise vector , are resampled with replacement, writing this vector as . Then, the bootstrapped observation vector is given by
where if the individual is female and otherwise, similarly for . This process is then repeated for individuals, sampling the individual functions with replacement, to give a complete bootstrapped data set. The model is then fit to this resampled data and new estimates for the fixed-effects obtained. Repeating this process for a large number of iterations gives a large number of fixed-effects estimates from which the confidence bands at a given significance level can be determined empirically.
3.4 Hypothesis Testing
Fitting model ( ‣ 3) allows us to separate out the mean, age and gender effects for each gene. It is then possible to determine whether there is a significant difference between age groups or genders for a given gene by testing the null hypothesis that the size of the effect is zero. As the effects are modelled as functions, a natural way to quantify their size is the norm and testing the significance of the age effect for a given gene can be framed as
Assessing the significance is complicated by the fact that the sample sizes are small and the null distribution is unknown. Instead, it can be determined empirically by using a data resampling scheme such as the bootstrap or permutation procedure. For example, in Storey et al. (2005), the null distribution of their F-type test-statistic was determined using a nonparametric bootstrap procedure by resampling the individual effects and error terms with replacement. Here, however, the null distribution of the norm of the age and gender effects can be estimated empirically using a permutation procedure where the class assignments - male/female or young/old - are randomly permuted.
These same ideas can be applied when testing for genes which are temporally regulated. In this case, the null hypothesis that there is no change over time can be formulated as where is the first derivative of the mean curve. The null distribution can then be empirically estimated by randomly permuting the time points.
3.5 Functional Principal Components Analysis
Fitting model ( ‣ 3) to each gene yields a set of mean curves where is the total number of genes in the data set. Performing a fPCA on this set of curves allows us to identify the main patterns of variation across all genes. A straight-forward way of doing so is the discretisation method described in Ramsay and Silverman (2005) which is essentially a two-stage approach to fPCA: (1) the data are smoothed by fitting model ( ‣ 3) to each gene (2) a fPCA is performed on the smoothed data - in the form of the set of curves . Alternative methods of fPCA such as James et al. (2000), which estimate and smooth the PCs directly, cannot be applied in this case where there are two levels of variation - the between and within-gene.
First, each curve is discretised on a fine grid of equally spaced points across the range of the time course. If there are curves in total, this yields a data matrix , of dimension . A standard PCA can then be performed on .
This procedure gives principal components, each a vector of length . It is then necessary to transform these vectors back into functions. A standard PCA can be defined as solving the eigenequation
where is the sample covariance matrix of , is one of the eigenvalues of , and is one of the eigenvectors, or principal components. In the functional setting, we replace by a covariance function , and by a function of , such that the eigenequation ( ‣ 3.5) becomes
for a given value of . Noting that after discretisation of the curves the elements of the matrix where and are any of the discretised points on the fine grid, the integral in ( ‣ 3.5) can be approximated as a summation such that
where is the spacing between the points on the fine grid, and are the discretised values of the function . The approximate discrete form of the functional eigenequation is therefore
which corresponds to ( ‣ 3.5) with . Assuming the eigenvectors obtained from the standard PCA have been normalised, the equivalent functional constraint that is achieved by enforicing . The function is then recovered by interpolating the points . Assuming the grid is fine enough, the choice of interpolation method is irrelevant.
As with a standard PCA, we will wish to retain only a small number of functional PCs. As is standard practice, the eigenvalues can be used to facilitate this choice, by retaining enough PCs to explain most of the variation in the data. Assuming PCs are retained, for curve we have
where are the PC loadings for curve . These can be estimated by minimising the residuals , which in practice again requires discretisation of the curve , and the PCs .
We fit the functional mixed-effects model described in Section 3 to the example data set described in Section 2, independently for each probe. Convergence of the EM algorithm was confirmed by convergence of the variance components estimates and and typically took around 30 iterations. 100 iterations of the simplex optimisation procedure were used to select the smoothing parameters. After obtaining estimates of the mean, age and gender effects, and individual curves, these were assessed for significance. To relieve some of the computational burden, permuted null test statistics were shared across all genes - theoretical results justifying this pooling can be found in Storey et al. (2004). Each gene was permuted 32 times, yielding in excess of 1 million null test statistics for each comparison. From these null distributions, empirical p-values were calculated, which were then corrected for multiple testing using the procedure of Benjamini and Hochberg (1995) to control the FDR at 10%.
After applying multiple testing corrections, no significant age genes were identified, as in the original analysis. 21 probes were found to be gender specific. Two of these 21 probes can be found on the Y-chromosome but are not mapped to any known genes. The remaining probes correspond to 7 known genes and 2 open reading frames, given in Table 2. Aside from XIST which, as discussed in Section 2 is only expressed in females and is responsible for X-chromose inactivation to facilitate dosage equivalence between the sexes, all significant genes and the two open reading frames are found on the Y-chromosome.
The highest ranked gender-effect gene on an autosomal chromosome was found to be TUBB2A, located on chromosome 6 and ranked number 23, with an associated FDR of 13%, hence of borderline significance. The gene and fitted mean and gender-effect curves is plotted in Figure 3, where a definite difference between the two groups is apparent, corresponding to between a 3- and 4-fold difference in expression levels.
299 probes were found to be significantly temporally regulated, corresponding to 183 unique, mapped genes. The highest ranking gene was found to be MBP - myelin basic protein - given as one of the examples in Figure 5. Myelin is an insulating sheath covering nerve cells, essential for the correct functioning of the central nervous system and degredation of myelin can be found in many neurodegenerative diseases such as multiple sclerosis. It is thought that MBP might function to maintain the correct structure of myelin, which may explain why we found it to be seasonally regulated, although we could find no existing evidence of this.
We performed a functional PCA of the gene mean curves. Each curve was discretised into 1,000 equally spaced points, then normalised by subtracting the first observation from the rest of the points. Thus, each curve represents the change in expression levels over time, relative to . The first two PC functions are given in Figure 6. The first PC accounts for 99.4% of the variation and corresponds to a linear change in expression levels over time. The second PC accounts for 0.5% of the variation and describes expression levels which rise over the first threee months before falling for the next three months, or vice versa. As these two PCs represent almost all of the variation in the curves, we estimated the loadings for each gene and plotted the results in Figure 4. Four outliers have been highlighted and each of these is plotted in Figure 5. It can be seen that the outliers in the loadings plot correspond to those genes which change most over time, with the distinctive line of points in the center corresponding to genes which change linearly. For these genes with linear dynamics, the size of the first PC loading is relative to the slope. Genes which can be separated on the y-axis are those with a quadratic temporal profile.
A number of different models have been proposed in the literature for the analysis of microarray time series data. One of the earliest examples of a FDA approach to the modelling of microarray time series data was Bar-Joseph et al. (2003) which dealt with the issue of clustering unreplicated data. In their model, the curves were parameterised using B-splines and functional mixed-effects models were used to estimate the cluster mean curves and model the within-cluster variability. In their approach, the function in (1) represents a given cluster’s mean, and the functions represent the temporal profiles of each of the genes belonging to this cluster, of which there are . A specialised EM algorithm was used to handle dynamic cluster assignments. A very similar approach was developed independently by Luan and Li (2003).
A problem with Bar-Joseph et al. (2003); Luan and Li (2003) is that the B-spline parameterisation of the curves requires selecting both the number and location of the knots - breakpoints for the piecewise polynomials - which control the overall smoothness of the fitted curve . As the total number of knots is limited by the number of time points, there is limited scope for controlling the smoothness of the fit. Furthermore, each curve was parameterised using the same number of knots which may be unable to fully capture the wide range of temporal profiles we are likely to observe. Ma et al. (2006) set out to resolve these issues with their alternative framework for clustering. In their model, the cluster mean curves - in (1) - are represented using smoothing splines, which place a knot at each design time point and use a roughness penalty to avoid fitted curves which are too ‘wiggly’. One drawback to their approach, however, is that the individual functions are only modelled as scalar shifts rather than smooth curves. This leads to a more parsimonious model which avoids fitting too many parameters but may fail to adequately model the within-cluster variability.
Angelini et al. (2009) adopt a fully Bayesian approach to estimation and testing in unreplicated or cross-sectional microarray data sets. Each gene is represented using Legendre polynomials. Three choices for a prior on the noise variance allows for errors which are marginally normal, Student t or double exponentially distributed, although is assumed the same for all genes. This assumption is unlikely to hold in practice, as a correlation between gene expression intensity and measurement noise is well known (Tusher et al., 2001). Given the fully Bayesian framework, hypothesis testing for differences in expression levels across two biological groups is performed using Bayes Factors.
A handful of models and computer packages have also specifically been suggested to model longitudinal data. For instance, Timecourse is an R package based on Tai and Speed (2006), where multivariate analysis techniques are applied directly to the vectors of observations. This treatment of time as an unordered categorical variable - found also in ANOVA approaches as in Wang and Kim (2003) - has some significant drawbacks. In particular, the method cannot handle missing data, the results obtained by an analysis would be invariant to permutation of the time points, and it is assumed that the time points are regularly spaced. Furthermore, this method only ranks the genes with no guidance given as to how to evaluate significance.
The EDGE method of Storey et al. (2005) is a FDA approach to modelling both longitudinal and cross-sectional microarray data. In their method for longitudinal data analysis, each gene is modelled independently as a separate functional mixed-effects model. The mean curve - in (1) - is modelled as a B-spline while the individual effects are treated as scalar shifts as in Ma et al. (2006). A complete framework for detecting genes differentially expressed across two or more biological groupings is presented, with the model estimation performed by an EM algorithm. Differential expression is quantified using an F-type statistic which compares the residuals of a null model where the biological groupings are ignored to an alternative model where the groupings are taken into account. Significance is assessed by using a resampling bootstrap procedure to estimate the null distribution of this F-type statistic, and the multiple testing problem is handled by analysing the empirical p-value histogram (Storey and Tibshirani, 2003) to estimate the positive false discovery rate.
Another way of accounting for the within-gene variance is to perform a functional principal components analysis (fPCA). This is analagous to the standard PCA, except the principal components (PCs) are functions rather than finite dimensional vectors. There have been a number of different methods suggested for estimating the PCs in a functional context including direct estimation in a mixed-effects model framework (James et al., 2000), standard PCA on discretised curves (Ramsay and Silverman, 2005) and ‘Principal Components Analysis through Conditional Expectation’ (PACE) (Yao et al., 2005). It is this latter approach which Liu and Yang (2009) applied to the analysis of microarray data; however, PACE was originally proposed for data where the observations on each individual are taken at different time points - for example, in the case of growth curve data - in our experience, microarray experiments tend to have much more regular designs, with each individual observed at the same time points, although these may, indeed, be unequally spaced.
Some key shortcomings of these methods should be noted. Firstly, none of the methods can incorporate the gender and age covariates, particularly as functions of time. Secondly, all of these approachs either use B-splines and/or model the individual ‘functions’ as scalar-shifts, both of which lead to inflexible models. Finally, we are not aware of any existing methods which address the issue of modelling both the within- and between-gene variation. Our proposed methodology addresses some of these limitations.
Our results related to the case study presented in section 2 can be compared to the original findings of Karlovich et al. (2009), who used a non-functional mixed-effect model. Karlovich et al. (2009) lists 19 probes they detected as having a significant gender effect and with a log-transformed signal intensity greater than 7, which we have reproduced here in Table 1. No justification for this cut-off of 7 is provided, and this filter gives misleading results. For instance, all of the significant gender genes we have identified fail to meet the cut-off. This is because the mean log-transformed signal intensity is taken across both genders, and all of our genes aside from XIST are found on the Y-chromosome and hence completely unexpressed in females.
We were unable to find any confirmation in the literature that TUBB2A is a sex-related gene, and it does not appear in the 15 probes given by Karlovich et al. (2009). With a mean log-transformed signal intesity of 7.4, it meets their cut-off criteria. It is possible that they removed the probe from their analysis if the residuals from their model were found to be non-normally distributed. Indeed, the unadjusted p-value for the Shapiro-Wilk test on the residuals of our model for this probe is . However, looking at the residual analysis plotted in Figure 2, it is easy to see that there is one very large outlier. This observation corresponds to subject 174 at Day 90. Subject 174 is the individual who developed lung cancer between days 28 and 90, and died prior to day 180. If this observation is removed then the unadjusted Shapiro-Wilk p-value is , and the null hypothesis that the residuals are normally distributed is no longer rejected. Hence, TUBB2A may indeed be a novel gender regulated gene.
The number of temporally regulated genes we identified are consistent with Karlovich et al. (2009), although their method for identifying differentially expressed genes is quite dissimilar to ours (see Section 2). Indeed, although they found 66 significant genes associated with apoptosis, we found only 15, suggesting the actual significant genes found may vary more widely than the numbers suggest.
In this paper we have demonstrated a complete framework for the analysis of microarray time series data. The unique characteristics of microarry data lend themselves well to a functional data analysis approach and we have shown how this naturally extends to the inclusion of covariates such as age and sex. Our model presented here is a specialisation of the more general functional mixed-effects model (Rice and Wu, 2001; Guo, 2002) and, to the best of our knowledge, we are the first to show how to derive the maximum-likelihood estimators, EM-algorithm, confidence intervals and smoother matrix with more than one fixed-effects function.
We were motivated by a real data set and we have aimed to improve upon the existing results with a more flexible model. By taking a roughness penalty approach, this is achieved while avoiding overfitting, allowing for a departure from the original linear mixed-effects model when the data permits it. A deeper biological interpretation is required to fully assess our success here, but the results we have highlighted in this paper suggest that we can easily attach meaning to our findings. It may also prove worthwhile performing a comparative analysis with Eady et al. (2005), which is another, similar longitudinal study taken over a shorter period of five weeks.
a.1 Example incidence matrix
In our example data set, there are 5 design time points: Day 1, 14, 28, 90 and 180. Therefore, the incidence matrices for all individuals, , all have 5 columns. The first column corresponds to observations at Day 1, the second to observations at Day 14 and so on. The rows correspond to the specific observations on a particular individual. If the individual is observed once at each design time point, then, assuming their vector of observations has been ordered according to the time points, .
Now consider the case of subject 174 who died prior to Day 180 and hence only contributed 4 observations at each of the remaining design time points. The design matrix for this individual has 4 rows, corresponding to the 4 observations, but still has 5 columns, corresponding to the design time points. Specifically the incidence matrix in this case is:
Note how there is no 1 in the final column which would correspond to an observation at Day 180.
a.2 Specification of roughness matrix
Green and Silverman (1994) show that there is a straight forward way to calculate the roughness matrix for a smoothing spline given the set of distinct time points . The roughness matrix is given as where the matrices and are defined as follows. First calculate , the differences between successive time points. Then matrix is an matrix whose entries are given by
for and elsewhere. is an matrix with the entries given by
a.3 Derivation of conditional expectations
We begin by first considering the posterior expectation of which, using basic properties of expectations, can be rewritten as:
The definition of covariance allows us to write:
The problem is now to determine the mean and covariance of , for which we use a standard result conerning the multivariate normal distribution (See, for example, Anderson, 1958) which says, for any vectors and distributed as
the conditional distribution of is given by
If we let and , and derive the covariance of and as then we have
Recognising that, because and are block diagonal and , we have
and we can now write
For the posterior expectation of , we follow exactly the same approach, writing
and using the identity
allows us to derive
- Anderson (1958) Anderson, T. W. (1958). Introduction to Multivariate Statistical Analysis. Wiley.
- Angelini et al. (2009) Angelini, C., D. D. Canditiis, and M. Pensky (2009). Bayesian models for two-sample time-course microarray experiments. Computational Statistics & Data Analysis 53(5), 1547 – 1565. Statistical Genetics & Statistical Genomics: Where Biology, Epistemology, Statistics, and Computation Collide.
- Bar-Joseph et al. (2003) Bar-Joseph, Z., G. K. Gerber, D. K. Gifford, T. S. Jaakkola, and I. Simon (2003). Continuous representations of time-series gene expression data. Journal of Computational Biology 10(3-4), 341–356.
- Benjamini and Hochberg (1995) Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society 57(1), 289–300.
- Buja et al. (1989) Buja, A., T. Hastie, and R. Tibshirani (1989). Linear smoothers and additive models. The Annals of Statistics 17(2), 453–510.
- Calvano et al. (2005) Calvano, S. E., W. Xiao, D. R. Richards, R. M. Felciano, H. V. Baker, R. J. Cho, R. O. Chen, B. H. Brownstein, J. P. Cobb, S. K. Tschoeke, C. Miller-Graziano, L. L. Moldawer, M. N. Mindrinos, R. W. Davis, R. G. Tompkins, S. F. Lowry, L. S. C. R. ProgramInflamm, and H. R. to Injury (2005, October). A network-based analysis of systemic inflammation in humans. Nature 437(7061), 1032–1037.
- Eady et al. (2005) Eady, J. J., G. M. Wortley, Y. M. Wormstone, J. C. Hughes, S. B. Astley, R. J. Foxall, J. F. Doleman, and R. M. Elliott (2005). Variation in gene expression profiles of peripheral blood mononuclear cells from healthy volunteers. Physiol. Genomics 22(3), 402–411.
- Green and Silverman (1994) Green, P. J. and B. W. Silverman (1994). Nonparametric Regression and Generalized Linear Models. Chapman and Hall.
- Guo (2002) Guo, W. (2002). Functional mixed effects models. Biometrics 58(1), 121–128.
- Harville (1977) Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association 72(358), 320–388.
- James et al. (2000) James, G., T. Hastie, and C. Sugar (2000). Principal component models for sparse functional data. Biometrika 87(3), 587–602.
- Karlovich et al. (2009) Karlovich, C., G. Duchateau-Nguyen, A. Johnson, P. McLoughlin, M. Navarro, C. Fleurbaey, L. Steiner, M. Tessier, T. Nguyen, M. Wilhelm-Seiler, and J. Caulfield (2009). A longitudinal study of gene expression in healthy individuals. BMC Medical Genomics 2(1), 33.
- Laird and Ware (1982) Laird, N. M. and J. H. Ware (1982). Random-effects models for longitudinal data. Biometrics 38, 963–974.
- Lee (2003) Lee, T. C. M. (2003). Smoothing parameter selection for smoothing splines: a simulation study. Computational Statistics & Data Analysis 42(1-2), 139 – 148.
- Liu and Yang (2009) Liu, X. and M. C. K. Yang (2009). Identifying temporally differentially expressed genes through functional principal components analysis. Biostat, kxp022.
- Luan and Li (2003) Luan, Y. and H. Li (2003). Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics 19(4), 474–482.
- Ma et al. (2006) Ma, P., C. I. Castillo-Davis, W. Zhong, and J. S. Liu (2006). A data-driven clustering method for time course gene expression data. Nucleic Acids Res 34(4), 1261–1269.
- Nelder and Mead (1965) Nelder, J. A. and R. Mead (1965). A Simplex Method for Function Minimization. The Computer Journal 7(4), 308–313.
- Ramsay and Silverman (2005) Ramsay, J. and B. W. Silverman (2005). Functional Data Analysis (2 ed.). New York: Springer.
- Rice and Wu (2001) Rice, J. and C. Wu (2001). Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics 57, 253–259(7).
- Robinson (1991) Robinson, G. K. (1991). That BLUP is a good thing: the estimation of random effects. Statistical Science 6(1), 15–32.
- Spellman et al. (1998) Spellman, P. T., G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9(12), 3273–3297.
- Storey et al. (2004) Storey, J. D., J. E. Taylor, and D. Siegmund (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 66(1), 187–205.
- Storey and Tibshirani (2003) Storey, J. D. and R. Tibshirani (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America 100(16), 9440–9445.
- Storey et al. (2005) Storey, J. D., W. Xiao, J. T. Leek, R. G. Tompkins, and R. W. Davis (2005, Sep). Significance analysis of time course microarray experiments. Proc Natl Acad Sci U S A 102(36), 12837–12842.
- Tai and Speed (2006) Tai, Y. C. and T. P. Speed (2006). A multivariate empirical bayes statistic for replicated microarray time course data. Annals of Statistics 34(5), 2387–2412.
- Tang et al. (2004) Tang, Y., A. Lu, R. Ran, B. J. Aronow, E. K. Schorry, R. J. Hopkin, D. L. Gilbert, T. A. Glauser, A. D. Hershey, N. W. Richtand, M. Privitera, A. Dalvi, A. Sahay, J. P. Szaflarski, D. M. Ficker, N. Ratner, and F. R. Sharp (2004). Human blood genomics: distinct profiles for gender, age and neurofibromatosis type 1. Molecular Brain Research 132(2), 155 – 167. Neurogenomics.
- Tusher et al. (2001) Tusher, V. G., R. Tibshirani, and G. Chu (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences of the United States of America 98(9), 5116–5121.
- Wahba (1977) Wahba, G. (1977). Practical approximate solutions to linear operator equations when the data are noisy. SIAM Journal on Numerical Analysis 14(4), 651–667.
- Wang and Kim (2003) Wang, J. and S. K. Kim (2003). Global analysis of dauer gene expression in Caenorhabditis elegans. Development 130(8), 1621–1634.
- Whitney et al. (2003) Whitney, A. R., M. Diehn, S. J. Popper, A. A. Alizadeh, J. C. Boldrick, D. A. Relman, and P. O. Brown (2003). Individuality and variation in gene expression patterns in human blood. Proceedings of the National Academy of Sciences of the United States of America 100(4), 1896–1901.
- Wu and Zhang (2006) Wu, H. and J.-T. Zhang (2006). Nonparametric Regression Methods for Longitudinal Data Analysis. Wiley.
- Yao et al. (2005) Yao, F., H.-G. Müller, and J.-L. Wang (2005). Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100(470), 577–590.