Derivative Principal Component Analysis for Representing the Time Dynamics of Longitudinal and Functional Data\sectionfont\subsectionfont
Derivative Principal Component Analysis for Representing the Time Dynamics of Longitudinal and Functional Data
Short title: Derivative Principal Component Analysis
Accepted manuscript by Statistica Sinica
Xiongtao Dai, Hans-Georg Müller
Department of Statistics
University of California
Davis, CA 95616 USA
Mountain View, CA 94041 USA
March 6, 2018
We propose a nonparametric method to explicitly model and represent the derivatives of smooth underlying trajectories for longitudinal data. This representation is based on a direct Karhunen–Loève expansion of the unobserved derivatives and leads to the notion of derivative principal component analysis, which complements functional principal component analysis, one of the most popular tools of functional data analysis. The proposed derivative principal component scores can be obtained for irregularly spaced and sparsely observed longitudinal data, as typically encountered in biomedical studies, as well as for functional data which are densely measured. Novel consistency results and asymptotic convergence rates for the proposed estimates of the derivative principal component scores and other components of the model are derived under a unified scheme for sparse or dense observations and mild conditions. We compare the proposed representations for derivatives with alternative approaches in simulation settings and also in a wallaby growth curve application. It emerges that representations using the proposed derivative principal component analysis recover the underlying derivatives more accurately compared to principal component analysis-based approaches especially in settings where the functional data are represented with only a very small number of components or are densely sampled. In a second wheat spectra classification example, derivative principal component scores were found to be more predictive for the protein content of wheat than the conventional functional principal component scores.
Keywords: Derivatives, Empirical dynamics, Functional principal component analysis, Growth curves, Best linear unbiased prediction.
Estimating derivatives and representing the dynamics for longitudinal data is often crucial for a better description and understanding of the time dynamics that generate longitudinal data (Müller and Yao (2010)). Representing derivatives, however, is not straightforward. Efficient representations of derivatives can be based on expansions into eigenfunctions of derivative processes. Difficulties abound in scenarios with sparse designs and noisy data. To address these issues, we propose a method for representing the derivatives of observed longitudinal data by directly aiming at the Karhunen–Loève expansion (Grenander (1950)) of derivative processes. Classical methods for estimating derivatives of random trajectories usually require observed data to be densely sampled. These methods include difference quotients, estimates based on B-splines (de Boor (1972)), smoothing splines (Chambers and Hastie (1991); Zhou and Wolfe (2000)), kernel-based estimators such as convolution-type kernel estimators (Gasser and Müller (1984)), and local polynomial estimators (Fan and Gijbels (1996)). In the case of sparsely and irregularly observed data, however, direct estimation of derivatives for each single function is not feasible due to the gaps in the measurement times.
For the case of irregular and sparse designs, Liu and Müller (2009) proposed a method based on functional principal component analysis (FPCA) (Rice and Silverman (1991); Ramsay and Silverman (2005)) for estimating derivatives. The central idea of FPCA is dimension reduction by means of a spectral decomposition of the autocovariance operator, which yields functional principal component scores (FPCs) as coefficient vectors to represent the random curves in the sample. In Liu and Müller (2009), derivatives of eigenfunctions are estimated and plugged in to obtain derivatives of the estimated Karhunen–Loève representation for the random trajectories. While this method was shown to outperform several other approaches for recovering derivatives for sparse and irregular designs, including those using difference quotients, functional mixed effect models with B-spline basis functions (Shi et al. (1996); Rice and Wu (2001)), or P-splines (Jank and Shmueli (2005); Reddy and Dass (2006); Bapna et al. (2008); Wang et al. (2008)), it is suboptimal for representing derivatives, as the coefficients in the Karhunen–Loève expansion are targeting the functions themselves and not the derivatives.
This provides the key motivation for this paper: represent dynamics by directly targeting the Karhunen–Loève representation of derivatives of random trajectories. The Karhunen–Loève representation of derivatives needs to be inferred from available data, which are modeled as noisy measurements of the trajectories. We then aim to represent derivatives directly in their corresponding eigenbasis, yielding the most parsimonious representation, and leading to a novel Derivative Principal Component Analysis (DPCA). In addition, the resulting expansion coefficients, which we refer to as derivative principal component scores (DPCs) provide a novel representation and dimension reduction tool for functional data that complements other such representations such as the commonly used FPCs.
The proposed method is designed for both sparse and dense cases and works successfully under both cases. When the functional data are densely sampled with possibly large measurement errors, smoothing the observed trajectories and obtaining derivatives for each trajectory separately is subject to possibly large estimation errors, which are further amplified for derivatives. In contrast, the proposed method pools observations across subjects and utilizes information from measurements at nearby time points from all subjects when targeting derivatives, and therefore is less affected by large measurement errors. In scenarios where only a few measurements are available for each subject, the proposed method performs derivative estimation by borrowing strength from all observed data points, instead of relying on the sparse data that are observed for a specific trajectory. A key step is to model and estimate the eigenfunctions of the random derivative functions directly, by spectral-decomposing the covariance function of the derivative trajectories.
The main novelty of our work is to obtain empirical Karhunen–Loève representations for the dynamics of both sparsely measured longitudinal data and densely measured functional data, and to obtain the DPCA with corresponding DPCs. For the estimation of these DPCs, we employ a best linear unbiased prediction (BLUP) method that directly predicts the DPCs based on the observed measurements. In the special case of a Gaussian process with independent Gaussian noises, the BLUP method coincides with the best prediction. This unified approach provides a straightforward implementation for the Karhunen–Loève representation of derivatives. Under a unified framework for the sparse and the dense case, we provide convergence rate results for the derivatives of the mean function, the covariance function, and the derivative eigenfunctions based on smoothing the pooled scatter plots (Zhang and Wang (2016)). We also derive convergence rates for the estimated DPCs based on BLUP.
The remainder of this paper is structured as follows: In 2, we introduce the new representations for derivatives. DPCs and their estimation are the topic of 3. Asymptotic properties of the estimated components and of the resulting derivative estimates are presented in 4. We compare the proposed method with alternative approaches in terms of derivatives recovery in 5 via simulation studies and in 6 using longitudinal wallaby body length data. As is demonstrated in 6, DPCs can be used to improve classification of functional data, illustrated by wheat spectral data. Additional details are provided in the Appendix.
2 Representing Derivatives of Random Trajectories
Consider a -times differentiable stochastic process on a compact interval , with , mean function , and auto-covariance function , for . The independent realizations of can be represented in the Karhunen–Loève expansion,
where are the functional principal component scores (FPCs) of the random functions that satisfy , , for , ; the are the eigenfunctions of the covariance operator associated with , with ordered eigenvalues .
assuming that both sides are well defined, with corresponding variance One can then estimate derivatives by approximating with the first components
In (2) and (3), is the -th derivative of the mean function and can be estimated by local polynomial fitting applied to a pooled scatterplot where one aggregates all the observed measurements from all sample trajectories. The FPCs of the sample trajectories can be estimated with the principal analysis by conditional expectation (PACE) approach described in Yao et al. (2005). Starting from the eigenequations with orthonormality constraints, under suitable regularity conditions, by taking derivatives on both sides, one obtains targets and respective estimates,
where is the th partial derivative, is a smooth estimate of obtained by for example local polynomial smoothing, and an estimate of the -th eigenfunction. The derivative is thus represented by
where the integer is chosen in a data-adaptive fashion, for example by cross-validation (Rice and Silverman (1991)), AIC (Shibata (1981)), BIC (Schwarz (1978)), and fraction of variance explained (Liu and Müller (2009)).
A conceptual problem with this approach is that the eigenfunction derivatives , are not the orthogonal eigenfunctions of the derivatives . Consequently this approach does not lead to the Karhunen–Loève expansion of derivatives, and therefore is suboptimal in terms of parsimoniousness. This motivates our next goal, to obtain the Karhunen–Loève representation for derivatives.
2.2 Karhunen–Loève Representation for Derivatives
To obtain the Karhunen–Loève representation for derivatives, consider the covariance function of , , a symmetric, positive definite and continuous function on . The associated autocovariance operator for is a linear Hilbert-Schmidt operator with eigenvalues denoted by and orthogonal eigenfunctions , . This leads to the representation
with , and the Karhunen–Loève representation for the derivatives ,
with DPCs , for , . For practical applications, one employs a truncated Karhunen–Loève representation
with a finite .
In contrast to (2), where derivatives of eigenfunctions are used in conjunction with the FPCs of processes to represent , the proposed approach is based on the derivative principal component scores (DPCs) and the derivative eigenfunctions . The proposed representation (6) is more efficient in representing than using (3), as
The eigenfunctions of the derivatives can be obtained by the spectral decomposition of . Let . Under regularity conditions,
To fully implement our approach, we need to identify the components of the representation (5), as described in the next subsection.
2.3 Sampling Model and BLUP
The sampling model needs to reflect that longitudinal data are typically sparsely sampled with random locations of the design points, while functional data such as the spectral data discussed in 6.2 are sampled at a dense grid of design points. Assuming that for the -th trajectory , one obtains measurements made at random times , for , where for sparse longitudinal designs the number of observations per subject is bounded, while for dense functional designs . For both scenarios the observed data are assumed to be generated as
where are i.i.d. measurement errors with and , independent of , and the are generated according to some fixed density that has certain properties. All expected values in the following are interpreted to be conditional on the random locations , which is not explicitly indicated in the following.
Let be an matrix representing the covariance of with -th element , where if and 0 otherwise. In addition, is a vector obtained by evaluating the mean function at the vector of measurement times, and is a column vector of length with -th element cov, , where
For the prediction of the DPCs , we use the best linear unbiased predictors (BLUP, Rice and Wu (2001))
that are always defined without distributional assumptions. In the special case that errors and processes are jointly Gaussian, is the conditional expectation of given , which is the optimal prediction of under squared error loss.
3 Estimation of Derivative Principal Components
For estimation, we provide details for the most important case of the first derivative, . Higher order derivatives are handled similarly. By (5), approximate derivative representations are given by
with approximation errors , the convergence rate of which is determined by the decay rate of the . We then obtain plug-in estimates for , , by substituting , and in (12) with corresponding estimates, leading to . Here we obtain by applying local polynomial smoothing to a pooled scatterplot that aggregates the observed measurements from all sample trajectories, and and by spectral decomposition of , where is the estimate for the mixed first-order partial derivative of , obtained by two-dimensional local polynomial smoothing. For more details and related discussion about these estimates of and we refer to Appendix A.1.
Estimating the DPCs is an essential step for representing derivatives as in (12). From the definition , it seems plausible to obtain using plug-in estimates and numerical integration, . However, this approach requires that one already has derivative estimates , which is not viable, especially for sparse/longitudinal designs.
An alternative approach is to construct BLUPs from the observed measurements that were made at time points as in (11), where can be consistently estimated. Applying (11) for , the BLUP for given observations is
where and .
When the joint Gaussianity of and holds, is the conditional expectation of given , the best prediction. The required estimates of the partial derivative of the covariance function and estimates can be obtained by local polynomial smoothing (Liu and Müller (2009) equation (7)). Estimate of the error variance can be obtained using the method described in equation (2) of Yao et al. (2005).
In practice, the number of included components can be chosen by a variety of methods, including leave-one-curve-out cross-validation (Rice and Silverman (1991)), pseudo-AIC (Shibata (1981)), or pseudo-BIC (Schwarz (1978); Yao et al. (2005)). Another fast and stable option that works quite well in practice is to choose the smallest so that the inclusion of the first components explains a preset level of variation, which can be set at 90%.
4 Asymptotic Results
We take a unified approach in our estimation procedure for the DPCs and other model components that encompasses both the dense and the sparse case. Estimation of the derivatives of the mean and the covariance function, and the derivative eigenfunctions are based on smoothing the pooled scatter plots (Zhang and Wang (2016)); the estimation for the DPCs is based on best linear unbiased predictors as in (14). We derive convergence rate results that make use of a novel argument for the dense case. Consistency of the estimator for can be obtained by utilizing the convergence of estimators , , and to their respective targets , , and as in Theorems 1 and 2 below. Regularity conditions include assumptions on the number and distribution of the design points, smoothness of the mean and the covariance functions, bandwidth choices, and moments for , as detailed in Appendix A.2.
We present results on asymptotic convergence rates in the supremum norm for the estimates of the mean and the covariance functions for derivatives and the corresponding estimates of eigenfunctions. Our first theorem covers the case of sparse/longitudinal designs, where the number of design points is bounded, and the case of dense/functional designs, where . For convenience of notation, let
This result provides the basis for the convergence of the DPCs. We write if for some constants . In the sparse case, the optimal supremum convergence rates for , and are of order almost surely, achieved for example if , , , and as in 6. and 8.. In the dense case, if the number of observations per curve is at least of order , then a root- rate is achieved for our estimates if , , and .
Using asymptotic results in Liu and Müller (2009) for auxiliary estimates of the mean and the covariance functions and their derivatives or partial derivatives, we obtain asymptotic convergence rates of toward the appropriate target, , as in (13) for the sparse case and in the dense case.
For example, in the sparse case if we choose and , then . In the dense case, the can be consistently estimated if , , and , with the optimal rate for , achieved when .
If we choose the bandwidths as described after Theorem 2, then in the sparse case , and in the dense case .
5 Simulation Studies
To examine the practical utility of the DPCs, we compared them with various alternatives under different simulation settings, which included a dense and a sparse design. To evaluate the performance of each method in terms of recovering the true derivative trajectories, we examined the mean and standard deviation of the relative mean integrated square errors (RMISE), defined as
We compared the proposed approach based on model (5), referred to as DPCA, and a PACE method (Yao et al. (2005)) followed by differentiating the eigenfunctions of observed processes as in Liu and Müller (2009), corresponding to (2), referred to as FPCA. Each simulation consisted of 400 Monte Carlo samples with the number of random trajectories chosen as per simulation sample.
While our methodology is intended to address the difficult problem of derivative estimation for the case of sparse designs, the Karhunen–Loève expansion for derivatives is of interest in itself and is also applicable to densely sampled functional data. The proposed method also has advantages for densely sampled data with large measurement errors. For the case of dense designs, another straightforward approach to obtain derivatives is LOCAL, a method that corresponds to local quadratic smoothing of each trajectory separately, then taking the coefficient at the linear term as estimate of the derivative; and SMOOTH-DQ, where difference quotients are smoothed with local linear smoothers. These methods are obvious tools to obtain derivatives, but their application is only reasonable for densely sampled trajectories.
All simulated longitudinal data were generated according to the data sampling model described in Section 2, with mean function ; five eigenfunctions , where is the th orthonormal Legendre polynomial on ; eigenvalues for ; and FPC scores distributed as , . The additional measurement errors were i.i.d , where the value of varied for different simulation settings.
Simulation A – Sparsely Sampled Longitudinal Data. The number of observations for each trajectory, denoted by , was generated from a discrete uniform distribution from 2 to 9. The measurement times of the observations were randomly sampled in according to a Beta() distribution with mean 0.4 and standard deviation 0.3, so that the design is genuinely sparse and unbalanced. Measurement errors were generated by a Gaussian distribution with standard deviation or .
Simulation B – Densely Sampled Functional Data. Each random trajectory consists of 51 equidistant observations measured at the same dense time grid on the interval . In this setting, the proposed DPC method is compared with FPC, LOCAL, and SMOOTH-DQ. In LOCAL, we estimate the derivatives by applying local quadratic smoothing to individual subjects, with bandwidth selected by minimizing the average cross-validated integrated squared deviation between the resulting derivatives and the raw difference quotients formed from adjacent measurements. In SMOOTH-DQ, individual derivative trajectories were estimated by local linear smoothing of the difference quotients, with smoothing bandwidth chosen by a similar strategy as for LOCAL. Gaussian measurement errors were added with standard deviation or .
For the smoothing steps, Gaussian kernels were used and the bandwidths , were selected by a generalized cross-validation method (GCV). For DPC, we took the partial derivative of to obtain , which was superior in performance compared to smoothing the raw data directly, and then we applied a one-dimensional smoother on to obtain , where the smoothing bandwidth was chosen to be the same as . The smoothers for and enjoy better finite sample performance than two-dimensional smoothers due to more stable estimates and better boundary behavior. We let the number of components range from 1 to 5 for estimating the derivative curves, and we also included an automatic selection of based on FVE with threshold 90%. The population fraction of variance explained for FPCA is , which were 0%, 18%, 61%, 74%, 100% for , respectively. In contrast, the FVEs for DPCA are , which were 56%, 77%, 92%, 100%, 100% in our simulation. It is evident that DPCA explains more variance than FPCA, given the same number of components, as expected in view of (7).
The results for sparse and irregular designs (Simulation A) are shown in Table 1. For sparse and irregular designs, the sparsity of the observations for each subject precludes the applicability of LOCAL and SMOOTH-DQ, so we compared the proposed DPCA only with FPCA, given that the latter was shown to have much better performance compared to mixed effect modeling with B-splines in Liu and Müller (2009). We also include the RMISE for the simple approach of estimating individual derivatives by the estimated population mean derivative .
As the results in Table 1 demonstrate, given the same number of components , the representation of derivatives with DPCA works equally well or better than FPCA in terms of RMISE where, in the latter, derivatives are represented with the standard FPCs and the derivatives of the eigenfunctions. DPCA performs well with as few as components, while FPCA performs well only when . The performance for individual trajectories when is illustrated in Figure 1, which shows the derivative curves and corresponding estimates obtained with FPCA and DPCA for four randomly selected samples generated under measurement error . We find that in the sparse case the estimated derivatives using FPCA and DPCA are overall similar.
The results for Simulation B for dense designs are shown in Table 2. We found that under both small () and large () measurement errors, the proposed DPCA clearly outperforms the other three methods in terms of RMISE. The runner-up among the other methods is FPCA, but it was highly unstable for more than five components. Performance for all methods was better with smaller measurement errors (), due to the fact that it is particularly difficult to infer derivatives in situations with large measurement errors. Also, unsurprisingly, under the same level of measurement errors, all methods achieve smaller RMISE for dense designs, compared to their respective performance under sparse designs. This shows that DPCA has a significant advantage over FPCA in the dense setting.
6.1 Modeling Derivatives of Tammar Wallaby Body Length Data
We applied the proposed DPCA for derivative estimation to the Wallaby growth data, which can be found at http://www.statsci.org/data/oz/wallaby.html, from the Australian Dataset and Story Library (OzDASL). This dataset includes body length measurements for 36 tammar wallabies (Macropus eugenii), longitudinally taken and collected from wallabies in their early age. A detailed introduction of the dataset is given by Mallon (1994). To gain a better understanding of the growth pattern of wallabies, we investigated the dynamics of their body length growth by estimating the derivatives of their growth trajectories.
One main difficulty is that the body length measurements are very sparse, irregular, and fragmentary, as shown in Figure 2, making these data a good test case to reveal the difficulties in recovering derivatives from sparse longitudinal designs. The 36 wallabies included in the dataset had their body length measured from 1 to 12 times per subject, with a median of 3.5 measurements per subject.
Aiming at a number of components with 90% FVE leads to inclusion of the first derivative eigenfunctions. The estimated first derivative of the mean function, eigenfunctions of the original growth trajectories, and those of the derivatives by the proposed approach are shown in Figure 3. The average dynamic changes in body length are represented by the mean derivative function (upper left panel), which exhibits a monotonically decreasing trend, from greater than 25 cm/yr at age 1 to less than 10 cm/yr at age 1.8, where the decline rate of the mean derivative function becomes generally slower as age increases. The first eigenfunction (solid) of the trajectories reflects overall body length at different ages, the second eigenfunction (dashed) characterizes a contrast in length between early and late stages, and the third eigenfunction (dotted) corresponds to a contrast between a period around 1.5 years and the other stages (upper right panel).
The primary mode of dynamic changes in body length, as reflected by the first eigenfunction of the derivatives in the lower right panel (solid) of Figure 3, represents the overall speed of growth which has a decreasing trend as wallabies get older. The second mode of dynamic variation is determined by the second eigenfunction of the derivatives (dashed) that mainly contrasts dynamic variation during young age with that of late ages. The third eigenfunction of the derivatives (dotted) emphasizes growth variation around age 1.38 and stands for a contrast of growth speed between middle age and early/late ages. The eigenfunctions of derivatives are seen to clearly differ from the derivatives of the eigenfunctions (lower left panel) of the trajectories themselves, and are well interpretable.
Figure 4 exhibits the trajectories and corresponding derivatives estimates by FPCA and DPCA for four randomly selected wallabies, where the derivatives were constructed using components, the smallest number that leads to 90% FVE. Here the DPC-derived derivatives are seen to be reflective of the data dynamics.
6.2 Classifying Wheat Spectra
As a second example we applied the proposed DPCA to the near infrared (NIR) spectral dataset of wheat, which consists of NIR spectra of 87 wheat samples with known protein content. The spectra of the samples were measured by diffuse reflectance from 1100 to 2500 nm with 10 nm intervals, as displayed in the left panel of Figure 5. For these data, it is of interest to investigate whether the spectrum of a wheat sample can be utilized to predict its protein content. Protein content is an important factor for wheat storage, and higher protein contents may increase the market price. For a more detailed description of these data, we refer to Kalivas (1997). Functional data analysis of these data has been studied by various authors, including Reiss and Ogden (2007), and Delaigle and Hall (2012).
As can be seen from the left panel of Figure 5, the wheat samples are found to exhibit very similar spectral patterns: the overall trend for all trajectories is increasing, with three major local peaks appearing at wavelengths around 1200 nm, 1450 nm, and 1950 nm. The trajectories are almost parallel to each other, with only minor differences in the overall level. The response to be predicted is the protein content of a wheat sample, which is grouped into categories— high if a sample has more than 11.3% protein, and low if less than 11.3%. From the trajectory graphs it appears to be a non-trivial problem to classify the wheat samples, since the trajectories corresponding to the high protein wheats are highly spread out vertically and overlap on the lower side with those corresponding to the low group.
It has been suggested (Delaigle and Hall (2012)) that derivatives of wheat spectra are particularly suitable for classification of these spectra. We therefore applied the proposed DPCA for the fitting of these spectra and this led to the estimated derivatives of wheat spectra shown in the right panel of Figure 5. These fits are based on including the first four DPCs, which collectively explain 99.2% of the total variation in the derivatives.
For a comparative evaluation of the performance of using FPCA as opposed to DPCA for the purpose of classifying protein contents of wheat samples, we used a logistic regression with one to eight FPCs or DPCs as predictors. We randomly drew 30 samples as training sets and 57 as test sets, repeated this 500 times, and report the average misclassification rates for the test sets in Table 3, in which the first eight columns stand for using a fixed number of components, and the last column for selecting based on 5-fold cross-validation (CV), minimizing the misclassification rate. We found that the DPCA-based classifier outperforms the FPCA-based classifier if two to five predictor components were included, or if CV was used to select . The minimal misclassification rate is 23.8% using two components, while the best FPCA-based misclassification rate is 25.3%. The poor performance of DPCA when indicates the first DPC does not provide information for classification alone, while the second DPC may be a superior predictor of interest. While there are some improvements in the misclassification rates when using DPCs, they are relatively small in terms of misclassification error.
Appendix A.1 Estimating Mean Function and Eigenfunctions For Derivatives
To implement (12), one needs to obtain , the first order derivative of the mean function. Let , , and . Applying local quadratic smoothing to a pooled scatterplot, one can aggregate the observed measurements from all sample trajectories and minimize
with respect to , . The minimizer is the estimate of , (Liu and Müller (2009), equation (5)). Here , is a univariate density function, and is a positive bandwidth that can be chosen by GCV in practical implementation.
In order to estimate the eigenfunctions of the derivatives , we proceed by first estimating the covariance kernel (in (4) with ), followed by a spectral decomposition of the estimated kernel. According to (8), for the case ; this can be estimated by a two-dimensional kernel smoother targeting the mixed partial derivatives of the covariance function. Specifically, we aim at minimizing
with respect to for , and set to the minimizer . For theoretical derivations we adopt this direct estimate of , while in practical implementation it has been found to be more convenient to first obtain and then to apply a local linear smoother on the second direction, which also led to better stability and boundary behavior.
After obtaining estimates of , eigenfunctions and eigenvalues of the derivatives can be estimated by the spectral decomposition of ,
Appendix A.2 Assumptions, Proofs and Auxiliary Results
is a symmetric probability density function on and is Lipschitz continuous: There exists such that for any .
are i.i.d. copies of a random variable defined on , and are regarded as fixed. The density of is bounded below and above,
Furthermore , the second derivative of , is bounded.
, , and are independent.
and exist and are bounded on and , respectively, for .
For some , , , and
For some , , , and
For any , there exists such that = 0 for all .
Note that 9. holds for any infinite-dimensional processes where the eigenfunctions correspond to the Fourier basis or Legendre basis.
Proof of Theorem 1:.
We first prove the rate of convergence in the supremum norm for to