Functional robust regression for longitudinal data
Abstract
We present a robust regression estimator for longitudinal data, which is especially suited for functional data that has been observed on sparse or irregular time grids. We show by simulation that the proposed estimators possess good outlierresistance properties compared with the traditional functional leastsquares estimator. As an example of application, we study the relationship between levels of oxides of nitrogen and ozone in the city of San Francisco.
Key Words: Functional data analysis; Longitudinal data analysis; Mixed effects models; Robust statistics; Spline smoothing.
1 Introduction
In a typical longitudinal study, a number of variables are measured on a group of individuals and the goal is to analyze the relationships between the trajectories of the variables. In recent years, functional data analysis has provided efficient ways to analyze longitudinal data. In many cases the variable trajectories are discretized continuous curves that can be reconstructed by smoothing, and functional linear regression methods can be applied to study the relationship between the variables (Ramsay and Silverman, 2005). But in other situations the data is observed at sparse and irregular time points, which makes smoothing difficult or even unfeasible. Therefore, functional regression methods that can be applied directly to the raw measurements become very useful.
Methods for functional data analysis of irregularly sampled curves have been proposed by a number of authors, for the onesample problem as well as for the functional regression problem (Chiou et al., 2004; James et al., 2000; Müller et al., 2008; Yao et al., 2005a, 2005b). Outlierresistant techniques for the functional onesample problem have also been proposed (Cuevas et al., 2007; Gervini, 2008, 2009; Fraiman and Muniz, 2001; Locantore et al., 1999), and two recent papers deal with robust functional regression for presmoothed curves (Zhu et al. 2011; Maronna and Yohai, 2012). However, outlierresistant functional regression methods for raw functional data have not yet been proposed in the literature. In this paper we address this problem and present a computationally simple approach based on randomeffect models. Our simulations show that this method attains the desired outlier resistance against atypical curves, and that the asymptotic distribution of the test statistic is approximately valid for small samples.
As an example of application, we will analyze the daily trajectories of oxides of nitrogen and ozone levels in the city of Sacramento, California, during the summer of 2005. The data is shown in Figure LABEL:fig:sample_curves. The goal is to predict ozone concentration from oxides of nitrogen. Both types of curves follow regular patterns, but some atypical curves can be discerned in the sample. We will show in Section 4 that to a large extend it is indeed possible to predict ozone levels from oxidesofnitrogen levels, but that the outlying curves distort the classical regression estimators and that the proposed robust method gives more reliable results.
The paper is organized as follows. Section 2 presents a brief overview of functional linear regression and introduces the new method. Section 3 reports the results of a comparative simulation study, and Section 4 presents a detailed analysis of the above mentioned ozone dataset. Technical derivations and proofs are left to the Appendix. Matlab\registered programs implementing these procedures are available on the author’s webpage.
ftbpFU5.9819in2.0634in0pt\QcbOzone Example. Daily trajectories of groundlevel concentrations of (a) oxides of nitrogen and (b) ozone in the city of Sacramento in the Summer of 2005.\Qlbfig:sample_curvessample.eps
2 Method
2.1 Background: classical functional linear regression
The functional approach to longitudinal data analysis assumes that the observations are discrete measurements of underlying continuous curves, so
(1)  
(2) 
where and are the trajectories of interest, and are random measurement errors, and and are the time points where the data is observed. The s and the s are random functions that we assume independent and identically distributed realizations of a pair .
Suppose and are squareintegrable functions on an interval . Define the norm and the inner product . If and are finite, then and admit the decomposition
(3)  
(4) 
known as the Karhunen–Loève decomposition (Ash and Gardner 1975, ch. 1.4), where , , and are orthonormal functions (i.e. and , where is Kronecker’s delta), and and are random variables with zero mean and finite variance (without loss of generality, one can assume that and .) This is the functional equivalent of the principalcomponent decomposition in multivariate analysis, so the s and s are called “principal components”, and the s and s are called “component scores”. In principle and in (3) and (4) could be infinite, but since and are finite, the sequences and usually decrease to zero fast enough that for practical purposes and can be assumed to be finite.
Methods for estimating the mean and the principal components of and can be found in Ramsay and Silverman (2005), James et al. (2000), and Yao et al. (2005b). These methods are not resistant to outliers, though; outlierresistant estimators of the mean and principal components have been proposed by Locantore et al. (1999), Cuevas et al. (2007), and Gervini (2008, 2009). We will use the method of Gervini (2009) to estimate the mean and the principal components in (3) and (4). This method is briefly reviewed in the Appendix.
Now suppose that there is a functional linear relationship between and :
(5) 
where is the intercept, the slope, and the error term. We assume and for all and . (Note that the is not necessarily white noise; it is just the portion of that is not explained by , and it is usually a smooth nontrivial process.) Since (5) implies that , we can rewrite (5) as
(6) 
Then the only parameter that remains to be estimated is the regression slope .
Since is an orthonormal basis of the space and is an orthonormal basis of the space, without loss of generality the regression slope can be expressed as
(7) 
In matrix form, , where and . If we also collect the component scores and into vectors and , from (3), (4), (6) and (7) we obtain
where is the random vector with elements . This reduces the functional regression model (6) to a simpler multivariate regression model,
(8) 
and the problem now is to estimate the regression matrix .
2.2 Outlierresistant functional regression
As explained above, given the data we use the reducedrank estimators of Gervini (2009) to obtain robust estimators of , , , , and . By (7) and (8), the leastsquares estimator of would be with
(9) 
However, this estimator is not robust. Although the reducedrank estimators of , , and are robust, the component scores and are individual parameters that will be outliers if the corresponding curves and are outliers. Therefore, the estimator of has to incorporate a mechanism to downweight outlying s and s.
This can be accomplished, for instance, by a modification of the type GMestimators of He et al. (2000), that we will call GMt for short. Let
(10) 
where . These are the maximum likelihood estimators of and when in (8) follows a multivariate distribution with mean zero and scatter matrix , although we do not actually assume that follows this distribution; as in He et al. (2000), this is just the motivation behind definition (10).
It is shown in the Appendix that and satisfy the fixedpoint equations
(11)  
(12) 
where and . These equations can be solved iteratively by a reweighting algorithm.
As for the weights , they are essentially a byproduct of the estimation of , and . Since and , the s are approximately uncorrelated with mean zero. The squared Mahalanobis distance of is then , and large s will correspond to outliers. The s will follow an approximate distribution if the data is Gaussian.
This suggests a number of weighting schemes. One possibility is to use “metric” trimming,
(13) 
where is the quantile of the distribution. Another possibility is to use rankbased trimming,
(14) 
The latter will always eliminate the observations with largest Mahalanobis distances, even if they are not actual outliers; so we recommend not using an unnecessarily large for rankbased trimming. In practice, the choice of can be based on the proportion of outliers observed in a boxplot or histogram of the s.
The estimator defined above belongs to the general class of Mestimators, which have wellknown asymptotic properties (Van der Vaart, 1998, ch. 5). As shown in the Appendix, follows an approximate distribution for large , with
(15)  
(16) 
The matrices and can be easily estimated, replacing expectations by averages. This asymptotic distribution can be used, for instance, to test significance of the regression: if , Wald’s statistic follows an approximate distribution for large , so we decide the regression is significant if for a given level . We can also construct marginal tests and confidence intervals for the individual coefficients .
In Section 3 we will study the accuracy of this asymptotic approximation. It is our experience that the distribution of approaches normality quite fast, but the above “sandwich formula” tends to underestimate the variance when the sample size is small. In that case it is better to use bootstrap estimators of the covariance matrix of .
3 Simulations
In this section we study by simulation the finitesample behavior of the estimators (10). To this end, we generated data from model (8) with and , where and . Two regression parameters were considered: for the first set of simulations (to study estimation error) we took with and for ; for the second set of simulations (to study the goodness of the asymptotic approximation of Wald’s test) we took . The curves and were generated following (3) and (4), with and equal to zero, and , for and in . The raw observations were generated following (1) and (2), with random s uniformly distributed in , and independent , and ; for simplicity we took the grid equal to .
The first series of simulations were designed to study estimation error of the s, both for clean and for outliercontaminated data. We generated outliers by replacing of the pairs by , with and for , and . Note that the contaminated data follows model (8) with and highleverage s, so the effect of this type of contamination is an underestimation of that tends to pull towards 0.
The estimation of requires two steps: first, to estimate and from the raw data, and then to compute from the s and the s. So we compared two procedures: a nonrobust procedure, using reducedrank Normal models (James et al., 2000) to estimate the component scores, followed by the ordinary leastsquares regression estimator (9); and a robust procedure, using reducedrank models (Gervini, 2009) to estimate the component scores, followed by the GMt regression estimator (10). For the robust procedure, we considered the two types of weights discussed in Section 2.2, with trimming proportions and ; degrees of freedom and were used for the models.
Four levels of contamination were considered: 0 (clean data), , and . We took as sample size, as grid size, and as model dimensions. Each case was replicated 1000 times. As measure of the estimation error we used the expected root integrated squared error , where .
The results are reported in Table 1, along with Monte Carlo standard errors. We see that for noncontaminated data (), there is no significant difference between metric and rank trimming for a given pair . The trimming proportion has a larger impact on the estimator’s behavior than the degrees of freedom . For this reason we recommend choosing adaptively, so as not to cut off too much good data. When , we see that metric trimming tends to outperform rank trimming for a given pair . Somewhat counterintuitively, estimators with tend to be more robust than those with for a given ; the reason is that for this type of contamination, which affects but not the s or the s, models with provide more accurate estimators of and than models with (for other types of contamination this is no longer true, although models with are still very robust; see Gervini (2009).) In general, then, the recommendation is to use model estimators with metrically trimmed weights and a trimming proportion chosen adaptively.
Contamination proportion  

Estimator  0%  10%  20%  30% 
Least squares  .293 (.004)  2.241 (.006)  2.644 (.048)  2.731 (.007) 
GMt, ,  
Metric trim  .472 (.006)  .497 (.007)  1.316 (.028)  2.924 (.008) 
Rank trim  .473 (.006)  .469 (.007)  2.246 (.028)  2.941 (.006) 
GMt, ,  
Metric trim  .846 (.012)  .800 (.013)  1.112 (.018)  1.756 (.022) 
Rank trim  .832 (.012)  .922 (.015)  1.212 (.018)  1.784 (.021) 
GMt, ,  
Metric trim  .379 (.005)  .396 (.005)  1.493 (.023)  2.746 (.006) 
Rank trim  .374 (.005)  .395 (.006)  2.341 (.011)  2.792 (.005) 
GMt, ,  
Metric trim  .795 (.010)  .666 (.011)  .912 (.015)  1.494 (.021) 
Rank trim  .783 (.010)  .829 (.013)  1.054 (.017)  1.506 (.021) 
The second series of simulations were designed to assess the finitesample adequacy of the asymptotic Wald test. To this end we generated data as before, but with . Then should approximately follow a distribution, where is the asymptotic covariance matrix of . For GMt estimators, is the “sandwich formula” given in Section 2.2; for the leastsquares estimator, . Table 2 reports the tail probabilities for the usual values of (, and ) and various combinations of parameters , , and . Each combination was replicated 10,000 times. We compared only two estimators this time: the leastsquares estimator and the 10% metrically trimmed GMt estimator with . We see in Table 2 that the asymptotic approximation works reasonably well for the leastsquares estimator if the ratio exceeds 15; however, for the GMt estimator a ratio of at least 35 is necessary for the asymptotic approximation to be reasonably good. Therefore, the asymptotic Wald test can be used with confidence only for large sample sizes and relatively small dimensions. In other cases, permutation tests or Wald tests with bootstrapestimated covariances are preferable.
Nominal probability  

Parameters  Estimator  .10  .05  .01 
, ,  LS  .1426 (.0035)  .0819 (.0027)  .0219 (.0015) 
GMt  .2270 (.0042)  .1571 (.0036)  .0749 (.0026)  
, ,  LS  .1272 (.0033)  .0693 (.0025)  .0170 (.0013) 
GMt  .1584 (.0037)  .0952 (.0029)  .0366 (.0019)  
, ,  LS  .1117 (.0032)  .0561 (.0023)  .0123 (.0011) 
GMt  .1392 (.0035)  .0824 (.0027)  .0258 (.0016)  
, ,  LS  .1452 (.0035)  .0813 (.0027)  .0211 (.0014) 
GMt  .2750 (.0045)  .1900 (.0039)  .0875 (.0028)  
, ,  LS  .1316 (.0034)  .0718 (.0026)  .0144 (.0012) 
GMt  .2111 (.0041)  .1360 (.0034)  .0514 (.0022)  
, ,  LS  .1185 (.0032)  .0625 (.0024)  .0169 (.0013) 
GMt  .1782 (.0038)  .1122 (.0032)  .0391 (.0019) 
4 Application: Ozone Pollution Data
Groundlevel ozone is an air pollutant known to cause serious health problems. Unlike other pollutants, ozone is not emitted directly into the air but forms as a result of complex chemical reactions, including volatile organic compounds and oxides of nitrogen among other factors. Modeling groundlevel ozone formation has been an active topic of airquality studies for many years. The California Environmental Protection Agency database, available at http://www.arb.ca.gov/aqd/aqdcd/aqdcddld.htm, has collected data on hourly concentrations of pollutants at different locations in California for the years 1980 to 2009. Here we will focus on the trajectories of oxides of nitrogen (NOx) and ozone (O3) in the city of Sacramento (site 3011 in the database) between June 6 and August 26 of 2005, which make a total of 82 days (shown in Figure LABEL:fig:sample_curves). There are a few days with some missing observations (9 in total), but since the method can handle unequal time grids, imputation of the missing data was not necessary.
ftbpFU5.3125in6.9427in0pt\QcbOzone Example. Normal () and Cauchy (—–) reducedrank Bspline estimators of the mean [(a),(b)], the first principal component [(c),(d)], the second principal component [(e),(f)] and the third principal component [(g),(h)] of logNOx and rootO3 trajectories. \Qlbfig:meanpcmeanpc.eps
The first step in the analysis is to fit reducedrank models to the sample curves. We used cubic Bsplines with 7 equally spaced knots every 5 years, and fitted Normal and (Cauchy) reducedrank models with up to 10 principal components. For both the response and the explanatory curves, the leading three components explain at least 85% of the total variability, so we retained these models. The means and the principal components are plotted in Figure LABEL:fig:meanpc. There is no substantial difference between the estimators obtained by these models, except perhaps for the mean and the third component of logNOx (Figures LABEL:fig:meanpc (a) and (g)).
With the Normal component scores we computed the Least Squares estimator, obtaining
With the Cauchy component scores we computed the GMt estimator with 1 degree of freedom and 10% metric trimming, obtaining
The latter cut off 5 observations out of the 82. There are some noticeable differences between these two estimators, even leaving aside the third row (which are not easily comparable, since and are rather different). The differences are more striking in the slope estimators and , shown in Figure LABEL:fig:betas. There is a “bump” in around that does not appear in . This means that the robust slope estimator assigns positive weight to NOx values around 8am in the prediction of O3 levels around 4pm, showing that there is a persistent effect of oxidesofnitrogen level in ozone formation.
ftbpFU6.5094in2.3938in0pt\QcbOzone Example. Functional slope estimators obtained by (a) least squares using Normal scores and (b) metrictrimmed GMt using Cauchy scores.\Qlbfig:betasbetas.tif
Of course, none of this would be meaningful if the regression model was not statistically significant. But the estimated response curves, shown in Figure LABEL:fig:predy, clearly show that the model does predict the response curves to a large extent. The robust estimator provides a better fit overall, with a root median squared error of compared to the root median squared error of for the least squares estimator.
ftbpFU7.6104in1.881in0pt\QcbOzone Example. Daily trajectories of rootO3 levels: (a) observed, (b) predicted by robust GMt estimator, and (c) predicted by least squares.\Qlbfig:predypredy.eps
Acknowledgement
The author was partly supported by NSF grants DMS 0604396 and 1006281.
Appendix
Reducedrank models
The method proposed by Gervini (2009) to estimate the mean and the principal components of a stochastic process works as follows. The mean function and the principal components are modeled as spline functions; that is, given a set of spline basis functions , chosen by the user, it is assumed that and . The observed vector can then be expressed as
where , and . Note that in this notation. By assuming has a standard multivariate distribution, robust maximum likelihood estimators of , , and are obtained. The estimators are computed via a standard EM algorithm. The optimal number of components can be chosen via AIC or BIC criteria. See Gervini (2009) for details. In addition to parameter estimates, the EM algorithm yields predictors of the random effects , so one obtains as a byproduct. The estimators of , , and are obtained in a similar way from the sample .
GMt estimating equations and asymptotics
The estimators and defined by (10) are Mtype estimators (Van der Vaart, 1998, ch. 5), since they minimize a function of the form . Specifically,
Then and solve the equations and . To compute matrix derivatives we use the method of differentials (Magnus and Neudecker, 1999). Differentiating with respect to we obtain
where . Then
(17) 
which can be rearranged in matrix form as
and (11) follows. Differentiating with respect to we obtain
so
Again, this can be expressed in matrix form as
from which (12) follows.
We will simplify the derivation of the asymptotic distribution of by assuming that the true component scores are used, instead of the estimated scores , and by assuming that is fixed and known. In that case we can apply Theorem 5.23 of Van der Vaart (1998) directly, and obtain that is asymptotically with
and
these expectations are taken with respect to the true parameters . Without loss of generality we can eliminate the factor in (17); then it is easy to see that (16) holds. To derive (15) we use differentials again:
so
from which (15) follows.
References

Ash, R. B. and Gardner, M. F. (1975). Topics in Stochastic Processes. Probability and Mathematical Statistics (Vol. 27). New York: Academic Press.

Chiou, J.M., Müller, H.G., and Wang, J.L. (2004). Functional response models. Statistica Sinica 14, 675–693.

Cuevas, A., Febrero, M., and Fraiman, R. (2007). Robust estimation and classification for functional data via projectionbased depth notions. Computational Statistics 22, 481–496.

Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge: Cambridge University Press.

Fraiman, R., and Muniz, G. (2001). Trimmed means for functional data. Test 10, 419–440.

Gervini, D. (2008). Robust functional estimation using the median and spherical principal components. Biometrika 95, 587–600.

Gervini, D. (2009). Detecting and handling outlying trajectories in irregularly sampled functional datasets. The Annals of Applied Statistics 3, 1758–1775.

He, X., Simpson, D. G. and Wang, G. (2000). Breakdown points of type regression estimators. Biometrika 87, 675–687.

James, G., Hastie, T. G. and Sugar, C. A. (2000). Principal component models for sparse functional data. Biometrika 87, 587–602.

Locantore, N., Marron, J. S., Simpson, D. G., Tripoli, N., Zhang, J. T. and Cohen, K. L. (1999). Robust principal components for functional data (with discussion). Test 8, 1–28.

Magnus, J. R., and Neudecker, H. (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics. Revised Edition, New York: Wiley.

Maronna, R. A., Martin, R. D. and Yohai, V. J. (2006). Robust Statistics. Theory and Methods. New York: Wiley.

Maronna, R. A. and Yohai, V. J. (2012). Robust functional linear regression based on splines. To appear in Computational Statistics & Data Analysis.

Müller, H.G., Chiou, J.M., and Leng, X. (2008). Inferring gene expression dynamics via functional regression analysis. BMC Bioinformatics 9:60.

Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis. Second Edition. New York: Springer.

Van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge, UK: Cambridge University Press.

Yao, F., Müller, H.G. and Wang, J.L. (2005a). Functional linear regression analysis for longitudinal data. The Annals of Statistics 33, 2873–2903.

Yao, F., Müller, H.G. and Wang, J.L. (2005b). Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100, 577–590.

Zhu, H., Brown, P.J. and Morris, J.S. (2011). Robust, adaptive functional regression in functional mixed model framework. Journal of the American Statistical Association 106, 1167–1179.