Functional robust regression for longitudinal data

# Functional robust regression for longitudinal data

Daniel Gervini
Department of Mathematical Sciences
University of Wisconsin–Milwaukee
###### Abstract

We present a robust regression estimator for longitudinal data, which is especially suited for functional data that has been observed on sparse or irregular time grids. We show by simulation that the proposed estimators possess good outlier-resistance properties compared with the traditional functional least-squares estimator. As an example of application, we study the relationship between levels of oxides of nitrogen and ozone in the city of San Francisco.

Key Words: Functional data analysis; Longitudinal data analysis; Mixed effects models; Robust statistics; Spline smoothing.

## 1 Introduction

In a typical longitudinal study, a number of variables are measured on a group of individuals and the goal is to analyze the relationships between the trajectories of the variables. In recent years, functional data analysis has provided efficient ways to analyze longitudinal data. In many cases the variable trajectories are discretized continuous curves that can be reconstructed by smoothing, and functional linear regression methods can be applied to study the relationship between the variables (Ramsay and Silverman, 2005). But in other situations the data is observed at sparse and irregular time points, which makes smoothing difficult or even unfeasible. Therefore, functional regression methods that can be applied directly to the raw measurements become very useful.

Methods for functional data analysis of irregularly sampled curves have been proposed by a number of authors, for the one-sample problem as well as for the functional regression problem (Chiou et al., 2004; James et al., 2000; Müller et al., 2008; Yao et al., 2005a, 2005b). Outlier-resistant techniques for the functional one-sample problem have also been proposed (Cuevas et al., 2007; Gervini, 2008, 2009; Fraiman and Muniz, 2001; Locantore et al., 1999), and two recent papers deal with robust functional regression for pre-smoothed curves (Zhu et al. 2011; Maronna and Yohai, 2012). However, outlier-resistant functional regression methods for raw functional data have not yet been proposed in the literature. In this paper we address this problem and present a computationally simple approach based on random-effect models. Our simulations show that this method attains the desired outlier resistance against atypical curves, and that the asymptotic distribution of the test statistic is approximately valid for small samples.

As an example of application, we will analyze the daily trajectories of oxides of nitrogen and ozone levels in the city of Sacramento, California, during the summer of 2005. The data is shown in Figure LABEL:fig:sample_curves. The goal is to predict ozone concentration from oxides of nitrogen. Both types of curves follow regular patterns, but some atypical curves can be discerned in the sample. We will show in Section 4 that to a large extend it is indeed possible to predict ozone levels from oxides-of-nitrogen levels, but that the outlying curves distort the classical regression estimators and that the proposed robust method gives more reliable results.

The paper is organized as follows. Section 2 presents a brief overview of functional linear regression and introduces the new method. Section 3 reports the results of a comparative simulation study, and Section 4 presents a detailed analysis of the above mentioned ozone dataset. Technical derivations and proofs are left to the Appendix. Matlab\registered programs implementing these procedures are available on the author’s webpage.

\FRAME

ftbpFU5.9819in2.0634in0pt\QcbOzone Example. Daily trajectories of ground-level concentrations of (a) oxides of nitrogen and (b) ozone in the city of Sacramento in the Summer of 2005.\Qlbfig:sample_curvessample.eps

## 2 Method

### 2.1 Background: classical functional linear regression

The functional approach to longitudinal data analysis assumes that the observations are discrete measurements of underlying continuous curves, so

 xij = Xi(sij)+εij,  i=1,…,n,  j=1,…,mi, (1) yij = Yi(tij)+ε′ij,  i=1,…,n,  j=1,…,m′i, (2)

where and are the trajectories of interest, and are random measurement errors, and and are the time points where the data is observed. The s and the s are random functions that we assume independent and identically distributed realizations of a pair .

Suppose and are square-integrable functions on an interval . Define the norm and the inner product . If and are finite, then and admit the decomposition

 X(s) = μX(s)+p∑k=1Ukϕk(s), (3) Y(t) = μY(t)+q∑l=1Vlψl(t), (4)

known as the Karhunen–Loève decomposition (Ash and Gardner 1975, ch. 1.4), where , , and are orthonormal functions (i.e. and , where is Kronecker’s delta), and and are random variables with zero mean and finite variance (without loss of generality, one can assume that and .) This is the functional equivalent of the principal-component decomposition in multivariate analysis, so the s and s are called “principal components”, and the s and s are called “component scores”. In principle and in (3) and (4) could be infinite, but since and are finite, the sequences and usually decrease to zero fast enough that for practical purposes and can be assumed to be finite.

Methods for estimating the mean and the principal components of and can be found in Ramsay and Silverman (2005), James et al. (2000), and Yao et al. (2005b). These methods are not resistant to outliers, though; outlier-resistant estimators of the mean and principal components have been proposed by Locantore et al. (1999), Cuevas et al. (2007), and Gervini (2008, 2009). We will use the method of Gervini (2009) to estimate the mean and the principal components in (3) and (4). This method is briefly reviewed in the Appendix.

Now suppose that there is a functional linear relationship between and :

 Y(t)=α0(t)+∫baβ0(s,t)X(s)ds+Z(t), (5)

where is the intercept, the slope, and the error term. We assume and for all and . (Note that the is not necessarily white noise; it is just the portion of that is not explained by , and it is usually a smooth non-trivial process.) Since (5) implies that , we can rewrite (5) as

 Y(t)=μY(t)+∫baβ0(s,t){X(s)−μX(s)}ds+Z(t). (6)

Then the only parameter that remains to be estimated is the regression slope .

Since is an orthonormal basis of the -space and is an orthonormal basis of the -space, without loss of generality the regression slope can be expressed as

 β0(s,t)=p∑k=1q∑l=1θ0klϕk(s)ψl(t). (7)

In matrix form, , where and . If we also collect the component scores and into vectors and , from (3), (4), (6) and (7) we obtain

 ψ(t)TV = ∫baψ(t)TΘT0ϕ(s)ϕ(s)TU ds+Z(t) = ψ(t)TΘT0U+ψ(t)TW,

where is the random vector with elements . This reduces the functional regression model (6) to a simpler multivariate regression model,

 V=ΘT0U+W, (8)

and the problem now is to estimate the regression matrix .

### 2.2 Outlier-resistant functional regression

As explained above, given the data we use the reduced-rank estimators of Gervini (2009) to obtain robust estimators of , , , , and . By (7) and (8), the least-squares estimator of would be with

 (9)

However, this estimator is not robust. Although the reduced-rank estimators of , , and are robust, the component scores and are individual parameters that will be outliers if the corresponding curves and are outliers. Therefore, the estimator of has to incorporate a mechanism to downweight outlying s and s.

This can be accomplished, for instance, by a modification of the -type GM-estimators of He et al. (2000), that we will call GMt for short. Let

 (10)

where . These are the maximum likelihood estimators of and when in (8) follows a multivariate distribution with mean zero and scatter matrix , although we do not actually assume that follows this distribution; as in He et al. (2000), this is just the motivation behind definition (10).

It is shown in the Appendix that and satisfy the fixed-point equations

 ^Θ = {n∑i=1ρ′(ei)w(^Ui)^Ui^UTi}−1n∑i=1ρ′(ei)w(^Ui)^Ui^VTi, (11) ^Σ = 1nn∑i=1ρ′(ei)w(^Ui)RiRTi, (12)

where and . These equations can be solved iteratively by a reweighting algorithm.

As for the weights , they are essentially a by-product of the estimation of , and . Since and , the s are approximately uncorrelated with mean zero. The squared Mahalanobis distance of is then , and large s will correspond to -outliers. The s will follow an approximate distribution if the data is Gaussian.

This suggests a number of weighting schemes. One possibility is to use “metric” trimming,

 w(^Ui)={1,D2i≤χ2p,1−α,0,otherwise, (13)

where is the quantile of the distribution. Another possibility is to use rank-based trimming,

 w(^Ui)={1,rank(D2i)/n≤1−α,0,otherwise. (14)

The latter will always eliminate the observations with largest Mahalanobis distances, even if they are not actual outliers; so we recommend not using an unnecessarily large for rank-based trimming. In practice, the choice of can be based on the proportion of outliers observed in a boxplot or histogram of the s.

The estimator defined above belongs to the general class of M-estimators, which have well-known asymptotic properties (Van der Vaart, 1998, ch. 5). As shown in the Appendix, follows an approximate distribution for large , with

 A = (15) B = E[{ρ′(e)}2w2(U)RRT⊗UUT]. (16)

The matrices and can be easily estimated, replacing expectations by averages. This asymptotic distribution can be used, for instance, to test significance of the regression: if , Wald’s statistic follows an approximate distribution for large , so we decide the regression is significant if for a given level . We can also construct marginal tests and confidence intervals for the individual coefficients .

In Section 3 we will study the accuracy of this asymptotic approximation. It is our experience that the distribution of approaches normality quite fast, but the above “sandwich formula” tends to underestimate the variance when the sample size is small. In that case it is better to use bootstrap estimators of the covariance matrix of .

## 3 Simulations

In this section we study by simulation the finite-sample behavior of the estimators (10). To this end, we generated data from model (8) with and , where and . Two regression parameters were considered: for the first set of simulations (to study estimation error) we took with and for ; for the second set of simulations (to study the goodness of the asymptotic approximation of Wald’s test) we took . The curves and were generated following (3) and (4), with and equal to zero, and , for and in . The raw observations were generated following (1) and (2), with random s uniformly distributed in , and independent , and ; for simplicity we took the grid equal to .

The first series of simulations were designed to study estimation error of the s, both for clean and for outlier-contaminated data. We generated outliers by replacing of the pairs by , with and for , and . Note that the contaminated data follows model (8) with and high-leverage s, so the effect of this type of contamination is an underestimation of that tends to pull towards 0.

The estimation of requires two steps: first, to estimate and from the raw data, and then to compute from the s and the s. So we compared two procedures: a non-robust procedure, using reduced-rank Normal models (James et al., 2000) to estimate the component scores, followed by the ordinary least-squares regression estimator (9); and a robust procedure, using reduced-rank -models (Gervini, 2009) to estimate the component scores, followed by the GMt regression estimator (10). For the robust procedure, we considered the two types of weights discussed in Section 2.2, with trimming proportions and ; degrees of freedom and were used for the -models.

Four levels of contamination were considered: 0 (clean data), , and . We took as sample size, as grid size, and as model dimensions. Each case was replicated 1000 times. As measure of the estimation error we used the expected root integrated squared error , where .

The results are reported in Table 1, along with Monte Carlo standard errors. We see that for non-contaminated data (), there is no significant difference between metric and rank trimming for a given pair . The trimming proportion has a larger impact on the estimator’s behavior than the degrees of freedom . For this reason we recommend choosing adaptively, so as not to cut off too much good data. When , we see that metric trimming tends to outperform rank trimming for a given pair . Somewhat counterintuitively, estimators with tend to be more robust than those with for a given ; the reason is that for this type of contamination, which affects but not the s or the s, models with provide more accurate estimators of and than models with (for other types of contamination this is no longer true, although models with are still very robust; see Gervini (2009).) In general, then, the recommendation is to use -model estimators with metrically trimmed weights and a trimming proportion chosen adaptively.

The second series of simulations were designed to assess the finite-sample adequacy of the asymptotic Wald test. To this end we generated data as before, but with . Then should approximately follow a distribution, where is the asymptotic covariance matrix of . For GMt estimators, is the “sandwich formula” given in Section 2.2; for the least-squares estimator, . Table 2 reports the tail probabilities for the usual values of (, and ) and various combinations of parameters , , and . Each combination was replicated 10,000 times. We compared only two estimators this time: the least-squares estimator and the 10% metrically trimmed GMt estimator with . We see in Table 2 that the asymptotic approximation works reasonably well for the least-squares estimator if the ratio exceeds 15; however, for the GMt estimator a ratio of at least 35 is necessary for the asymptotic approximation to be reasonably good. Therefore, the asymptotic Wald test can be used with confidence only for large sample sizes and relatively small dimensions. In other cases, permutation tests or Wald tests with bootstrap-estimated covariances are preferable.

## 4 Application: Ozone Pollution Data

Ground-level ozone is an air pollutant known to cause serious health problems. Unlike other pollutants, ozone is not emitted directly into the air but forms as a result of complex chemical reactions, including volatile organic compounds and oxides of nitrogen among other factors. Modeling ground-level ozone formation has been an active topic of air-quality studies for many years. The California Environmental Protection Agency database, available at http://www.arb.ca.gov/aqd/aqdcd/aqdcddld.htm, has collected data on hourly concentrations of pollutants at different locations in California for the years 1980 to 2009. Here we will focus on the trajectories of oxides of nitrogen (NOx) and ozone (O3) in the city of Sacramento (site 3011 in the database) between June 6 and August 26 of 2005, which make a total of 82 days (shown in Figure LABEL:fig:sample_curves). There are a few days with some missing observations (9 in total), but since the method can handle unequal time grids, imputation of the missing data was not necessary.

\FRAME

ftbpFU5.3125in6.9427in0pt\QcbOzone Example. Normal () and Cauchy (—–) reduced-rank B-spline estimators of the mean [(a),(b)], the first principal component [(c),(d)], the second principal component [(e),(f)] and the third principal component [(g),(h)] of log-NOx and root-O3 trajectories. \Qlbfig:mean-pcmean-pc.eps

The first step in the analysis is to fit reduced-rank models to the sample curves. We used cubic B-splines with 7 equally spaced knots every 5 years, and fitted Normal and (Cauchy) reduced-rank models with up to 10 principal components. For both the response and the explanatory curves, the leading three components explain at least 85% of the total variability, so we retained these models. The means and the principal components are plotted in Figure LABEL:fig:mean-pc. There is no substantial difference between the estimators obtained by these models, except perhaps for the mean and the third component of log-NOx (Figures LABEL:fig:mean-pc (a) and (g)).

With the Normal component scores we computed the Least Squares estimator, obtaining

 ^ΘLS=⎛⎜⎝.0404−.0077.0083−.0537−.0085.0317−.0109−.0173−.0263⎞⎟⎠.

With the Cauchy component scores we computed the GMt estimator with 1 degree of freedom and 10% metric trimming, obtaining

 ^ΘGM=⎛⎜⎝.0406−.0172.0045−.0451.0029.0266−.0289−.0071−.0317⎞⎟⎠.

The latter cut off 5 observations out of the 82. There are some noticeable differences between these two estimators, even leaving aside the third row (which are not easily comparable, since and are rather different). The differences are more striking in the slope estimators and , shown in Figure LABEL:fig:betas. There is a “bump” in around that does not appear in . This means that the robust slope estimator assigns positive weight to NOx values around 8am in the prediction of O3 levels around 4pm, showing that there is a persistent effect of oxides-of-nitrogen level in ozone formation.

\FRAME

ftbpFU6.5094in2.3938in0pt\QcbOzone Example. Functional slope estimators obtained by (a) least squares using Normal scores and (b) metric-trimmed GMt using Cauchy scores.\Qlbfig:betasbetas.tif

Of course, none of this would be meaningful if the regression model was not statistically significant. But the estimated response curves, shown in Figure LABEL:fig:pred-y, clearly show that the model does predict the response curves to a large extent. The robust estimator provides a better fit overall, with a root median squared error of compared to the root median squared error of for the least squares estimator.

\FRAME

ftbpFU7.6104in1.881in0pt\QcbOzone Example. Daily trajectories of root-O3 levels: (a) observed, (b) predicted by robust GMt estimator, and (c) predicted by least squares.\Qlbfig:pred-ypred-y.eps

## Acknowledgement

The author was partly supported by NSF grants DMS 0604396 and 1006281.

## Appendix

### Reduced-rank t models

The method proposed by Gervini (2009) to estimate the mean and the principal components of a stochastic process works as follows. The mean function and the principal components are modeled as spline functions; that is, given a set of spline basis functions , chosen by the user, it is assumed that and . The observed vector can then be expressed as

 xi=Biξ+BiHΛ1/2zi+σεi,

where , and . Note that in this notation. By assuming has a standard multivariate distribution, robust maximum likelihood estimators of , , and are obtained. The estimators are computed via a standard EM algorithm. The optimal number of components can be chosen via AIC or BIC criteria. See Gervini (2009) for details. In addition to parameter estimates, the EM algorithm yields predictors of the random effects , so one obtains as a by-product. The estimators of , , and are obtained in a similar way from the sample .

### GMt estimating equations and asymptotics

The estimators and defined by (10) are M-type estimators (Van der Vaart, 1998, ch. 5), since they minimize a function of the form . Specifically,

 m(Θ,Σ)(^Ui,^Vi)=ρ{w(^Ui)(^Vi−ΘT^Ui)TΣ−1(^Vi−ΘT^Ui)}+log|Σ|.

Then and solve the equations and . To compute matrix derivatives we use the method of differentials (Magnus and Neudecker, 1999). Differentiating with respect to we obtain

 dm(Θ,Σ)(^Ui,^Vi) = ρ′(ei)w(^Ui)2(^Vi−ΘT^Ui)TΣ−1{−(dΘ)T^Ui} = −2ρ′(ei)w(^Ui)tr{(dΘ)T^Ui(^Vi−ΘT^Ui)TΣ−1} =

where . Then

 ∇vec(Θ)m(Θ,Σ)(^Ui,^Vi)=−2ρ′(ei)w(^Ui)vec{^Ui(^Vi−ΘT^Ui)TΣ−1}, (17)

which can be rearranged in matrix form as

 ∂∂Θm(Θ,Σ)(^Ui,^Vi)=−2ρ′(ei)w(^Ui)^Ui(^Vi−ΘT^Ui)TΣ−1,

and (11) follows. Differentiating with respect to we obtain

 dm(Θ,Σ)(^Ui,^Vi)=
 = ρ′(ei)w(^Ui)(^Vi−ΘT^Ui)T{−Σ−1(dΣ)Σ−1}(^Vi−ΘT^Ui)+tr{Σ−1(dΣ)} = −ρ′(ei)w(^Ui)tr{Σ−1(^Vi−ΘT^Ui)(^Vi−ΘT^Ui)TΣ−1(dΣ)}+tr{Σ−1(dΣ)} = +vec(Σ−1)Tvec(dΣ),

so

 ∇vec(Σ)m(Θ,Σ)(^Ui,^Vi)
 =−ρ′(ei)w(^Ui)vec{Σ−1(^Vi−ΘT^Ui)(^Vi−ΘT^Ui)TΣ−1}+vec(Σ−1).

Again, this can be expressed in matrix form as

 ∂∂Σm(Θ,Σ)(^Ui,^Vi)

from which (12) follows.

We will simplify the derivation of the asymptotic distribution of by assuming that the true component scores are used, instead of the estimated scores , and by assuming that is fixed and known. In that case we can apply Theorem 5.23 of Van der Vaart (1998) directly, and obtain that is asymptotically with

 A=E{∇vec(Θ)∇Tvec(Θ)m(Θ0,Σ0)(U,V)}

and

 B=E{∇vec(Θ)m(Θ0,Σ0)(U,V)∇Tvec(Θ)m(Θ0,Σ0)(U,V)};

these expectations are taken with respect to the true parameters . Without loss of generality we can eliminate the factor in (17); then it is easy to see that (16) holds. To derive (15) we use differentials again:

 d{∇Tvec(Θ)m(Θ,Σ0)(U,V)}=
 = 2ρ′′(e)w2(U)(V−ΘTU)TΣ−10(dΘ)TUvec{U(V−ΘTU)T}T +ρ′(e)w(U)vec{U(dΘTU)T}T = 2ρ′′(e)w2(U)tr{(dΘ)TU(V−ΘTU)TΣ−10}vec{U(V−ΘTU)T}T = 2ρ′′(e)w2(U)vec(dΘ)Tvec{U(V−ΘTU)TΣ−10}vec{U(V−ΘTU)T}T +ρ′(e)w(U){(Iq⊗UUT)vec(dΘ)}T,

so

 ∇vec(Θ)∇Tvec(Θ)m(Θ,Σ0)(U,V)=
 = 2ρ′′(e)w2(U)vec{U(V−ΘTU)TΣ−10}vec{U(V−ΘTU)T}T +ρ′(e)w(U)(Iq⊗UUT) = 2ρ′′(e)w2(U)(Σ−10R⊗U)(R⊗U)T+ρ′(e)w(U)(Iq⊗UUT),

from which (15) follows.

## References

Ash, R. B. and Gardner, M. F. (1975). Topics in Stochastic Processes. Probability and Mathematical Statistics (Vol. 27). New York: Academic Press.

Chiou, J.-M., Müller, H.-G., and Wang, J.-L. (2004). Functional response models. Statistica Sinica 14, 675–693.

Cuevas, A., Febrero, M., and Fraiman, R. (2007). Robust estimation and classification for functional data via projection-based depth notions. Computational Statistics 22, 481–496.

Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge: Cambridge University Press.

Fraiman, R., and Muniz, G. (2001). Trimmed means for functional data. Test 10, 419–440.

Gervini, D. (2008). Robust functional estimation using the median and spherical principal components. Biometrika 95, 587–600.

Gervini, D. (2009). Detecting and handling outlying trajectories in irregularly sampled functional datasets. The Annals of Applied Statistics 3, 1758–1775.

He, X., Simpson, D. G. and Wang, G. (2000). Breakdown points of -type regression estimators. Biometrika 87, 675–687.

James, G., Hastie, T. G. and Sugar, C. A. (2000). Principal component models for sparse functional data. Biometrika 87, 587–602.

Locantore, N., Marron, J. S., Simpson, D. G., Tripoli, N., Zhang, J. T. and Cohen, K. L. (1999). Robust principal components for functional data (with discussion). Test 8, 1–28.

Magnus, J. R., and Neudecker, H. (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics. Revised Edition, New York: Wiley.

Maronna, R. A., Martin, R. D. and Yohai, V. J. (2006). Robust Statistics. Theory and Methods. New York: Wiley.

Maronna, R. A. and Yohai, V. J. (2012). Robust functional linear regression based on splines. To appear in Computational Statistics & Data Analysis.

Müller, H.-G., Chiou, J.-M., and Leng, X. (2008). Inferring gene expression dynamics via functional regression analysis. BMC Bioinformatics 9:60.

Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis. Second Edition. New York: Springer.

Van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge, UK: Cambridge University Press.

Yao, F., Müller, H.-G. and Wang, J.-L. (2005a). Functional linear regression analysis for longitudinal data. The Annals of Statistics 33, 2873–2903.

Yao, F., Müller, H.-G. and Wang, J.-L. (2005b). Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100, 577–590.

Zhu, H., Brown, P.J. and Morris, J.S. (2011). Robust, adaptive functional regression in functional mixed model framework. Journal of the American Statistical Association 106, 1167–1179.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters