Sparse Estimation of Historical Functional Linear Models with a Nested Group Bridge Approach
The conventional historical functional linear model relates the current value of the functional response at time to all past values of the functional covariate up to time . Motivated by situations where it is more reasonable to assume that only recent, instead of all, past values of the functional covariate have an impact on the functional response, we investigate in this work the historical functional linear model with an unknown forward time lag into the history. Besides the common goal of estimating the bivariate regression coefficient function, we also aim to identify the historical time lag from the data, which is important in many applications. Tailored for this purpose, we propose an estimation procedure adopting the finite element method to conform naturally to the trapezoidal domain of the bivariate coefficient function. A nested group bridge penalty is developed to provide simultaneous estimation of the bivariate coefficient function and the historical lag. The method is demonstrated in a real data example investigating the effect of muscle activation recorded via the noninvasive electromyography (EMG) method on lip acceleration during speech production. The finite sample performance of our proposed method is examined via simulation studies in comparison with the conventional method.
Key words: Finite element; Functional data analysis; Function-on-function regression; Historical linear model; Nested group bridge penalty.
We are interested in functional linear regression models, where a functional predictor is used to explain a functional response over a time interval , . There are a number of possible ways to build up the model. For instance, Ramsay and Dalzell (1991) introduced the functional linear model
where is the intercept function, is the residual and is an unconstrained regression bivariate coefficient function of primary interest. The bivariate coefficient function represents the effect of the functional predictor at time on the functional response at time . According to model (1), future facts about the covariate is used to explain , the response at current time. This is reasonable if the underlying process is periodic, but illogical otherwise. If the prediction of the response only depends on concurrently observed predictor , a restriction is then introduced on , leading to the functional concurrent model , for where a univariate coefficient function suffices to fully describe their relationship (Ramsay and Silverman, 2005). In other situations, the response at a particular time point may depend on the recent history of the predictor, then the functional concurrent model becomes too constrained. For example, the recovery of a patient may well depend on treatment received over the past a few days.
To properly model the relationship between the historical predictor and the current response, Malfait and Ramsay (2003) introduced the historical functional linear model
where represents a forward time lag back into the history such that is likely to have an impact on only over the time frame rather than over the complete time frame . Notice that the degenerated case with reduces to a concurrent model. The time lag is typically unknown and of interest. The objective of our study is to estimate the time lag along with the coefficient function from the data .
Concerning the classic function-on-function regression model (1), one can apply basis expansion to and , and obtain a weighted least squares type estimator, possibly with a bivariate roughness penalty (Besse and Cardot, 1996; Ramsay and Silverman, 2005). Alternatively, one can compute the functional principal component scores for and , and base the estimation on the functional principal component scores (Chen and Wang, 2011; Ivanescu, et al., 2015; Yao, Müller, and Wang, 2005).
The functional concurrent model, as a type of varying coefficient model, has also been studied relatively well in the literature (see, Fan and Zhang, 2000; Hastie and Tibshirani, 1993; Wu, Chiang and Hoover, 1998; Wu and Liang, 2004; Zhou, Huang, and Carroll, 2008). For a comprehensive introduction to functional regression models, we refer to monographs by Ferraty and Vieu (2006), Hsing and Eubank (2015), Kokoszka and Reimherr (2017), Ramsay and Silverman (2005), as well as the review papers by Morris (2015) and Wang, Chiou and Müller (2016) and references therein.
There also exist intermediate models between the functional linear model (1) and the functional concurrent model which try to model the effect of past values of the predictor on current response. A class of time-varying functional regression models were discussed by Müller and Zhang (2005), Şentürk and Müller (2008; 2010), etc. Brockhaus et al. (2017) and Greven and Scheipl (2017) discussed a general framework of functional regression, where many aforementioned models are included as special cases, and proposed gradient-boosting-based estimation procedures. Kim, Şentürk and Li (2011) proposed an estimation procedure that was geared towards sparse longitudinal data with irregular observation times and small amount of measurements per subject. Assuming the dependence of on for does not change over time, Asencio, Hooker and Gao (2014) introduced the functional convolution model which is a functional extension of distributed lag models in time series, and proposed a penalized ordinary least squares estimator for the regression coefficient given the historical time lag .
The historical functional linear model (2) is identifiable but not estimable, with effectively an infinite number of covariates, therefore regularization or roughness penalty on is necessary in the estimation process (Ramsay and Silverman, 2005). Malfait and Ramsay (2003) regularized the fit by approximating using an expansion with a finite number of basis functions. Such approach bears the well known limitation that a small number of basis functions would decrease the goodness of fit while a large number of basis functions would lead to unstable estimation. Harezlak et al. (2007) improved the fit of model (2) by imposing a discrete roughness penalty forcing neighboring coefficients to be similar, which is an extension of the P-spline by Eilers and Marx (1996). Both work focus on the estimation of coefficient function for a given historical lag , while the lag is decided in a separate manner. Malfait and Ramsay (2003) considered deciding as a model selection problem. Briefly put, they divided the whole time course into sub-intervals of equal length, fitted the model with for , and chose the best model from the candidate models according to a certain criterion.
In this work, we focus on the problem of estimating the historical lag and propose a tailored procedure for the simultaneous estimation of the coefficient function and the lag in model (2). Considering the following “full” model,
Figure 1 illustrates a scenario when the bivariate coefficient function becomes zero over the triangular region defined by vertices . This scenario corresponds to that does not affect for . In other words, the full model (3) is equivalent to the historical functional linear model (2) when the support domain of the coefficient function is a trapezoidal area defined by vertices , given that the lag .
We propose a nested group bridge shrinkage method to estimate the historical lag and the coefficient function with sparsity, which tackles the problem from a completely different perspective. Our method has two features. First of all, we utilize the triangular basis functions from the finite element method theory, which respects the non-rectangular support domain of the coefficient function . Such basis was also used by Malfait and Ramsay (2003) and Harezlak et al. (2007), as well as in spatial data analysis, where data distribute over irregularly shaped spatial domains with features like complex boundaries, strong concavities and interior holes (e.g. Ramsay, 2002; Sangalli, Ramsay and Ramsay, 2013). Then under model (3), we organize the basis coefficients in such a way that the nested group bridge penalty is able to shrink specifically the coefficient function over the upper triangular region with vertices , and towards zero. The group bridge shrinkage was originally proposed by Huang et al. (2009) for variable selection. Such penalty has been utilized by Wang and Kai (2015) for locally sparse estimation in nonparametric regression for a scalar-on-function historical functional linear model with B-spline basis function expansion. The major advantage of our proposed approach is that we can estimate automatically without predetermining the candidate values of . Our simulation studies show that our estimator of has a better finite sample performance than the conventional methods.
To ease the notation, the intercept function can be dropped from model (3) without loss of generality. Let and denote the pointwise centered response curves and predictor curves, respectively. A centered model without intercept is
Upon obtaining an estimate , the intercept function is then estimated as
We drop the asterisk from model (4) from now on and focus the discussion on the estimation of the coefficient function in the model:
2.1 Approximation with the Finite Element Method
We propose to approximate the coefficient function with the triangular basis, which originates from the finite element method and is widely used in the numerical solution for the boundary-value problems of partial differential equations. Noticing that the support of the coefficient function is non-rectangular, approximation with a commonly used bivariate spline generated via tensor product would result in a jagged shape along the boundary . Therefore, finite elements that can approximate arbitrary regions serve as a natural alternative.
Let denote the known triangular basis functions. The coefficient function is approximated by the expansion,
Plugging the above approximation into model (4), we obtain
where is known, includes both the residual and the approximation error.
Divide the interval on each axis into subintervals with equidistant nodes . Further split each square into two triangles by the diagonal parallel to the line . This divides the triangular region over which the coefficient function is possibly non-zero into congruent triangles (i.e. the triangular elements, or finite elements) with nodes. Each node corresponds to a basis function and a coefficient , both of which are indexed from bottom to top and row-wise from left to right. For example, the left panel in Figure 2 shows the triangulation and indexation of the nodes when . There are 25 triangular elements and 21 nodes corresponding to and for .
Each basis function has a compact support, defined over the hexagon centered at the -th node. Basis functions of degree 1 are tent shaped, piecewise linear and continuous, with value 1 at node and value 0 at the boundary of the hexagon. For instance, the right panel in Figure 2 shows a tent-shaped basis function corresponding to the 9th node at . Though the ’s are not orthogonal basis, their compact support property brings in certain computational advantage. We refer to Larson and Bengzon (2013) for a comprehensive introduction to the finite element basis system.
2.2 The Nested Group Bridge Approach
The coefficient function in model (5) is formally defined over the triangular region in Figure 1 corresponding to the lag , while we are trying to identify whether the upper left triangular region is zero or not. Therefore, a proper penalty should be able to shrink the upper left triangular region towards zero specifically, while respecting the nested group structure among the basis coefficients ’s.
We start by defining a sequence of nested triangular regions. Let and denote the triangle with vertices , for , and let contain a single point. Notice that . In correspondence to these regions, we define a sequence of decreasing groups , where consists of the indices of the nodes contained in triangle according to the node indexation described in Section 2.1. Taking as an example, there are 10 nodes and the index sets are , and . Furthermore, we denote by the vector of basis coefficients whose indices belong to set , and follow the conventional notations that and represent the and norms, respectively.
We adapt the group bridge approach proposed by Huang et al. (2009) and propose to estimate the vector of basis coefficients for in model (5) by minimizing the penalized least squares criterion
with a fixed , a nonnegative tuning parameter , known weights to offset the effect of different dimensions of , and a known smoothness penalty matrix to be introduced shortly. Following Huang et al. (2009), a simple choice for the weights is .
The first term in criterion (6) is the ordinary least squares taking into account the whole time course . The second term is the so called nested group bridge penalty, which originated from Huang et al. (2009) for simultaneous variable selection at both the group and within-group levels. Consider any two sets and , with . Due to the nested nature, the vector of coefficients is always a subvector of , hence the coefficients corresponding to the nodes in region appear in more groups than the ones corresponding only to . This suggests that the nested group bridge penalty shrinks more heavily the coefficients corresponding to regions in a closer proximity to , as desired. The last term imposes a discrete roughness penalty on the basis coefficients , extending the idea of Eilers and Marx (1996) and Harezlak et al. (2007). Concerning the smoothness of the coefficient function , there are three interpretable directions in which we desire to control the changes in the adjacent coefficients, i.e., horizontal, vertical and diagonal directions, as shown in the left panel of Figure 2. Along each direction, the rows of a penalty matrix correspond to the differences in the basis coefficients of adjacent nodes. The discrete smoothness penalty matrix is defined as
which combines the three directions with nonnegative weights , and . An example for is as follows, where the columns of , and from left to right correspond to the basis coefficients ,
The key difficulty in optimizing the objective function (6) is due to its non-convexity. We can work on an equivalent constrained optimization problem, which is convex and easier to solve (Huang et al. 2009). The objective function (6) involves an integration of the time dependent ordinary least squares criterion over time. Dividing the time course into equally-spaced subintervals at time points , , the integrated least squares criterion can be approximated by the finite sum as
With a sufficiently large , the above approximation with Riemann sums over a regular partition of the integral domain provides a satisfactory level of precision with acceptable computational load. Define the -by- vector , and the -by- matrix whose -th element is . By stacking the vectors into a long vector of length and the matrices into a tall matrix of size by , the approximated integrated least squares criterion is rewritten in the matrix expression as
The objective function (6) is then rewritten as the familiar penalized least squares
where differs from earlier notation by a constant factor .
Define a vector . Then minimizes criterion (7) if and only if minimizes
for and (Huang et al. 2009). Notice that by definition of the norm, and that if appears in then it also appears in all with . Thus, exchange the order of summation in the second term of (8), rearrange and collect all the individual terms involving , the objective function becomes
where with . Intuitively, is the number of times that coefficient appears in the nested group bridge penalty.
Define as a diagonal matrix with the th diagonal element , , , and of proper length. Finally, the objective function (8) is expressed as
where is the th element of vector . The following iterative algorithm is used to compute .
Step 1: Obtain an initial estimate with ordinary least squares.
Step 2: At each iteration , update based on as
Step 3: At each iteration , update based on by recognizing this is a LASSO problem
Repeat Step 2 and Step 3 until convergence.
Once obtaining , the estimators for and are defined as
After obtaining , we then refine the estimation for by minimizing criterion (7) with , i.e. excluding the nested grouped bridge penalty.
When implementing the algorithm, there are two types of tuning parameters, i.e. the shrinkage parameter (or equivalently ) and the smoothness parameters as one type, and the number of grids on the time interval as the other type. Since the precision of obviously depends on , we desire a relatively large in order to achieve a reasonably good estimate of the historical lag , and at the same time to capture enough local features of the coefficient function . This is also consistent with the common strategy when applying penalized least squares approach, where a relatively large number of nodes is preferred and potential overfitting caused by such a choice would be offset via the sparsity and roughness penalty. The shrinkage parameter and smoothness parameters can be chosen via, for example, the Bayesian information criterion (BIC). The effective degrees of freedom for given and can be approximated by
where consists of the columns of corresponding to the nonzero coefficients in , and is obtained in the same way by properly selecting the columns of , and . The BIC is then approximated by
and the tuning parameters are selected as the minimizer of .
3 Analysis of Real Data
In this analysis, we apply the proposed nested group bridge approach to data from a speech production experiment. It is known that there are over 100 muscles that must be controlled centrally during speech production, such as the muscles of the thoracic and abdominal walls, the neck and face, and the oral cavity, etc. The timing of activation of different muscle groups is a central issue in anatomical and physiological research of speech. Due to the fact that muscle contractions generate electrochemical changes, the noninvasive electromyography (EMG) method is used to collect data on muscle activation. With electrodes attached to the skin over the muscle, potentials as a result of muscle stimulation can be picked up. In this experiment, a subject said the syllable “bob” 32 times. By Newton’s second law, the accelerations of the center of the lower lip reflect the force applied to tissue by muscle contraction, which was recorded as the response curves. Ramsay and Silverman (2002) provided a comprehensive introduction of the background knowledge about this physiology study.
A random subset of 10 smoothed observations for EMG recordings (top) and corresponding lower lip accelerations (bottom) is presented in Figure 3, where the time range is . Considering that the current lip acceleration is very likely a result of recent muscle movement, this motivates us to fit the historical functional linear model (2) with an unknown lag. The EMG recordings are the functional covariates and the lip accelerations are the functional response. The goal of this analysis is to explore the association between EMG recording and lip acceleration, and identify a historical lag if there is any. To apply the nested group bridge approach, we divides the time range [0,0.64] into 20 subintervals with equidistant nodes, i.e. . This leads to nodes and triangular basis functions. By minimizing criterion (6) as described in Section 2, we obtain the estimates for both the regression coefficient and the historical lag.
The estimated historical lag is seconds, indicating there is a historical effect of muscle activation on the lip movement. A 95% confidence interval for the historical lag is , constructed via the bootstrap method by re-sampling the residuals and then re-estimating the model.
Figure 4 shows the corresponding estimate of the bivariate coefficient function . Dividing pointwise by its standard error obtained via the bootstrap method, the plot of the test statistics looks very similar to Figure 4, with slight difference in the magnitude of the values. We observe that most of the coefficients are significant, using a two-sided -test at 5% significance level. The estimate of has extraordinarily large values along the diagonal direction of two regions, and . This is as expected, corresponding to the two productions of the
/b/ syllable when the lip is closed and the muscle activation is most influential. The width of the diagonal band with large values varies approximately from 50 to 60 milliseconds, which corresponds to the delay for a neural signal to be transduced into muscle contraction. And peaks at larger lags suggest possible covariation of EMG and lip movement that requires solid physiological knowledge for interpretation.
4 Simulation Studies
Simulation studies are conducted to evaluate the proposed estimation procedure with a nested group bridge penalty (NGB) in comparison with the penalized approach by Harezlak et al. (2007) based on linear model approximation (PLMA). Set . Let denote the trapezoidal support with vertices , , , as shown in Figure 1, and the indicator function. We have considered the following three scenarios, and the true coefficient function is shown in Figure 5.
Scenario 1. , where is the trapezoid with vertices , , , . The coefficient function is constant over the region , linear over and vanishes at the line .
Scenario 2. . The coefficient function is linear over its support , with maximum along the line and vanishes at the line .
Scenario 3. Some “holes” are created randomly on the coefficient function of Scenario 2, such that can vanish inside .
In Scenario 1, we take a small value of , which leads to a sharp drop in the coefficient function towards zero when going outside from the region toward the line . We consider this scenario as the easiest situation to determine the historical lag due to the sharp change, and expect both approaches to perform well and similarly. Scenario 2 is slightly more difficult, since the boundary becomes more blurred than that in Scenario 1 when data are contaminated with errors. Scenario 3 adds further difficulty with irregularly shaped coefficient function and admits the more general assumption that the dependence of response on the historical predictor can vary over time. For all scenarios, the observed covariate curves in the real data example presented in Section 3 are rescaled to time interval and are used along with the true coefficient function to generate the mean response curves, possibly to mimic real situations. A random subset of 10 covariate curves are shown in the top panel of Figure 3. We refer to Section 3 for details about the example. The errors are additive and generated pointwise from a distribution.
For the proposed NGB approach, we set and use the ordinary least squares estimate by Malfait and Ramsay (2003) as an initial value for the algorithm. We choose similarly to the idea of adaptive LASSO (Zou, 2006). For each scenario, 100 independent replications are simulated. The results are summarized in Table 1. Concerning the estimated root mean squared error of , the proposed NGB estimator outperforms the PLMA estimator in all three scenarios, with a much smaller bias and comparable standard deviation. While the PLMA estimator performs well in Scenario 1 with a small bias and a reasonably small standard deviation, the performance deteriorates badly in more difficult scenarios and the estimator exhibits a tendency to underestimate . To assess the accuracy of the estimated bivariate coefficient function, we calculate the square root of the mean integrated squared errors (RISEs) over a meshgrid, also presented in Table 1. The NGB and PLMA approaches do not shown discernible difference in terms of the estimated RISE for .
|RMSE of||%bias of||SD of||RISE of|
|Scenario 1||0.0173||0.0206||1.2||1.7||0.0162||0.0188||1.29 (0.25)||0.83 (0.35)|
|Scenario 2||0.0480||0.1587||-5.4||-31.5||0.0396||0.0192||69.86 (7.04)||69.33 (7.54)|
|Scenario 3||0.0548||0.2426||-9.2||-48.3||0.0297||0.0235||8.74 (1.33)||9.75 (1.34)|
We have considered in this article the estimation of the historical functional linear model (2) with an unknown forward lag. We propose a nested group bridge approach, tailored for the simultaneous estimation of the historical lag and the regression coefficient function. The nested group bridge penalty is able to shrink not only a group of coefficient corresponding to the designated area towards zero, but also individual coefficients corresponding to the remaining area towards zero. We adopt the triangular basis from the finite element method in order to conform naturally to the non-rectangular domain of the regression coefficient function. The triangular basis system is computationally efficient in the sense that the compact support of the basis functions leads to a sparse design matrix.
Under the historical functional linear model (2), we assume that the historical lag is independent of . It may be of interest to describe more precisely the historical effect of the functional covariate on the functional response via a time-dependent lag . Another point of interest is the alternative choice of the smoothness penalty. The discrete penalty originated from P-spline (Eilers and Marx, 1996) involves a large matrix and its inversion. It is of interest to explore other possibilities. We leave the above mentioned points for future work.
- 1 Asencio, M., Hooker, G., and Gao, H. O. (2014). Functional convolution models. Statistical Modelling: An International Journal, 14(4), 315–335.
- 2 Besse, P. C., and Cardot, H. (1996). Approximation spline de la prévision d’un processus fonctionnel autorégressif d’ordre 1. Canadian Journal of Statistics, 24(4), 467–487.
- 3 Brockhaus, S., Melcher, M., Leisch, F., and Greven, S. (2017). Boosting flexible functional regression models with a high number of functional historical effects. Statistics and Computing, 27(4), 913–926.
- 4 Chen, H., and Wang, Y. (2011). A penalized spline approach to functional mixed effects model analysis. Biometrics, 67(3), 861–870.
- 5 Eilers, P. H., and Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89–102.
- 6 Fan, J., and Zhang, J.-T. (2000). Two-step estimation of functional linear models with applications to longitudinal data. Journal of the Royal Statistical Society, Series B, 62(2), 303–322.
- 7 Ferraty, F., and Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory and Practice. New York: Springer.
- 8 Greven, S., and Scheipl, F. (2017). A general framework for functional regression modelling. Statistical Modelling: An International Journal, 17(1–2), 1–35.
- 9 Harezlak, J., Coull, B. A., Laird, N. M., Magari, S. R., and Christiani, D. C. (2007). Penalized solutions to functional regression problems. Computational Statistics and Data Analysis, 51(10), 4911–4925.
- 10 Hastie, T., and Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society, Series B, 55(4), 757–779.
- 11 Hsing, T., and Eubank, R. (2015). Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators. New York: John Wiley & Sons.
- 12 Huang, J., Ma, S., Xie, H., and Zhang, C.-H. (2009). A group bridge approach for variable selection. Biometrika, 96(2), 339–355.
- 13 Ivanescu, A. E., Staicu, A.-M., Scheipl, F., and Greven, S. (2015). Penalized function-on-function regression. Computational Statistics, 30(2), 539–568.
- 14 Kim, K., Şentürk, D., and Li, R. (2011). Recent history functional linear models for sparse longitudinal data. Journal of Statistical Planning and Inference, 141(4), 1554–1566.
- 15 Kokoszka, P., and Reimherr, M. (2017). Introduction to Functional Data Analysis. New York: CRC Press.
- 16 Larson, M. G., and Bengzon, F. (2013). The Finite Element Method: Theory, Implementation, and Applications. New York: Springer.
- 17 Malfait, N., and Ramsay, J. O. (2003). The historical functional linear model. Canadian Journal of Statistics, 31(2), 115–128.
- 18 Morris, J. S. (2015). Functional regression. Annual Review of Statistics and Its Application, 2(1), 321–359.
- 19 Müller, H.-G., and Zhang, Y. (2005). Time-varying functional regression for predicting remaining lifetime distributions from longitudinal trajectories. Biometrics, 61(4), 1064–1075.
- 20 Ramsay, J. O., and Dalzell, C. J. (1991). Some tools for functional data analysis. Journal of the Royal Statistical Society, Series B, 53(3), 539–561.
- 21 Ramsay, J. O., and Silverman, B. W. (2002). Applied Functional Data Analysis: Methods and Case Studies. New York: Springer.
- 22 Ramsay, J. O., and Silverman, B. W. (2005). Functional Data Analysis (2nd ed). New York: Springer.
- 23 Ramsay, T. (2002). Spline smoothing over difficult regions. Journal of the Royal Statistical Society, Series B, 64(2), 307–319.
- 24 Sangalli, L. M., Ramsay, J. O., and Ramsay, T. O. (2013). Spatial spline regression models. Journal of the Royal Statistical Society, Series B, 75(4), 681–703.
- 25 Şentürk, D., and Müller, H.-G. (2008). Generalized varying coefficient models for longitudinal data. Biometrika, 95(3), 653–666.
- 26 Şentürk, D., and Müller, H.-G. (2010). Functional varying coefficient models for longitudinal data. Journal of the American Statistical Association, 105(491), 1256–1264.
- 27 Wang, H., and Kai, B. (2015). Functional sparsity: global versus local. Statistica Sinica, 25(4), 1337–1354.
- 28 Wang, J.-L., Chiou, J.-M., and Müller, H.-G. (2016). Functional data analysis. Annual Review of Statistics and Its Application, 3(1), 257–295.
- 29 Wu, C. O., Chiang, C.-T., and Hoover, D. R. (1998). Asymptotic confidence regions for kernel smoothing of a varying-coefficient model with longitudinal data. Journal of the American Statistical Association, 93(444), 1388–1402.
- 30 Wu, H., and Liang, H. (2004). Backfitting random varying-coefficient models with time-dependent smoothing covariates. Scandinavian Journal of Statistics, 31(1), 3–19.
- 31 Yao, F., Müller, H.-G., and Wang, J.-L. (2005). Functional linear regression analysis for longitudinal data. The Annals of Statistics, 33(6), 2873–2903.
- 32 Zhou, L., Huang, J. Z., and Carroll, R. J. (2008). Joint modelling of paired sparse functional data using principal components. Biometrika, 95(3), 601–619.
- 33 Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.