WaveletBased ScalaronFunction Finite Mixture Regression Models
Abstract
Classical finite mixture regression is useful for modeling the relationship between scalar predictors and scalar responses arising from subpopulations defined by the differing associations between those predictors and responses. Here we extend the classical finite mixture regression model to incorporate functional predictors by taking a waveletbased approach in which we represent both the functional predictors and the componentspecific coefficient functions in terms of an appropriate wavelet basis. In the wavelet representation of the model, the coefficients corresponding to the functional covariates become the predictors. In this setting, we typically have many more predictors than observations. Hence we use a lassotype penalization to perform variable selection and estimation. We also consider an adaptive version of our waveletbased model. We discuss the specification of the model, provide a fitting algorithm, and apply and evaluate our method using both simulations and a real data set from a study of the relationship between cognitive ability and diffusion tensor imaging measures in subjects with multiple sclerosis.
Keywords: EM algorithm, Functional data analysis, Lasso, Wavelets.
1 Introduction
Regression models for scalar responses and functional predictors have become increasingly important tools for analyzing functional data. There are many applications in which these models can provide an adequate description of the data at hand. For instance, a classic example provided in Ramsay and Silverman (2005) uses a functional linear model to describe the association between total annual rainfall (scalar response of interest) and temperature over the course of the corresponding year (functional predictor).
Let be the scalar response of interest for observation , and let be a random predictor process that is square integrable on a compact support (i.e, ). The corresponding functional linear model (FLM) is given by:
(1.1) 
where is the scalar intercept and is the error term such that . is a square integrable coefficient function that relates the predictor process to the response. The magnitude of indicates the relative importance of the predictor at a given value of . If is large, this means that changes in the predictor process at are important in predicting the response.
We note that in model (1.1), the predictor is presented as though it is a 1dimensional signal depending on scalar value . However, could also be a 2 or higherdimensional functional object such as an image. In the case of a 2dimensional predictor image, corresponds to an ordered pair and is a coefficient image that provides information about the association of various regions of the predictor image with the scalar outcome of interest (Reiss and Ogden, 2010; Reiss et al., 2013).
A variety of approaches have been developed for estimating the coefficient function in (1.1). Many of these approaches employ functional principal components regression (FPCR) as in James (2002); Ramsay and Silverman (2005); and Cai and Hall (2006). Reiss and Ogden (2007) employ both penalization techniques and FPCR or functional partial least squares (FPLS). Splinebased approaches are also common for estimating (Cardot et al., 2003; Cardot and Sarda, 2005). Other “highly flexible” approaches include those taken by Müller and Yao (2008) who propose a functional additive regression model and James and Silverman (2005) who extend generalized linear models, generalized additive models, and projection pursuit regression to include functional predictors. Other more recent developments in estimating rely on combining the use of wavelets and sparse fitting procedures as in Zhao et al. (2012) and Reiss et al. (2013).
Although (1.1) can be appropriate for modeling the relationship between a scalar response and a functional predictor when the association between the response and predictor is the same for all observations, it is inadequate for settings in which the coefficient function differs across subgroups of the observations. If there are different associations corresponding to different coefficient functions then we can think of each observation as coming from one of distinct subpopulations/components and would need distinct FLMs to adequately describe the relationship between the response and the predictor. We are concerned with cases in which subpopulation membership is not observed and will need to be estimated along with the componentspecific coefficient functions. As motivation for a model that accounts for heterogeneous association between a functional predictor and scalar response Yao et al. (2011) describe a study of the association between the longevity (scalar response) and reproductive trajectory over the first 20 days of life (functional predictor) in female Mediterranean flies. Their analysis showed evidence of the existence of two distinct subpopulations defined by two different kinds of association between reproductivity and longevity.
Several approaches have been proposed to model scenarios like that mentioned above in both classical regression (with scalar predictors and a scalar response) and scalaronfunction regression. Some rely on using a clustering algorithm followed by fitting linear models within the estimated clusters as done by Spaeth (1979) in the case of classical regression and Preda and Saporta (2005) in the case of scalaronfunction regression. An alternative approach, which we adopt here, makes use of the theory of finite mixture regression models. When the coefficient function is thought to be different for mutually exclusive subgroups, it is natural to model the heterogeneity with a finite mixture of regressions. Finite mixture models have become an important statistical tool because of their ability to both approximate general distribution functions in a semiparametric way and account for unobserved heterogeneity (Grün and Leisch, 2007). Furthermore, mixture models can provide insight into previously unknown mechanisms that relate the predictors to the response.
Although the underlying theory of finite mixture regression models and methods for estimating those models have been wellstudied when the predictors are scalars (McLachlan and Peel, 2000; Schlattmann, 2009), methods for finite mixture regression remain relatively undeveloped when the predictors are functions. To our knowledge, Yao et al. (2011) are the only ones to investigate such an extension. They refer to the corresponding class of models that use the framework of finite mixture regression to model relationships between functional predictors and a scalar response in the presence of unobserved heterogeneity as functional mixture regression (FMR) models. In their approach, they first represent each functional predictor in terms of some suitably chosen number of functional principal components and apply standard mixture regression techniques in the new coordinate space.
The FMR model is given by
(1.2) 
where is the number of components or distinct subpopulations, is the th componentspecific intercept, and is the regression function for the th group, .
In contrast to Yao et al. (2011), we propose to take a waveletbased approach to FMR models. We focus here on the wavelet basis for several reasons. Wavelets are particularly well suited to handle many types of functional data, especially functional data that contain features on multiple scales. They have the ability to adequately represent global and local attributes of functions and can handle discontinuities and rapid changes.
Furthermore a large class of functions can be well represented by a wavelet expansion with relatively few nonzero coefficients. This is a desirable property from a computational point of view as it aids in achieving the goal of dimension reduction and will be of critical importance when the functional predictors are of very high dimension as is the case when the predictors are 2 or 3dimensional images.
Once the model is represented in terms of an appropriate wavelet basis, we employ a lassotype (Tibshirani, 1996) penalized fitting procedure to estimate the wavelet and scaling coefficients corresponding to the componentspecific coefficient functions in the mixture model.
The rest of the paper is organized as follows. In Section 2, we provide a brief discussion of wavelets and the waveletbased functional linear model followed by specification of the waveletbased (WB) functional finite mixture regression and adaptive waveletbased (AWB) functional finite mixture regression models. In Section 3, we outline an EMtype algorithm for fitting WB models. Section 4 discusses the various tuning parameters in the WB and AWB models. Section 5 presents simulation results showing the performance of the WB and AWB methods and an application of our method to a real data set where we investigate the association between fractional anisotropy profiles from the corpus callosum and scores on a test of cognitive functioning in a sample of subjects with multiple sclerosis. In section 6 we conclude with a brief discussion.
2 Methodology
2.1 Wavelets and Wavelet Decomposition
Wavelet bases are sets of functions that can be employed in representing a wide range of functional data and have the ability to represent localized features of functions in a sparse way. Comprehensive treatment of wavelets and their applications in statistics can be found in Ogden (1997); Nason (2008); and Vidakovic (1999). Here, we provide a brief overview of some important aspects of wavelet bases.
In , a wavelet basis is generated by two kinds of functions: a father wavelet, , and a mother wavelet, , with the following properties:
Father wavelets, also referred to as scaling functions, serve to approximate the function of interest while mother wavelets serve to provide the detail not captured by this approximation.
Any particular wavelet basis consists of translated and dilated versions of its father and mother wavelets given by
where the integer is the dilation index referring to the scale and is an integer that serves as a translation index. Larger values of correspond to scaling and wavelet functions that can provide more localized information about the function of interest.
In practice, the functional predictors and coefficient function associated with the FLM are not considered outside of a given interval. Without loss of generality, we take that interval to be [0,1]. Wavelet and scaling functions can be adapted via implementation of one of several boundary handling schemes to represent a given function on the unit interval. Hence if we assume that then we can represent in the wavelet domain by
(2.1) 
where is an integer that determines the number of scaling functions used in the lowest scale representation. The coefficients and are the corresponding scaling and wavelet coefficients for the functions and respectively and are given by
Each coefficient provides information about the characteristics of the function at a given scale and location.
We restrict ourselves to orthonormal wavelet families. By this we mean that and are such that
In typical applications, the functional predictors are discretely sampled. We assume that we observe a dyadic length vector of function values where the arguments , are equally spaced and the same for all observations. To obtain the wavelet and scaling coefficients corresponding to the functional predictors we use the discrete wavelet transform (DWT). The inverse DWT (IDWT) can be used to reconstruct a vector of functional observations from its corresponding wavelet and scaling coefficients. Both the DWT and IDWT can be performed using a computationally fast pyramid algorithm (Mallat, 1989).
2.2 Wavelet Representation of the FLM
Before moving on to our WB model we first review the WBFLM proposed by Zhao et al. (2012). In the this model, both the coefficient function, and the functional predictor from (1.1) are expressed in terms of an appropriate wavelet basis. can be expressed as in (2.1) and can be expressed as
where the scaling and wavelet coefficients are given respectively by
In practice, given equallyspaced observations of , the corresponding wavelet and scaling coefficients can be calculated using the DWT. These coefficients can be be put into an vector denoted by (here we include 1 as the first element of the vector which will correspond to the intercept in the regression model) having the form
(2.2) 
where and .
Because of the orthonormality of the wavelet basis, (1.1) can be simply written as
since all of the crossproduct terms comprising the product of the waveletdomain representations of and integrate to zero (Zhao et al., 2012), or, in matrix notation:
(2.3) 
where , is an matrix with th row , is an vector containing the intercept followed by the coefficients arranged in the same order as the vector , and .
Thus we see that once the functions and from (1.1) have been represented in the wavelet domain, the scaling and wavelet coefficients corresponding to , namely the elements of , become the predictors in the transformed space. In the wavelet representation of (1.1) there are parameters to estimate (including the intercept).
2.3 Specification of the WB Functional Mixture Regression Model
If the pairs of functional predictors and scalar responses come from a heterogeneous population, where the subpopulations (or components) are determined by distinct associations between the predictors and the response, then there is a unique coefficient function, , corresponding to each subpopulation. Since, as noted above, the predictors are typically discretely sampled at points, the model we consider is
Thus, the coefficient functions of interest are given by
and our goal is to find estimates for the ’s and the ’s.
In this setting, the model of interest is similar to that seen in classical finite mixture regression. We have that and
(2.4) 
where is the componentspecific coefficient vector for component , with the same form as in the model given by (2.3), is the corresponding componentspecific error variance, and is the probability that observation belongs to component . Let be the vector of free parameters to be estimated from (2.4), where is the space of vectors of the form such that for , , and .
In practice, may be densely sampled and so we may have . In this case, maximum likelihood estimation will provide inaccurate and unstable estimates for each and consequently poor estimates for each . Since wavelets allow for sparse representation of each , we may assume that most elements of are negligible and thus we consider a lassotype procedure for estimating the componentspecific vectors of wavelet and scaling coefficient values.
Städler et al. (2010) proposed an penalized mixture regression procedure for model fitting with general highdimensional predictors. We make use of this procedure here. We begin by first reparameterizing model (2.4) using the following:
Based on this new parameterization, model (2.4) can be written as:
(2.5) 
There is a onetoone mapping from in (2.4) to a new parameter vector
The corresponding loglikelihood for model (2.5) is
(2.6) 
To estimate the parameter vector in model (2.5), we propose to use that minimizes
(2.7) 
where is the norm of the vector . Note that the penalty on each wavelet and scaling component coefficient vector is proportional to the mixing probability . Including the mixing proportion in this manner corresponds to the common practice of relating the amount of penalty to the sample size, where, in the context of mixture regression, Khalili and Chen (2007) note that the “virtual” sample size from the th component is proportional to . Further discussion of the tuning parameters is given in Section 4.
Estimation of and rather than the direct estimation of and is considered primarily for two reasons. The reparametrization, along with a lassotype penalty allows for penalization of both the coefficient vectors of interest and the error variances within each component (Städler et al., 2010) while maintaining convexity of the optimization problem to be solved.
2.4 The Adaptive Model
We also consider an adaptive version of the estimator, , above. Zou (2006) proposed the adaptive lasso which allows for differing weights, adaptively chosen, to be assigned to the coefficients in the penalty. Among other benefits, the use of such weights can serve to provide better performance with respect to variable selection in high dimensional settings. In the twostage adaptive lasso procedure, one first finds initial estimates for the coefficients of interest and then uses these estimates (or a transformation of them) as weights for the coefficients in the penalty of the lasso procedure.
Our adaptive estimator is denoted by and minimizes the following criterion which involves a reweighted norm penalty term:
where is the th element of the vector and where is the th element of the WB functional mixture regression estimate of component vector . Since it is possible that some the values can be zero, which would cause the corresponding values to be infinite, we add a small constant (0.001) to each estimate in the fitting algorithm. Note that we follow Städler et al. (2010) and use estimates from the WB functional mixture regression in our weights, but other weighting strategies may also be considered.
3 Fitting WB Functional Mixture Models
Fitting of WB functional mixture regression models is carried out in three main steps: (1) Use the DWT to obtain the wavelet and scaling coefficients corresponding to the functional predictors, (2) use an EMtype algorithm for computing parameter estimates in the wavelet domain, and (3) use the IDWT to obtain estimates of the componentspecific coefficient functions in the original domain from the corresponding wavelet and scaling coefficient estimates. Each step is explained in detail below.
Step 1. Use the DWT to decompose the functional predictors and obtain the corresponding wavelet and scaling coefficients for each predictor. Here we must choose the wavelet family (e.g., Daubechies’ least asymmetric wavelets), number of vanishing moments, lowest level of decomposition (), and method for handling the boundaries (e.g., symmetric boundary handling).
The empirical wavelet and scaling coefficients for each predictor curve can be arranged into vectors, denoted , , which have the same structure as (2.2). We then form , an matrix with th row .
Step 2. We carry out an EMtype algorithm for our setting in a manner similar to that described in Städler et al. (2010). Consider the unobserved random indicator variable which designates component membership:
Then the expected scaled complete negative loglikelihood and penalized negative loglikelihood are given by
and
respectively.
In the Estep of the fitting procedure, we replace each unobserved group membership indicator, , with its expected value
where corresponds to the current vector of values for the parameters at EMiteration . Hence we can compute .
The Mstep of the fitting procedure is carried out in two stages. First, we fix each component coefficient vector, , at its current value of and improve
with respect to according to the procedure described in Städler et al. (2010). This yields an updated estimate of the vector, . In the second stage of the Mstep, we improve with respect to and . At this stage, we note that the optimization problem decouples into distinct convex optimization problems where we seek to minimize each of
with respect to and . To solve this set of optimization problems, we implement a coordinate descent algorithm that updates one coordinate at a time while holding the other coordinates fixed at their current values. The update for is given by (Städler et al., 2010)
where and is a matrix with rows , . Here refers to the vector inner product and is the Euclidean norm. Once the update for is computed, we calculate the update for the unpenalized componentspecific intercept using
where is the th element of the vector .
The coordinatewise updates for the remaining coefficients in each vector are computed as
where with being the th column of and where we define by
for . The E and Msteps are iterated until some convergence criteria are satisfied which ensure that the relative improvement in and the relative change in the parameter vector are small. Specifically, the EM procedure stops when
where refers to the th element of the parameter vector and we set .
Step 3. Use the IDWT to obtain estimates from the estimates respectively.
The EM procedure discussed in Step 2 above requires that we provide initial values for the parameters being estimated. We use the following scheme for obtaining these initial values. We first assign a weight to each observation corresponding to each of the distinct components. To do this, we randomly assign to each observation a class, , from the set . For observation and its randomly selected class we assign and for each of the other classes we assign , . We then normalize the vector of values, to sum to 1. Note that this process can be thought of as an initialization of the Estep. This is followed by updating all of the coordinates involved in the optimizations in the Mstep with initial values of , , and , , .
To speed up the EM procedure, we restrict ourselves to updating only the nonzero coordinates (active set elements) for 10 out of every 11 iterations of Step 2. This type of active set algorithm is used in Meier et al. (2008); Friedman et al. (2010); and Städler et al. (2010). After 10 iterations on the active set, we expand to consider all coordinates, both the active and nonactive, for updating in the 11th iteration. We obtain a possibly new active set and continue in this manner until the convergence criteria are satisfied.
4 Tuning Parameters and Their Selection
In using WBFMR we need to specify a number of tuning parameters, namely, the number of components, , the lowest level of decomposition, , and the penalty parameter, . The choice of the values for each of these tuning parameters may be based on prior information, otherwise datadriven methods may be employed in their selection. Below, we discuss each tuning parameter and consider two possible datadriven methods for their selection.
If the number of components is known a priori or exploratory data analysis suggests a particular number of components, then can be specified outright. However, we often employ the mixture modeling approach when the number of components is unknown or knowledge of component membership is unavailable.
The value of corresponds to the lowest level of decomposition and can range from 0 to . Since the predictors are sampled at points, the DWT provides a decomposition that uses a total of wavelet and scaling functions. Among this set of basis functions, will be scaling functions and will be wavelet functions. Hence, setting close or equal to 0 results in using fewer scaling functions to represent largescale features and more wavelet functions to represent local details of the function of interest. Conversely, setting close or equal to results in using more scaling functions and fewer wavelet functions.
The value of directly determines the role that the penalty function will have in both estimating and selecting variables in the model. Large values of force elements of the estimated component coefficient vectors to zero while small values result in many nonzero estimates.
We will employ two methods for tuning parameter selection. First we consider selecting the parameters that minimize the crossvalidated value
(4.1) 
Here, denotes the loglikelihood from (2.6) and the estimate depends on the values of the tuning parameters as indexed by the subscripts. We will refer to (4.1) as the “predictive loss”. We also consider selecting the parameters that minimize a modified BIC criterion. We use the modified BIC measure, proposed by Pan and Shen (2007), which is given by
(4.2) 
where is the effective number of parameters with being the number of coefficients estimated to be zero in all of the components. We can compute this value over a grid of candidate values for all or a subset of the turning parameters.
The crossvalidation procedure generally puts more emphasis on predictive ability and chooses a model that performs well in this regard. On the other hand, BIC focuses more on finding the “true” model and often chooses a simpler one. Compared to BIC, crossvalidation is computationally demanding and can be prohibitive for large and/or highdimensional data.
5 Simulations and Application
We present simulation results that demonstrate various aspects of the WBFMR and AWBFMR procedures and that draw comparisons to a functional principal componentsbased (FPC) method similar to that proposed by Yao et al. (2011). For each simulation discussed below, we generated observations consisting of a discretely sampled onedimensional functional predictor signal, , and a scalar response, whose association with depends on some known group membership.
Each functional predictor is a Brownian bridge stochastic process for with an expected value of 0, covariance given by for , and with . We consider various sampling densities for the functional predictors. Specifically, we consider data sets where the functional predictors are sampled at 64, 128, 256, or 512 equallyspaced points. A sample of three of these predictors are given in the left panel of Figure 1.
The scalar outcomes corresponding to each functional predictor were generated using two distinct settings for the componentspecific coefficient functions. The first pair of componentspecific coefficient functions are given by and . The second pair of componentspecific coefficient functions are given by and where . The middle panel of Figure 1 shows and which we will refer to as the “smooth” functions while the right panel shows and which we will refer to as the “bumpy” functions.
In addition to considering different componentspecific coefficient settings, we also consider different signaltonoise ratio settings. Equal proportions of observations were generated in each component with and the discrete approximation to takes on the desired value in a given setting. We consider settings with values of 0.9, 0.7, and 0.5 corresponding to “high”, “medium”, and “low” signaltonoise ratios respectively.
We employ Daubechies’ least asymmetric wavelets with eight vanishing moments in all simulations. The WaveThresh
package in R
(Nason, 1998) is used to perform the DWT and IDWT with the periodic boundary handling option.
5.1 Simulation 1: Comparison of FMR Methods
In the first set of simulations we compare the waveletbased (WB), adaptive waveletbased (AWB), and functional principal componentsbased (FPC) mixture regression methods in various combinations of the settings mentioned above. In all, we consider 24 different settings: two types of component coefficient functions (smooth or bumpy), three possible values (0.9, 0.7, or 0.5), and four possible sampling densities ( = 64, 128, 256, or 512).
For a given simulation run, we generate a training set, a validation set, and a test set all from the same setting (i.e., type of component coefficient functions, value, sampling density, and equal proportion of observations from each component). Each set is made up of 100 observation pairs consisting of a functional predictor and its corresponding scalar response. The training set is used to fit a model for each combination of the tuning parameters. The validation set is then used to select the combination of tuning parameters that minimizes (4.1) among all combinations of tuning parameters. Finally, the model estimated from the validation set is applied to the test set and the corresponding predictive loss is computed. We repeat this procedure 100 times for each setting.
In this first set of simulations, we treat the number of components as known, i.e., . For the waveletbased methods, we fix the lowest level of decomposition to in the smooth setting and to in the bumpy setting. In extensive prior simulations (not shown here), these decomposition levels tended to consistently minimize the predictive loss for each of the settings that we consider here. Additional support for these choices is provided by the results in Simulation 2 where we note that the plots in Figure 7 show little difference between the distribution of the losses when is set to the values that we selected and the distributions when is allowed to be chosen by either crossvalidation or BIC. To choose an optimal value for in a given setting, we first fit the model based on the training data for each in a grid of 100 values then chose the value that minimizes (4.1) in the corresponding validation set. The predictive loss is then obtained by using the model fit on the training set to compute (4.1) from the corresponding test data. For the adaptive waveletbased method, we use the estimated componentspecific wavelet and scaling coefficients from the corresponding waveletbased model as the initial estimates.
For the functional principal componentsbased procedure, based on the procedure proposed in Yao et al. (2011), the tuning parameters consisted of the number of order four Bspline basis functions used in representing the predictor signals and the number of principal components to serve as the predictors in the FMR model. Following Yao et al. (2011), we choose to use the minimum number of principal components that account for at least 90% of the variation in the predictor signals. The optimal set of tuning parameters was selected by first fitting a model for each combination of number of Bspline basis functions and number of principal components using the training data and then picking the pair that minimized (4.1) in the corresponding validation set. The fitted model was then applied to the corresponding test data and the predictive loss was obtained. We use the FlexMix
packge (Leisch, 2004) in R
to fit the functional principal componentsbased models.
We first consider how the three methods compare with respect to predictive loss based on the test sets. The boxplots in Figure 2 show the predictive loss in the test set for the 100 simulation runs for each of the three methods at the various settings. Lower loss values are preferred. In the smooth setting (top row), we note that the waveletbased and adaptive waveletbased methods perform comparably to the functional principal componentsbased method while in the bumpy setting (bottom row), the waveletbased methods appear to do better, especially for higher values of .
Estimation performance is illustrated in Figures 46. The solid and dashed thick curves correspond to the pointwise mean estimated component coefficient functions over the 100 simulation runs at the specified setting. The solid and dashed thin curves correspond to the true component coefficient functions used to generate the scalar responses. Figures 4 and 4 show the average estimation performance of the three methods in the smooth setting for = 0.9 and 0.7 respectively. (Performance for is not shown but is similar to that of ). In all settings for which the true component coefficient functions are smooth, we note that the functional principal componentsbased method appears to do best while the waveletbased methods perform similarly well when the functional predictors are densely sampled. Substantial gains in estimation performance by the waveletbased methods are evident in Figures 6 and 6 which show the average estimation performance of the three methods in the bumpy setting for = 0.9 and 0.7 respectively. (Again, performance for is not shown but is similar to that of ). We note that the waveletbased methods do very well in capturing the local features of the component coefficient functions and in estimating regions where there is no association between the functional predictors and the response while the functional principal componentsbased method struggles with both of these tasks.
It is interesting to note that the waveletbased and adaptive waveletbased methods perform nearly identically in the simulations discussed above. Comparison (not shown here) of the wavelet and scaling coefficient estimates given by the waveletbased and corresponding adaptive waveletbased methods shows that the adaptive version is performing additional variable selection and producing different estimates from those given by the nonadaptive procedure, but these changes do not yield substantial gains in reducing either predictive loss or estimation error.
5.2 Simulation 2: Tuning Parameter Selection Methods
In the second set of simulations, we investigate selection methods for the tuning parameters in the waveletbased model. We compare selection based on minimizing the 5fold crossvalidated loglikelihood loss to that based on minimizing the modified BIC criteria given in (4.2). We consider three different scenarios for tuning parameter selection:
 [leftmargin = 2.8cm]

Set and = 0 (smooth setting) or 5 (bumpy setting); select .

Set = 0 (smooth setting) or 5 (bumpy setting); select and .

Set ; select and .
We restrict ourselves to a subset of four of the 24 settings from the first group of simulations discussed above. Specifically, we compare the three tuning parameter selection scenarios in the smooth and bumpy settings when the sampling density of the functional predictors is either 128 or 256. In all four settings we have . For each of 100 simulation runs at each setting, the training set was used to determine the optimal tuning parameters that either minimized the 5fold crossvalidated predictive loglikelihood loss or minimized the modified BIC criteria. The corresponding test set was used to estimate the test loss in each scenario for both selection methods.
The boxplots showing these loglikelihood loss values for the test sets are provided in Figure 7. The three tuning parameter selection scenarios appear to be comparable with respect to predictive loglikelihood loss across the four settings.
In Scenario 2, we allowed the data to select the number of components, . Table 1 shows the proportions of simulation runs at each of the four settings for which the number of components was chosen to be 1, 2, or 3. The table suggests that, relative to using the modified BIC for selecting the number of components, 5fold cross validation has greater tendency to overfit by estimating more components than truly exist.
Smooth  CV  0.00  0.62  0.38 

BIC  0.00  1.00  0.00  
Smooth  CV  0.00  0.59  0.41 
BIC  0.00  1.00  0.00  
Bumpy  CV  0.04  0.76  0.20 
BIC  0.05  0.94  0.01  
Bumpy  CV  0.07  0.72  0.21 
BIC  0.05  0.94  0.01 
5.3 Application to DTI Data for Subjects with Multiple Sclerosis
We now analyze data from a diffusion tensor imaging (DTI) study, discussed in Goldsmith et al. (2012), using our waveletbased functional mixture regression approach. The data are from a longitudinal study investigating the cerebral white matter tracts of subjects with multiple sclerosis (MS) recruited from an outpatient neurology clinic and healthy controls who were recruited from the community. Here we focus on the baseline observations for the 100 MS subjects. In particular, we are interested in the relationship between the fractional anisotropy profile (FAP) from the corpus callosum (functional predictor) and the Paced Auditory Serial Addition Test (PASAT) score (scalar response).
The PASAT is an assessment tool that measures a subject’s cognitive ability with respect to auditory information processing speed and flexibility and also provides information on calculation ability (Rosti et al., 2006). The PASAT score is the number of correct answers out of 60 questions and thus ranges from 0 to 60. Lower scores are generally taken to indicate some level of dysfunction. The functional predictor of interest is the FAP from the corpus callosum which is derived from DTI, a magnetic resonance imaging modality that is commonly used to track the diffusion of water in biological tissue. The FAP is a continuous summary of water diffusivity that is parametrized by the arc length along a curve. The tract profiles are estimated via an automated tractprobabilitymapping scheme described in Reich et al. (2010). In the data set, the FAP predictors are recorded at 93 locations along the corpus callosum. In our analysis, we linearly interpolate the FAP curves at 128 equally spaced points before projecting them onto a wavelet basis. We used data from 99 of the 100 MS subjects since one subject had missing FAP values at several locations along the tract. Figure 8 shows the FAPs for all 99 MS subjects that we considered as well as those for three subjects with the lowest, median, and highest PASAT scores.
We were interested in conducting an analysis that inspects whether the regression relationship between corpus callosum FAPs and PASAT scores varies due to some unknown mechanism. In the top plot of Figure 8, we note that there is no obvious grouping in the FAP curves.
We apply our WB functional mixture regression approach in which we used the BIC from (4.2) to select the optimal tuning parameters. This approach suggests that there are two distinct groups with different coefficient functions describing the association between corpus callosum FAP and PASAT score. Figure 10 shows the estimated coefficient functions, and , for each of the two groups. For illustration, Figure 10 also shows the FAPs that belong to the groups associated with those functions. To determine which group a subject’s FAP belongs to, we use the estimated group membership indicators from the last iteration of the EM algorithm. The indicator with the the highest value (i.e., for subject ) was taken to correspond to the group from which the observation came. Using this assignment method, there are 52 subjects belonging to Group 1 and 47 belonging to Group 2.
From Figure 10 we note that the estimated coefficient function corresponding to Group 2 is identically zero at all locations along the profile suggesting no association between FAP and PASAT score among MS subjects belonging to this group whereas the estimated coefficient function for Group 1 suggests that higher fractional anisotropy values between profile locations of about 0.2 and 0.7 are associated with higher PASAT scores while higher values between profile locations of about 0.7 and 0.9 are associated with lower PASAT scores for those MS subjects belonging to Group 1.
Figure 10 shows the PASAT scores corresponding to the two groups. This plot illustrates a distinctive split between the two groups with respect to PASAT score. Overall the model may suggest that, among MS subjects with better cognitive function, there is no association between corpus callosum FAP and PASAT score whereas among those with worse cognitive function, fractional anisotropy values in the middle region of the tract can discriminate among the PASAT scores and that greater fractional anisotropy corresponds to higher scores.
With respect to the the properties that characterize the estimated components, the application of our method to the DTI data resulted in findings that are similar to those found in an application of the FPCbased FMR method used in Yao et al. (2011). Here we are referring to the analysis of the association between early reproductivity (functional predictor) and longevity (scalar response) in Mediterranean fruit flies. As in our application, Yao et al. (2011) found that their approach suggested there were two groups of flies corresponding to two different regression structures that characterized the association between early fertility and longevity. Furthermore, upon examining the distribution of the response within each group, they found that one group tended to consist of flies with greater longevity while the other group consisted of flies with shorter longevity; similar to how the estimated groups in our DTI example show a distinction by higher and lower PASAT score.
Finally, we compare the chosen waveletbased function mixture regression model to the wavelet based FLM (selected to minimize BIC) with respect to leaveoneout crossvalidated relative prediction errors where is the predicted PASAT score for subject from a model fit on data with subject removed. To determine which estimated coefficient function to use to obtain the predicted PASAT score for subject , we use the following ad hoc method similar to that used in Yao et al. (2011): if the observed PASAT score is less than 50 then we use the coefficient function that is not identically zero at each profile location and if is 50 or larger then we use the zero function. For our model with 2 groups, the CVRPE is 0.0315 and 0.0723 for the wavelet based FLM.
6 Discussion
In this article we present a general waveletbased approach to functional mixture regression which is appropriate to use when modeling the association between a continuous scalar response and a functional predictor where the association is not homogeneous across the population. We provide a fitting algorithm and demonstrate some properties of the corresponding estimators using simulations. When compared with a functional principal componentsbased approach to functional mixture regression, evidence suggests that our method performs better with respect to prediction and estimation accuracy when the component coefficient functions defining the association between the predictors and responses possess relatively small scale features.
Zhao et al. (2012) note that there are many factors that may be important to the performance of a waveletbased approach like the one we present here. For one, selection of a particular wavelet basis for the DWT has an impact on the sparsity of the representation of the functional predictor. As in Zhao et al. (2012), we chose to use a wavelet basis from the Daubechies family in our simulations. This family has good localizing properties in both the temporal and frequency domains. Of course, selection from other families is possible and it may be of interest to compare performance of our method when basis functions from different families are chosen.
Another factor that plays an important role in the performance of our method is tuning parameter selection. We looked at two criteria for selecting tuning parameters: minimizing the 5fold crossvalidated predictive loss and minimizing a modified BIC value. We found that both methods were generally comparable with the BIC method perhaps slightly underperforming with respect to predictive loss. However, simulations showed that BIC tended to select the correct number of components more often and 5fold crossvalidation tended to overfit.
We noted in Section 2.3 that it is common to relate the amount of penalty on the covariates to the sample size as is done in (2.7) by including in the penalty function. In their penalized mixture approach, Städler et al. (2010) suggest including an additional tuning parameter, , in the form of an exponent on the mixing probability . They consider using only the values of . They suggest using the value of 0 when the true mixing proportions are not very different from each other and using 1/2 or 1 when the mixing proportions are unbalanced. Our method corresponds to the case where . In other simulations (not presented here) we compared models that resulted from using different values of in both balanced and unbalanced settings but generally saw little difference with respect to predictive loss when using different values for .
The ability to conduct inference on the estimated component coefficient functions is of critical importance. Future work will focus on developing methods for constructing confidence bands for the the true component coefficient functions. Subsequent work will also focus on extending the model presented here to incorporate additional scalar covariates and implementing fitting algorithms that allow for 2 and 3dimensional images as predictors.
Acknowledgements
This work was partially supported by NIBIB grant 5 R01 EB009744.
Footnotes
 Adam Ciarleglio is a Postdoctoral Fellow, Department of Child and Adolescent Psychiatry, Division of Biostatistics, New York University, New York, NY (Email: Adam.Ciarleglio@nyumc.org). R. Todd Ogden is Professor, Department of Biostatistics, Columbia University, New York, NY (Email: to166@columbia.edu).
References
 Cai, T. and Hall, P. (2006). Prediction in functional linear regression. Annals of Statistics 34:2159–2179.
 Cardot, H., Ferraty, F., and Sarda, P. (2003). Spline estimators for the functional linear model. Statistica Sinica 13:571–591.
 Cardot, H. and Sarda, P. (2005). Estimation in generalized linear models for functional data via penalized likelihood. Journal of Multivariate Analysis 92:24–41.
 Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33:1–22.
 Goldsmith, J., Crainiceanu, C., Caffo, B., and Reich, D. (2012). Longitudinal penalized functional regression for cognitive outcomes on neuronal tract measurements. Journal of the Royal Statistical Society, Series C 12:453–469.
 Grün, B. and Leisch, F. (2007). Fitting finite mixtures of generalized linear regressions in R. Computational Statistics & Data Analysis 51:5247–5252.
 James, G. M. (2002). Generalized linear models with functional predictors. Journal of the Royal Statistical Society Series B 64:411–432.
 James, G. M. and Silverman, B. W. (2005). Functional adaptive model estimation. Journal of the American Statistical Association 100:565–576.
 Khalili, A. and Chen, J. (2007). Variable selection in finite mixture of regression models. Journal of the American Statistical Association 102:1025–1038.
 Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software 11:1–18.
 Mallat, S. G. (1989). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11:674–693.
 McLachlan, G. and Peel, D. (2000). Finite Mixture Models. WileyInterscience, New York.
 Meier, L., van de Geer, S., and Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society, Series B 70:53–71.
 Müller, H. and Yao, F. (2008). Functional additive models. Journal of the American Statistical Association 103:1534–1544.
 Nason, G. (1998). Wavethresh software. Department of Mathematics, University of Bristol, Bristol, UK.
 Nason, G. (2008). Wavelet Methods in Statistics with R. Springer, New York.
 Ogden, R. T. (1997). Essential Wavelets for Statistical Applications and Data Analysis. Birkhäuser, Boston.
 Pan, W. and Shen, X. (2007). Penalized modelbased clustering with application to variable selection. Journal of Machine Learning Research 8:1145–1164.
 Preda, C. and Saporta, G. (2005). Clusterwise PLS regression on a stochastic process. Computational Statistics and Data Analysis 49:99–108.
 Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis, Second Edition. Springer, New York.
 Reich, D., Ozturk, A., Calabresi, P., and Mori, S. (2010). Automated vs. conventional tractography in multiple sclerosis: Variability and correlation with disability. NeuroImage 49:3047–3056.
 Reiss, P. T., Huo, L., Ogden, R. T., Zhao, Y., and Kelly, C. (2013). Waveletdomain regression with image predictors, and a surprising (non)result in psychiatric neuroimaging. (preprint) .
 Reiss, P. T. and Ogden, R. T. (2007). Functional principal component regression and functional partial least squares. Journal of the American Statistical Association 102:984–996.
 Reiss, P. T. and Ogden, R. T. (2010). Functional generalized linear models with images as predictors. Biometrics 66:61–69.
 Rosti, E., Hämäläinen, P., Koivisto, K., and Hokkanen, L. (2006). The PASAT performance among patients with multiple sclerosis: Analyses of responding patterns using different scoring methods. Multiple Sclerosis 12:586–593.
 Schlattmann, P. (2009). Medical Applications of Finite Mixture Models. SpringerVerlag, Berlin.
 Spaeth, H. (1979). Clusterwise linear regression. Computing 22:367–373.
 Städler, N., Bühlmann, P., and van de Geer, S. (2010). penalization for mixture regression models. Test 19:209–256.
 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B 58:267–288.
 Vidakovic, B. (1999). Statistical Modeling by Wavelets. Wiley, New York.
 Yao, F., Fu, Y., and Lee, T. (2011). Functional mixture regression. Biostatistics 12:341–353.
 Zhao, Y., Ogden, R. T., and Reiss, P. T. (2012). Waveletbased lasso in functional linear regression. Journal of Computational and Graphical Statistics 21:600–617.
 Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101:1418–1429.