Abstract
We investigate the issue of bandwidth estimation in a nonparametric functional regression model with functionvalued, continuous realvalued and discretevalued regressors under the framework of unknown error density. Extending from the recent work of Shang (2013, Computational Statistics & Data Analysis), we approximate the unknown error density by a kernel density estimator of residuals, where the regression function is estimated by the functional NadarayaWatson estimator that admits mixed types of regressors. We derive a kernel likelihood and posterior density for the bandwidth parameters under the kernelform error density, and put forward a Bayesian bandwidth estimation approach that can simultaneously estimate the bandwidths. Simulation studies demonstrated the estimation accuracy of the regression function and error density for the proposed Bayesian approach. Illustrated by a spectroscopy data set in the food quality control, we applied the proposed Bayesian approach to select the optimal bandwidths in a nonparametric functional regression model with mixed types of regressors.
Keywords: functional NadarayaWatson estimator; kernel density estimation; Markov chain Monte Carlo; mixture error density; spectroscopy
MSC code: 62G08, 62F15
Bayesian bandwidth estimation for a nonparametric functional regression model with mixed types of regressors and unknown error density
Han Lin Shang
Australian National University
1 Introduction
Recent advances in computer recording and storing technology facilitate the presence of functional data sets, motivated many researchers to consider various functional regression models for estimating the relationship between predictor and response variables, where at least one variable is functional in nature. The first functional formulation of a linear model dates back to a discussion by Hastie and Mallows (1993), and it is later extended in detail by Ramsay and Silverman (2005). Since then, it has been further extended or modified to take into account possible nonlinear relationship between predictor and response variables. Some of the extended regression models include the functional logit regression model (Aguilera et al., 2008), functional polynomial regression model (Yao and Müller, 2010), functional additive regression model (FebreroBande and GonzálezManteiga, 2013), and nonparametric functional regression model (Ferraty and Vieu, 2006), semifunctional partial linear model (AneirosPérez and Vieu, 2006; AneirosPerez and Vieu, 2008), to name only a few. Due to the fast development in functional regression models, it has received an increasing popularity in various fields of application, such as agespecific mortality and fertility forecasting in demography (Hyndman and Shang, 2009), analysis of spectroscopy data in chemometrics (Ferraty and Vieu, 2002), calculations of conditional valueatrisk and expected shortfall in econometrics (Ferraty and QuinteladelRío, 2014), earthquake modelling (QuinteladelRío et al., 2011), and ozone level prediction (QuinteladelRío and FranciscoFernández, 2011).
Despite the fast development in functional regression models for finding the relationship between predictor and response variables, the type of predictor variable is often limited to a functionvalued variable and error density is often assumed to be Gaussian. In the field of functional regression models, error density estimation remains largely unexplored. However, the estimation of error density is important to understand the residual behaviour and to assess the adequacy of error distribution assumption (see for example, Akritas and Van Keilegom, 2001; Cheng and Sun, 2008); the estimation of error density is useful to test the symmetry of the residual distribution (see for example, Ahmad and Li, 1997; Neumeyer and Dette, 2007); the estimation of error density is important to statistical inference, prediction and model validation (see for example, Efromovich, 2005; Muhsal and Neumeyer, 2010); and the estimation of error density is useful for the estimation of the density of the response variable (see for example, Escanciano and JachoChávez, 2012). In the realm of financial asset return, an important use of the estimated error density is to estimate valueatrisk for holding an asset. In such a model, any wrong specification of the error density may produce an inaccurate estimate of valueatrisk and make the asset holder unable to control risk. Therefore, being able to estimate the error density is as important as being able to estimate the regression function.
This motivates the investigation of a kernelform error density for estimating unknown error density in a nonparametric functional regression model, where the response variable is scalar but the predictor variables can be functionvalued, continuous realvalued and discretevalued. The contribution of this paper is two fold: First, to deal with the mixed types of regressors, we consider three types of kernel functions to model the explanatory power of the functionvalued, continuous realvalued and discretevalued predictors on the scalar realvalued response. Second, to accurately estimate the error density, we put forward the kernelform estimator which depends on three parameters: 1) the type of semimetric used to measure distances among curves, such as a semimetric based on secondorder derivative or a semimetric based on functional principal components; 2) residuals fitted through the functional NadarayaWatson (NW) estimator of the regression function; 3) and the bandwidth of residuals. Cheng (2002, 2004) studied weak and strong uniform consistency of such an error density estimator, while Samb (2011) established the optimal convergence rate of the kernelform error density estimator in a multivariate framework. Recently, Shang (2013) proposed a novel Bayesian bandwidth estimation procedure to simultaneously estimate the bandwidths in the functional NW estimator of the regression function and the kernelform error density in a nonparametric functional regression function with only the functional predictor. In this paper, we extend this Bayesian approach to the same nonparametric functional regression model with mixed types of regressors.
2 Bayesian bandwidth estimation
We present the basics of this method and refer readers to the textbooks by Ferraty and Vieu (2006) on nonparametric functional regression and Li and Racine (2007) on nonparametric multivariate regression with mixed types of regressors, respectively. The essential idea of functional NW kernel smoothing is to allow flexible functional estimation of the unknown regression function. For a scalar realvalued response , we consider the following nonparametric functional regression model,
(1) 
where is a smooth function from a squareintegrable function space to the real line, with being an infinitedimensional functional variable bounded within a compact interval, being dimensional continuous realvalued variables and being dimensional discretevalued variables, are independent and identically distributed (iid) errors with unknown density denoted as . The discretevalued predictors can be either ordered or unordered, which in turn affects the selection of suitable kernels (see equations (3) and (4)). The regression function is denoted by , and we assume that . We investigate the problem of nonparametric estimation of the probability density function of the error term, which in turn might provide a good estimate of regression function. As noted by Samb (2011), the difficulty of estimating error density is the fact that the regression error term is not observed and thus must be estimated.
The flexibility of model (1) stems from the fact that the unknown regression function does not need to have a specific functional form. With some smoothness properties, the functional form of is often estimated in a datadriven manner. There are a growing amount of literature on the development of nonparametric functional estimators, such as functional NW estimator (Ferraty and Vieu, 2006), functional local linear estimator (Berlinet et al., 2011), functional nearest neighbour estimator (Burba et al., 2009) and distancebased local linear estimator (Boj et al., 2010). Throughout the paper, we demonstrate the proposed method using the functional NW estimator because of its simplicity and mathematical elegance.
2.1 Functional NadarayaWatson (NW) estimator
The functional NW estimator of the conditional mean can be defined as
where is a generalised product kernel that admits functionvalued, continuous realvalued and discretevalued predictors.
The kernel function for a functionvalued predictor can be expressed as
where is a symmetric real valued function defined on , is a semimetric used to measure distances among curves, represents a bandwidth associated with an infinitedimensional functionvalued predictor, and is the secondorder Gaussian kernel function. For simplicity, we consider just one functionvalued predictor in this paper, although the methodology can be extended to multiple functionvalued predictors (see for example, Goldsmith et al., 2011).
In the nonparametric functional regression model, faster rates of convergence can be obtained (and therefore nicer theoretical and practical results), as long as one uses a semimetric that increases the concentration of the explanatory variable , while reflecting as much as possible the effect of on the response variable (QuinteladelRío et al., 2011). For a nonsmoothed functional data, a semimetric based on functional principal component analysis should be considered; for a smoothed functional data, a semimetric based on derivative should be considered. Because the data we considered are smooth, a semimetrics based on the order derivative of is used to measure the distance between two curves, that is given by
(2) 
where being the order derivative of (see for example Goutis, 1998; Ferraty and Vieu, 2009). We refer readers to the work of Ferraty and Vieu (2006, pp.3233) for the calculation of (2) in detail.
The product kernel for the continuous realvalued predictors can be expressed as
where is a row vector of bandwidths associated with dimensional continuous realvalued predictors, is the predictor of , is the grid point of , and is the secondorder Gaussian kernel function.
For binary variables, the choice of discrete kernel we considered is Aitchison and Aitken’s (1976) kernel, given by
(3) 
where represents a row vector of bandwidths associated with dimensional discretevalued predictors, is the predictor of and represents possible discrete outcomes. Notice that when , the above kernel function becomes an indicator function which takes values 0 and 1.
So far, we consider discrete variables that do not have a natural ordering, an example of which includes gender. However, there are some cases where a discrete variable has a natural ordering, an example of which includes credit rating. For categorical variables, the choice of discrete kernel we considered is Li and Racine’s (2007) kernel, which incorporates the natural ordering. It is expressed as
(4) 
where represents a row vector of bandwidths associated with dimensional discretevalued predictors.
Having determined the types of kernel function, the unknown parameters in the functional NW estimator are the bandwidths (also known as smoothing parameters in the field of nonparametric smoothing). As it is always the case in nonparametric estimation, the role of smoothing parameters is prominent. For example, the rates of convergence of the nonparametric functional estimator can be divided into two parts: a squared bias component which increases with the bandwidths, and a variance component which decreases with the bandwidths. Therefore, there is a need to select optimal bandwidths in order to balance the tradeoff between squared bias and variance (see Heidenreich et al., 2013, for a recent review on the bandwidth selection in the multivariate density estimation and nonparametric regression).
In the field of nonparametric functional regression, the bandwidth associated with a functionvalued regressor is commonly estimated by the socalled functional cross validation (see for example, Ferraty and Vieu, 2002, 2006; Rachdi and Vieu, 2007; Burba et al., 2009; AneirosPerez and Vieu, 2011). The optimal bandwidths of the kernel estimators are obtained by minimising
where
and is the leaveoneout kernel estimator. Functional cross validation is designed to assess the predictive performance of a model by an average of certain measures for the ability of predicting a subset of curves by a model fit, after deleting just these curves from the functional data set. Under the criteria of averaged squared error, integrated squared error and mean integrated squared error, Rachdi and Vieu (2007) presented asymptotic optimality of the functional cross validation. In addition, functional cross validation has the appealing feature that no estimation of the error variance is required. However, since residuals affect the estimation accuracy of regression function, functional cross validation may select suboptimal bandwidths, in turn might lead to inferior estimation accuracy of regression function for a small sample. In addition, as the number of parameters increases, functional cross validation may suffer from numerical instability. Instead, we present a Bayesian bandwidth estimation approach that simultaneously estimate the bandwidths in the regression function and error density, under the framework of mixed types of regressors and unknown error density.
2.2 Estimation of error density
The goal is to recover an unknown error density (on the real line) from a sample of independent random observations. Because errors are often unknown, they can be approximated by residuals that are obtained from the functional NW estimator. To avoid the selection of zero value for the bandwidth, can be approximated by the leaveoneout kernel density estimator expressed as
(5) 
where represents the residual, and represents the residual bandwidth.
Given a set of parameters , the kernel likelihood of can be approximated by
(6) 
where and are the leaveoneout nonparametric estimators of the regression function and error density function of the computed residuals, respectively.
We now discuss the issue of prior density for the bandwidths. Let , , and be the prior of squared bandwidths , , and . Since , and play the same role as a variance parameter in the Gaussian density, we assume that the priors of , and are inverse Gamma density, denoted as IG, IG, and IG, respectively. Therefore, the prior densities of , , and are
where are hyperparameters (see for example, Geweke, 2010). The selection of these hyperparameters is to assign more weights to a smaller value of squared bandwidth, in keeping consistent with the asymptotic results. The prior density with the chosen hyperparameter is shown in Figure 1, along with two other possibilities considered in the sensitivity analysis given later in Table 3.
For discrete variables, denote as the prior of the discrete kernel bandwidth for . We assume the prior of following a uniform distribution, given by
where . For binary variables, and ; for ordered categorical variables, and (see also Hayfield and Racine, 2008).
According to Bayes theorem, the posterior can be expressed as (up to a normalising constant):
(7) 
where is the same likelihood as in (6) but with squared bandwidths for those parameters that are estimated by the Gaussian kernel, and and are independent priors. Since we assume that there is no correlation between the regression function and error term in (1), the bandwidths of the regression function and error density are uncorrelated in (7).
Based on (7), we use the adaptive block randomwalk Metropolis algorithm of Garthwaite et al. (2010) to sample . Throughout the paper, the burnin period is the first 1,000 iterations, and the number of recorded iterations after the burnin period is iterations. To measure the mixing performance of the sample paths, we consider the simulation inefficiency factor (see also Kim et al., 1998; Shang, 2013). It can be interpreted as the sample mean from an sampler that draws iid observations from the posterior distribution.
In the kernel density estimation, it has been observed that the leaveoneout kernel estimator, such as (5), can be heavily affected by the presence of extreme residuals (see, for example, Bowman, 1984). This may cause by the use of a global bandwidth. To rectify this problem, we consider the local bandwidth method studied by Zhang and King (2011) and Shang (2013). The idea of localised bandwidths is to assign small bandwidths to the observations in the high density region, while large bandwidths to the observations in the low density region. The localised kernelform error density can be expressed by
where is the bandwidth assigned to , for , and the vector of parameters is now . By setting the prior of to be a uniform density on , the adaptive randomwalk Metropolis algorithm is used to sample these parameters.
3 Simulation study
The main goal of this section is to illustrate the proposed methodology through simulated data. One way to do that consists in comparing the true regression function with the estimated regression function, and comparing the true error density with the estimated error density. To measure the estimation accuracy between and , we use the mean average squared error (MASE), given by
(8) 
for , where represents the sample size, and represents one out of 100 replications considered.
To measure the discrepancy between and , we calculate mean integrated squared error (MISE) defined as
(9) 
for . For each replication, MISE can be approximated at grid points bounded between an interval, such as .
Building the simulated samples.
We briefly describe the construction of the simulated data. First of all, we build simulated discretised curves
(10) 
where are equispaced points, , , are independently drawn from a uniform distribution on . The functional form of (10) is taken from Ferraty et al. (2010b). Figure 2 presents the simulated curves for one replication.
Once the curves are defined, one simulates a functional regression model to compute the responses:

two regression functions are constructed as
where is the derivative of , is a realvalued continuous variable that is simulated from a standard normal distribution, is a realvalued continuous variable that is simulated from an exponential distribution with rate parameter 1, is a discretevalued variable that is drawn from a Bernoulli distribution, and with for .

generate from a trimodal distribution with functional form of , for example. As shown in the supplement, we have also considered claw error density (see also Marron and Wand, 1992, Table 1).

compute the corresponding response as in model 1.
Estimating the regression function.
We compute the discrepancy between and , for . To do that, we use the following MonteCarlo scheme:

build 100 replications .

compute the difference , where represents the regression function estimated by the functional NW estimator with generalised product kernel for the replication. The Gaussian kernel function is used for functionvalued and continuous realvalued predictors, while the Aitchison and Aitken’s (1976) kernel function is used for binary predictor and Li and Racine’s (2007) kernel is used for categorical predictor.

For each replication, we calculate the average squared error of the regression function. Averaged over 100 replications, we use the MASE to access the estimation accuracy of the regression function.
Table 1 presents the MASE for the estimated regression function in the nonparametric functional regression model with mixed types of regressors. For both models, the functional cross validation method produces larger MASE than the Bayesian approach. Between a global bandwidth and localised bandwidths, we found that the former performs better in model 2 when there are more regressors; but the latter performs better in model 1 when there are less regressors. As increases, the estimation accuracy improves and the difference between the two Bayesian methods becomes smaller. For comparison, we also considered the same regression model without discretevalued regressor. We found that the inclusion of discretevalued regressor reduces MASE in both cases, more so for model 2 than model 1.
Functional cross validation  Bayesian method  

Global bandwidth  Localised bandwidths  
\backslashbox[2mm]Type of regressor  50  150  250  50  150  250  50  150  250 
function, continuous realvalued, discretevalued  
model 1  2.196  1.512  1.307  2.115  1.201  1.038  2.084  1.194  1.031 
(0.723)  (0.246)  (0.167)  (0.742)  (0.213)  (0.148)  (0.710)  (0.195)  (0.143)  
model 2  2.201  1.614  1.341  1.812  1.372  1.213  3.035  2.374  2.211 
(2.165)  (0.271)  (0.211)  (0.568)  (0.229)  (0.145)  (1.309)  (0.523)  (0.338)  
function, continuous realvalued  
model 1  2.304  1.570  1.402  2.440  1.335  1.139  2.417  1.335  1.140 
(0.716)  (0.257)  (0.198)  (0.863)  (0.246)  (0.159)  (0.873)  (0.228)  (0.155)  
model 2  3.201  3.078  2.806  3.936  2.895  2.758  3.988  3.257  3.110 
(1.240)  (0.705)  (0.539)  (1.629)  (0.703)  (0.572)  (1.657)  (0.689)  (0.513) 
Estimating the error density
With a set of residuals, we apply a univariate kernel density estimator with a global bandwidth or local bandwidths. For , we compute the discrepancy in terms of integrated square error between and , and obtain the overall discrepancy by averaging over 100 replications of discrepancy.
Table 2 presents the MISE for the kernelform error density with bandwidths estimated by the two Bayesian methods. There is an advantage in using the localised bandwidths over the global bandwidth, for all sample sizes considered.
Bayesian method  

Global bandwidth  Localised bandwidths  
\backslashbox[2mm]Type of regressor  50  150  250  50  150  250 
function, continuous realvalued, discretevalued  
model 1  0.0896  0.0809  0.0599  0.0233  0.0088  0.0057 
(0.0513)  (0.0482)  (0.0231)  (0.0199)  (0.0099)  (0.0043)  
model 2  0.2839  0.1503  0.1217  0.0287  0.0204  0.0184 
(0.5101)  (0.1139)  (0.0585)  (0.0210)  (0.0148)  (0.0089)  
function, continuous realvalued  
model 1  0.0522  0.0432  0.0330  0.0545  0.0494  0.0414 
(0.0458)  (0.0279)  (0.0129)  (0.0467)  (0.0248)  (0.0146)  
model 2  0.0672  0.1286  0.1266  0.0619  0.0442  0.0424 
(0.0636)  (0.0253)  (0.0256)  (0.0694)  (0.0085)  (0.0051) 
Diagnostic check of Markov chains
As a demonstration with one replication, we plot the unthinned sample paths of all the parameters in model 1 on the left panel of Figure 3, and the ACFs of these sample paths on the right panel of Figure 3. These plots show that the sample paths are reasonably well mixed. Table 3 summarises the ergodic averages, 95% Bayesian credible intervals (CIs), total SE, batchmean SE, and SIF values.
Prior density: IG()  
Parameter  Mean  Bayesian CIs  SE  Batchmean SE  SIF 
0.3716  (0.1935, 0.4937)  0.0840  0.2630  9.81  
0.8119  (0.6511, 1.0322)  0.0985  0.2263  5.28  
0.7249  (0.6509, 0.8131)  0.0422  0.0951  5.07  
0.2001  (0.0988, 0.3283)  0.0638  0.1956  9.40  
Prior density: IG()  
0.3493  (0.1823, 0.4806)  0.0881  0.2173  6.08  
0.7540  (0.5995, 0.9285)  0.0794  0.1747  4.84  
0.7041  (0.6090, 0.7840)  0.0432  0.1093  6.39  
0.2200  (0.1516, 0.3011)  0.0392  0.1067  7.41  
Prior density: Cauchy  
0.3632  (0.2118, 0.4795)  0.0742  0.2305  9.65  
0.8393  (0.6650, 1.0830)  0.1059  0.2765  6.82  
0.7316  (0.6564, 0.8296)  0.0417  0.1003  5.79  
0.1925  (0.0542, 0.3590)  0.0875  0.2987  11.66 
By using the coda package (Plummer et al., 2006) in RR language (R Core Team, 2013), we further checked the convergence of Markov chain with Geweke’s (1992) convergence diagnostic test and Heidelberger and Welch’s (1983) convergence diagnostic test. Our Markov chains pass both tests for the 100 replications.
3.1 Regression models having irrelevant regressors
Sometimes, not all the regressors in are relevant (see Hall et al., 2007; Ma and Racine, 2013, for the case of nonparametric multivariate regression model). Without loss of generality, we assume that some of the continuous and discrete regressors are irrelevant and set up our simulation study using the previous two models. Suppose the true model contains only one functionvalued, one realvalued continuous and one discretevalued regressors. However, we observe one functionvalued, two realvalued continuous and two discretevalued regressors. As pointed out in Hall et al. (2007) and also shown in Table 4, the irrelevant discrete regressor is smoothed out when its bandwidth takes on its upper bound value, while the irrelevant continuous regressor is effectively smoothed out when its bandwidth exceeds just a few standard deviation of the data.
50  150  250  1000  

0.128 [0.059, 0.175]  0.329 [0.280, 0.386]  0.330 [0.231, 0.410]  0.275 [0.244, 0.305]  
0.895 [0.823, 0.990]  0.952 [0.860, 0.994]  0.987 [0.955, 0.998]  0.969 [0.916, 0.995]  
0.881 [0.608, 2.529]  0.742 [0.669, 0.830]  0.894 [0.766, 1.058]  0.768 [0.716, 0.822]  
1.965 [0.997, 2.625]  3.752 [1.977, 10.866]  2.334 [1.478, 5.852]  5.394 [3.145, 14.808]  
0.072 [0.068, 0.087]  0.067 [0.064, 0.074]  0.050 [0.047, 0.053]  0.044 [0.042, 0.045] 
The asymptotic results of irrelevant regressors in the nonparametric multivariate regression have been proven in Hall et al. (2007). We verified this phenomenon in the nonparametric functional regression with bandwidths selected by the Bayesian approach.
4 Application to food quality control
Our second dataset focuses on the prediction of the fat content of meat samples based on nearinfrared (NIR) absorbance spectra. These data were obtained from http://lib.stat.cmu.edu/datasets/tecator, and have been studied by Ferraty and Vieu (2006), AneirosPérez and Vieu (2006), among many others. Each food sample contains finely chopped pure meat with different percentages of the fat, protein and moisture contents. For each unit (among 215 pieces of finely chopped meat), we observe one spectrometric curve, denoted by , which corresponds to the absorbance measured at a grid of 100 wavelengths (i.e., . For each unit , we also observe its fat, protein and moisture contents , obtained by analytical chemical processing. As noted by Ferraty and Vieu (2003) and FebreroBande and de la Fuente (2012), the spectrometric curves can be split into two groups, based on if the fat content is below 20%. Graphical displays of original spectrometric curves and their first derivative are shown in Figure 6.
Given a set of spectrometric curves, a classification algorithm can be used to find its group member. Given a new spectrometric curve , we can allocate this new curve to its corresponding group member, denoted by . Also, we may observe new measurements of protein and moisture contents, denoted by and , which may help in the prediction of the fat content, denoted by .
In order to assess the outofsample point forecast accuracy of the nonparametric functional estimator, we split the original sample into two subsamples (see also Ferraty and Vieu, 2006, p.105). The first one is called learning sample, which contains the first 160 units . The second one is called testing sample, which contains the last 55 units . The learning sample allows us to build the regression model with the optimal bandwidths, where the learning sample is used. The testing sample allows us to evaluate the prediction accuracy, where the testing sample is used.
To measure the prediction accuracy, we consider the mean square forecast error (MSFE) and mean absolute forecast error (MAFE). These are expressed as: MSFE = and MAFE = . The corresponding values of MSFE and MAFE for the functional cross validation and the two Bayesian bandwidth estimation methods are shown in Table 5. As a result, there is an improvement in prediction accuracy for the functional estimator with localised bandwidths than a global bandwidth.
Method  MSFE  MAFE  Log marginal likelihood  Coverage probability 

Functional cross validation  
1.5107  0.8738  
(2.7580)  (0.8723)  
Bayesian method with global bandwidth  
IG(1, 0.05)  1.5803  0.8718  227.04  0.78 
(3.0387)  (1.2638)  
IG(5, 0.25)  1.6278  0.8862  233.57  0.76 
(3.1131)  (1.2823)  
Cauchy  1.5853  0.8700  225.91  0.78 
(3.0758)  (1.2660)  
Bayesian method with local bandwidth  
IG(1, 0.05)  1.5015  0.8425  228.88  0.91 
(2.9014)  (0.8980)  
IG(5, 0.25)  1.6979  0.8880  234.17  0.84 
( 3.4346)  (0.9624)  
Cauchy  1.5039  0.8475  225.39  0.93 
(2.9702)  (0.8945) 
The model evidence, as measured by the log marginal likelihood, is an important quantity in the comparison of statistical models under the Bayesian paradigm. It has received numerous attention in the Bayesian statistics literature, such as the Laplace’s method (Tierney and Kadane, 1986), harmonic mean estimator (Newton and Raftery, 1994), generic method based on MCMC output (Chib, 1995; Chib and Jeliazkov, 2001), nested sampling (Skilling, 2006), power posterior (Friel and Pettitt, 2008), to name only a few. Friel and Wyse (2012) provided a comparison of these methods, and found that the generic method based on MCMC output has good estimation accuracy, as well as fast computational speed. It is this method we consider here to compare the evidence from different prior distributions. From Table 5, we find that the three prior densities have similar marginal likelihoods, with the Cauchy prior density as the favour one (largest log marginal likelihood and best empirical coverage probability).
Because the original functional data are iid, we sample with replacement to obtain 100 replications of bootstrapped samples. While 160 pairs of observations within each replication are used for estimation, the remaining ones are used for forecast evaluation. Furthermore, we consider a large set of functional regression models collected in Goldsmith and Scheipl (2014) and Ferraty and Vieu (2011). These models include: (1) REMLbased functional linear model with a locally adaptive penalty (Cardot et al., 2003); (2) penalised functional regression (Goldsmith et al., 2011); (3) functional principal component regression on first few functional principal components (Reiss and Ogden, 2007); (4) linear model on first functional principal components, where optimal is estimated by 20fold bootstrap (Ramsay and Silverman, 2005); (5) REMLbased singleindex signal regression with locally adaptive penalty (Wood, 2011); (6) cross validation based singleindex signal regression (Eilers et al., 2009); (7) penalised partial least squares (Krämer et al., 2008); (8) LASSO penalised linear model on first few functional principal components (Friedman et al., 2010); (9) nonparametric functional regression with NadarayaWatson estimator for estimating conditional mean (Ferraty and Vieu, 2006); (10) nonparametric functional regression with NadarayaWatson estimator for estimating conditional median (Laksaci et al., 2011); (11) nonparametric functional regression with NadarayaWatson estimator for estimating conditional mode (Ferraty et al., 2005); (12) nonparametric functional regression with nearest neighbour estimator (Burba et al., 2009); (13) nonparametric functional regression with mostpredictive design points (Ferraty et al., 2010a); (14) nonparametric functional regression with mixed types of regressors.
In Figure (a)a, we investigated the forecast accuracy when the predictor is a set of original functional curves. The proposed nonparametric functional regression with mixed types of regressors gives the smallest root mean square forecast error (RMSFE) among all methods considered. In Figure (b)b, we investigated the forecast accuracy when the predictor is secondderivative of original functional curves. The nonparametric functional regression with mostpredictive design points performs the best in all, followed by the nonparametric functional regression with mixed types of regressors. Since the input variable affects the forecast accuracy of a method, it remains as a future research topic to investigate optimal transformation of the input variable in order to achieve maximum forecast accuracy.
With the Bayesian approach, we can also compute the prediction interval nonparametrically. To this end, we first compute the cumulative density function (cdf) of the error distribution, over a set of grid points within a range, say 10 and 10. Then, we take the inverse of the cdf and find two grid points that are closest to the 2.5% and 97.5% quantiles. The 95% prediction interval of the holdout samples is obtained by adding the two grid points to the point forecasts. For instance, the point forecasts of the fat content are shown in solid blue dots, while the 95% pointwise prediction intervals are shown as vertical red bars in Figure 10.
5 Conclusion and future research
We propose a Bayesian approach to estimate optimal bandwidths in a nonparametric functional regression model that admits mixed types of regressors with homoscedastic errors and unknown error density. Since a closed form expression for our bandwidth estimator does not exist, so establishing the mathematical properties of the bandwidth estimator has been difficult. This is true even in terms of the standard class of asymptotic statistics. However, we have developed an approximate solution to the bandwidth estimator through Markov chain Monte Carlo. As a byproduct, marginal likelihood can be used to determine the optimal choice of prior density for the bandwidths.
Through a simulation study, the Bayesian approach provides a way to simultaneously estimate the bandwidths in the functional NW estimator and kernelform error density. Illustrated by a spectroscopy data set, the Bayesian bandwidth estimation approach allows the nonparametric construction of prediction interval for measuring the prediction uncertainty. To the best of our knowledge, the proposed method makes the first attempt in modeling and forecasting the scalar realvalued response variable based on mixed types of regressors.
There are many ways in which the present paper can be extended, and we briefly mention five at this point:

Consider other functional regression estimators, such as functional local linear estimator of Benhenni et al. (2007) or nearest neighbour estimator of Burba et al. (2009). The functional local linear estimator can improve the estimation accuracy of the regression function by using a higherorder kernel function. The nearest neighbour estimator takes into account the local structure of the data and gives better forecasts when the functional data are heterogeneously concentrated.

Extend to nonparametric functional regression model that admits the mixed types of regressors with heterogeneous errors. The covariatedependent variance can be modelled by another kernel density estimator.

Extend to nonparametric functional regression model that admits the mixed types of regressors with autoregressive errors.

Extend to semifunctional partial linear model that admits the mixed types of regressors with homogenous, heterogeneous and autoregressive errors.

Apply the idea of marginal likelihood to select optimal semimetric in nonparametric functional regression models.
Footnotes
 Postal address: Research School of Finance, Actuarial Studies and Applied Statistics, Australian National University, Kingsley Street, Canberra, ACT 0200, Australia; Telephone: +61 2 6125 0535, Fax: +61 2 6125 0087; Email: hanlin.shang@anu.edu.au
References
 Aguilera, A., Escabias, M., and Valderrama, M. (2008). Discussion of different logistic models with functional data. Application to systemic lupus erythematosus. Computational Statistics & Data Analysis, 53(1):151–163.
 Ahmad, I. and Li, Q. (1997). Testing symmetry of an unknown density function by kernel method. Journal of Nonparametric Statistics, 7(3):279–293.
 Aitchison, J. and Aitken, C. G. G. (1976). Multivariate binary discrimination by the kernel method. Biometrika, 63(3):413–420.
 Akritas, M. G. and Van Keilegom, I. (2001). Nonparameteric estimation of the residual distribution. Scandinavian Journal of Statistics, 28(3):549–567.
 AneirosPérez, G. and Vieu, P. (2006). Semifunctional partial linear regression. Statistics & Probability Letters, 76(11):1102–1110.
 AneirosPerez, G. and Vieu, P. (2008). Nonparametric time series prediction: A semifunctional partial linear modeling. Journal of Multivariate Analysis, 99:834–857.
 AneirosPerez, G. and Vieu, P. (2011). Automatic estimation procedure in partial linear model with functional data. Statistical Papers, 52(4):751–771.
 Benhenni, K., Ferraty, F., Rachdi, M., and Vieu, P. (2007). Local smoothing regression with functional data. Computational Statistics, 22(3):353–369.
 Berlinet, A., Elamine, A., and Mas, A. (2011). Local linear regression for functional data. Annals of the Institute of Statistical Mathematics, 63(5):1047–1075.
 Boj, E., Delicado, P., and Fortiana, J. (2010). Distancebased local linear regression for functional predictors. Computational Statistics & Data Analysis, 54(2):429–437.
 Bowman, A. W. (1984). An alternative method of crossvalidation for the smoothing of density estimates. Biometrika, 71(2):353–360.
 Burba, F., Ferraty, F., and Vieu, P. (2009). nearest neighbour method in functional nonparametric regression. Journal of Nonparametric Statistics, 21(4):453–469.
 Cardot, H., Ferraty, F., and Sarda, P. (2003). Spline estimators for the functional linear model. Statistica Sinica, 13:571–591.
 Cheng, F. (2002). Consistency of error density and distribution function estimators in nonparametric regression. Statistics & Probability Letters, 59(3):257–270.
 Cheng, F. (2004). Weak and strong uniform consistency of a kernel error density estimator in nonparametric regression. Journal of Statistical Planning and Inference, 119(1):95–107.
 Cheng, F. and Sun, S. (2008). A goodness of fit test of the errors in nonlinear autoregressive time series models. Statistics & Probability Letters, 78(1):50–59.
 Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90(432):1313–1321.
 Chib, S. and Jeliazkov, I. (2001). Marginal likelihood from the MetropolisHastings output. Journal of the American Statistical Association, 96(453):270–281.
 Efromovich, S. (2005). Estimation of the density of regression errors. The Annals of Statistics, 33(5):2194–2227.
 Eilers, P. H. C., Li, B., and Marx, B. D. (2009). Multivariate calibration with singleindex signal regression. Chemometrics and Intelligent Laboratory Systems, 96(2):196–202.
 Escanciano, J. and JachoChávez, D. (2012). uniformly consistent density estimation in nonparametric regression models. Journal of Econometrics, 167(2):305–316.
 FebreroBande, M. and de la Fuente, M. O. (2012). Statistical computing in functional data analysis: the R package fda.usc. Journal of Statistical Software, 51(4).
 FebreroBande, M. and GonzálezManteiga, W. (2013). Generalized additive models for functional data. Test, 22(2):278–292.
 Ferraty, F., Hall, P., and Vieu, P. (2010a). Mostpredictive design points for functional data predictors. Biometrika, 97(4):807–824.
 Ferraty, F., Laksaci, A., and Vieu, P. (2005). Functional time series prediction via conditional mode estimation. C. R. Acad. Sci. Paris, Ser. I, 340:389–392.
 Ferraty, F. and QuinteladelRío, A. (2014). Conditional VaR and expected shortfall: A new functional approach. Econometric Reviews, in press.
 Ferraty, F., Van Keilegom, I., and Vieu, P. (2010b). On the validity of the bootstrap in nonparametric functional regression. Scandinavian Journal of Statistics, 37(2):286–306.
 Ferraty, F. and Vieu, P. (2002). The functional nonparametric model and application to spectrometric data. Computational Statistics, 17(4):545–564.
 Ferraty, F. and Vieu, P. (2003). Curves discrimination: A nonparametric functional approach. Computational Statistics & Data Analysis, 44(12):161–173.
 Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis. Springer, New York.
 Ferraty, F. and Vieu, P. (2009). Additive prediction and boosting for functional data. Computational Statistics & Data Analysis, 53(4):1400–1413.
 Ferraty, F. and Vieu, P. (2011). Richesse et complexité des données fonctionnelles. Revue MODULAD, 43:25–43.
 Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1).
 Friel, N. and Pettitt, A. N. (2008). Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society. Series B, 70(3):589–607.
 Friel, N. and Wyse, J. (2012). Estimating the evidence–a review. Statistica Neerlandica, 66(3):288–308.
 Garthwaite, P. H., Fan, Y., and Sisson, S. A. (2010). Adaptive optimal scaling of MetropolisHastings algorithms using the RobbinsMonro process. Working paper, University of New South Wales, Sydney. http://arxiv.org/pdf/1006.3690v1.pdf.
 Geweke, J. (1992). Evaluating the accuracy of samplingbased approaches to the calculation of posterior moments. In Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M., editors, Bayesian Statistics, pages 169–193. Oxford University Press, New York.
 Geweke, J. (2010). Complete and Incomplete Econometric Models. Princeton University Press, New Jersey.
 Goldsmith, J., Bobb, J., Crainiceanu, C. M., Caffo, B., and Reich, D. (2011). Penalized functional regression. Journal of Computational and Graphical Statistics, 20(4):830–851.
 Goldsmith, J. and Scheipl, F. (2014). Estimator selection and combination in scalaronfunction regression. Computational Statistics & Data Analysis, 70:362–372.
 Goutis, C. (1998). Secondderivative functional regression with applications to near infrared spectroscopy. Journal of the Royal Statistical Society: Series B, 60(1):103–114.
 Hall, P., Li, Q., and Racine, J. (2007). Nonparametric estimation of regression functions in the presence of irrelevant regressors. The Review of Economics and Statistics, 89(4):784–789.
 Hastie, T. and Mallows, C. (1993). A statistical view of some chemometrics regression tools (discussion). Technometrics, 35(2):140–143.
 Hayfield, T. and Racine, J. S. (2008). Nonparametric econometrics: the np package. Journal of Statistical Software, 27(5).
 Heidelberger, P. and Welch, P. D. (1983). Simulation run length control in the presence of an initial transient. Operations Research, 31(6):1109–1144.
 Heidenreich, N.B., Schindler, A., and Sperlich, S. (2013). Bandwidth selection for kernel density estimation: A review of fully automatic selectors. AStA Advances in Statistical Analysis, 97(4):403–433.
 Hyndman, R. J. and Shang, H. L. (2009). Forecasting functional time series (with discussions). Journal of the Korean Statistical Society, 38(3):199–211.
 Kim, S., Shephard, N., and Chib, S. (1998). Stochastic volatility: Likelihood inference and comparison with ARCH models. Review of Economic Studies, 65(3):361–393.
 Krämer, N., Boulesteix, A.L., and Tutz, G. (2008). Penalized partial least squares with application to Bspline transformations and functional data. Chemometrics ad Intelligent Laboratory Systems, 94(1):60–69.
 Laksaci, A., Lemdani, M., and Saïd, E. O. (2011). Asymptotic results for an norm kernel estimator of the conditional quantile for functional dependent data with application to climatology. Sankhya: The Indian Journal of Statistics, 73A(1):125–141.
 Li, Q. and Racine, J. S. (2007). Nonparametric Econometrics: Theory and Practice. Princeton University Press, Princeton.
 Ma, S. and Racine, J. S. (2013). Additive regression splines with irrelevant categorical and continuous regressors. Statistica Sinica, 23:515–541.
 Marron, J. S. and Wand, M. P. (1992). Exact mean integrated squared error. The Annals of Statistics, 20(2):712–736.
 Muhsal, B. and Neumeyer, N. (2010). A note on residualbased empirical likelihood kernel density estimation. Electronic Journal of Statistics, 4:1386–1401.
 Neumeyer, N. and Dette, H. (2007). Testing for symmetric error distribution in nonparametric regression models. Statistica Sinica, 17:775–795.
 Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society: Series B, 56(1):3–48.
 Plummer, M., Best, N., Cowles, K., and Vines, K. (2006). CODA: convergence diagnosis and output analysis for MCMC. R News, 6(1):7–11.
 QuinteladelRío, A., Ferraty, F., and Vieu, P. (2011). Analysis of time of occurrence of earthquakes: A functional data approach. Mathematical Geoscience, 43(6):695–719.
 QuinteladelRío, A. and FranciscoFernández, M. (2011). Nonparametric functional data estimation applied to ozone data: Prediction and extreme value analysis. Chemosphere, 82(6):800–808.
 R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3900051070. http://www.Rproject.org/.
 Rachdi, M. and Vieu, P. (2007). Nonparametric regression for functional data: Automatic smoothing parameter selection. Journal of Statistical Planning and Inference, 137(9):2784–2801.
 Ramsay, J. and Silverman, B. (2005). Functional Data Analysis. Springer, New York, 2nd edition.
 Reiss, P. and Ogden, T. (2007). Functional principal component regression and functional partial least squares. Journal of the American Statistical Association, 102(479):984–996.
 Samb, R. (2011). Nonparametric estimation of the density of regression errors. Comptes Rendus Mathematiques, 349(2324):1281–1285.
 Shang, H. L. (2013). Bayesian bandwidth estimation for a nonparametric functional regression model with unknown error density. Computational Statistics & Data Analysis, 67:185–198.
 Skilling, J. (2006). Nested sampling for general Bayesian computation. Bayesian Analysis, 1:833–860.
 Tierney, L. and Kadane, J. B. (1986). Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81(393):82–86.
 Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B, 73(1):3–36.
 Yao, F. and Müller, H.G. (2010). Functional quadratic regression. Biometrika, 97(1):49–64.
 Zhang, X. and King, M. L. (2011). Bayesian semiparametric GARCH models. Working paper 11/24, Monash University. http://www.buseco.monash.edu.au/ebs/pubs/wpapers/2011/wp2411.pdf.