1 Introduction

Abstract

We investigate the issue of bandwidth estimation in a nonparametric functional regression model with function-valued, continuous real-valued and discrete-valued regressors under the framework of unknown error density. Extending from the recent work of Shang (2013, Computational Statistics & Data Analysis), we approximate the unknown error density by a kernel density estimator of residuals, where the regression function is estimated by the functional Nadaraya-Watson estimator that admits mixed types of regressors. We derive a kernel likelihood and posterior density for the bandwidth parameters under the kernel-form error density, and put forward a Bayesian bandwidth estimation approach that can simultaneously estimate the bandwidths. Simulation studies demonstrated the estimation accuracy of the regression function and error density for the proposed Bayesian approach. Illustrated by a spectroscopy data set in the food quality control, we applied the proposed Bayesian approach to select the optimal bandwidths in a nonparametric functional regression model with mixed types of regressors.

Keywords: functional Nadaraya-Watson estimator; kernel density estimation; Markov chain Monte Carlo; mixture error density; spectroscopy

MSC code: 62G08, 62F15

Bayesian bandwidth estimation for a nonparametric functional regression model with mixed types of regressors and unknown error density

Han Lin Shang1

Australian National University

1 Introduction

Recent advances in computer recording and storing technology facilitate the presence of functional data sets, motivated many researchers to consider various functional regression models for estimating the relationship between predictor and response variables, where at least one variable is functional in nature. The first functional formulation of a linear model dates back to a discussion by Hastie and Mallows (1993), and it is later extended in detail by Ramsay and Silverman (2005). Since then, it has been further extended or modified to take into account possible nonlinear relationship between predictor and response variables. Some of the extended regression models include the functional logit regression model (Aguilera et al., 2008), functional polynomial regression model (Yao and Müller, 2010), functional additive regression model (Febrero-Bande and González-Manteiga, 2013), and nonparametric functional regression model (Ferraty and Vieu, 2006), semi-functional partial linear model (Aneiros-Pérez and Vieu, 2006; Aneiros-Perez and Vieu, 2008), to name only a few. Due to the fast development in functional regression models, it has received an increasing popularity in various fields of application, such as age-specific mortality and fertility forecasting in demography (Hyndman and Shang, 2009), analysis of spectroscopy data in chemometrics (Ferraty and Vieu, 2002), calculations of conditional value-at-risk and expected shortfall in econometrics (Ferraty and Quintela-del-Río, 2014), earthquake modelling (Quintela-del-Río et al., 2011), and ozone level prediction (Quintela-del-Río and Francisco-Fernández, 2011).

Despite the fast development in functional regression models for finding the relationship between predictor and response variables, the type of predictor variable is often limited to a function-valued variable and error density is often assumed to be Gaussian. In the field of functional regression models, error density estimation remains largely unexplored. However, the estimation of error density is important to understand the residual behaviour and to assess the adequacy of error distribution assumption (see for example, Akritas and Van Keilegom, 2001; Cheng and Sun, 2008); the estimation of error density is useful to test the symmetry of the residual distribution (see for example, Ahmad and Li, 1997; Neumeyer and Dette, 2007); the estimation of error density is important to statistical inference, prediction and model validation (see for example, Efromovich, 2005; Muhsal and Neumeyer, 2010); and the estimation of error density is useful for the estimation of the density of the response variable (see for example, Escanciano and Jacho-Chávez, 2012). In the realm of financial asset return, an important use of the estimated error density is to estimate value-at-risk for holding an asset. In such a model, any wrong specification of the error density may produce an inaccurate estimate of value-at-risk and make the asset holder unable to control risk. Therefore, being able to estimate the error density is as important as being able to estimate the regression function.

This motivates the investigation of a kernel-form error density for estimating unknown error density in a nonparametric functional regression model, where the response variable is scalar but the predictor variables can be function-valued, continuous real-valued and discrete-valued. The contribution of this paper is two fold: First, to deal with the mixed types of regressors, we consider three types of kernel functions to model the explanatory power of the function-valued, continuous real-valued and discrete-valued predictors on the scalar real-valued response. Second, to accurately estimate the error density, we put forward the kernel-form estimator which depends on three parameters: 1) the type of semi-metric used to measure distances among curves, such as a semi-metric based on second-order derivative or a semi-metric based on functional principal components; 2) residuals fitted through the functional Nadaraya-Watson (NW) estimator of the regression function; 3) and the bandwidth of residuals. Cheng (2002, 2004) studied weak and strong uniform consistency of such an error density estimator, while Samb (2011) established the optimal convergence rate of the kernel-form error density estimator in a multivariate framework. Recently, Shang (2013) proposed a novel Bayesian bandwidth estimation procedure to simultaneously estimate the bandwidths in the functional NW estimator of the regression function and the kernel-form error density in a nonparametric functional regression function with only the functional predictor. In this paper, we extend this Bayesian approach to the same nonparametric functional regression model with mixed types of regressors.

2 Bayesian bandwidth estimation

We present the basics of this method and refer readers to the textbooks by Ferraty and Vieu (2006) on nonparametric functional regression and Li and Racine (2007) on nonparametric multivariate regression with mixed types of regressors, respectively. The essential idea of functional NW kernel smoothing is to allow flexible functional estimation of the unknown regression function. For a scalar real-valued response , we consider the following nonparametric functional regression model,

(1)

where is a smooth function from a square-integrable function space to the real line, with being an infinite-dimensional functional variable bounded within a compact interval, being -dimensional continuous real-valued variables and being -dimensional discrete-valued variables, are independent and identically distributed (iid) errors with unknown density denoted as . The discrete-valued predictors can be either ordered or unordered, which in turn affects the selection of suitable kernels (see equations (3) and (4)). The regression function is denoted by , and we assume that . We investigate the problem of nonparametric estimation of the probability density function of the error term, which in turn might provide a good estimate of regression function. As noted by Samb (2011), the difficulty of estimating error density is the fact that the regression error term is not observed and thus must be estimated.

The flexibility of model (1) stems from the fact that the unknown regression function does not need to have a specific functional form. With some smoothness properties, the functional form of is often estimated in a data-driven manner. There are a growing amount of literature on the development of nonparametric functional estimators, such as functional NW estimator (Ferraty and Vieu, 2006), functional local linear estimator (Berlinet et al., 2011), functional -nearest neighbour estimator (Burba et al., 2009) and distance-based local linear estimator (Boj et al., 2010). Throughout the paper, we demonstrate the proposed method using the functional NW estimator because of its simplicity and mathematical elegance.

2.1 Functional Nadaraya-Watson (NW) estimator

The functional NW estimator of the conditional mean can be defined as

where is a generalised product kernel that admits function-valued, continuous real-valued and discrete-valued predictors.

The kernel function for a function-valued predictor can be expressed as

where is a symmetric real valued function defined on , is a semi-metric used to measure distances among curves, represents a bandwidth associated with an infinite-dimensional function-valued predictor, and is the second-order Gaussian kernel function. For simplicity, we consider just one function-valued predictor in this paper, although the methodology can be extended to multiple function-valued predictors (see for example, Goldsmith et al., 2011).

In the nonparametric functional regression model, faster rates of convergence can be obtained (and therefore nicer theoretical and practical results), as long as one uses a semi-metric that increases the concentration of the explanatory variable , while reflecting as much as possible the effect of on the response variable (Quintela-del-Río et al., 2011). For a non-smoothed functional data, a semi-metric based on functional principal component analysis should be considered; for a smoothed functional data, a semi-metric based on derivative should be considered. Because the data we considered are smooth, a semi-metrics based on the -order derivative of is used to measure the distance between two curves, that is given by

(2)

where being the -order derivative of (see for example Goutis, 1998; Ferraty and Vieu, 2009). We refer readers to the work of Ferraty and Vieu (2006, pp.32-33) for the calculation of (2) in detail.

The product kernel for the continuous real-valued predictors can be expressed as

where is a row vector of bandwidths associated with -dimensional continuous real-valued predictors, is the predictor of , is the grid point of , and is the second-order Gaussian kernel function.

For binary variables, the choice of discrete kernel we considered is Aitchison and Aitken’s (1976) kernel, given by

(3)

where represents a row vector of bandwidths associated with -dimensional discrete-valued predictors, is the predictor of and represents possible discrete outcomes. Notice that when , the above kernel function becomes an indicator function which takes values 0 and 1.

So far, we consider discrete variables that do not have a natural ordering, an example of which includes gender. However, there are some cases where a discrete variable has a natural ordering, an example of which includes credit rating. For categorical variables, the choice of discrete kernel we considered is Li and Racine’s (2007) kernel, which incorporates the natural ordering. It is expressed as

(4)

where represents a row vector of bandwidths associated with -dimensional discrete-valued predictors.

Having determined the types of kernel function, the unknown parameters in the functional NW estimator are the bandwidths (also known as smoothing parameters in the field of nonparametric smoothing). As it is always the case in nonparametric estimation, the role of smoothing parameters is prominent. For example, the rates of convergence of the nonparametric functional estimator can be divided into two parts: a squared bias component which increases with the bandwidths, and a variance component which decreases with the bandwidths. Therefore, there is a need to select optimal bandwidths in order to balance the trade-off between squared bias and variance (see Heidenreich et al., 2013, for a recent review on the bandwidth selection in the multivariate density estimation and nonparametric regression).

In the field of nonparametric functional regression, the bandwidth associated with a function-valued regressor is commonly estimated by the so-called functional cross validation (see for example, Ferraty and Vieu, 2002, 2006; Rachdi and Vieu, 2007; Burba et al., 2009; Aneiros-Perez and Vieu, 2011). The optimal bandwidths of the kernel estimators are obtained by minimising

where

and is the leave-one-out kernel estimator. Functional cross validation is designed to assess the predictive performance of a model by an average of certain measures for the ability of predicting a subset of curves by a model fit, after deleting just these curves from the functional data set. Under the criteria of averaged squared error, integrated squared error and mean integrated squared error, Rachdi and Vieu (2007) presented asymptotic optimality of the functional cross validation. In addition, functional cross validation has the appealing feature that no estimation of the error variance is required. However, since residuals affect the estimation accuracy of regression function, functional cross validation may select sub-optimal bandwidths, in turn might lead to inferior estimation accuracy of regression function for a small sample. In addition, as the number of parameters increases, functional cross validation may suffer from numerical instability. Instead, we present a Bayesian bandwidth estimation approach that simultaneously estimate the bandwidths in the regression function and error density, under the framework of mixed types of regressors and unknown error density.

2.2 Estimation of error density

The goal is to recover an unknown error density (on the real line) from a sample of independent random observations. Because errors are often unknown, they can be approximated by residuals that are obtained from the functional NW estimator. To avoid the selection of zero value for the bandwidth, can be approximated by the leave-one-out kernel density estimator expressed as

(5)

where represents the residual, and represents the residual bandwidth.

Given a set of parameters , the kernel likelihood of can be approximated by

(6)

where and are the leave-one-out nonparametric estimators of the regression function and error density function of the computed residuals, respectively.

We now discuss the issue of prior density for the bandwidths. Let , , and be the prior of squared bandwidths , , and . Since , and play the same role as a variance parameter in the Gaussian density, we assume that the priors of , and are inverse Gamma density, denoted as IG, IG, and IG, respectively. Therefore, the prior densities of , , and are

where are hyperparameters (see for example, Geweke, 2010). The selection of these hyperparameters is to assign more weights to a smaller value of squared bandwidth, in keeping consistent with the asymptotic results. The prior density with the chosen hyperparameter is shown in Figure 1, along with two other possibilities considered in the sensitivity analysis given later in Table 3.

Figure 1: Probability density functions of three possible hyperparameter choices for the squared bandwidths. Throughout the paper, we use the IG(1,0.05), and report the sensitivity analysis in Table 3.

For discrete variables, denote as the prior of the discrete kernel bandwidth for . We assume the prior of following a uniform distribution, given by

where . For binary variables, and ; for ordered categorical variables, and (see also Hayfield and Racine, 2008).

According to Bayes theorem, the posterior can be expressed as (up to a normalising constant):

(7)

where is the same likelihood as in (6) but with squared bandwidths for those parameters that are estimated by the Gaussian kernel, and and are independent priors. Since we assume that there is no correlation between the regression function and error term in (1), the bandwidths of the regression function and error density are uncorrelated in (7).

Based on (7), we use the adaptive block random-walk Metropolis algorithm of Garthwaite et al. (2010) to sample . Throughout the paper, the burn-in period is the first 1,000 iterations, and the number of recorded iterations after the burn-in period is iterations. To measure the mixing performance of the sample paths, we consider the simulation inefficiency factor (see also Kim et al., 1998; Shang, 2013). It can be interpreted as the sample mean from an sampler that draws iid observations from the posterior distribution.

In the kernel density estimation, it has been observed that the leave-one-out kernel estimator, such as (5), can be heavily affected by the presence of extreme residuals (see, for example, Bowman, 1984). This may cause by the use of a global bandwidth. To rectify this problem, we consider the local bandwidth method studied by Zhang and King (2011) and Shang (2013). The idea of localised bandwidths is to assign small bandwidths to the observations in the high density region, while large bandwidths to the observations in the low density region. The localised kernel-form error density can be expressed by

where is the bandwidth assigned to , for , and the vector of parameters is now . By setting the prior of to be a uniform density on , the adaptive random-walk Metropolis algorithm is used to sample these parameters.

3 Simulation study

The main goal of this section is to illustrate the proposed methodology through simulated data. One way to do that consists in comparing the true regression function with the estimated regression function, and comparing the true error density with the estimated error density. To measure the estimation accuracy between and , we use the mean average squared error (MASE), given by

(8)

for , where represents the sample size, and represents one out of 100 replications considered.

To measure the discrepancy between and , we calculate mean integrated squared error (MISE) defined as

(9)

for . For each replication, MISE can be approximated at grid points bounded between an interval, such as .

Building the simulated samples.

We briefly describe the construction of the simulated data. First of all, we build simulated discretised curves

(10)

where are equispaced points, , , are independently drawn from a uniform distribution on . The functional form of (10) is taken from Ferraty et al. (2010b). Figure 2 presents the simulated curves for one replication.

Figure 2: 250 simulated curves.

Once the curves are defined, one simulates a functional regression model to compute the responses:

  • two regression functions are constructed as

    where is the derivative of , is a real-valued continuous variable that is simulated from a standard normal distribution, is a real-valued continuous variable that is simulated from an exponential distribution with rate parameter 1, is a discrete-valued variable that is drawn from a Bernoulli distribution, and with for .

  • generate from a trimodal distribution with functional form of , for example. As shown in the supplement, we have also considered claw error density (see also Marron and Wand, 1992, Table 1).

  • compute the corresponding response as in model 1.

Estimating the regression function.

We compute the discrepancy between and , for . To do that, we use the following Monte-Carlo scheme:

  1. build 100 replications .

  2. compute the difference , where represents the regression function estimated by the functional NW estimator with generalised product kernel for the replication. The Gaussian kernel function is used for function-valued and continuous real-valued predictors, while the Aitchison and Aitken’s (1976) kernel function is used for binary predictor and Li and Racine’s (2007) kernel is used for categorical predictor.

  3. For each replication, we calculate the average squared error of the regression function. Averaged over 100 replications, we use the MASE to access the estimation accuracy of the regression function.

Table 1 presents the MASE for the estimated regression function in the nonparametric functional regression model with mixed types of regressors. For both models, the functional cross validation method produces larger MASE than the Bayesian approach. Between a global bandwidth and localised bandwidths, we found that the former performs better in model 2 when there are more regressors; but the latter performs better in model 1 when there are less regressors. As increases, the estimation accuracy improves and the difference between the two Bayesian methods becomes smaller. For comparison, we also considered the same regression model without discrete-valued regressor. We found that the inclusion of discrete-valued regressor reduces MASE in both cases, more so for model 2 than model 1.

Functional cross validation Bayesian method
Global bandwidth Localised bandwidths
\backslashbox[2mm]Type of regressor 50 150 250 50 150 250 50 150 250
function, continuous real-valued, discrete-valued
model 1 2.196 1.512 1.307 2.115 1.201 1.038 2.084 1.194 1.031
(0.723) (0.246) (0.167) (0.742) (0.213) (0.148) (0.710) (0.195) (0.143)
model 2 2.201 1.614 1.341 1.812 1.372 1.213 3.035 2.374 2.211
(2.165) (0.271) (0.211) (0.568) (0.229) (0.145) (1.309) (0.523) (0.338)
function, continuous real-valued
model 1 2.304 1.570 1.402 2.440 1.335 1.139 2.417 1.335 1.140
(0.716) (0.257) (0.198) (0.863) (0.246) (0.159) (0.873) (0.228) (0.155)
model 2 3.201 3.078 2.806 3.936 2.895 2.758 3.988 3.257 3.110
(1.240) (0.705) (0.539) (1.629) (0.703) (0.572) (1.657) (0.689) (0.513)
Table 1: MASE comparison among the functional cross validation and two Bayesian bandwidth estimation methods with and without the inclusion of discrete-valued regressor for estimating the regression function. The red coloured text represents the minimal mean, while the blue coloured text represents the minimal sd. The number in parenthesis represents the sample standard deviation of the squared errors. The expression of MASE is shown in (8).

Estimating the error density

With a set of residuals, we apply a univariate kernel density estimator with a global bandwidth or local bandwidths. For , we compute the discrepancy in terms of integrated square error between and , and obtain the overall discrepancy by averaging over 100 replications of discrepancy.

Table 2 presents the MISE for the kernel-form error density with bandwidths estimated by the two Bayesian methods. There is an advantage in using the localised bandwidths over the global bandwidth, for all sample sizes considered.

Bayesian method
Global bandwidth Localised bandwidths
\backslashbox[2mm]Type of regressor 50 150 250 50 150 250
function, continuous real-valued, discrete-valued
model 1 0.0896 0.0809 0.0599 0.0233 0.0088 0.0057
(0.0513) (0.0482) (0.0231) (0.0199) (0.0099) (0.0043)
model 2 0.2839 0.1503 0.1217 0.0287 0.0204 0.0184
(0.5101) (0.1139) (0.0585) (0.0210) (0.0148) (0.0089)
function, continuous real-valued
model 1 0.0522 0.0432 0.0330 0.0545 0.0494 0.0414
(0.0458) (0.0279) (0.0129) (0.0467) (0.0248) (0.0146)
model 2 0.0672 0.1286 0.1266 0.0619 0.0442 0.0424
(0.0636) (0.0253) (0.0256) (0.0694) (0.0085) (0.0051)
Table 2: MISE comparison between the two Bayesian bandwidth estimation methods with and without the inclusion of discrete-valued regressor for estimating the error density. The number in parenthesis represents the sample standard deviation of the squared errors. The expression of MISE is shown in (9).

Diagnostic check of Markov chains

As a demonstration with one replication, we plot the un-thinned sample paths of all the parameters in model 1 on the left panel of Figure 3, and the ACFs of these sample paths on the right panel of Figure 3. These plots show that the sample paths are reasonably well mixed. Table 3 summarises the ergodic averages, 95% Bayesian credible intervals (CIs), total SE, batch-mean SE, and SIF values.

Figure 3: On the left panel, trace plots of the bandwidths for the discrete regressor (), continuous regressor (), functional regressor (), and residual (). On the right panel, ACF plots of the bandwidths.
Prior density: IG()
Parameter Mean Bayesian CIs SE Batch-mean SE SIF
0.3716 (0.1935, 0.4937) 0.0840 0.2630 9.81
0.8119 (0.6511, 1.0322) 0.0985 0.2263 5.28
0.7249 (0.6509, 0.8131) 0.0422 0.0951 5.07
0.2001 (0.0988, 0.3283) 0.0638 0.1956 9.40
Prior density: IG()
0.3493 (0.1823, 0.4806) 0.0881 0.2173 6.08
0.7540 (0.5995, 0.9285) 0.0794 0.1747 4.84
0.7041 (0.6090, 0.7840) 0.0432 0.1093 6.39
0.2200 (0.1516, 0.3011) 0.0392 0.1067 7.41
Prior density: Cauchy
0.3632 (0.2118, 0.4795) 0.0742 0.2305 9.65
0.8393 (0.6650, 1.0830) 0.1059 0.2765 6.82
0.7316 (0.6564, 0.8296) 0.0417 0.1003 5.79
0.1925 (0.0542, 0.3590) 0.0875 0.2987 11.66
Table 3: MCMC results of the bandwidth estimation under different prior densities with trimodal error density and .

By using the coda package (Plummer et al., 2006) in      RR language (R Core Team, 2013), we further checked the convergence of Markov chain with Geweke’s (1992) convergence diagnostic test and Heidelberger and Welch’s (1983) convergence diagnostic test. Our Markov chains pass both tests for the 100 replications.

3.1 Regression models having irrelevant regressors

Sometimes, not all the regressors in are relevant (see Hall et al., 2007; Ma and Racine, 2013, for the case of nonparametric multivariate regression model). Without loss of generality, we assume that some of the continuous and discrete regressors are irrelevant and set up our simulation study using the previous two models. Suppose the true model contains only one function-valued, one real-valued continuous and one discrete-valued regressors. However, we observe one function-valued, two real-valued continuous and two discrete-valued regressors. As pointed out in Hall et al. (2007) and also shown in Table 4, the irrelevant discrete regressor is smoothed out when its bandwidth takes on its upper bound value, while the irrelevant continuous regressor is effectively smoothed out when its bandwidth exceeds just a few standard deviation of the data.

50 150 250 1000
0.128 [0.059, 0.175] 0.329 [0.280, 0.386] 0.330 [0.231, 0.410] 0.275 [0.244, 0.305]
0.895 [0.823, 0.990] 0.952 [0.860, 0.994] 0.987 [0.955, 0.998] 0.969 [0.916, 0.995]
0.881 [0.608, 2.529] 0.742 [0.669, 0.830] 0.894 [0.766, 1.058] 0.768 [0.716, 0.822]
1.965 [0.997, 2.625] 3.752 [1.977, 10.866] 2.334 [1.478, 5.852] 5.394 [3.145, 14.808]
0.072 [0.068, 0.087] 0.067 [0.064, 0.074] 0.050 [0.047, 0.053] 0.044 [0.042, 0.045]
Table 4: Summary of Bayesian bandwidths (median, 10th percentile, 90th percentile of the bandwidths) for the functional NW estimator with mixed types of regressors. The standard deviation of the first continuous regressor is 1.0327, and the standard deviation of the second continuous regressor is 0.9758.

The asymptotic results of irrelevant regressors in the nonparametric multivariate regression have been proven in Hall et al. (2007). We verified this phenomenon in the nonparametric functional regression with bandwidths selected by the Bayesian approach.

4 Application to food quality control

Our second dataset focuses on the prediction of the fat content of meat samples based on near-infrared (NIR) absorbance spectra. These data were obtained from http://lib.stat.cmu.edu/datasets/tecator, and have been studied by Ferraty and Vieu (2006), Aneiros-Pérez and Vieu (2006), among many others. Each food sample contains finely chopped pure meat with different percentages of the fat, protein and moisture contents. For each unit (among 215 pieces of finely chopped meat), we observe one spectrometric curve, denoted by , which corresponds to the absorbance measured at a grid of 100 wavelengths (i.e., . For each unit , we also observe its fat, protein and moisture contents , obtained by analytical chemical processing. As noted by Ferraty and Vieu (2003) and Febrero-Bande and de la Fuente (2012), the spectrometric curves can be split into two groups, based on if the fat content is below 20%. Graphical displays of original spectrometric curves and their first derivative are shown in Figure 6.

(a) Spectrometric curves: curves with fat content (red lines) and curves with fat content (blue lines).
(b) Corresponding spectrometric curves after first-order differencing.
Figure 6: Graphical displays of spectrometric curves.

Given a set of spectrometric curves, a classification algorithm can be used to find its group member. Given a new spectrometric curve , we can allocate this new curve to its corresponding group member, denoted by . Also, we may observe new measurements of protein and moisture contents, denoted by and , which may help in the prediction of the fat content, denoted by .

In order to assess the out-of-sample point forecast accuracy of the nonparametric functional estimator, we split the original sample into two sub-samples (see also Ferraty and Vieu, 2006, p.105). The first one is called learning sample, which contains the first 160 units . The second one is called testing sample, which contains the last 55 units . The learning sample allows us to build the regression model with the optimal bandwidths, where the learning sample is used. The testing sample allows us to evaluate the prediction accuracy, where the testing sample is used.

To measure the prediction accuracy, we consider the mean square forecast error (MSFE) and mean absolute forecast error (MAFE). These are expressed as: MSFE = and MAFE = . The corresponding values of MSFE and MAFE for the functional cross validation and the two Bayesian bandwidth estimation methods are shown in Table 5. As a result, there is an improvement in prediction accuracy for the functional estimator with localised bandwidths than a global bandwidth.

Method MSFE MAFE Log marginal likelihood Coverage probability
Functional cross validation
1.5107 0.8738
(2.7580) (0.8723)
Bayesian method with global bandwidth
IG(1, 0.05) 1.5803 0.8718 -227.04 0.78
(3.0387) (1.2638)
IG(5, 0.25) 1.6278 0.8862 -233.57 0.76
(3.1131) (1.2823)
Cauchy 1.5853 0.8700 -225.91 0.78
(3.0758) (1.2660)
Bayesian method with local bandwidth
IG(1, 0.05) 1.5015 0.8425 -228.88 0.91
(2.9014) (0.8980)
IG(5, 0.25) 1.6979 0.8880 -234.17 0.84
( 3.4346) (0.9624)
Cauchy 1.5039 0.8475 -225.39 0.93
(2.9702) (0.8945)
Table 5: Out-of-sample MSFE and MAFE, log marginal likelihood and empirical coverage probability for the functional cross validation and Bayesian bandwidth estimation methods for forecasting fat content. The number in parenthesis represents the sample standard deviation of the squared or absolute errors.

The model evidence, as measured by the log marginal likelihood, is an important quantity in the comparison of statistical models under the Bayesian paradigm. It has received numerous attention in the Bayesian statistics literature, such as the Laplace’s method (Tierney and Kadane, 1986), harmonic mean estimator (Newton and Raftery, 1994), generic method based on MCMC output (Chib, 1995; Chib and Jeliazkov, 2001), nested sampling (Skilling, 2006), power posterior (Friel and Pettitt, 2008), to name only a few. Friel and Wyse (2012) provided a comparison of these methods, and found that the generic method based on MCMC output has good estimation accuracy, as well as fast computational speed. It is this method we consider here to compare the evidence from different prior distributions. From Table 5, we find that the three prior densities have similar marginal likelihoods, with the Cauchy prior density as the favour one (largest log marginal likelihood and best empirical coverage probability).

Because the original functional data are iid, we sample with replacement to obtain 100 replications of bootstrapped samples. While 160 pairs of observations within each replication are used for estimation, the remaining ones are used for forecast evaluation. Furthermore, we consider a large set of functional regression models collected in Goldsmith and Scheipl (2014) and Ferraty and Vieu (2011). These models include: (1) REML-based functional linear model with a locally adaptive penalty (Cardot et al., 2003); (2) penalised functional regression (Goldsmith et al., 2011); (3) functional principal component regression on first few functional principal components (Reiss and Ogden, 2007); (4) linear model on first functional principal components, where optimal is estimated by 20-fold bootstrap (Ramsay and Silverman, 2005); (5) REML-based single-index signal regression with locally adaptive penalty (Wood, 2011); (6) cross validation based single-index signal regression (Eilers et al., 2009); (7) penalised partial least squares (Krämer et al., 2008); (8) LASSO penalised linear model on first few functional principal components (Friedman et al., 2010); (9) nonparametric functional regression with Nadaraya-Watson estimator for estimating conditional mean (Ferraty and Vieu, 2006); (10) nonparametric functional regression with Nadaraya-Watson estimator for estimating conditional median (Laksaci et al., 2011); (11) nonparametric functional regression with Nadaraya-Watson estimator for estimating conditional mode (Ferraty et al., 2005); (12) nonparametric functional regression with nearest neighbour estimator (Burba et al., 2009); (13) nonparametric functional regression with most-predictive design points (Ferraty et al., 2010a); (14) nonparametric functional regression with mixed types of regressors.

In Figure (a)a, we investigated the forecast accuracy when the predictor is a set of original functional curves. The proposed nonparametric functional regression with mixed types of regressors gives the smallest root mean square forecast error (RMSFE) among all methods considered. In Figure (b)b, we investigated the forecast accuracy when the predictor is second-derivative of original functional curves. The nonparametric functional regression with most-predictive design points performs the best in all, followed by the nonparametric functional regression with mixed types of regressors. Since the input variable affects the forecast accuracy of a method, it remains as a future research topic to investigate optimal transformation of the input variable in order to achieve maximum forecast accuracy.

(a) Input variable is a set of original functional curves.
(b) Input variable is the second derivative of original functional curves.
Figure 9: Boxplots of forecast accuracy among several functional regression models. Refer to Section 4 for details of the 14 methods.

With the Bayesian approach, we can also compute the prediction interval nonparametrically. To this end, we first compute the cumulative density function (cdf) of the error distribution, over a set of grid points within a range, say -10 and 10. Then, we take the inverse of the cdf and find two grid points that are closest to the 2.5% and 97.5% quantiles. The 95% prediction interval of the holdout samples is obtained by adding the two grid points to the point forecasts. For instance, the point forecasts of the fat content are shown in solid blue dots, while the 95% pointwise prediction intervals are shown as vertical red bars in Figure 10.

Figure 10: Plot of predicted fat contents in percentage and the 95% pointwise prediction intervals. The point forecasts of the fat content are shown as solid blue dots, while the 95% prediction intervals are shown as vertical red bars. The localised bandwidths of error density are estimated by the Bayesian method with the Cauchy prior distribution. The empirical coverage probability is 93%.

5 Conclusion and future research

We propose a Bayesian approach to estimate optimal bandwidths in a nonparametric functional regression model that admits mixed types of regressors with homoscedastic errors and unknown error density. Since a closed form expression for our bandwidth estimator does not exist, so establishing the mathematical properties of the bandwidth estimator has been difficult. This is true even in terms of the standard class of asymptotic statistics. However, we have developed an approximate solution to the bandwidth estimator through Markov chain Monte Carlo. As a byproduct, marginal likelihood can be used to determine the optimal choice of prior density for the bandwidths.

Through a simulation study, the Bayesian approach provides a way to simultaneously estimate the bandwidths in the functional NW estimator and kernel-form error density. Illustrated by a spectroscopy data set, the Bayesian bandwidth estimation approach allows the nonparametric construction of prediction interval for measuring the prediction uncertainty. To the best of our knowledge, the proposed method makes the first attempt in modeling and forecasting the scalar real-valued response variable based on mixed types of regressors.

There are many ways in which the present paper can be extended, and we briefly mention five at this point:

  1. Consider other functional regression estimators, such as functional local linear estimator of Benhenni et al. (2007) or -nearest neighbour estimator of Burba et al. (2009). The functional local linear estimator can improve the estimation accuracy of the regression function by using a higher-order kernel function. The -nearest neighbour estimator takes into account the local structure of the data and gives better forecasts when the functional data are heterogeneously concentrated.

  2. Extend to nonparametric functional regression model that admits the mixed types of regressors with heterogeneous errors. The covariate-dependent variance can be modelled by another kernel density estimator.

  3. Extend to nonparametric functional regression model that admits the mixed types of regressors with autoregressive errors.

  4. Extend to semi-functional partial linear model that admits the mixed types of regressors with homogenous, heterogeneous and autoregressive errors.

  5. Apply the idea of marginal likelihood to select optimal semi-metric in nonparametric functional regression models.

Footnotes

  1. Postal address: Research School of Finance, Actuarial Studies and Applied Statistics, Australian National University, Kingsley Street, Canberra, ACT 0200, Australia; Telephone: +61 2 6125 0535, Fax: +61 2 6125 0087; Email: hanlin.shang@anu.edu.au

References

  1. Aguilera, A., Escabias, M., and Valderrama, M. (2008). Discussion of different logistic models with functional data. Application to systemic lupus erythematosus. Computational Statistics & Data Analysis, 53(1):151–163.
  2. Ahmad, I. and Li, Q. (1997). Testing symmetry of an unknown density function by kernel method. Journal of Nonparametric Statistics, 7(3):279–293.
  3. Aitchison, J. and Aitken, C. G. G. (1976). Multivariate binary discrimination by the kernel method. Biometrika, 63(3):413–420.
  4. Akritas, M. G. and Van Keilegom, I. (2001). Non-parameteric estimation of the residual distribution. Scandinavian Journal of Statistics, 28(3):549–567.
  5. Aneiros-Pérez, G. and Vieu, P. (2006). Semi-functional partial linear regression. Statistics & Probability Letters, 76(11):1102–1110.
  6. Aneiros-Perez, G. and Vieu, P. (2008). Nonparametric time series prediction: A semi-functional partial linear modeling. Journal of Multivariate Analysis, 99:834–857.
  7. Aneiros-Perez, G. and Vieu, P. (2011). Automatic estimation procedure in partial linear model with functional data. Statistical Papers, 52(4):751–771.
  8. Benhenni, K., Ferraty, F., Rachdi, M., and Vieu, P. (2007). Local smoothing regression with functional data. Computational Statistics, 22(3):353–369.
  9. Berlinet, A., Elamine, A., and Mas, A. (2011). Local linear regression for functional data. Annals of the Institute of Statistical Mathematics, 63(5):1047–1075.
  10. Boj, E., Delicado, P., and Fortiana, J. (2010). Distance-based local linear regression for functional predictors. Computational Statistics & Data Analysis, 54(2):429–437.
  11. Bowman, A. W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71(2):353–360.
  12. Burba, F., Ferraty, F., and Vieu, P. (2009). -nearest neighbour method in functional nonparametric regression. Journal of Nonparametric Statistics, 21(4):453–469.
  13. Cardot, H., Ferraty, F., and Sarda, P. (2003). Spline estimators for the functional linear model. Statistica Sinica, 13:571–591.
  14. Cheng, F. (2002). Consistency of error density and distribution function estimators in nonparametric regression. Statistics & Probability Letters, 59(3):257–270.
  15. Cheng, F. (2004). Weak and strong uniform consistency of a kernel error density estimator in nonparametric regression. Journal of Statistical Planning and Inference, 119(1):95–107.
  16. Cheng, F. and Sun, S. (2008). A goodness of fit test of the errors in nonlinear autoregressive time series models. Statistics & Probability Letters, 78(1):50–59.
  17. Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90(432):1313–1321.
  18. Chib, S. and Jeliazkov, I. (2001). Marginal likelihood from the Metropolis-Hastings output. Journal of the American Statistical Association, 96(453):270–281.
  19. Efromovich, S. (2005). Estimation of the density of regression errors. The Annals of Statistics, 33(5):2194–2227.
  20. Eilers, P. H. C., Li, B., and Marx, B. D. (2009). Multivariate calibration with single-index signal regression. Chemometrics and Intelligent Laboratory Systems, 96(2):196–202.
  21. Escanciano, J. and Jacho-Chávez, D. (2012). uniformly consistent density estimation in nonparametric regression models. Journal of Econometrics, 167(2):305–316.
  22. Febrero-Bande, M. and de la Fuente, M. O. (2012). Statistical computing in functional data analysis: the R package fda.usc. Journal of Statistical Software, 51(4).
  23. Febrero-Bande, M. and González-Manteiga, W. (2013). Generalized additive models for functional data. Test, 22(2):278–292.
  24. Ferraty, F., Hall, P., and Vieu, P. (2010a). Most-predictive design points for functional data predictors. Biometrika, 97(4):807–824.
  25. Ferraty, F., Laksaci, A., and Vieu, P. (2005). Functional time series prediction via conditional mode estimation. C. R. Acad. Sci. Paris, Ser. I, 340:389–392.
  26. Ferraty, F. and Quintela-del-Río, A. (2014). Conditional VaR and expected shortfall: A new functional approach. Econometric Reviews, in press.
  27. Ferraty, F., Van Keilegom, I., and Vieu, P. (2010b). On the validity of the bootstrap in non-parametric functional regression. Scandinavian Journal of Statistics, 37(2):286–306.
  28. Ferraty, F. and Vieu, P. (2002). The functional nonparametric model and application to spectrometric data. Computational Statistics, 17(4):545–564.
  29. Ferraty, F. and Vieu, P. (2003). Curves discrimination: A nonparametric functional approach. Computational Statistics & Data Analysis, 44(1-2):161–173.
  30. Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis. Springer, New York.
  31. Ferraty, F. and Vieu, P. (2009). Additive prediction and boosting for functional data. Computational Statistics & Data Analysis, 53(4):1400–1413.
  32. Ferraty, F. and Vieu, P. (2011). Richesse et complexité des données fonctionnelles. Revue MODULAD, 43:25–43.
  33. Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1).
  34. Friel, N. and Pettitt, A. N. (2008). Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society. Series B, 70(3):589–607.
  35. Friel, N. and Wyse, J. (2012). Estimating the evidence–a review. Statistica Neerlandica, 66(3):288–308.
  36. Garthwaite, P. H., Fan, Y., and Sisson, S. A. (2010). Adaptive optimal scaling of Metropolis-Hastings algorithms using the Robbins-Monro process. Working paper, University of New South Wales, Sydney. http://arxiv.org/pdf/1006.3690v1.pdf.
  37. Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M., editors, Bayesian Statistics, pages 169–193. Oxford University Press, New York.
  38. Geweke, J. (2010). Complete and Incomplete Econometric Models. Princeton University Press, New Jersey.
  39. Goldsmith, J., Bobb, J., Crainiceanu, C. M., Caffo, B., and Reich, D. (2011). Penalized functional regression. Journal of Computational and Graphical Statistics, 20(4):830–851.
  40. Goldsmith, J. and Scheipl, F. (2014). Estimator selection and combination in scalar-on-function regression. Computational Statistics & Data Analysis, 70:362–372.
  41. Goutis, C. (1998). Second-derivative functional regression with applications to near infra-red spectroscopy. Journal of the Royal Statistical Society: Series B, 60(1):103–114.
  42. Hall, P., Li, Q., and Racine, J. (2007). Nonparametric estimation of regression functions in the presence of irrelevant regressors. The Review of Economics and Statistics, 89(4):784–789.
  43. Hastie, T. and Mallows, C. (1993). A statistical view of some chemometrics regression tools (discussion). Technometrics, 35(2):140–143.
  44. Hayfield, T. and Racine, J. S. (2008). Nonparametric econometrics: the np package. Journal of Statistical Software, 27(5).
  45. Heidelberger, P. and Welch, P. D. (1983). Simulation run length control in the presence of an initial transient. Operations Research, 31(6):1109–1144.
  46. Heidenreich, N.-B., Schindler, A., and Sperlich, S. (2013). Bandwidth selection for kernel density estimation: A review of fully automatic selectors. AStA Advances in Statistical Analysis, 97(4):403–433.
  47. Hyndman, R. J. and Shang, H. L. (2009). Forecasting functional time series (with discussions). Journal of the Korean Statistical Society, 38(3):199–211.
  48. Kim, S., Shephard, N., and Chib, S. (1998). Stochastic volatility: Likelihood inference and comparison with ARCH models. Review of Economic Studies, 65(3):361–393.
  49. Krämer, N., Boulesteix, A.-L., and Tutz, G. (2008). Penalized partial least squares with application to B-spline transformations and functional data. Chemometrics ad Intelligent Laboratory Systems, 94(1):60–69.
  50. Laksaci, A., Lemdani, M., and Saïd, E. O. (2011). Asymptotic results for an -norm kernel estimator of the conditional quantile for functional dependent data with application to climatology. Sankhya: The Indian Journal of Statistics, 73-A(1):125–141.
  51. Li, Q. and Racine, J. S. (2007). Nonparametric Econometrics: Theory and Practice. Princeton University Press, Princeton.
  52. Ma, S. and Racine, J. S. (2013). Additive regression splines with irrelevant categorical and continuous regressors. Statistica Sinica, 23:515–541.
  53. Marron, J. S. and Wand, M. P. (1992). Exact mean integrated squared error. The Annals of Statistics, 20(2):712–736.
  54. Muhsal, B. and Neumeyer, N. (2010). A note on residual-based empirical likelihood kernel density estimation. Electronic Journal of Statistics, 4:1386–1401.
  55. Neumeyer, N. and Dette, H. (2007). Testing for symmetric error distribution in nonparametric regression models. Statistica Sinica, 17:775–795.
  56. Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society: Series B, 56(1):3–48.
  57. Plummer, M., Best, N., Cowles, K., and Vines, K. (2006). CODA: convergence diagnosis and output analysis for MCMC. R News, 6(1):7–11.
  58. Quintela-del-Río, A., Ferraty, F., and Vieu, P. (2011). Analysis of time of occurrence of earthquakes: A functional data approach. Mathematical Geoscience, 43(6):695–719.
  59. Quintela-del-Río, A. and Francisco-Fernández, M. (2011). Nonparametric functional data estimation applied to ozone data: Prediction and extreme value analysis. Chemosphere, 82(6):800–808.
  60. R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org/.
  61. Rachdi, M. and Vieu, P. (2007). Nonparametric regression for functional data: Automatic smoothing parameter selection. Journal of Statistical Planning and Inference, 137(9):2784–2801.
  62. Ramsay, J. and Silverman, B. (2005). Functional Data Analysis. Springer, New York, 2nd edition.
  63. Reiss, P. and Ogden, T. (2007). Functional principal component regression and functional partial least squares. Journal of the American Statistical Association, 102(479):984–996.
  64. Samb, R. (2011). Nonparametric estimation of the density of regression errors. Comptes Rendus Mathematiques, 349(23-24):1281–1285.
  65. Shang, H. L. (2013). Bayesian bandwidth estimation for a nonparametric functional regression model with unknown error density. Computational Statistics & Data Analysis, 67:185–198.
  66. Skilling, J. (2006). Nested sampling for general Bayesian computation. Bayesian Analysis, 1:833–860.
  67. Tierney, L. and Kadane, J. B. (1986). Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81(393):82–86.
  68. Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B, 73(1):3–36.
  69. Yao, F. and Müller, H.-G. (2010). Functional quadratic regression. Biometrika, 97(1):49–64.
  70. Zhang, X. and King, M. L. (2011). Bayesian semiparametric GARCH models. Working paper 11/24, Monash University. http://www.buseco.monash.edu.au/ebs/pubs/wpapers/2011/wp24-11.pdf.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
   
Add comment
Cancel
Loading ...
10096
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description