Estimation in functional linear quantile regression\thanksrefT1

Estimation in functional linear quantile regression\thanksrefT1

[ [ Hiroshima University Department of Mathematics
Graduate School of Science
Hiroshima University
1-3-1 Kagamiyama, Higashi-Hiroshima
Hiroshima 739-8526
Japan
\printeade1
\smonth3 \syear2012\smonth8 \syear2012
\smonth3 \syear2012\smonth8 \syear2012
\smonth3 \syear2012\smonth8 \syear2012
Abstract

This paper studies estimation in functional linear quantile regression in which the dependent variable is scalar while the covariate is a function, and the conditional quantile for each fixed quantile index is modeled as a linear functional of the covariate. Here we suppose that covariates are discretely observed and sampling points may differ across subjects, where the number of measurements per subject increases as the sample size. Also, we allow the quantile index to vary over a given subset of the open unit interval, so the slope function is a function of two variables: (typically) time and quantile index. Likewise, the conditional quantile function is a function of the quantile index and the covariate. We consider an estimator for the slope function based on the principal component basis. An estimator for the conditional quantile function is obtained by a plug-in method. Since the so-constructed plug-in estimator not necessarily satisfies the monotonicity constraint with respect to the quantile index, we also consider a class of monotonized estimators for the conditional quantile function. We establish rates of convergence for these estimators under suitable norms, showing that these rates are optimal in a minimax sense under some smoothness assumptions on the covariance kernel of the covariate and the slope function. Empirical choice of the cutoff level is studied by using simulations.

[
\kwd
\doi

10.1214/12-AOS1066 \volume40 \issue6 2012 \firstpage3108 \lastpage3136 \newproclaimremarkRemark[section] \newproclaimexampleExample[section] \newproclaimnotationNotation

\runtitle

Functional quantile regression

\thankstext

T1Supported by the Grant-in-Aid for Young Scientists (B) (22730179) from the JSPS.

{aug}

A]\fnmsKengo \snmKato\correflabel=e1]kkato@hiroshima-u.ac.jp

class=AMS] \kwd62G20 Functional data \kwdnonlinear ill-posed problem \kwdprincipal component analysis \kwdquantile regression

1 Introduction

Quantile regression, initially developed by the seminal work of Koenker and Bassett (1978), is one of the most important statistical methods in measuring the impact of covariates on dependent variables. An attractive feature of quantile regression is that it allows us to make inference on the entire conditional distribution by estimating several different conditional quantiles. Some basic materials on quantile regression and its applications are summarized in Koenker (2005).

This paper studies estimation in functional linear quantile regression in which the dependent variable is scalar while the covariate is a function, and the conditional quantile for each fixed quantile index is modeled as a linear functional of the covariate. The model that we consider is an extension of functional linear regression to the quantile regression case. Here we suppose that covariates are discretely observed and sampling points may differ across subjects, where the number of measurements per subject increases as the sample size. Also, we allow the quantile index to vary over a given subset of the open unit interval, so the slope function is a function of two variables: (typically) time and quantile index. Likewise, the conditional quantile function is a function of the quantile index and the covariate. We consider the problem of estimating the slope function as well as the conditional quantile function itself. The estimator we consider for the slope function is based on the principal component analysis (PCA). Expanding the covariate and the slope function in terms of the PCA basis, the model is transformed into a quantile regression model with an infinite number of regressors. Truncating the infinite sum by the first (say) terms, we may apply a standard quantile regression technique to estimating the first coefficients in the basis expansion of the slope function at each quantile index, where diverges as the sample size. In practice, the population PCA basis is unknown, so it has to be replaced by a suitable estimate for it. Once the estimator for the slope function is available, an estimator for the conditional quantile function is obtained by a plug-in method. Since the so-constructed plug-in estimator not necessarily satisfies the monotonicity constraint with respect to the quantile index, we also consider a class of monotonized estimators for the conditional quantile function.

In summary, we have the following three types of estimators in mind: {longlist}[(iii)]

a PCA-based estimator for the slope function;

a plug-in estimator for the conditional quantile function;

monotonized estimators for the conditional quantile function. We establish rates of convergence for these estimators under suitable norms, showing that these rates are optimal in a minimax sense under some smoothness assumptions on the covariance kernel of the covariate and the slope function.

In practice, we have to choose the cutoff level empirically. We suggest some criteria, namely, (integrated-)AIC, BIC and GACV, to choose the cutoff level. We study the performance of these criteria by using simulations. In our limited simulation experiments, although none of these criteria clearly dominated the others, (integrated-)BIC worked relatively stably.

Functional data have become increasingly important. For data collected on dense grids, a traditional multivariate analysis is not directly applicable since the number of grid points may be larger than the sample size and the correlation between distinct grid points is potentially high. Functional data analysis views such data as realizations of random functions and takes into account the functional nature of the data. We refer the reader to Ramsay and Silverman (2005) for a comprehensive treatment on functional data analysis. Earlier theoretical studies in functional data analysis have focused on functional linear mean regression models [see Cai and Hall (2006), Cardot, Ferraty and Sarda (1999, 2003), Yao, Müller and Wang (2005), Hall and Horowitz (2007), Crambes, Kneip and Sarda (2009), James, Wang and Zhu (2009), Yuan and Cai (2010), Delaigle and Hall (2012) and references cited in these papers]. Among them, Hall and Horowitz (2007) established fundamental results in functional linear mean regression, deriving sharp rates of convergence for a PCA-based estimator for the slope function under some smoothness assumptions. Note that they assumed that covariates are continuously observed. Other than functional linear mean regression, Müller and Stadtmüller (2005) developed estimation methods for generalized functional linear models using series expansions of covariates and slope functions.

While not many, there are some earlier papers on estimating conditional quantiles with function-valued covariates. Cardot, Crambes and Sarda (2005) studied smoothing splines estimators for functional linear quantile regression models, while their established rates are not sharp. Ferraty, Rabhi and Vieu (2005) considered nonparametric estimation of conditional quantiles when covariates are functions, which is a different topic than ours. Chen and Müller (2012) considered an “indirect” estimation of conditional quantiles. They modeled the conditional distribution as the composition of some (possibly unknown) link function and a linear functional of the covariate. They first estimated the conditional distribution function by adapting the method developed in Müller and Stadtmüller (2005) and then estimated the conditional quantile function by inverting the estimated conditional distribution function. In the quantile regression literature, there are two ways to estimate conditional quantiles. One is to directly model conditional quantiles and estimate unknown parameters by minimizing check functions. The other is to estimate conditional distribution functions and invert them to estimate conditional quantile functions. We refer to the former as a “direct” method while to the latter as an “indirect” method. The approach taken in this paper is classified into a direct method, while that of Chen and Müller (2012) is classified into an indirect method. Note that although their method is flexible, they only established consistency of the estimator.

Conditional quantile estimation offers a variety of fruitful applications for data containing function-valued covariates. A leading example in which conditional quantile estimation with function-valued covariates is useful appears in analysis of growth data [Chen and Müller (2012)]. Suppose that we have a growth data set of girls’ heights between ages and , say, where multiple measurements may occur at some ages. Use girl’s growth history between ages and as a covariate, and her height at age as a response. Then, conditional quantile estimation gives us an overall picture of the predictive distribution of girl’s height at age given her growth history between ages and , which is more informative than just knowing the mean response. In addition to growth data, functional quantile regression has been applied in analysis of ozone pollution data [Cardot, Crambes and Sarda (2007)] and El Niño data [Ferraty, Rabhi and Vieu (2005)]. We believe that functional linear quantile regression modeling is a benchmark modeling in conditional quantile estimation when covariates are functions, just as linear quantile regression modeling is so when covariates are vectors.

Our estimator for the slope function (at a fixed quantile index) can be understood as a regularized solution to an empirical version of a nonlinear ill-posed inverse problem that corresponds to the “normal equation” in the quantile regression case, where the regularization is controlled by the cutoff level. The paper is thus in part related to the literature on statistical nonlinear inverse problems, which is still an ongoing research area [see Bissantz, Hohage and Munk (2004), Horowitz and Lee (2007), Loubes and Ludeña (2010), Chen and Pouzo (2012), Gagliardini and Scaillet (2012)]. On the other hand, in the mean regression case, the normal equation becomes a linear ill-posed inverse problem. Hall and Horowitz (2007) considered two regularized estimators for the slope function based on the normal equation in the mean regression case. Conceptually, the problems handled in our and their papers are different in their nature: linearity and nonlinearity.

From a technical point of view, establishing sharp rates of convergence for our estimators is challenging. Our proof strategy builds upon the techniques developed in the asymptotic analysis for -estimators with diverging numbers of parameters [see, e.g., He and Shao (2000), Belloni, Chernozhukov and Fernández-Val (2011)]. However, the additional complication arises essentially because “regressors” here are estimated ones and the estimation error has to be taken into account, which requires some new techniques such as “conditional” maximal inequalities and some careful moment calculations. Additionally, the discretization error brings a further technical complication.

Finally, the setting here is similar to Section 3 of Crambes, Kneip and Sarda (2009): covariates are densely but discretely observed, and the discretization error is taken into account in the analysis. However, the paper does not cover the case in which covariates are discretely observed with measurement errors because of the technical complication. A formal theoretical analysis in such a case is left to a future work.

The remainder of the paper is organized as follows. Section 2 presents the model and estimators. Section 3 gives the main results in which we derive rates of convergence for the estimators. Section 4 discusses empirical choice of the cutoff level. A proof of Theorem 3.1 is given in Section 5. Some additional discussions, technical proofs, useful technical tools and simulation results are provided in the Appendix. Due to the space limitation, the Appendix is contained in the supplementary file [Kato (2013)].

{notation*}

Throughout the paper, we shall obey the following notation. For , let denote the Euclidean norm of . For any integer , let denote the set of all unit vectors in : . For any , let and . Let denote the indicator function. For any given (random or nonrandom, scalar or vector) sequence , we use the notation

which should be distinguished from the population expectation . For any two sequences of positive constants and , we write if the ratio is bounded and bounded away from zero. Let denote the usual space with respect to the Lebesgue measure for functions defined on . Let denote the -norm: . For any finite set , denotes the cardinality of .

2 Methodology

2.1 Functional linear quantile regression modeling

Let be a pair of a scalar random variable and a random function on a bounded closed interval in . Without loss of generality, we assume . By “random function,” we mean that is a random variable for each . For our purpose of formulating a functional linear quantile regression model, we need the existence of the regular conditional distribution of given , to which end we assume a mild regularity condition on the path property of . Let denote the space of all càdlàg functions on , equipped with the Skorohod metric [see Billingsley (1968)]. We assume that the map is càdlàg almost surely. Equip with the Borel -field. Then, can be taken as a -valued random variable. Since is a Polish space, and the product space with the product metric is also Polish, the regular conditional distribution of given exists [see, e.g., Dudley (2002), Theorem 10.2.2].

Let denote the conditional quantile function of given . Let be a given subset of that is away from and , that is, for some small , . For each , we assume that can be written as a linear functional of , that is, for each , there exist a scalar constant and a scalar function such that

(1)

where . Typical examples of are as follows: (i)  (singleton); (ii) with (finite set); (iii) with (bounded closed interval). Formally, we allow for all these possibilities.

The model (1) is a natural extension of standard linear quantile regression models to function-valued covariates and was first formulated in Cardot, Crambes and Sarda (2005). In what follows, we consider estimating the slope function and the conditional quantile function .

Before going to the next step, we briefly give two simple examples of data generating processes that admit the conditional quantile restriction (1).

{example}

[(Linear location design)] Suppose that obeys the linear location design

where is a constant and is a function in . Let denote the distribution function of and denote by the quantile function of . Then obeys the conditional quantile restriction

{example}

[(Linear location-scale design)] Suppose that obeys the linear location-scale design

where are constants and are functions in . Suppose that almost surely. Denote by the quantile function of the distribution of . Then obeys the conditional quantile restriction

In this design, the slope function depends on the quantile index.

2.2 Estimation strategy

We base our estimation strategy on the principal component analysis (PCA). To this end, we additionally assume here that . Define the covariance kernel . Then, by the Hilbert–Schmidt theorem, admits the spectral expansion

where are ordered eigenvalues, and is an orthonormal basis of consisting of eigenfunctions of the integral operator from to itself with kernel . We will later assume that are all positive and there are no ties in , that is, . Since is an orthonormal basis of , we have the following expansions in :

where and are defined by

The are called “principal scores” for . The expansion for is called the “Karhunen–Loève expansion.” This leads to the expression . Then, the model (1) is transformed into a quantile regression model with an infinite number of “regressors”:

(2)

Note that and for all .

We first consider estimating the slope function . To this end, we estimate the function for each and collect them to construct a final estimator for . To explain the basic idea, suppose for a while that (i) were continuously observable; and (ii) the covariance kernel were known. The problem then reduces to finding suitable estimates of the coefficients . Let be independent copies of . For each , let be the principal scores for . Pick any . Then, a plausible approach to estimating is to truncate by for some large , and estimate only the first coefficients using a standard quantile regression technique. Let be the “cutoff” level such that and as . Estimate and the first coefficients of by

(3)

where is the check function [Koenker and Bassett (1978)]. Note that for , is equivalent to the absolute value function. Here recall that for any sequence . The resulting estimator for the slope function is given by

However, this “estimator” is infeasible since (i) is usually discretely observed; and (ii) is unknown. Because of (i), it is usually not possible to directly estimate the covariance kernel by the empirical one (since with is unavailable for some ). Similarly to Crambes, Kneip and Sarda (2009), we consider the following setting: {longlist}[(1)]

For each , is observed only at discrete points , that is, we only observe . Typically, as is assumed.

Based on the discrete observations, for each , construct an interpolated function for . Here we shall use a simple interpolation rule (see also the later remark):

The observation time points (and ) should be indexed by the sample size , however, this is suppressed for the notational convenience. Suppose now that the interpolated functions are obtained. Then, we may estimate the covariance kernel by

where . Let be the spectral expansion of , where are ordered eigenvalues and is an orthonormal basis of consisting of eigenfunctions of the integral operator from to itself with kernel . Each principal score is now estimated by

Then, the coefficients and are estimated by

(4)

The resulting estimator for the slope function is given by

The optimization problem (4) can be transformed into a linear programming problem and can be solved by using standard statistical softwares. Once the estimator for the slope function is obtained, the conditional -quantile of given for a given function is estimated by a plug-in method:

Empirical choice of the cutoff level will be discussed in Section 4.

The basis is called the (population) PCA basis. Alternatively, one may use other basis functions independent of the data, such as Fourier and wavelet bases, in which case the analysis becomes more tractable. A potential drawback of using such basis functions is that, as discussed in Delaigle and Hall (2012), using the “first” basis functions is less motivated. The PCA basis is a benchmark basis in functional data analysis, which is the reason why the PCA basis is used in this paper. Other estimation methods such as smoothing splines [Crambes, Kneip and Sarda (2009)] and a reproducing kernel Hilbert space approach [Yuan and Cai (2010)] could be adapted in the quantile regression case, which is left as a future topic.

The interpolation rule used here may be replaced by any other reasonable interpolation rule. For example, a plausible alternative is to use

It is not hard to see that the theory below also applies to this interpolation rule. In practice, this interpolation rule may be more recommended since it uses all the discrete observations .

Finally, we note that for any fixed , our estimator can be understood as a regularized solution to an empirical version of a nonlinear ill-posed inverse problem that corresponds to the “normal equation” in the quantile regression case. Due to the space limitation, we shall discuss this connection to nonlinear ill-posed inverse problems in Appendix A in the supplementary file [Kato (2013)].

2.3 Monotonization

Suppose in this section that is a bounded closed interval: with . The conditional quantile function is monotonically nondecreasing in . However, the plug-in estimator constructed is not necessarily so. To circumvent this problem, we may monotonize the map by one of the following three methods: (i) rearrangement [Chernozhukov, Fernández-Val and Galichon (2009)], (ii) isotonization [Barlow et al. (1972)], (iii) convex combination of (i) and (ii). Such methods are explained in Chernozhukov, Fernández-Val and Galichon (2009) in a general setup. We briefly explain their basic ideas. Pick any .

{longlist}

[(iii)]

Rearrangement. Define the function

Then the map is a proper distribution function supported in . The rearranged estimator for is defined by

Clearly, the map is nondecreasing.

Isotonization. The isotonization is carried out by projecting the original estimate on the set of nondecreasing functions, typically by the “pool adjacent violators” algorithm. Denote by the isotonized estimator.

Convex combination of (i) and (ii). Take . Then, the convex combination is nondecreasing in .

By Chernozhukov, Fernández-Val and Galichon (2009), it is shown that any monotonized estimate [constructed by using one of (i)–(iii)] is at least as good as the initial estimate in the following sense: let be any monotonized estimate for given above. Then, we have for all ,

(5)

where the obvious modification is made when .

3 Rates of convergence

In this section we derive rates of convergence for the estimators defined in the previous section and argue their optimality. We make the following assumptions. Let be some sufficiently large constant. First of all, we assume the following:

{longlist}

[(A1)]

is i.i.d. with .

and for all .

The i.i.d. assumption is conventional. It is beyond the scope of the paper to extend the theory to dependent data. Note that Hörmann and Kokoszka (2010) discussed weakly dependent functional data. Assumption (A2) is a mild moment restriction. Here no moment condition on is needed, so may not exist.

{longlist}

[(A3)]

For some , and for all .

For some , for all .

Let denote the conditional distribution function of given . Then, for every realization of , the map is twice continuously differentiable with and. Furthermore, .

.

Assumptions (A3) and (A4) are adapted from (3.2) and (3.3) of Hall and Horowitz (2007). Assumption (A3) especially means that all are positive, which guarantees identification of the slope function. In assumption (A3), measures the smoothness of the covariance kernel , which also measures the difficulty of estimating the slope function . The second part of assumption (A3) is to require the spacings among not be too small, which ensures identifiability of eigenfunctions and thereby sufficient estimation accuracy of . Note that the lower inequality for follows essentially from the condition on the difference, as () and hence . Assumption (A4) determines the “smoothness” of the function . The condition that requires the function to be sufficiently smooth relative to uniformly in . See Hall and Horowitz [(2007), page 74] for some related discussions on these assumptions. Assumptions (A5) and (A6) are specific to the quantile regression case. Both assumptions are standard in the quantile regression literature when is a vector. The role of assumption (A6) is to guarantee that the conditional distribution function is not “flat” near quantiles of interest, which is essential to our asymptotic study.

{longlist}

[(A7)]

For each , is observed only at discrete points . Define . Then, as .

There exists a constant such that for all .

Assumptions (A7) and (A8) are a set of sampling assumptions on . A rate restriction will be imposed on . Assumption (A7) at least requires that as , which means that each set of discrete points has to be dense in as the sample size grows. Assumption (A8) is an additional assumption on the smoothness of the covariance kernel as well as the mean function . Suppose in this discussion for all . Then, for example, if is Lipschitz continuous and if is twice continuously differentiable. The value of controls the discretization error. Note that possible values of depend on the value of . Typically, if is sufficiently large, (A8) is satisfied with (and vice versa). The relation between the smoothness of the covariance (or more generally integral) kernel and the decay rate of the associated eigenvalues is studied in Ritter, Wasilkowski and Woźniakowski (1995) and Ferreira and Menegatto (2009) and the references therein. For example, the former paper shows that, if verifies the “Sacks–Ylvisaker condition” of order with a nonnegative integer (see the original paper for the precise description of the conditions), in which case (A8) is satisfied with , then as . Assumption (A8) is similar in spirit to (A2) of Crambes, Kneip and Sarda (2009), in which they directly assumed some smoothness of the random function to deal with the discretization error [roughly speaking, their corresponds to our as (A2) of Crambes, Kneip and Sarda (2009) essentially assumes to be -Hölder continuous and in that case ].

Let denote the set of all distributions of compatible with assumptions (A2)–(A6) and (A8) for given (admissible) values of and (such that ). The following theorem, which will be proved in Section 5 below, establishes rates of convergence for the slope estimator .

Theorem 3.1

Suppose that assumptions (A1)–(A8) are satisfied. Take . Then, we have

(6)

provided that as .

Inspection of the proof of Theorem 3.1 shows that, when is continuously observable, under assumptions (A1)–(A6), the rate of convergence of the estimator based on the direct empirical covariance kernel will be . The side condition that is assumed to make the discretization error negligible. This condition seems not quite restrictive. For example, if and , it is satisfied as long as , which seems to be mild in view of some applications in functional data analysis.

The following theorem, which will be proved in Appendix B in the supplementary file [Kato (2013)], establishes rates of convergence for . For notational convenience, define

where denotes the distribution of (defined on ).

Theorem 3.2

Suppose that assumptions (A1)–(A8) are satisfied. Take . Then, we have

(7)

provided that as .

For monotonized estimators, the following corollary directly follows in view of (2.3) and Theorem 3.2.

Corollary 3.1

Let be a bounded closed interval in . Suppose that all the assumptions of Theorem 3.2 are satisfied. Let be any monotonized estimator for given in Section 2.3. Then, we have

Here note that the rate attained in estimating is faster than the rate attained in estimating .

In what follows, we discuss optimality of these rates.

Proposition 3.1

Suppose that assumptions (A1)–(A6) and (A8) are satisfied. Let be such that

(8)

Then, there exists a constant such that for ,

(9)

where is taken over all estimators for the slope function based on . Similarly, there exists a constant such that for ,

where is taken over all estimators for the conditional quantile function based on . When is a bounded closed interval in , there exists a constant such that for ,

where the previous convention applies.

The side condition (8) is a compatibility condition between assumptions (A3) and (A8). It is not addressed here whether this condition is tight. However, some restriction between and is required in establishing lower bounds of rates of convergence to guarantee that the class is at least nonempty. Proposition 3.1 shows that under this side condition the rates established in Theorems 3.13.2 and Corollary 3.1 are indeed optimal in the minimax sense. A proof of Proposition 3.1 is given in Appendix C in the supplementary file [Kato (2013)].

4 Empirical choice of the cutoff level

In this section we suggest three criteria to choose , and investigate their performance by simulations.222Application of a (weighted) Lasso type procedure could be an alternative in the -selection. Also, a thresholding rule for the estimated eigenvalues is a possible option. Analysis of such procedures is left to a future work. We use a heuristic reasoning to derive selection criteria. Suppose that is a singleton: . Suppose that there is no truncation bias, that is, and . Then, the infeasible estimator defined by (3) can be regarded as a (conditional) maximum likelihood estimator when the conditional distribution of given  has the asymmetric Laplace density of the form

where is a scale parameter. This suggests the following analogues of AIC and BIC in the present context:

See also Koenker [(2005), Section 4.9.1] for some related discussion. According to Yuan (2006), we may also consider an analogue of GACV as follows:

When is a bounded closed interval, define the integrated-AIC, BIC and GACV as follows:

When is a set of finite grid points, each integral is replaced by the summation over the grid points.

We carried out a small Monte Carlo study to investigate the finite sample performance of these criteria. In all cases, the number of Monte Carlo repetitions was . The numerical results obtained in this section were carried out by using the matrix language Ox [Doornik (2002)]. The Ox code for solving quantile regression problems supplied on Professor Koenker’s website was used. See also Portnoy and Koenker (1997) for some computational aspects of quantile regression problems.

The simulation design is described as follows:

(10)
(11)

This corresponds to in assumption (A3). Each is observed at equally spaced grid points on . In this design, we have

where is the quantile function of the distribution of . Thus, and [ is independent of ]. We considered two cases for : (a) and (b) . In each case, the performance was measured by

where denotes the distribution of and is the abbreviation of “quantile-averaged mean integrated squared error.”

The simulation results for case (a) are summarized in Figures 14 and Table 1. Figures 1 and 2 show the performance of the selection criteria for the normal error case, while Figures 3 and 4 show that for the Cauchy error case. In each figure, “Fixed” refers to the performance of the estimator with fixed . Table 1 shows the average numbers of selected by three criteria. The table shows that as the sample size increases, all the criteria tend to choose larger , and as increases, they tend to choose smaller , which is consistent with the theoretical requirement in the -selection. Interestingly, it is observed that the error distribution affects the performance of the -selection.

Figure 1: Performance of selection criteria. Case (a). Estimation of .
Figure 2: Performance of selection criteria. Case (a). Estimation of .
Figure 3: Performance of selection criteria. Case (a). Estimation of .
Figure 4: Performance of selection criteria. Case (a). Estimation of .

In the normal error case, in Figure 1, BIC worked better than the other two criteria. On the other hand, in the Cauchy error case, in Figure 2, AIC and GACV worked better than BIC. Looking closely at these figures, one finds that AIC and GACV performed quite badly in some cases (see the bottom half in Figure 1). Although none of these criteria clearly dominated the others, BIC worked relatively stably. These figures also show that as increases from to , the performance of becomes worse, while that of becomes better. This is consistent with the theoretical results in the previous section. Essentially similar comments apply to case (b), Figures 1–4 and Table 1 in Appendix F in the supplementary file [Kato (2013)].

error dist. AIC BIC GACV
100 1.1 normal 7.40 (4.12) 3.09 (1.24) 6.58 (3.78)
200 1.1 normal 7.34 (3.87) 3.39 (1.14) 6.98 (3.67)
500 1.1 normal 8.13 (3.59) 3.94 (1.05) 7.97 (3.52)
100 2 normal 6.50 (4.23) 2.53 (0.93) 5.67 (3.84)
200 2 normal 6.33 (3.95) 2.67 (0.85) 5.96 (3.76)
500 2 normal 6.92 (3.85) 3.12 (0.77) 6.81 (3.80)
100 1.1 Cauchy 2.30 (1.18) 1.62 (0.61) 2.24 (1.08)
200 1.1 Cauchy 2.51 (0.88) 1.91 (0.64) 2.48 (0.84)
500 1.1 Cauchy 3.00 (0.87) 2.20 (0.51) 2.99 (0.86)
100 2 Cauchy 1.90 (0.94) 1.35 (0.51) 1.86 (0.81)
200 2 Cauchy 2.14 (0.76) 1.60 (0.53) 2.10 (0.60)
500 2 Cauchy 2.41 (0.67) 1.91 (0.38) 2.40 (0.67)
Table 1: Average numbers of selected by three criteria in case (a). Standard deviations are given in parentheses

5 Proof of Theorem 3.1

We divide the proof into three subsections. Some technical results are proved in Appendices D and E in the supplementary file [Kato (2013)]. To avoid the notational complication, uniformity in will be suppressed. Let denote a generic constant of which the value may change from line to line. In most cases, qualification “almost surely” will be suppressed. In some parts of the proofs, we use empirical process techniques. We follow the basic notation used in van der Vaart and Wellner (1996).

5.1 Reduction of the problem

Let