A Continuous Threshold Expectile Model
Expectile regression is a useful tool for exploring the relation between the response and the explanatory variables beyond the conditional mean. This article develops a continuous threshold expectile regression for modeling data in which the effect of a covariate on the response variable is linear but varies below and above an unknown threshold in a continuous way. Based on a grid search approach, we obtain estimators for the threshold and the regression coefficients via an asymmetric least squares regression method. We derive the asymptotic properties for all the estimators and show that the estimator for the threshold achieves root-n consistency. We also develop a weighted CUSUM type test statistic for the existence of a threshold in a given expectile, and derive its asymptotic properties under both the null and the local alternative models. This test only requires fitting the model under the null hypothesis in the absence of a threshold, thus it is computationally more efficient than the likelihood-ratio type tests. Simulation studies show desirable finite sample performance in both homoscedastic and heteroscedastic cases. The application of our methods on a Dutch growth data and a baseball pitcher salary data reveals interesting insights.
keywords:Expectile regression, Threshold, Weighted CUSUM test, Grid search method
Expectile regression, first introduced by Aigner et al. (1976) and Newey and Powell (1987), has become popular in the last decades. Analogous to quantile regression (Koenker and Bassett, 1978), expectile regression draws a complete picture of the conditional distribution of the response variable given the covariates, making it a useful tool for modeling data with heterogeneous conditional distributions. As modeling tools, quantile regression and expectile regression both have advantages over the other in certain aspects: quantile regression is more robust to outliers than expectile regression, whereas expectile regression is more sensitive to the extreme values in the response variable than quantile regression. However, expectile regression has certain computational advantages over quantile regression (Newey and Powell, 1987). First, unlike quantile regression, the loss function of expectile regression is everywhere differentiable, thus its estimation is more straightforward and much quicker. Second, the computation of the asymptotic covariance matrix of the expectile regression estimator does not involve estimating the density function of the errors. Besides the early development on linear expectile regression (Newey and Powell, 1987; Efron, 1991), many nonparametric or semiparametric expectile regression have been developed in recent years, for example, Yao and Tong (1996), De Rossi and Harvey (2009), Kuan et al. (2009), Schnabel and Eilers (2009), Kneib (2013), Sobotka et al. (2013), Xie et al. (2014), Waltrup et al. (2015), Kim and Lee (2016), and among others. These models greatly improve the flexibility of expectile regression for modeling nonlinear relationships.
However, some natural phenomena call for nonlinear regression forms that exhibit structure changes, sometimes in the form of two line segments with different slopes. For example, a child’s height increases rapidly with age before and during puberty and then stops increasing in late teens. This implies that the growth curve of height may be described as two line segments with different slopes intersecting at a threshold. Another example arises from a study of the salaries of major league baseball players in 1987 (Hoaglin and Velleman, 1995). The data shows a positive correlation between salaries and years of experience for less experienced pitchers but a negative correlation for more experienced pitchers. In these instances, besides the regression coefficients, the onset of the transition point is often of great research interest, for example, when a child reaches his/her full adult height or whether there is a prime time for pitchers’ salaries. Although the existing spline-based (e.g., Schnabel and Eilers, 2009; Kim and Lee, 2016) or varying-coefficient expectile models (e.g., Xie et al., 2014) can capture the nonlinear relationship between the response variable and the predictors, they cannot provide information on the location of the threshold. This issue motivates us to consider a continuous threshold model for expectile regression. Continuous threshold regression, also called segmented regression or bent line regression, has been studied in the context of least squares regression (Quandt, 1958, 1960; Hinkley, 1969; Feder, 1975; Chappell, 1989; Chan and Tsay, 1998; Chiu et al., 2006; Hansen, 2015), quantile regression (Li et al., 2011), and rank-based regression (Zhang and Li, 2016). However, no literature has investigated the continuous threshold expectile regression.
In this article, we develop a continuous threshold expectile regression model. The contribution of this article is twofold. First, we propose a grid search method to estimate the unknown threshold and other regression coefficients. We derive the asymptotic properties for all the parameters including the threshold, and show that the estimator for the threshold achieves -consistency. Second, we develop a testing procedure for the existence of structural change at a given expectile, based on a weighted CUSUM type statistic. This test only requires fitting the model under the null hypothesis in the absence of a threshold, thus it is computationally efficient. The limiting distribution of the test statistic is also established. The estimation and testing procedures are implemented in R code, which is available from the first author by request.
The remainder of the article is organized as follows. In Section 2, we describe the continuous threshold expectile regression model, and develop a grid search method for estimating the unknown threshold and regression coefficients. A testing procedure for the structural change in a given expectile level is also proposed. In Section 3, we conduct simulation studies and two real data analyses. Section 4 provides the conclusion with possible future extensions. Technical proofs are presented in the Appendix.
Let , , be a sequence of independent and identically distributed sample from the population . We assume that is the response variable, is a vector of covariates, and is a scalar variable, whose relationship with changes at an unknown location. The population -expectile of , , minimizes the loss function , where
is the asymmetric squared error loss function, and is the parameter that controls the degree of loss asymmetry. Clearly, when , the -expectile corresponds to the mean of .
In this paper, we model the conditional -th expectile of using the continuous threshold model
where are the unknown parameters of interest, is the vector of parameters excluding the unknown location of the threshold or change point , is a vector of parameters. Here, , where is the indicator function. Clearly, the linear expectile regression is continuous on at , but has different slopes on either side of the threshold . In other words, is the slope of the left line segment for and is the slope of the right line segment for .
2.2 Estimation procedure
To estimate at a given expectile , we minimize the objective function
However, due to the existence of the threshold , the objective function (2) is convex in but non-convex in , making it difficult to obtain its minimizer. One estimation approach is to use the grid search strategy, which is commonly used for bent line mean regression (Quandt, 1958; Chappell, 1989). To proceed, we re-write the objective function (2) with respect to and as
where . The minimization can be carried out in two steps:
(1) for each , where is the range set of all ’s, obtain a profile estimate of by
(2) obtain the threshold as
The estimate for then is .
2.3 Asymptotic properties
Because the objective function is not differentiable with respect to , it is impossible to obtain the asymptotic properties of using the standard theory. Here, we derive the asymptotic properties using the modern empirical processes theory. We first introduce some notations. Denote the true parameters as . Let , where . Using the notation of empirical process, one can write
where is the empirical measure, and with the weights
Here, is the observed data .
In Lemma A.1 in the Appendix, we show that converges to zero in probability, as goes to infinity. Furthermore, we establish the consistency of .
Under the regularity conditions in the Appendix, as , we have that
We prove the asymptotic normality by using Theorem 5.23 in Van der Vaart (2000), which establishes the asymptotic normality of M-estimators when the criterion function is Lipschitz continuous and its limiting function admits a second order Taylor expansion. To proceed, define the matrix , where is
Define the Hessian matrix of
Under the regularity conditions in the Appendix, is asymptotically normally distributed with mean zero and covariance matrix , as .
It is worthwhile to emphasize that the regression coefficients and threshold estimators are jointly asymptotically normal with convergence rate, and have non-zero asymptotic covariance in our model setting. This is different from the model with a discontinuous threshold. In the latter situation, the estimators of the regression coefficients are still -consistent, but the threshold estimator is -consistent with a non-standard asymptotic distribution. The -convergence rate of in our model is due to the continuity of at .
The asymptotic variance-covariance matrix can be estimated by , where , and
Here, , and is the kernel estimator for the density of , and is a kernel function with a bandwidth . In practice, we use the Epanechnikov kernel and obtain the optimal bandwidth by the Silverman’s rule of thumb (Silverman, 1986), , where is the standard deviation of .
2.4 Testing for structural change at a given expectile
An important question before fitting model (1) is whether there exists a threshold at a pre-specified expectile. If a threshold does not exist, then is unidentifiable and the estimation procedure in the last section is ill-conditioned. To test the existence of a threshold, we test null () and alternative () hypotheses
where is the range set of all ’s.
Tests for structural changes have been developed in conditional mean regression (Andrews, 1993; Bai, 1996; Hansen, 1996, 2015), quantile regression (Qu, 2008; Li et al., 2011), transformation models (Kosorok and Song, 2007), time series models (Chan, 1993; Cho and White, 2007), and among others. To construct our test statistic, we take an approach in spirit similar to the test for structural changes in quantile regression in Qu (2008). This test is constructed by sequentially evaluating the subgradients of the objective function under for a subsample, in a fashion similar to the CUSUM statistic. An advantage of this test is that it only requires fitting the model under the null hypothesis. Thus, it is computationally more efficient than the likelihood-ratio type tests, such as the sup-likelihood-ratio-type test for testing threshold effects in regression models in Lee et al. (2011), which requires fitting the models under both null and alternative hypotheses.
To proceed, we define the following statistic,
where , and is the estimator of coefficients under the null hypothesis , that is,
An intuitive interpretation for is given as follows. If there is not a threshold, is a good estimate of its population value, and hence, the estimated residual would show a random pattern against , leading to a small . On the other hand, if there exists a threshold, the estimate would differ significantly from the true value, and the estimated residuals would depart from zero in a systematic fashion related to , resulting in a large absolute value of . Because the location of the threshold is unknown, we need search through all the possible locations. Therefore, we propose the test statistic
This statistic can be viewed as a weighted CUSUM statistic based on the estimated residuals under the null hypothesis. Intuitively, it is plausible to reject when is too large. This intuition will be formally verified by Theorem 2.5. It implies that converges to a Gaussian process with mean zero, and the size of such a process can be used to test for a threshold effect.
In order to derive the large-sample inference for , we consider the local alternative model,
where is the location of threshold, , and the -expectile of is zero. We first introduce some notations
Under the regularity conditions in the Appendix, for the local alternative model (4), has the asymptotic representation
Furthermore, converges weakly to the process , where is the Gaussian process with mean zero and covariance function
Under the regularity conditions in the Appendix, for the local alternative model, , for any increasing sequence goes to infinite, we have that for any .
Because the limiting null distribution of is nonstandard, we resort to the Gaussian multiplier method (Van der Vaart, 2000) to calculate the critical values, based on the asymptotic representation (5). The procedure is described in Algorithm 1. In the Appendix, we prove the following result, which implies the validity of the bootstrap resampling scheme.
Under both the null and the local alternative hypotheses, (defined in Algorithm 1) converges to the Gaussian process as .
We summarize the computing procedure as follows.
3 Simulation Studies and Applications
3.1 Simulation studies
In this section, we conduct simulation studies for assessing the finite sample performance of the proposed method. We consider the following two scenarios:
Independent and identically distributed (IID):
where is generated from a uniform distribution , is generated from a normal distribution , and the parameters are . For each scenario, we consider three error cases: (1) , (2) , and (3) a mixture distribution , where is the -distribution with four degrees of freedom. For each case, we conduct 1000 repetitions with sample sizes and .
As shown in Tables 1—2, for both the IID and the heteroscedastic scenarios, all the biases are small, indicating the proposed estimator is asymptotically consistent. Moreover, the average estimated standard errors are close to the empirical standard errors. The coverage probabilities of the regression parameters are close to the nominal level . Though some coverage probabilities of the threshold are below when , they improve as the sample size increases to . The performance is similar in all the three error distributions. In summary, the proposed estimate has a good finite sample performance.
We also conduct simulation studies to evaluate the type I error and the power of the testing procedure. The simulation models are similar to the above, with threshold effects at . The number of bootstrap times is set as and the nominal significance level is . The results are shown in Table 3. For all scenarios, the tests have type I errors close to the nominal level and have reasonable power, which indicates that the proposed test is valid for testing the existence of a threshold.
Bias: the empirical bias; SD: the empirical standard error; ESE: the average estimated standard error; CP: coverage probability.
Bias: the empirical bias; SD: the empirical standard error; ESE: the average estimated standard error; CP: coverage probability.
3.2.1 Fourth Dutch growth data
We first apply our method to the Fourth Dutch Growth data, which was collected by the Fourth Dutch Growth study (van Buuren, 2007) and is available in the R package expectreg. This dataset has the height, weight and head circumference of Dutch children between ages 0 and 21 years (van Buuren and Fredriks, 2001). A primary interest of this study concerns the relation between age and height. The scatter plot (Figure (a)a) shows the relationship between age and height for a subset of boys. Clearly, there is a nonlinear trend between height and age (Figure (a)a), with a steep curvature before age three due to rapid growth in early childhood, and a bent in the late teens due to reaching the full adult height. This dataset has been analyzed by Schnabel and Eilers (2009). In their analysis, they took a square root transformation on age. While this transformation effectively removes the curvature at early childhood, the nonlinearity in the late teens still exists (Figure (b)b). Then they fitted the transformed data using smoothed expectile regression, by combining the least asymmetrically weighted squares with the P-splines. Though the smoothed expectile curves fit the data well, they do not provide any information on the location of the threshold, i.e., the age to stop growing.
Here, we fit the continuous threshold expectile model to the square root transformed data and estimate the location of threshold. Specifically,
where is the height of the th boy, is the square root of his age, and are the unknown parameters of interest, is the unknown location of threshold. We fit the model with 0.05, 0.15, 0.25, 0.30, 0.40, 0.50, 0.80, 0.90, 0.95, 0.98.
For all the expectile levels we fit, the p-values from our threshold effect test are nearly 0, indicating a highly significant continuous threshold pattern. The regression results for different expectile levels are reported in Table 4. The estimated coefficients show that the height first increases rapidly with age (roughly 31–35 cm per square root of age), and then the growth is very limited or nearly stops after about age 17–18. The estimated thresholds illustrate a general trend that shorter boys seem to stop growth later than taller boys, with a confidence interval (CI), [18.39, 19.24] at the expectile level , and [16.76, 17.40] at . Figure (b)b confirms these results.
3.2.2 Baseball pitcher salary
Our second example concerns the salaries of major league baseball (MLB) players for the 1987 baseball season (Hoaglin and Velleman, 1995). The dataset has been analyzed by several groups in the ASA graphical session in 1989. Here we consider a subset with n=176 pitchers, which was analyzed in (Hettmansperger and McKean, 2011) using a rank-based regression. This dataset is available in the R package rfit. It consists of the 1987 beginning salary and the number of years of experience for these pitchers.
Visually, the scatter plot (Figure (a)a) suggests that the salaries are first positively correlated with the years of experience, but then decline after about 9 years. This is somewhat unusual, because it is generally expected that salaries grow with the years of experience in players’ early career and the status of free agent (i.e., the player whose initial 6 year contract expires). Although salaries do decrease after players pass their prime time, it would happen much later, for example, Haupert and Murray (2012) estimated the decline for MLB players occurs after 22 years. In the analysis by the ASA graphical session in 1989, the model with the best predictive performance is a segmented mean regression model with a fixed threshold at 7 years, where the threshold was chosen according to the length of the initial professional baseball contracts (6 years). It is of interest to formally test if the visually observed transition is significant and estimate the onset of the decline from the data. Furthermore, the salaries show considerable heterogeneity at a given number of years of experience. Hence, a regression model based on the conditional distribution of the response variable provides a more complete picture than a mean regression model. Previous analyses only focused on the mean regression model (Hettmansperger and McKean, 2011; Hoaglin and Velleman, 1995), but not regression models for conditional distribution.
Here we fit the data using the continuous threshold expectile regression,
where is the log (salary) of the th pitcher, is log (years of experience), and are the unknown parameters of interest, is the unknown location of threshold, 0.01, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.98, 0.99.
Our threshold test shows that the continuous threshold patterns are highly significant, with p-values less than 0.05 for all the expectile levels considered. Table 5 reports the estimated coefficients and their standard errors. The coefficients show that the salaries indeed decline for pitchers with 9 or more years of experience (range: (8.61, 10.35)), at all the expertile levels we fitted. Figure 2 confirms this conclusion.
This raises two natural questions: why did the salaries decrease for more experienced pitchers? and why did the decrease occur at 9 years for all salary levels? The history of the MLB shows that, in the time period of 1985 to 1987, the MLB team owners colluded in an effort to decrease salaries for free agents after their initial contracts expired. Pitchers with 9 or more years of experience are all free agents. Their salary decrease is a reflection of owners trying to control salaries. The reason that the observed threshold (9 years) is later than the start of free agents (7 years), is that some pitchers have become free agents before the collusion, thus they had more than 7 years of experience when the collusion occurred.
As a comparison, we also fit the data with the bent-line quantile regression (Li et al., 2011). Though the overall trend is similar to the continuous threshold expectile regression, it has more crossing between quantiles. This agrees with the observation of Schnabel and Eilers (2009) and Waltrup et al. (2015) that expectile regression tends to have less crossing than quantile regression.
4 Concluding Remarks
In this article, we have developed the continuous threshold expectile regression model. This model allows the expectiles of the response to be piecewise linear but still continuous in covariates. We developed a grid search method to estimate the unknown threshold and the regression coefficients. A weighted CUSUM type test statistics was proposed to test the structural change at a given expectile. Our numerical studies showed that the proposed estimator has good finite sample performance.
Our work may be extended in several ways. First, although generally there are fewer crossings in expectile regression than in quantile regression (Schnabel and Eilers, 2009), the expectile crossings may happen. It will be worthwhile to extend our model to non-crossing continuous threshold expectile estimation and to develop tests for structure change across expectiles. Another interesting extension is to consider more than one threshold for a covariate. In such a situation, the estimation and test of the thresholds would be more complicated, and further investigation is needed.
The authors thank Dr. Andrew Wiesner for the interpretation of the baseball data. This research is partially supported by NIH R01GM109453. Zhang’s work is partially supported by National Natural Science Foundation of China (NSFC) (11401194), and the Fundamental Research Funds for the Central Universities (531107050739).
is unique, where is a compact set in .
is in , and is a compact subset of .
The scalar variable has an absolutely continuous distribution with density function , which is strictly positive, bounded and continuous for any in a neighborhood of .
, , and .
Given , the Hessian matrix is nonsingular.
Condition (A1) is the identifiability condition of the estimation. Conditions (A1)—(A3) are for the consistency of the estimates, and Conditions (A4)—(A5) are used for the asymptotic normality.
We first provide the following uniformly convergence results.
Under the regular conditions, as , we have
Proof of Lemma A.1. To show that the class of functions is Glivenko-Cantelli, it is sufficient to show is Lipschitz continuous. Recalling that , and the derivatives
By the Condition (A2), both and are finite. Note that for any . Hence, applying the mean-value theorem, for every , where
Therefore, is Lipschitz continuous, and applying the Glivenko-Cantelli theorem and Example 19.8 in Van der Vaart (2000), we can establish that is Glivenko-Cantelli. ∎
Proof of Theorem 2.1. By Lemma A.1, as goes to infinity. Since is compact and the uniqueness of the minimum (by Conditions A1 and A2), along with that is continuous with respective to , then we can establish that , by Theorem 2.1 of Newey and McFadden (1994). ∎
Proof of Theorem 2.2. Firstly, by Condition (A3), the function is measurable, and the function is differentable at for P-almost every . Recall that is Lipschitz continuous with respect to , as proved in Lemma A.1.
Secondly, the map admits a second order Taylor expansion at , with a nonsingular symmetric Hessian matrix . We can verify that is continuous in . Indeed, the elements of are quadratic functions of , and hence is continuous in . It is sufficient to show that is continuous in . Note that the first term of is a function of through moments of the form . By Condition (A4), for some constant . By Condition (A3), for some constant , . Then, by Cauchy-Schwartz inequality,
is uniformly continuous in . Hence, the first term of is continuous in . On the other hand, since is continuous in , then the second term of is continuous in . Thus, is continuous in .
Under the regularity conditions, as , we have
Proof of Lemma A.2. It is easily obtained by using the law of large number for (i). To establish (ii) and (iii), note that and are sums of indicator functions and Lipschitz functions, then they are Glivenko-Cantelli class, which implies that both (ii) and (iii) holds.
Proof of Theorem 2.5. Note that , which is equivalent to the solution of the estimating equation
Recall that the local alternative model (4) is
Then, under model (4), the estimating equation can be written as
By the mean-value theorem, we have