Smoothed quantile regression processes for binary response models.

Smoothed quantile regression processes for binary response models.

Stanislav Volgushev
Ruhr-Universität Bochum and University of Illinois at Urbana Champaign
The idea of considering binary response quantile processes originated from discussions with Prof. Roger Koenker. I am thankful to him for the encouragement and many insightful discussions on this topic. The mistakes are of course my sole responsibility. This research was conducted while I was a visiting scholar at UIUC. I am very grateful to the Statistics and Economics departments for their hospitality. Financial support from the DFG (grant VO1799/1-1) is gratefully acknowledged.

In this paper, we consider binary response models with linear quantile restrictions. Considerably generalizing previous research on this topic, our analysis focuses on an infinite collection of quantile estimators. We derive a uniform linearization for the properly standardized empirical quantile process and discover some surprising differences with the setting of continuously observed responses. Moreover, we show that considering quantile processes provides an effective way of estimating binary choice probabilities without restrictive assumptions on the form of the link function, heteroskedasticity or the need for high dimensional non-parametric smoothing necessary for approaches available so far. A uniform linear representation and results on asymptotic normality are provided, and the connection to rearrangements is discussed.

1 Introduction

In various situations in daily life, individuals are faced with making a decision that can be described by a binary variable. Examples relevant to various fields of economics include the decision to participate in the labour market, to retire, to make a major purchase. From an econometric point of view, such decisions can be modelled by a binary response variable that depends on an unobserved continuous random variable which summarizes an individuals preferences. In the presence of covariates, say , a natural question is: what can we infer about the distribution of the unobserved variable conditional on from observations of i.i.d. replicates of . In a seminal paper, Manski (1975) assumed that where the ’error’ satisfies the conditional median restriction and derived conditions on the distribution of that imply identifiability of the coefficient vector up to scale. In later work, Manski (1985) extended those results to general quantile restrictions of the form for fixed . A more detailed discussion of identification issues was provided in Manski (1988). Due to their importance in understanding binary decisions, binary choice models have ever since aroused a lot of interest and many estimation procedures have been proposed [see Cosslett (1983), Horowitz (1992), Powell et al. (1989), Ichimura (1993), Klein and Spady (1993), Coppejans (2001), Kordas (2006) and Khan (2013) to name just a few].

A particularly challenging part of analysing binary response models lies in understanding the stochastic properties of corresponding estimation procedures. The asymptotic distribution of Manski’s estimator was derived in Kim and Pollard (1990) under fairly general conditions, while a non-standard case was considered in Portnoy (1998). In particular, Kim and Pollard (1990) demonstrated that the convergence rate is and that the limiting distribution is non-Gaussian. A different approach based on non-parametric smoothing that avoids some of the difficulties encountered by Manski’s estimator was taken by Horowitz (1992). By smoothing the objective function, Horowitz (1992) obtained both better rates of convergence and a normal limiting distribution. However, note that the smoothness conditions on the underlying model are stronger than those of Kim and Pollard (1990).

The approaches of Manski and Horowitz have in common that only estimators for the coefficient vector are provided. While those coefficients are of interest and can provide valuable structural information, their interpretation can be quite difficult since the scale of is not identifiable from the observations. On the other hand, the ’binary choice probabilities’ provide a much simpler and more straightforward interpretation.

Most of the available methods for estimating binary choice probabilities are of two basic types. The first and more thoroughly studied approach is to assume a model of the form where the are assumed to be either independent of [see Cosslett (1983) and Coppejans (2001)], or admit a very special kind of heteroskedasticity [Klein and Spady (1993)]. Another popular approach has been to embed the problem into general estimation of single index models, see for example Powell et al. (1989) or Ichimura (1993). Here, it is again necessary to assume independence between and the covariate .

While in the settings described above it is possible to obtain parametric rates of convergence for the coefficient vector and also construct estimators for choice probabilities, in many cases the assumptions on the underlying model structure seem too restrictive.

An alternative approach allowing for general forms of heteroskedasticity was recently investigated by Khan (2013), who proved that under general smoothness conditions any binary response model with is observationally equivalent to a Probit/Logit model with multiplicative heteroskedasticity, that is a model where with independent of and general scale function . Khan (2013) also proposed to simultaneously estimate and the function by a semi-parametric sieve approach. The resulting model allows one to obtain an estimator of the binary choice probabilities. While this idea is extremely interesting, it effectively requires estimation of a -dimensional function in a non-parametric fashion. For the purpose of estimating , the function can be viewed as nuisance parameter and its estimation does not have an impact on the rate at which is estimable. However, the binary choice probabilities explicitly depend on and can thus only be estimated at the corresponding -dimensional non-parametric rate. In settings where is moderately large this can be quite problematic.

In the classical setting where responses are observed completely, linear quantile regression models [see Koenker and Bassett (1978)] have proved useful in providing a model that can incorporate general forms heteroskedasticity and at the same time avoid non-parametric smoothing. In particular, by looking at a collection of quantile coefficients indexed by the quantile level it is possible to obtain a broad picture of the conditional distribution of the response given the covariates. The aim of the present paper is to carry this approach into the setting of binary response models. In contrast to existing methods, we can on one hand allow for rather general forms of heteroskedasticity and at the same time estimate binary choice probabilities without the need of non-parametrically estimating a -dimensional function.

The ideas explored here are closely related to the work of Kordas (2006). Yet, there are many important differences. First, in his theoretical investigations, Kordas (2006) considered only a finite collection of quantile levels. The present paper aims at considering the quantile process. Contrary to the classical setting, and also contrary to the results suggested by the analysis in Kordas (2006), we see that the asymptotic distribution is a white noise type process with limiting distributions corresponding to different quantile levels being independent. An intuitive explanation of this seemingly surprising fact along with rigorous theoretical results can be found in Section 2. We thus provide both a correction and considerable extension of the findings in Kordas (2006).
Further, our results on the quantile process pave the way to obtaining an estimator for the conditional probabilities and derive its asymptotic representation. While a related idea was considered in Kordas (2006), no theoretical justification of its validity was provided. Moreover, we are able to considerably relax the identifiability assumptions that were implicitly made in there. Finally, we demonstrate that our ideas are closely related to the concept of rearrangement [see Dette et al. (2006) or Chernozhukov et al. (2010)] and provide new theoretical insights regarding certain properties of the rearrangement map that seem to be of independent interest.
The rest of the paper is organized as follows. In Section 2, we formally state the model and provide results on uniform consistency and a uniform linearization of the binary response quantile process. All results hold uniformly over an infinite collection of quantiles . In Section 3, we show how the results from Section 2 can be used to obtain estimators of choice probabilities. We elaborate on the connection of this approach to rearrangements. Finally, a uniform asymptotic representation for a properly rescaled version of the proposed estimators is provided and their joint asymptotic distribution is briefly discussed. All proofs are deferred to an appendix.

2 Estimating the coefficients

Before we proceed to state our results, let us briefly recall some basic facts about identification in binary response models and provide some intuition for the estimators of Manski (1975) and Horowitz (1992). Assume that we have n i.i.d. replicates, say , drawn from the distribution with denoting the unobserved variable of interest and denoting a vector of covariates. Further, denote by the conditional quantile function of given and assume that for we have for some vectors . Observing that with arbitrary directly shows that the scale of the vector can not be identified from . On the other hand, the vector is identified up to scale if for example implies that the distribution of conditional on differs from that conditional on on a sufficiently large set. More precisely, assume that the function is strictly increasing for in a neighbourhood of zero and all . In that case we have by the definition of the ’th quantile

This already suggests that the expectation of is positive for and negative for . We thus expect that under appropriate conditions the function

should be maximal at for any . Consider a vector . Then


Note that both quantities are non-positive, and at least one of them being strictly negative is sufficient for inferring from the observable data. An overview and more detailed discussion of related results is provided in Chapter 4 of Horowitz (2009).

A common assumption [see e.g. Chapter 4 in Horowitz (2009)] is that one component of is either constant or at least bounded away from zero. Without loss of generality, we assume that this holds for the first component of . In order to simplify the notation of what follows, write the covariate in the form with being the first component of and denoting the remaining components. Denote the supports of by , respectively. Denote by a sample of i.i.d. realizations of the random variable . Define the empirical counterpart of by

and consider a smoothed version

with denoting a bandwidth parameter and a smoothed version of the indicator function . Following Horowitz (2009), define the estimator through

Remark 2.1

The proofs of all subsequent results implicitly rely on the fact that we know which coefficient stays away from zero and that the covariate corresponding to this particular coefficient has a ’nice’ distribution conditional on all other coefficients [see assumptions (F1), (D2) etc.]. This is in line with the approach of Horowitz (1992) and Kordas (2006) and makes sense in many practical examples. Results similar to the ones presented below might continue to hold if we use Manski’s normalization instead of setting the ’right’ component to . However, the asymptotic representation would be somewhat more complicated. For this reason, we leave this interesting question to future research.

In all of the subsequent developments we make the following basic assumption.

  1. The coefficient satisfies and the coefficient has the same sign on all of . In what follows, denote this sign by .

Remark 2.2

Note that due to the scaling the estimator defined above is an estimator of the re-scaled quantity where . When interpreting the estimator , this must be taken into account. In particular, can not be interpreted as classical quantile regression coefficient. This also explains the reason behind assumption (A).

In order to establish uniform consistency of the smoothed maximum score estimator, we need the following assumptions.

  1. The function is uniformly bounded and satisfies as .

  2. The conditional distribution function of given , say , is uniformly continuous uniformly over , that is

  3. For any fixed , is the unique minimizer of on and additionally

In order to intuitively understand the meaning of condition (D1) above, note that conditions (K1) and (F1) imply that uniformly in . Condition (D1) essentially requires that the maximum of is ’well separated’ uniformly in , which allows to obtain uniform consistency of a sequence of maximizers of any function that uniformly converges to . Versions of this condition that are directly connected to densities and distributions of some of the regressors can for example be derived by considering a uniform version of Assumption 2 in Manski (1985) by arguments similar to the ones given in that paper, see also Assumptions 1-3 in Horowitz (1992).

Lemma 2.3

Under assumptions (K1), (D1), (F1) let . Then the estimator is weakly uniformly consistent, that is

The next collection of assumptions is sufficient for deriving a uniform linearization [some kind of ’Bahadur-representation’] for Assume that there exists some such that the following conditions hold.

  1. The function is two times continuously differentiable and its second derivative is uniformly Hölder continuous of order , that is it satisfies

    Denote the derivative of by . Assume that additionally, are uniformly bounded and we have and additionally .

  2. for some and for and additionally as well as as .

  3. The bandwidth satisfies and additionally .

  4. The distribution of has bounded support . For almost every , the covariate has a conditional density .

  5. For any vector with the two functions and are two times continuously differentiable at every for almost every and the first and second derivatives are uniformly bounded [uniformly over ,].

  6. The function is times continuously differentiable for every at every with . All derivatives are uniformly bounded and uniformly continuous uniformly in . The function is times continuously differentiable at every with at almost every and all derivatives are uniformly bounded and uniformly continuous uniformly in .

  7. The map is uniformly on Hölder continuous of order , that is for some universal constant , some and all with .

  8. For any with there exists such that .

  9. We have where denotes the largest eigenvalue of the matrix and we defined

The conditions on the kernel function are standard in the binary response setting and were for example considered in Horowitz (2009) and Kordas (2006). Assumptions (D2)-(D4) and (Q) are uniform versions of the conditions in Horowitz (1992) and are needed to obtain results holding uniformly in an infinite collection of quantiles. Condition (D5) is needed to obtain a rate in the uniform representation below. Condition (D6) implies asymptotic independence of the limiting variables corresponding to different quantile levels. Essentially, it states that quantile curves corresponding to different quantile levels should be ’uniformly separated’ which is reasonable in most applications. In particular, (D6) follows if the conditional density of given is uniformly bounded away from zero for all with and in the support of .

Remark 2.4

Some straightforward calculations show that under assumption (D3) and the boundedness of the support of the matrix in condition (Q) is the second derivative of the function evaluated in . Since is assumed to be maximal in this point, the matrix is negative definite, and thus we need to bound its largest eigenvalue away from zero in order to obtain a uniform version of the non-singularity of .

Theorem 2.5

Under assumptions (A), (B), (D1)-(D5), (F1), (K1)-(K3), we have



In particular, and thus negligible compared to .
Now assume that additionally condition (D6) holds. Then, for any finite collection we obtain



are independent for and


Compared to the results available in the literature [e.g. in Kordas (2006) and Horowitz (2009)], the preceding theorem provides two important new insights. To the best of our knowledge, it is the first time that the estimator is simultaneously considered at an infinite collection of quantiles. Equally importantly, it demonstrates that the joint asymptotic distribution of several quantiles differs substantially from what both intuition and results in Kordas (2006) seem to suggest.

Remark 2.6

In contrast to the ’classical’ case, the properly normalized quantile process at different quantile levels converges to independent random variables. An intuitive explanation for this surprising fact can be obtained from the asymptotic linerization in (2.1). For simplicity, assume that the kernel has compact support, say . Then all observations that have a non-zero contribution to will need to satisfy . In particular, letting implies that asymptotically for different values of disjoint sets of observations will be driving the distribution of . Similar phenomena can be observed in other settings that include non-parametric smoothing, a classical example being density estimation. Note that regarding this particular point the paper of Kordas (2006) contained a mistake. More precisely, Kordas (2006) claimed that the asymptotic distributions corresponding to different quantiles have a non-trivial covariance which is not the case.

In particular, the above findings imply that there can be no weak convergence of the normalized process in a reasonable functional sense since the candidate ’limiting process’ has a ’white noise’ structure and is not tight. This will present an additional challenge for the analysis of estimators for binary choice probabilities constructed in the following section.

3 Estimating conditional probabilities

Partially due to the lack of complete identification, the coefficients estimated in the preceding section might be hard to interpret. A more tractable quantity is given by the conditional probability . One possible way to estimate this probability would be local averaging. However, due to the curse of dimensionality, this becomes impractical if the length of exceeds 2 or 3. An alternative is to assume that the linear model holds for all . By definition of , the existence of with implies that . On the other hand, the quantile function of is given by and thus . By definition of the quantile function and the assumptions on , . This implies the equality . In particular, we have for any with

This suggests to estimate by replacing in the above representation with the estimator from the preceding section after choosing in some sensible manner. The fact that is an estimator of the re-scaled version is not important here since multiplication by a positive number does not affect the inequality . From here on, define


This also indicates that in order to estimate we do not need the linear model to hold globally and also do not require that can be estimated for all . In fact, the validity of the linear model for in a neighbourhood of and estimability of on this region is sufficient for the asymptotic developments provided below.

Remark 3.1

Assume that the estimator is uniformly consistent and that additionally for any there exists such that . Then uniform consistency of will directly yield that, with probability tending to one, as long as and . This suggests that from an asymptotic point of view the choice of in the estimator is not very critical. On the other hand, is unknown in practice. Thus choosing as small and large as the data allow, respectively, seems to be a sensible practical approach. At the same time, identifiability at infinity is not needed to obtain an estimator of probabilities for points that are bounded away from the boundary of the covariate space.

Remark 3.2

The definition of is closely connected to the concept of rearrangement [see Hardy et al. (1988)]. More precisely, recall that the monotone rearrangement of a function is defined as

where denotes the generalized inverse of the function and the first step of the rearrangement, , is the distribution function of with respect to Lebesgue measure. Thus we can interpret the integral in the definition of as the distribution function of the map . Previously, a smoothed version of the first step of the rearrangement was used by Dette and Volgushev (2008) to invert a non-increasing estimator of an increasing function in the setting of quantile regression. On the other hand, it is not obvious if the function is increasing since is s re-scaled version of the quantile coefficient . However, as we already pointed out in Remark 3.1, the function will still have a unique zero. As we shall argue next, the first step of the rearrangement map can provide a way to estimate this zero point in a sensible way.

The properties of the rearrangement viewed as mapping between function spaces were considered in Dette et al. (2006) for estimating a monotone function and Chernozhukov et al. (2010) for monotonizing crossing quantile curves. In particular, the last-named authors derived a kind of compact differentiability of the rearrangement mapping at functions that are not necessarily increasing. However, those results can not be directly applied here since Dette et al. (2006) and Dette and Volgushev (2008) applied smoothing while the compact differentiability result of Chernozhukov et al. (2010) requires a process based functional central limit theorem. Due to the asymptotic independence of the limiting distributions in Theorem 2.5, such a result is impossible in our setting. Still, a general analysis of the rearrangement map is possible and will be presented next. The crucial insight is that the process is still sufficiently smooth on domains of size decreasing at the rate while its convergence to the limit takes place at a faster rate.

We begin by stating a general result that allows to derive a uniform linearization of the map defined above. In situations where a functional central limit result does not hold (this will often be the case in the situation of estimators build from local windows), this result seems to be of independent interest. In particular, it can be used to derive a uniform Bahadur representation for the estimator in the present setting.

Theorem 3.3

Consider a collection of functions indexed by a general set and assume that for all there exists with . Additionally, assume that each is continuously differentiable in a neighbourhood and that its derivative is uniformly Hölder continuous of order with constant both not depending on , and that for any we have and .
Denote by a collection of estimators for . Assume that




and that


If for all with some given we have it follows that for any collection of points with with fixed we have with probability tending to one




where .

We now state the additional assumptions that are needed to derive the limiting distribution of . Assume that for some the conditions of Theorem 2.5 hold on the set with . For this , we will need the following conditions.

  1. Define the set . Assume that for every there exists a unique with . Assume that the function is continuously differentiable on , that its derivative, say , is uniformly Hölder continuous of order and that .

  2. The function is Hölder continuous of order uniformly on .

  3. We have for some and all .

The above conditions ensure that the collection of estimators satisfies the conditions of Theorem 3.3 [see Lemma 4.4 for condition (3.2)]. An application of this result thus directly yields the following result.

Theorem 3.4

Assume that for some the conditions of Theorem 2.5 hold on the set with and let conditions (T1), (T2), (K4) hold. Assume that for each we have . Then for any


where was defined in Theorem 2.5. In particular, the remainder is negligible compared to . Moreover, for any finite collection with we obtain

where is as defined in Theorem 2.5.

From the results derived above, we see that the convergence rate of the estimators for binary choice probabilities corresponds to the rate typically encountered if one-dimensional smoothing is performed. Compared to the results of Khan (2013) whose rates correspond to dimensional smoothing, this can be a very substantial improvement. While our assumptions are of course more restrictive than those of Khan (2013), the form of allowed heteroskedasticity is somewhat more general than the simple multiplicative heteroskedasticity or even homoskedasticity assumed in previous work. While we of course do not suggest to completely replace the methodologies developed in the literature, we feel that our approach can be considered as a good compromise between flexibility of the underlying model and convergence rates. It thus provides a valuable supplement and extension of available procedures.

4 Proofs

Proof of Lemma 2.3 By Lemma 2.6.15 and Lemma 2.6.18 in van der Vaart and Wellner (1996), the classes of functions and are VC-subgraph classes of functions. Together with Theorem 2.6.7 and Theorem 2.4.3 in the same reference this implies


Next, observe that almost surely


and the classes of functions , are VC-subgraph by Lemma 2.6.15 and Lemma 2.6.18(viii) in in van der Vaart and Wellner (1996). In combination with Theorem 2.6.7 and Theorem 2.4.3 from the same reference this implies

Setting in the bound for we see that the first term, which is independent of , converges to zero by assumption (K1). Moreover, by assumption (F1) we have for

almost surely. A similar results holds for . Combining all the results so far we thus see that

Finally, observe that almost surely

where was defined in condition (D1). To see that this is the case, observe that maximizes , and thus we have a.s. for every

which implies since for all .

Proof of Theorem 2.5 Define the quantities


First, by uniform consistency of and given the fact that we see that with probability tending to one for all