Robust functional estimation in the multivariate partial linear model

Robust functional estimation in the multivariate partial linear model

Michael Levine
Abstract

We consider the problem of adaptive estimation of the functional component in a multivariate partial linear model where the argument of the function is defined on a -dimensional grid. Obtaining an adaptive estimator of this functional component is an important practical problem in econometrics where exact distributions of random errors and the parametric component are mostly unknown and cannot safely assumed to be normal. An estimator of the functional component that is adaptive in the mean squared sense over the wide range of multivariate Besov classes and robust to a wide choice of distributions of the linear component and random errors is constructed. It is also shown that the same estimator is locally adaptive over the same range of Besov classes and robust over large collections of distributions of the linear component and random errors as well. At any fixed point, this estimator also attains a local adaptive minimax rate. The procedure needed to obtain such an estimator turns out to depend on the choice of the right shrinkage approach in the wavelet domain. We show that one possible approach is to use the multivariate version of the classical BlockJS method. The multivariate version of BlockJS is developed in the manuscript and is shown to represent an independent interest. Finally, the Besov space scale over which the proposed estimator is locally adaptive is shown to depend on the dimensionality of the domain of the functional component; the higher the dimension, the larger the smoothness indicator of Besov spaces must be.

1 Introduction

In this manuscript, we consider a partial linear multivariate model defined as

(1)

where and , is an unknown vector of parameters, an unknown intercept term, is an unknown function, and are independent and identically distributed random variables. No moment conditions are imposed on but we do assume for convenience that the median of is equal to zero. is defined as a -dimensional continuous random variable. In this manuscript, we consider a special case where each is viewed as a -dimensional vector with each coordinate defined on an equispaced grid on . The sequence is assumed to be independent of . We use bold font for indices since the most convenient notation for this model involves multivariate indices; the detailed description of these indices is postponed until Section (2). In our manuscript, we only consider the case where that has been relatively little explored in statistical literature. More specifically, the only papers in the statistical literature that we are aware of discussing the multivariate case are [He and Shi, 1996], [Schick, 1996], [Müller et al., 2012], [Levine, 2015], and [Brown et al., 2016]. Econometric literature discusses the multivariate case to some extent. In particular, [Härdle et al., 2012] contains a review of some possible applications, clearly showing practical utility of considering the case of .

Partial linear models are, in many cases, preferable to purely nonparametric regression model because of the well-known “curse of dimensionality”. For the most part, the parametric part can be estimated at the rate where is the sample size. At the same time, the estimation precision of the nonparametric component usually decays as the dimensionality of its argument grows. Partial linear models have a long history of application in both econometrics and statistics. From the practical viewpoint, they are often of considerable interest because relationships between the response and predictors in the same model may be of a very different nature. Some of these relationships can often presumed to be linear while others are harder to parameterize. In the most common case, a small subset of the variables are presumed to have an unknown nonlinear relationship with the response while the rest are assumed to have a linear relationship with it. A good example can be found in [Schmalensee and Stoker, 1999] that considered a partial linear model to analyze gasoline household consumption in the US. In this model the demand for gasoline, measured in log number of gallons, is assumed to depend linearly on the number of drivers in the family, the household size, residence, region, and lifecycle. At the same time, the response is assumed to depend nonlinearly on the two remaining covariates, log of the household income and log of the age of the head of household.

Most often, the estimation of the nonparametric component is conducted in order to suggest a possible parametric form for this component where no prior rationale for choosing such a form is available. This, in turn, allows a researcher to describe the data more parsimoniously. For example, the above mentioned [Schmalensee and Stoker, 1999] fits the function that describes the dependence of the log demand for gasoline on the log of the household income and the log of the age of the head of household . When the function is fit, [Schmalensee and Stoker, 1999] plots it first as a function of for several fixed values of and as a function of for several fixed values of . The plots suggest that a possible parsimonious representation of the nonparametric component can be a piecewise linear function in both and ; later diagnostic testing confirms this conclusion. An in-depth discussion of this issue can be found in [Horowitz, 2009].

Our main goal in this manuscript is to construct an estimator of the nonparametric component that is adaptive over a range of functional classes for and robust with respect to the wide choice of distributions of and . In other words, we want to develop the estimator of the function that achieves the optimal, or nearly optimal, rate of convergence over a wide range of functional classes for and that is reasonably robust to choices of a distribution of and that of random errors . To the best of our knowledge, this question has not been considered before in statistical literature. The farthest step in this direction seems to have been made in [Brown et al., 2016] who constructed an asymptotically efficient estimator of in the model (1). Also, a somewhat related result in the existing literature can be found in [Wang et al., 2010]. They obtained some optimal convergence rates for a link function in the partial linear single index problem.

There are a number of reasons why the estimator robust to the wide choice of possible distributions of and is of interest. First, typical theory for models of the type (1) assumes that errors are independent, identically distributed (i.i.d.) normal random variables. From the practical viewpoint, normality may not always be satisfactory; see, for example, [Stuck and Kleiner, 1974] and [Stuck, 2000]. Moreover, the use of Gaussian formulation often implies that the specifically mean regression is of interest; in other words, the value of is viewed as a conditional expectation . In general, however, it is desirable to be able to handle more general formulations, such as median regression and other quantile regressions, as well. This has been noticed as early as [He and Shi, 1996] who suggested using an M-type objective function in order to treat mean regression, median regression, and other quantile regressions in one setting. Until the present time, if normality has not been required, it has been the usual approach in most statistical and econometric research to impose some moment assumptions that are necessary to obtain asymptotic results for estimators of the parametric component ; see, e.g. [Robinson, 1988] and [Härdle et al., 2012]. In this manuscript, we argue that this is also unnecessary and that the estimator robust to the wide choice of possible distributions for and achieving minimax or nearly minimax rate of convergence can be constructed without any moment assumptions being imposed on the distribution of .

Another issue lies in the fact that the distribution of is often not known in practice; in particular, it need not be a multivariate Gaussian. In econometric practice, in particular, it is exceedingly common to have to deal with a vector that includes at least some discrete components. An earlier mentioned model [Schmalensee and Stoker, 1999] has the vector that consists of discrete components only. In general, [Horowitz, 2009], p. notes when introducing partial linear models in econometric context that “… may be discrete or continuous”. Thus, from the practical viewpoint it seems inadvisable to limit the distribution of to the multivariate normal only. In our manuscript, we develop an estimation method that is adaptive over a wide range of functional classes for and robust over a large collection of error distributions for and design vector distributions for . This means, in particular, that an exact optimal rate of convergence for both mean squared error risk and the squared error at a point risk is achieved over a wide range of multivariate Besov classes for without the prior knowledge of the distribution for the design vector or the random error distribution.

The method that we propose is based on the idea of treating the sum of the parametric part and the random error as the “new” random error and viewing the model (1) as a nonparametric regression with unknown random errors. The usefulness of this idea lies in the fact that the resulting nonparametric regression model is somewhat similar to the nonparametric regression with an unknown error distribution considered in [Brown et al., 2008]. The main difference between it and the model of [Brown et al., 2008] is that the argument of the function now is multivariate while it was univariate in [Brown et al., 2008]. Thus, although the origin of our problem is rather different from that of [Brown et al., 2008], solutions of the two problems turn out to be connected. To obtain an adaptive estimator of the function , we divide the -dimensional cube into a number of equal volume bins, take the median of observations in each of these bins, and then apply a wavelet based procedure to these local medians together with a bias correction. Out of several possible procedures, we choose a multivariate generalization of the Block JS procedure developed in [Cai, 1999]. To the best of our knowledge, the multivariate BlockJS procedure has not been properly described before and therefore may be of independent interest. Note that, in general, the multivariate extension of the adaptive estimation procedure described in [Brown et al., 2008] is highly non-trivial. This is due to the necessity of selecting a correct blockwise shrinkage procedure that can be applied to the empirical wavelet coefficients. Our contribution consists of, first, designing a multivariate version of the by now classical BlockJS procedure of [Cai, 1999], and, second, of showing that it is the right choice that produces an adaptive and robust estimator of the functional component .

Our manuscript is organized as follows. In Section 2, we define our proposed procedure exactly and establish several auxiliary results. The asymptotic properties of our procedure are established in Section 3. Section 4 contains some further discussion of our work, while the formal proofs are relegated to the Appendix.

2 General approach to estimation of the functional component

2.1 Methodology for adaptive estimation of the functional component

As mentioned in the introduction, we begin with defining a random variable and rewriting the model (1) as

(2)

In this form, (2) is simply a multivariate nonparametric regression with an unknown error distribution. Note that in this model the intercept cannot be absorbed in the design matrix due to identifiability issues; in order to ensure that the model is identifiable, we have to require that an identifiability condition is satisfied. Otherwise, one can add and subtract to the right hand side of the model with the new constant becoming . Without loss of generality, we will adopt the following simpifying assumptions. First, we assume without loss of generality that is known and equal to . Second, we assume that the vector median of is also equal to zero. Indeed, if this is not the case, [Brown et al., 2016] has shown that there exists an estimator such that and that the parameter can be estimated at the same rate as well. Therefore, in a general model, we will have an additional term that can be estimated at the same fast parametric rate of convergence and its presence will not influence the optimal rate of convergence of the adaptive estimator of that we construct. Appropriate remarks will be made in the proofs in mathematical Appendix as well. Therefore, from now on we will work with the model

(3)

where it is assumed that the median of is equal to zero. Most of the classical nonparametric regression has been developed under the assumption of independent and identically distributed (i.i.d.) errors. In particular, a variety of smoothing techniques, such as wavelet thresholding techniques, were developed and shown to be highly adaptive in the Gaussian case. When errors are heavy-tailed, these techniques are typically not applicable. [Brown et al., 2008] worked with the model (3) when and noticed that, for example, if is Cauchy distributed, the maximum observation (out of ) will be of the order , instead of , which is the case when the errors are normally distributed. This invalidates classical denoising approches, such as the wavelet thresholding, and suggests the need for a different take on this problem.

Similarly to [Brown et al., 2008], to estimate the function adaptively, we bin according to the values of coordinates of . The sample median is then computed within each bin. Now, bin centers can be treated as independent variables in a multivariate nonparametric regression, bin medians being dependent variables. The number of bins has to be chosen in a suitable range; in our case, it turns out that the number of bins where is the original sample size, is a suitable choice. The resulting model can be viewed as a Gaussian multivariate nonparametric regression and a multivariate version of the BlockJS method of [Cai, 1999] can be used to obtain an adaptive estimator of the function . To the best of our knowledge, such a generalization of BlockJS method has not been implemented before. The implementation of the proposed procedure is not difficult, since the number of bins can be chosen as a power of . We will show that the resulting estimator enjoys excellent adaptivity properties over a wide range of multivariate Besov balls and is robust over wide ranging sets of distributions of and .

Before the detailed description of our procedure, it is necessary to specify the exact notation that will be used. For simplicity, we start with values of defined on an equispaced grid. More specifically, we define a point where each , for some positive . Note that the total sample size is . This assumption ensures that as . In the model (1), the multivariate index is . Throughout this article, we will use bold font for all multivariate indices and a regular font for scalar ones. We will say that two multivariate indices if for any ; the relationship between and is that of partial ordering. For convenience, we denote a -dimensional vector and a -dimensional vector . The norm of a vector will be denoted . To define the bins we use for aggregating observations, we start with defining and . We split the th edge of the -dimensional cube into intervals of the type where for any . All of the observations are split into bins in the following way: all of such that the th coordinate of the corresponding belongs in an interval , , are assigned to the bin with the multivariate index . We will denote the th bin . For convenience of notation, we denote a -dimensional vector all of whose coordinates are equal to , a -dimensional vector consisting of ’s as , and . We also denote . Clearly, the total number of such bins is ; note that, due to selection of and , the total number of bins . Also, we define an approximate number of observations in each bin ; clearly, . Let us denote the median of all such that the corresponding observation belongs in the th bin ; also, we denote the expectation of the sample median .

In order to approximate the median of observations in each multivariate bin with a normal random variable, we will have to develop a multivariate median coupling inequality. What follows is a very brief general discussion of median coupling; for more details, see [Brown et al., 2008]. Note that our first step will have to be estimation of the expectation of the median for th bin . Let be the density function of , . Let be independent random variables with the density . We will need the following assumption on the density ; this assumption comes from [Brown et al., 2008] but is given here in full for convenience.

Assumption 2.1.

, , and is locally Lipschitz at . The Lipschitz condition at zero for means that there exists a constant such that in an open neighborhood of .

Note that the assumption that guarantees uniqueness of the median of the distribution and the asymptotic normality of the sample median; see e.g. [Casella and Berger, 2002] p. . In order to approximate a median of the th bin, we will use a coupling inequality of [Brown et al., 2008]. We only give its statement here without a proof.

Theorem 2.1.

Let be iid random variables with the density function that satisfies the Assumption (2.1) while is a standard normal random variable. We assume that for some integer . Then, there exists a mapping such that the distribution law of and

when where depend on the density but not on .

The bound given in the Theorem (2.1) can also be expressed in terms of as follows.

Corollary 2.2.

Under the assumptions of Theorem (2.1), the mapping in Theorem (2.1) also satisfies

when where , do not depend on .

Note that the assumption that the number of observations is odd has been made for convenience. The Remark on p. of [Brown et al., 2008] shows that it can be dispensed by redefining the median as when and following a similar argument. In the future, we will always assume that the number of observations whose median is considered is odd for simplicity. Also, we can say that the Theorem (2.1) lets us approximate each bin median with a normal random variable that has the mean and the variance .

Remark 2.3.

A more general situation that we described briefly earlier would be if the intercept and/or the median of was not equal to zero. If that was the case, we would be looking at approximating the median of all observations from th bin with a normal random variable that has the mean and the variance .

However, an interesting question remains open. What kind of distributions of and will result in the density of that is going to satisfy the Assumption (2.1)? The following simple proposition identifies a reasonably wide range of distributions of and that will satisfy this Assumption. This proposition takes the form of a convenient sufficient condition. However, before stating it, we need to define a family of elliptical distributions.

Definition 2.4.

A random vector is said to have an elliptical distribution if its characteristic function can be expressed as where is the vector and matrix for some matrix . The function is called characteristic generator of .

Some examples of elliptical distributions are normal, t-distribution, Laplace distribution, Cauchy distribution, and many others when and their multivariate analogs when . In general, elliptical distributions may not have a density; however, when they do have one, it has the form

(4)

where the function is called a density generator, is a normalizing constant, and is the median vector (which is equal to the mean if the latter exists). It is a common practice to refer to an elliptical distribution with a density function as . The following simple property of elliptical distributions with a density function is easy to establish and so it is given here without a proof for brevity. For detailed discussion, see e.g. [Fang et al., 1990].

Lemma 2.5.

Let . Let be an matrix while . Then, any affine transformation of is an elliptical distribution as well:

With the above in mind, we can now state our Proposition.

Proposition 2.6.

Assume that the density of random errors is Lipschitz continuous at zero, symmetric around zero and strictly positive in an open neighborhood of zero. We also assume that has an elliptic distribution with a density function that is continuous in an open neighborhood of zero. Moreover, without loss of generality (as discussed earlier) we assume that the median of is zero. Then, the Assumption (2.1) is satisfied.

Proof.

Due to Lemma (2.5), has a univariate elliptical distribution with the median equal to zero. Clearly, the distribution of also has a density function. The form of the density function of any elliptical distribution given in (4) clearly implies that, if the median of is zero, the distribution of is symmetric around zero. Let us denote the distribution of . Since and are independent, the distribution of is a convolution of the two densities: . It follows directly from the definition of Lipschitz continuity at zero that, since is Lipschitz continuous at zero then so is ; moreover, it is not hard to check that the symmetry of and around zero implies the symmetry of their convolution around zero as well. Since is symmetric around zero, we have immediately that . Finally, since the density is continuous in an open neighborhood of zero and the corresponding median is unique, there exists an open neighborhood of zero where . This, together with strict positivity of in some open neighborhood of zero, guarantees that . ∎

Remark 2.7.

Note that Proposition (2.6) covers a wide variety of distributions of and . Concerning , distributions such as multivariate normal, multivariate t-distribution, multivariate Laplace, multivariate Cauchy, and, in general, any symmetric stable distribution, are elliptical distributions with a density. It is also of interest that, when the index of stability less than , which is the case for Cauchy and Levy distributions, there is no finite mean yet the resulting distribution of still satisfies requirements of the Proposition (2.6). The requirements for the distribution of also do not contain any moment requirements; note that, for example, a (univariate) Cauchy distribution for satisfies assumptions of the Proposition (2.6).

2.2 Wavelet procedure for a binned semiparametric model

We will start with a brief introduction into the multivariate wavelet bases. First, let be a pair of univariate father and mother wavelets. We need to assume that both of them are compactly supported on and . As is known, translation and dilation of and generates an orthonormal wavelet basis in ; thus, in order to obtain a convenient periodized wavelet basis, we have and where , and . The primary resolution level should be selected large enough to ensure that the support of the scaling functions (father wavelets) and mother wavelets at level does not cover the whole . We also assume regularity of our wavelet system for some positive . This means that the first moments of the wavelet function are equal to zero. The periodized wavelets generate a curtailed multiresolution ladder:

where spaces are spanned by . As in any regular multiresolution analysis, we have where the space is spanned by . In the future, we will suppress the superscript from the notation for convenience. An orthonormal wavelet basis has an associated orthogonal Discrete Wavelet Transform (DWT) whose function is to transform sampled data into the wavelet coefficients. A square integrable function on can be expanded into a wavelet series

where and are the wavelet coefficients of . We will use a one-dimensional orthogonal wavelet basis to define a -dimensional one. There are a number of ways to construct such a basis based on the one-dimensional one; we will use the one that is based on a tensor product construction and preserves a multiresolution analysis (MRA) in a -dimensional space. By using univariate orthogonal MRA’s

, we can define a -dimensional multiresolution analysis

in which . The resulting -dimensional multiresolution analysis corresponds to, first, one -variate scaling function where is an th copy of the father wavelet, and, second, -variate wavelets

where is either or but not all of are equal to the father wavelet . Now, let be the -dimensional lattice. To complete the description of our notation, we also introduce rescaled and translated versions and where or but not all . Finally, any function can be represented as

(5)

For a more detailed discussion of multivariate wavelet bases see, for example, [Daubechies, 1992] and [Vidakovic, 2009].

At this point, we can give a detailed description of our estimator of the functional component . The first step is to bin observations according to values of coordinates of as described in Chapter (2.1). Then, the sample median must be computed within each bin. We will use the notation where is the median of errors in the th bin as described earlier. We will denote medians of observations in th bin , . Our next step consists of applying a discrete wavelet transform to all of the medians . In the multivariate case that we consider, a discrete wavelet transform is a tensor. After the application of the discrete wavelet transform, we can describe the transformed data as

Here, , are the gross structure terms at the lowest resolution level, while , , , are empirical wavelet coefficients at level corresponding to the wavelet that represent fine structure at scale . The empirical wavelet coefficients can be written as

where are the discrete wavelet coefficients of , are deterministic errors, are iid and are stochastic errors. Later, we will show that both deterministic errors and stochastic errors are negligible in a certain sense. Ignoring them, we end up with

(6)

which is essentially an idealized sequence model with the noise level . As a next step, we propose a multivariate generalization of the BlockJS procedure of [Cai, 1999] and apply it to the empirical coefficients as if they were generated by (6). At each resolution level the empirical wavelet coefficients are grouped into nonoverlapping blocks of length . We define each block as consisting of observations such that, for any choice of th wavelet and th resolution level, only are included where and . We also define the sum of squared empirical wavelet coefficients included in th block. Let be an estimator of the squared value of the density function at the point zero. The following shrinkage rule is then applied to each block :

(7)

where is the solution of the equation and is present due to the fact that the noise level is equal to . For the gross structure terms, we define an estimator . Now, we reconstruct the estimate of the function at the points by applying the inverse discrete wavelet transform to the shrunk empirical wavelet coefficients. In other words, the entire function can be estimated for any , as

The last remaining step is to estimate the median in order to obtain an estimate of the function . The following procedure is employed for that purpose. Recall that the th edge of the -dimensional cube has been split into intervals; each of these intervals contained observations. We begin with splitting each of these intervals for a th edge of a -dimensional cube in two in such a way that the smaller half contains observations. Now, we define a set of -dimensional bins in the following way. Each bin in this set consists of smaller halves of all one-dimensional intervals for each of the dimensions. Note that the cardinality of that set of bins will still be . Next, we define to be the median of the th “halfbin” with being the median of all observations from the corresponding th “full” bin. Then, the median expectation is defined to be

(8)

This lets us define

(9)

3 Adaptivity of the procedure

We study the theoretical properties of our procedure over the Besov spaces in and over a suitable class of distributions of and as defined in (1). Besov spaces in make up a very rich class of spaces that incorporates functions of very significant spatial inhomogeneity and includes Hölder and Sobolev spaces as special cases. Our discussion of Besov spaces is necessarily brief; for details, see for example [Triebel, 2006]. Since multivariate Besov spaces are not used in the statistical literature very often, we will briefly describe them first. Let where is the th unit vector. We define the first difference of the function in the th direction as and the second difference as . Let be a positive integer and . We denote the increment in each direction where . We also need to define . The so-called Besov norm in the direction is then defined as

for and

otherwise. The Besov norm is now defined as the sum of the norm of the function and Besov norms in all possible directions:

(10)

A Besov class, sometimes also called a Besov ball, can be defined as for some positive .

Besov norm of a function can also be characterized in terms of its wavelet coefficients. For a fixed primary resolution level , the Besov sequence norm of the wavelet coefficients of a function can be defined the following way. First, let be a vector of the father wavelet coefficients at the primary resolution level , be the vector of wavelet coefficients at level , and . Then, we define first

and

With the above in mind, a sequence norm for a function can be defined as as

We know that the Besov sequence norm is equivalent to the Besov function norm defined in (10) and, therefore, the Besov class can also be defined as ; for details, see e.g. [Meyer, 1995]. It is also known that, in case of Gaussian noise, the minimax risk of estimating over the Besov body ,

(11)

converges to zero at the rate of as ; see e.g. [Donoho et al., 1995]. The following theorem shows that our estimator achieves optimal global adaptation for a wide range of multivariate Besov classes . Moreover, this adaptation is also uniform over a range of distributions of the design vector and the error term . To state this theorem properly, we need to define appropriate classes of distributions of and . We begin with the following Assumption on the density of . This Assumption guarantees the existence of some low order moment of .

Assumption 3.1.

for some .

For any , , we define the class of densities by

We will also need another, more narrow class of densities that is defined as for some , , where

One simple way to ensure that is to require, first, that densities and satisfy assumptions of the Proposition (2.6). In addition, we also have to require the existence of a finite moment of some order for the density and for the distribution of plus the existence of the finite third derivative for in a small open neighborhood of zero. Then, the moment assumption is satisfied because, for any , and random variables , we have . Moreover, differentiability of ensures the same property for the convolution . With the above in mind, we can define the class . As for