Introduction

Partially Linear Spatial Probit Models

Mohamed-Salem AHMED

University of Lille, LEM-CNRS 9221

Lille, France

mohamed-salem.ahmed@univ-lille.fr

Sophie DABO

INRIA-MODAL

University of Lille LEM-CNRS 9221

Lille, France

sophie.dabo@univ-lille.fr

Abstract
A partially linear probit model for spatially dependent data is considered. A triangular array setting is used to cover various patterns of spatial data. Conditional spatial heteroscedasticity and non-identically distributed observations and a linear process for disturbances are assumed, allowing various spatial dependencies. The estimation procedure is a combination of a weighted likelihood and a generalized method of moments. The procedure first fixes the parametric components of the model and then estimates the non-parametric part using weighted likelihood; the obtained estimate is then used to construct a GMM parametric component estimate. The consistency and asymptotic distribution of the estimators are established under sufficient conditions. Some simulation experiments are provided to investigate the finite sample performance of the estimators.
keyword: Binary choice model, GMM, non-parametric statistics, spatial econometrics, spatial statistics.

## Introduction

Agriculture, economics, environmental sciences, urban systems, and epidemiology activities often utilize spatially dependent data. Therefore, modelling such activities requires one to find a type of correlation between some random variables in one location with other variables in neighbouring locations; see for instance pinkse1998contracting. This is a significant feature of spatial data analysis. Spatial/Econometrics statistics provides tools to perform such modelling. Many studies on spatial effects in statistics and econometrics using many diverse models have been published; see cressie15, anselin2010thirty, anselin2013spatial and arbia2006spatial for a review.
Two main methods of incorporating a spatially dependent structure (see for instance cressie15) can essentially be distinguished as between geostatistics and lattice data. In the domain of geostatistics, the spatial location is valued in a continuous set of , . However, for many activities, the spatial index or location does not vary continuously and may be of the lattice type, the baseline of this current work. In image analysis, remote sensing from satellites, agriculture etc., data are often received as a regular lattice and identified as the centroids of square pixels, whereas a mapping often forms an irregular lattice. Basically, statistical models for lattice data are linked to nearest neighbours to express the fact that data are nearby.
Two popular spatial dependence models have received substantial attention for lattice data, the spatial autoregressive (SAR) dependent variable model and the spatial autoregressive error model (SAE, where the model error is an SAR), which extend the regression in a time series to spatial data.
From a theoretical point of view, various linear spatial regression SAR and SAE models as well as their identification and estimation methods, e.g., two-stage least squares (2SLS), three-stage least squares (3SLS), maximum likelihood (ML) or quasi-maximum likelihood (QML) and the generalized method of moments (GMM), have been developed and summarized by many authors such as anselin2013spatial, kelejian1998, kelejian1999, conley1999gmm, cressie15, Case, lee2004asymptotic, lee2007, leeetal10, zhendetal12, malikovetal17, garthoffetal17, yangetal17. Introducing nonlinearity into the field of spatial linear lattice models has attracted less attention; see for instance robinson2011asymptotic, who generalized kernel regression estimation to spatial lattice data. su2012semiparametric proposed a semi-parametric GMM estimation for some semi-parametric SAR models. Extending these models and methods to discrete choice spatial models has seen less attention; only a few papers were have been concerned with this topic in recent years. This may be, as noted by fleming2004techniques (see also smirnov2010modeling and bille2014computational), due to the ”added complexity that spatial dependence introduces into discrete choice models”. Estimating the model parameters with a full ML approach in spatially discrete choice models often requires solving a very computationally demanding problem of -dimensional integration, where is the sample size.
For linear models, many discrete choice models are fully linear and utilize a continuous latent variable; see for instance smirnov2010modeling, wang2013partial and martinetti2017approximate, who proposed pseudo-ML methods, and pinkse1998contracting, who studied a method based on the GMM approach. Also, others methodologies of estimation are emerged like, EM algorithm (mcmillen1992probit) and Gibbs sampling approach (lesage2000bayesian).

When the relationship between the discrete choice variable and some explanatory variables is not linear, a semi-parametric model may represent an alternative to fully parametric models. This type of model is known in the literature as partially linear choice spatial models and is the baseline of this current work. When the data are independent, these choice models can be viewed as special cases of the famous generalized additive models (hastie1990generalized) and have received substantial attention in the literature, and various estimation methods have been explored (see for instance hunsberger1994semiparametric; severini1994quasi; carroll1997generalized).
To the best of our knowledge, semi-parametric spatial choice models have not yet been investigated from a theoretical point of view. To fill this gap, this work addresses an SAE spatial probit model for when the spatial dependence structure is integrated in a disturbance term of the studied model.
We propose a semi-parametric estimation method combining the GMM approach and the weighted likelihood method. The method consists of first fixing the parametric components of the model and non-parametrically estimating the non-linear component by weighted likelihood (staniswalis1989kernel). The obtained estimator depending on the values at which the parametric components are fixed is used to construct a GMM estimator (pinkse1998contracting) of these components.
The remainder of this paper is organized as follows. In Section 1, we introduce the studied spatial model and the estimation procedure. Section 2 is devoted to hypotheses and asymptotic results, while Section 3 reports a discussion and computation of the estimates. Section 4 gives some numerical results based on simulated data to illustrate the performance of the proposed estimators. The last section presents the proofs of the main results.

## 1 Model

We consider that at spatial locations satisfying with , observations of a random vector are available. Assume that these observations are considered as triangular arrays (robinson2011asymptotic) and follow the partially linear model of a latent dependent variable :

 Y∗in=XTinβ0+g0(Zin)+Uin,1≤i≤n,n=1,2,… (1)

with

 Yin=I(Y∗in≥0),1≤i≤n,n=1,2,… (2)

where is the indicator function; and are explanatory random variables taking values in the two compact subsets and , respectively; the parameter is an unknown vector that belongs to a compact subset ; and is an unknown smooth function valued in the space of functions , with the space of twice differentiable functions from to and a positive constant. In model (1), and are constant over (and ). Assume that the disturbance term in is modelled by the following spatial autoregressive process (SAR):

 Uin=λ0n∑j=1WijnUjn+εin,1≤i≤n,n=1,2,… (3)

where is the autoregressive parameter, valued in the compact subset , are the elements in the –th row of a non-stochastic spatial weight matrix , which contains the information on the spatial relationship between observations. This spatial weight matrix is usually constructed as a function of the distances (with respect to some metric) between locations; see pinkse1998contracting for additional details. The matrix is assumed to be non-singular for all , where denotes the identity matrix and are assumed to be independent random Gaussian variables; and for . Note that one can rewrite as

 Un=(In−λ0Wn)−1εn,n=1,2,… (4)

where and . Therefore, the variance-covariance matrix of is

 (5)

This matrix allows one to describe the cross-sectional spatial dependencies between the observations. Furthermore, the fact that the diagonal elements of depend on and particularly on and allows some spatial heteroscedasticity. These spatial dependences and heteroscedasticity depend on the neighbourhood structure established by the spatial weight matrix .
Before proceeding further, let us give some particular cases of the model.
If one consider i.i.d observations, that is, with depending on , the obtained model may be viewed as a special case of classical generalized partially linear models (e.g. severini1994quasi) or the classical generalized additive model (hastie1990generalized). Several approaches for estimating this particular model have been developed; among these methods, we cite that of severini1994quasi based on the concept of the generalized profile likelihood (e.g severini1992profile). This approach consists of first fixing the parametric parameter and non-parametrically estimating using the weighted likelihood method. This last estimate is then used to construct a profile likelihood to estimate .
When (or is an affine function), that is, without a non-parametric component, several approaches have been developed to estimate the parameters and . The basic difficulty encountered is that the likelihood function of this model involves an -dimensional normal integral; thus, when is high, the computation or asymptotic properties of the estimates may present difficulties (e.g. poirier1988probit). Various approaches have been proposed to addressed this difficulty; among these approaches, we cite the following:

• Feasible Maximum Likelihood approach: this approach consists of replacing the true likelihood function by a pseudo-likelihood function constructed via marginal likelihood functions. smirnov2010modeling proposed a pseudo-likelihood function obtained by replacing by some diagonal matrix with the diagonal elements of . Alternatively, wang2013partial proposed to divide the observations by pairwise groups, where the latter are assumed to be independent with a bivariate normal distribution in each group, and estimate and by maximizing the likelihood of these groups. Recently martinetti2017approximate proposed a pseudo-likelihood function defined as an approximation of the likelihood function where the latter is inspired by some univariate conditioning procedure.

• Generalized Method of Moments (GMM) approach used by pinkse1998contracting. These authors used the generalized residuals defined by with some instrumental variables to construct moment equations to define the GMM estimators of and .

In what follows, using the observations , we propose parametric estimators of , and a non-parametric estimator of the smooth function .
To this end, we assume that, for all , is independent of and , and is independent of .
We give asymptotic results according to increasing domain asymptotic. This consists of a sampling structure whereby new observations are added at the edges (boundary points) to compare to the infill asymptotic, which consists of a sampling structure whereby new observations are added in-between existing observations. A typical example of an increasing domain is lattice data. An infill asymptotic is appropriate when the spatial locations are in a bounded domain.

### 1.1 Estimation Procedure

We propose an estimation procedure based on a combination of a weighted likelihood method and a generalized method of moments. We first fix the parametric components and of the model and estimate the non-parametric component using a weighted likelihood. The obtained estimate is then used to construct generalized residuals, where the latter are combined with the instrumental variables to propose GMM parametric estimates. This approach will be described as follow.

By equation (2), we have

 E0(Yin|Xin,Zin)=Φ((vin(λ0))−1(XTinβ0+g0(Zin))),1≤i≤n,n=1,2,… (6)

where denotes the expectation under the true parameters (i.e., and ), is the cumulative distribution function of a standard normal distribution, and are the diagonal elements of .
For each , and , we define the conditional expectation on of the log-likelihood of given for , as

 H(η;β,λ,z)=E0(L(Φ((vin(λ))−1(η+XTinβ));Yin)∣∣Zin=z), (7)

with . Note that is assumed to be constant over (and ). For each fixed , and , denotes the solution in of

 ∂∂ηH(η;β,λ,z)=0. (8)

Then, we have for all .
Now, using , we construct the GMM estimates of and as in pinkse1998contracting. For that, we define the generalized residuals, replacing in by :

 ~Uin(β,λ,gβ,λ) = E(Uin|Yin,β,λ) = ϕ(Gin(β,λ,gβ,λ))(Yin−Φ(Gin(β,λ,gβ,λ)))Φ(Gin(β,λ,gβ,λ))(1−Φ(Gin(β,λ,gβ,λ))),

where is the density of the standard normal distribution and

For simplicity of notation, we write when possible.
Note that in (1.1), the generalized residual is calculated by conditioning only on and not on the entire sample or a subset of it. This of course will influence the efficiency of the estimators of obtained by these generalized residuals, but it allows one to avoid a complex computation; see poirier1988probit for additional details. To address this loss of efficiency, let us follow pinkse1998contracting’s procedure, which consists of employing some instrumental variables to create some moment conditions, and use a random matrix to define a criterion function. Both the instrumental variables and the random matrix permit one to consider more information about the spatial dependences and heteroscedasticity characterizing the dataset. Let us now detail the estimation procedure. Let

 Sn(θ,gθ)=n−1ξTn~Un(θ,gθ), (10)

where is an vector, composed of and is an matrix of instrumental variables, whose th row is given by the random vector . The latter may depend on and . We assume that is , measurable for each . We suppress the possible dependence of the instrumental variables on the parameters for notational simplicity. The GMM approach consists of minimizing the following sample criterion function:

 Qn(θ,gθ)=STn(θ,gθ)MnSn(θ,gθ), (11)

where is some positive-definite weight matrix that may depend on the sample information. The choice of the instrumental variables and weight matrix characterizes the difference between GMM estimator and all pseudo-maximum likelihood estimators. For instance, if one takes

 ξin(θ,gθ)=∂Gin(θ,ηi)∂θ+∂Gin(θ,ηi)∂η∂gθ∂θ(Zin), (12)

with , , and with , then the GMM estimator of is equal to a pseudo-maximum profile likelihood estimator of , accounting only for the spatial heteroscedasticity.
Now, let

 S(θ,gθ)=limn→∞E0(Sn(θ,gθ)), (13)

and

 Q(θ,gθ)=ST(θ,gθ)MS(θ,gθ),

where , the limit of the sequence , is a nonrandom positive-definite matrix. The functions and are viewed as empirical counterparts of and , respectively.
Clearly, is not available in practice. However, we need to estimate it, particularly by an asymptotically efficient estimate. By (8) and for fixed , an estimator of , for , can be given by , which denotes the solution in of

 n∑i=1∂∂ηL(Φ(Gin(θ,η));Yin)K(z−Zinbn)=0 (14)

where is a kernel from to and is a bandwidth depending on .

Now, replacing in by the estimator permits one to obtain the GMM estimator of as

 ^θ=argminθ∈ΘQn(θ,^gθ). (15)

A classical inconvenience of the estimator proposed in (14) is that the bias of is high for near the boundary of . Of course, this bias will affect the estimator of given in (15) when some of the observations are near the boundary of . A local linear method, or more generally the local polynomial method (fan1996local), can be used to reduce this bias. Another alternative is to use trimming (severini1994quasi), in which the function is computed using only observations associated with that are away from the boundary. The advantage of this approach is that the theoretical results can be presented in a clear form, but it is less tractable from a practical point of view, in particular, for small sample sizes.

## 2 Large sample properties

We now turn to the asymptotic properties of the estimators derived in the previous section: and . Let us use the following notation: means that we differentiate with respect to , and is the partial derivative of w.r.t the first variable. The partial derivative of w.r.t , for any function , is

 ∂Sn∂g(θ,g)(v)=n−1n∑i=1ξin∂~Uin∂η(θ,ηi)v(Zin).

Without ambiguity, denotes when is a function, when is a vector, and when is a matrix.
Let the following matrices be needed in the asymptotic variance-covariance matrix of :

 B1(θ0)=limn→∞E0(nSn(θ0,g0)STn(θ0,g0)),B2(θ0)={ddθST(θ,gθ)∣∣∣θ=θ0}M{ddθS(θ,gθ)∣∣∣θ=θ0},

with

 ddθS(θ,gθ)=∂S∂θ(θ,gθ)+∂S∂g(θ,gθ)∂∂θgθ, (16)

and

 Ω(θ0)={B2(θ0)}−1{ddθST(θ,gθ)∣∣∣θ=θ0}MB1(θ0)M{ddθS(θ,gθ)∣∣∣θ=θ0}{B2(θ0)}−1.

The following assumptions are required to establish the asymptotic results.
Assumption A1. (Smoothing condition). For each fixed and , let denote the unique solution with respect to of

 ∂∂ηH(η;θ,z)=0.

For any and , there exists such that

 supθ∈Θ,z∈Z∣∣∣∂∂ηH(g(z);θ,z)∣∣∣≤γ⟹supθ∈Θ,z∈Z|g(z)−gθ(z)|≤ε. (17)

Assumption A2. (Local dependence). The density of exists, is continuous on uniformly on and and satisfies

 liminfn→∞infz∈Z1nn∑i=1fin(z)>0. (18)

The joint probability density of exists and is bounded on uniformly on and .
Assumption A3. (Spatial dependence). Let denote the conditional log likelihood function of given , where . Let be the vector , , , and assume that for all

 |Cov0(ψ(Tin),ψ(Tln))|≤{Var0(ψ(Tin))Var0(ψ(Tln))}1/2αiln, (19)

with

 ψ(Tin)=K(z−Zinbn) or ψ(Tin)=K(z−Zinbn)∂j1+⋯+j~p+r∂θj11⋯∂θj~p~p∂ηrhθ,ηin(Yin|Xin,Zin=z),

for all with , and for all nonnegative integers and such that .
We assume that

 ∣∣Cov0(ξitn~Uin(θ,gθ),ξjsn~Ujn(θ,gθ))∣∣≤{Var0(ξitn~Uin(θ,gθ))Var0(ξjsn~Ujn(θ,gθ))}1/2αijn, (20)

for all , and for any ,
and

 ∣∣Cov0(ξ(2)in(θ0,η0i),ξ(2)jn(θ0,η0j))∣∣≤{Var0(ξ(2)in(θ0,η0i))Var0(ξ(2)jn(θ0,η0j))}1/2αijn, (21)

with

where for each such that .

In addition, assume that there is a decreasing (to ) positive function such that , , as , for all fixed , where and are spatial coordinates associated with observations and , respectively.
Assumption A4. The kernel satisfies . It is Lipschitzian, i.e., there is a positive constant such that

 |K(u)−K(v)|≤C∥u−v∥for all u,v∈Rd.

Assumption A5. The bandwidth satisfies and as .
Assumption A6. The instrumental variables satisfy , where is the i-th column of the matrix of instrumental variables .
Assumption A7. takes values in a compact and convex set , and is in the interior of .
Assumption A8. is continuous on both arguments and , and attains a unique minimum over at .
Assumption A9. The square root of the diagonal elements of are twice continuous differentiable functions with respect to and uniformly on and .
Assumption A10. and are positive-definite matrices, and .

###### Remark 1

Assumption A1 ensures the smoothness of around its extrema point ; see severini1994quasi. Assumption A2 is a decay of the local independence condition of the covariates , meaning that these variables are not identically distributed; a similar condition can be find in robinson2011asymptotic. Condition (18) generalizes the classical assumption used in the case of estimating the density function with identically distributed or stationary random variables. This assumption has been used in robinson2011asymptotic (Assumption A7(x), p. 8). Assumption A3 describes the spatial dependence structure. The processes that we use are not assumed stationary; this allows for greater generalizability and the dependence structure to change with the sample size (see pinkse1998contracting for more discussion). Conditions (19), (20) and (21) are not restrictive. When the regressors and instrumental variables are deterministic, conditions (19) and (20) are equivalent to . The condition on is satisfied when the latter tends to zero at a polynomial rate, i.e., for all , as in the case of mixing random variables.
Assumption A6 requires that the instruments and explanatory variables be bounded uniformly on and . In addition, when the instruments depend on and , they are also uniformly bounded with respect to these parameters. The compactness condition in Assumption A7 is standard, and the convexity is somewhat unusual; however, it is reasonable in most applications. Condition A8 is necessary to ensure the identification of the true parameters . Assumption A9 requires the standard deviations of the errors to be uniformly bounded away from zero with bounded derivatives. This has been considered by pinkse1998contracting. Assumption A10 is classic (pinkse1998contracting) and required in the proof of Theorem 2.2. Those authors noted that in their model (without a non-parametric component), when the autoregressive parameter , is not invertible, regardless of the choice of . This is also the case in our context because for each solution of (8), and , we have

 ∂gθ∂β(z)=−E(Γjn(θ,gθ(z))Xjn∣∣Zjn=z)E(Γjn(θ,gθ(z))∣∣Zjn=z),

and

 ∂gθ∂λ(z) = = v′jn(λ)vjn(λ)(gθ(z)−βT∂gθ∂β(z)),

where ,

 Γjn(⋅)=Λ′(Gjn(⋅))[Yjn−Φ(Gjn(⋅))]−Λ(Gjn(⋅))ϕ(Gjn(⋅))

and . However

 ∂gθ∂λ(z)∣∣∣λ=0=0becausev′jn(0)=0,

then will be singular when .

With these assumptions in place, we are able to give some asymptotic results. The weak consistencies of the proposed estimators are given in the following two results. The first theorem and corollary below establish the consistency of our estimators, whereas the second theorem addresses the question of convergence to a normal distribution of the parametric component when it is properly standardized.

###### Theorem 2.1

Under Assumptions A1-A10, we have

 ^θ−θ0=op(1).
###### Corollary 2.1

If the assumptions of Theorem  2.1 are satisfied, then we have

 ∥∥^g^θ−g0∥∥=op(1).

Proof of Corollary 2.1 Note that

 ∥∥^g^θ−g0∥∥ ≤ ∥^g^θ−g^θ∥+∥g^θ−g0∥ ≤ supθ∥^gθ−gθ∥+supθ∥∥∥∂gθ∂θ∥∥∥∥^θ−θ0∥=op(1),

since, by the assumptions of Theorem 2.1, and .
The following gives an asymptotic normality result of .

###### Theorem 2.2

Under assumptions A1-A10, we have

 √n(^θ−θ0)→N(0,Ω(θ0))
###### Remark 2

In practice, the previous asymptotic normality result can be used to construct asymptotic confidence intervals and build hypothesis tests when a consistent estimate of the asymptotic covariance matrix is available. To estimate this matrix, let us follow the idea of pinkse1998contracting and define the estimator

 Ωn(^θ)={B2n(^θ)}−1{ddθSTn(θ,^gθ)∣∣∣θ=^θ}MnB1n(^θ)Mn{ddθSn(θ,^gθ)∣∣∣θ=^θ}{B2n(^θ)}−1,

with

 B1n(θ)=nSn(θ,^gθ)STn(θ,^gθ)andB2n(θ)={ddθSTn(θ,^gθ)}Mn{ddθSn(θ,^gθ)}.

The consistency of will be based on that of and , the estimators of and , respectively. Note that the consistency of is relatively easy to establish. On the other hand, that of asks for additional assumptions and an adaption of the proof of Theorem 3 of (pinkse1998contracting, p.134) to our case; this is of interest to future research.

## 3 Computation of the estimates

The aim of this section is to outline in detail how the regression parameters , the spatial auto-correlation parameter and the non-linear function can be estimated. We begin with the computation of , which will play a crucial role in what follows.

### 3.1 Computation of the estimate of the non-parametric component

An iterative method is needed to compute the solution of (14) for each fixed and . For fixed and , let and denote the left-hand side of (14), which can be rewritten as

 ψ(η;θ,z)=n∑i=1[vin(λ)]−1Λ(Gin(θ,η))[Yin−Φ(Gin(θ,η))]K(z−Zinbn). (22)

Consider the Fisher information:

 Ψ(ηθ;θ,z) = E0(∂∂ηψ(η;θ,z)∣∣∣η=ηθ∣∣ ∣∣{(Xin,Zin),1≤ i≤n,n=1,…}) (23) = −n∑i=1[vin(λ)]−2Λ(Gin(θ,ηθ))ϕ(Gin(θ,ηθ))K(z−Zinbn)+ +n∑i=1[vin(λ)]−2Λ′(Gin(θ,ηθ))[Φ(Gin(θ0,η0))−Φ(Gin(θ,ηθ))]K(z−Zinbn).

Note that the second term in the RHS of (23) is negligible when is near the true parameter .
Because for , an initial estimate can be updated to using Fisher’s scoring method:

 η†=~η−ψ(~η;θ,z)Ψ(~η;θ,z). (24)

The iteration procedure (24) requests some starting value to ensure convergence of the algorithm. To this end, let us adapt the approach of severini1994quasi, which consists of supposing that for fixed , there exists a satisfying for . Knowing that , we have . Then, (24) can be updated using the following initial value:

 η†0=~η0−ψ(~η0;θ,z)Ψ(~η0;θ,z)=∑ni=1[vin(λ)]−1Λ(Cin)ϕ(Cin)[Cin−[vin(λ)]−1XTinβ]K(z−Zinbn)∑ni=1[vin(λ)]−2Λ(Cin)ϕ(Cin)K(z−Zinbn),

where , , is computed using a slight adjustment because .
With this initial value, the algorithm iterates until convergence.

#### Selection of the bandwidth

A critical step (in non- or semi-parametric models) is the choice of the bandwidth parameter , which is usually selected by applying some cross-validation approach. The latter was adapted by su2012semiparametric in the case of a spatial semi-parametric model. Because cross-validation may be very time consuming, which is true in the case of our model, we adapt the following approach used in severini1994quasi to achieve greater flexibility:

• Consider the linear regression of on , without an intercept term, and let denote the corresponding residuals.

• Since we expect to have similar smoothness properties as , the optimal bandwidth is that of the non-parametric regression of the on , chosen by applying any non-parametric regression bandwidth selection method. For that, we use the cross-validation method in the np R Package.

### 3.2 Computation of ^θ

The parametric component and the spatial autoregressive parameter are computed as mentioned above by a GMM approach based on some instrumental variables and the weight matrix . The choices of these instrumental variables and weight matrix are as follows.
Because , if we differentiate the latter with respect to and , we have

 ∂∂β^gθ(z)=−∑ni=1[vin(λ)]−2Δin(θ,z)XinK(z−Zinbn)∑ni=1[vin(λ)]−2Δin(θ,z)K(z−Zinbn),

and

 ∂∂λ^gθ(z) = ∑ni=1[vin(λ)]−1v′in(λ)Δin(θ,z)[XTinβ+^gθ(z)]K(z−Z