Semiparametric Mixture Regression with Unspecified Error Distributions

# Semiparametric Mixture Regression with Unspecified Error Distributions

Yanyuan Ma***Ma’s research is supported by National Sciences Foundation and National Institute of Neurological Disorders and Strokes. Shaoli Wang Lin XuXu’s research is supported by Zhejiang Federation of Humanities and Social Sciences Circles grant 16NDJC154YB. Weixin Yaocorresponding author. weixin.yao@ucr.edu. Yao’s research is supported by NSF grant DMS-1461677 and and Department of Energy with the award DE-EE0007328. Department of Statistics, The Pennsylvania State University School of Statistics and Management, Shanghai University of Finance and Economics School of Data Sciences, Zhejiang University of Finance and Economics Department of Statistics, University of California, Riverside
###### Abstract

In fitting a mixture of linear regression models, normal assumption is traditionally used to model the error and then regression parameters are estimated by the maximum likelihood estimators (MLE). This procedure is not valid if the normal assumption is violated. To relax the normal assumption on the error distribution hence reduce the modeling bias, we propose semiparametric mixture of linear regression models with unspecified error distributions. We establish a more general identifiability result under weaker conditions than existing results, construct a class of new estimators, and establish their asymptotic properties. These asymptotic results also apply to many existing semiparametric mixture regression estimators whose asymptotic properties have remained unknown due to the inherent difficulties in obtaining them. Using simulation studies, we demonstrate the superiority of the proposed estimators over the MLE when the normal error assumption is violated and the comparability when the error is normal. Analysis of a newly collected Equine Infectious Anemia Virus data in 2017 is employed to illustrate the usefulness of the new estimator.

###### keywords:
EM algorithm; Kernel estimation; Mixture of regressions; Semiparametric models
journal: \externaldocument

mix-supp

## 1 Introduction

Mixtures of regressions provide a flexible tool to investigate the relationship between variables coming from several unknown latent components. It is widely used in many fields, such as engineering, genetics, biology, econometrics and marketing. A typical mixture of regressions model is as follows. Let be a latent class indicator with for , where is a -dimensional vector with the first component the constant 1 and the rest random predictors. Given , the response depends on through a linear regression model

 Y=XTβj+ϵj, (1)

where , and is independent of . Thus the conditional density of given can be written as

 fY∣X(y,x)=m∑j=1πjϕ(y;xTβj,σ2j), (2)

where is the normal density with mean and variance . The unknown parameters in model (2) can be estimated by the maximum likelihood estimator (MLE) using the EM algorithm (Dempster et al., 1977). See, for example, Wedel and Kamakura (2000), Skrondal and Rabe-Hesketh (2004), Jacobs et al. (1991) and Jiang and Tanner (1991) for some applications of model (2).

A major drawback of model (2) is the normal assumption of the error density, which does not always hold in practice. Unfortunately, unlike the equivalence between the MLE and the least squares estimator (LSE) in linear regression, the normal assumption of in (1) is indispensible for the consistency of MLE. Furthermore, the normal assumption is also critical for the computation of MLE because it is needed when calculating the classification probabilities in the E step of the EM algorithm.

In order to reduce the modeling bias, we relax the normal assumption of the component error distributions and propose a class of flexible semiparametric mixture of linear regression models by replacing the normal error densities in (2) with unspecified component error densities. Specifically, we propose a semiparametric mixture of linear regressions model of the form

 fY∣X(y,x,θ,g)=m∑j=1πjτjg{(y−xTβj)τj}, (3)

where and is an unspecified density function with mean zero and variance one. Note that and we can view as the scale parameter or precision parameter playing the role of in (2). For a special case of (3) where and is a symmetric function, some existing work on identifiability exists. For example, Bordes et al. (2006) and Hunter et al. (2007) established the model identifiability when and , i.e. when the regression model degenerates to a mixture of density functions, while Hunter and Young (2012) allowed any and included covariates in . In this work, we establish the identifiability result for model (3) in a more general setting than the existing literatures, where the identifiability is shown for the arbitrary component densities with mean 0 without the identical constraint on the ’s. We also propose a semiparametric EM algorithm to estimate the regression parameters and the unspecified function . We further prove the consistency and the asymptotic properties of the new semiparametric estimator. Our asymptotic results directly apply to many existing semiparametric mixture regression estimators whose asymptotic properties have not been established in the literature. Using a Monte Carlo simulation study, we demonstrate that our methods perform better than the traditional MLE when the errors have distributions other than normal and provide comparable results when the errors are normal. An empirical analysis of a newly collected Equine Infectious Anemia Virus (EIAV) data set in 2017 is carried out to illustrate the usefulness of the proposed methodology.

The rest of the paper is organized as follows. Section 2 introduces the new mixture of regressions model with unspecified error distributions, proposes the new semiparametric regression estimator, and establishes the asymptotic properties of the proposed estimator. In Section 3, we use a simulation study to demonstrate the superior performance of the new method. We illustrate the effectiveness of the new method on an EIAV data set in Section 4. Some discussions are given in Section 5.

## 2 Mixture of regressions with nonparametric error densities

### 2.1 Identifiability results

Before proposing estimation procedures, we first investigate the identifiability of the model in (3). Let with .

###### Theorem 1.

(Identifiability) Assume that , , and , , are distinct vectors in . Assume further that the support of contains an open set in . Then the semiparametric mixture regression model (3) is identifiable up to a permutation of the components.

###### Remark 1.

A more general identifiability result is proved in the supplementary document. More specifically, under the assumptions in Theorem 1, the model

 fY∣X(y,x,θ,g)=m∑j=1πjgj(y−xTβj), (4)

is identifiable, where has mean 0 and . Note that model (3) is a special case of (4) when ’s belong to the same distribution family with different precision parameters. Our identifiability result benefits from the information carried in the random covariates . This allows us to establish the identifiability result for general number of components and arbitrary ’s.

### 2.2 Estimation algorithms

Suppose that are random observations from (3). In this section, we propose a Kernel DEnsity based EM type algorithm (KDEEM) to estimate the parameter and the nonparametric density function in (3):

###### Algorithm 1.

Starting from an initial parameter and initial density function , at the th step,

E step:

Calculate the classification probabilities,

 p(k+1)ij=pr(Zi=j∣xi,yi)=π(k)jg(k)(r(k)ij)τ(k)j∑mj=1π(k)jg(k)(r(k)ij)τ(k)j,i=1,…,n,j=1,…,m,

where and .

M step:

Update and , via

1. ,

2. , for .

3. where , and is a kernel function, such as the Epanechnikov kernel.

For the conventional MLE, the normal density for is used to calculate the classification probabilities in the E step. In the KDEEM, the error density used in the E step is estimated by a weighted kernel density estimator in stage 3 of the M step, with classification probabilities as weights, to avoid the modelling bias of component error densities. Bordes et al. (2007) and Benaglia et al. (2009) have used similar idea of combining kernel density and EM algorithm for the mixture of location shifted densities when there are no predictors involved. Note that the above EM type algorithm cannot guarantee to increase the likelihood at each iteration due to the kernel density estimation in the M step. One could use the maximum smoothed loglikelihood method proposed by Levine et al. (2011) to produce a modified algorithm that does increase smoothed version of the loglikelihood at each iteration but provides similar performance to the KDEEM.

Hunter and Young (2012) considered a special case of Algorithm 1 by assuming homogeneous scales, i.e. , denoted by KDEEM.H. For completeness of the presentation, we also present out the EM algorithm for this special case.

###### Algorithm 2.

Starting from an initial parameter and initial density function , at the th step,

E step:

Calculate the classification probabilities,

 p(k+1)ij=P(Zi=j∣xi,yi)=π(k)jg(k)(ϵ(k)ij)∑mj=1π(k)jg(k)(ϵ(k)ij)i=1,…,n,j=1,…,m,

where .

M step:

Update and , via

1. ,

2. , for .

3.  g(k+1)(t)=n−1n∑i=1m∑j=1p(k+1)ijKh(ϵ(k+1)ij−t), (5)

where .

To simplify the computation, Hunter and Young (2012) also recommended to use the least squares criterion to update in the M step of Algorithm 2, i.e.,

 β(k+1)j=(XTW(k+1)jX)−1XTW(k+1)jY,

where . Let and be the resulting estimators, denoted by KDEEM.LSE. Note that is different from the classic MLE in that the classification probabilities are calculated based on the weighted kernel density estimator (5) instead of the normal density to avoid the misspecification of the component error densities.

### 2.3 Asymptotic properties

We now establish the asymptotic properties of the estimators presented in Section 2.2. Let and be the resulting estimators of Algorithm 1 and and be the corresponding true values. Next, we provide the asymptotic properties of and . We make the following mild Assumptions.

1. The probabilities for , .

2. The precision values satisfy for .

3. The true parameter value is in the interior of an open set where .

4. The pdf has a compact support and is bounded away from zero on its support. In addition, is continuous and has continuously bounded second derivative with mean 0 and variance 1.

5. The kernel function is symmetric, bounded, and twice differentiable with bounded second derivative, compact support and finite second moment.

6. The bandwidth satisfies and when .

7. In the neighborhood of the true parameter values (, ), there is a unique value (, ) where the EM algorithm converges to.

To state the theoretical results in Theorem 2, we first define some notations, while collect the proof of Theorem 2 in Section S.2 of the Supplementary document.

For any vector , let be the element-wise evaluation of at . Let where . Define

 Φ{xi,yi,g(ri),θ} =⎡⎢ ⎢⎣Φ1{xi,yi,g(ri),π,τ,β1}⋮Φm{xi,yi,g(ri),π,τ,βm}⎤⎥ ⎥⎦,

where , and

 Φj{xi,yi,g(ri),π,τ,βj}=[g(rij)τj∑mj=1πjg(rij)τj−1,πj{g(rij)+g′(rij)rij}∑mj=1πjg(rij)τj,πjg′(rij)τjxTi∑mj=1πjg(rij)τj]T.

Also, let

 Ψ{t,g(t),g(ri),θ}=∑mj=1πjg(rij)τjKh(rij−t)∑mj=1πjg(rij)τj−g(t).

Let satisfy for all and . Define

 r21(t) = E[∂Ψ{t,g(t),g(ri),θ}∂θ]∣∣∣g(⋅)=g(⋅,θ), r22(t) = E[∂Ψ{t,g(t,θ),g(ri,θ),θ}∂g(t,θ)], Ri(t) = ⎡⎢ ⎢⎣∂Ψ{t,g(t,θ),g(ri,θ),θ}/∂g(ri1,θ)⋮∂Ψ{t,g(t,θ),g(ri,θ),θ}/∂g(rim,θ)⎤⎥ ⎥⎦.

Further let , , , . Also let , , . Let

 r3(xi,yi)=[∂Φ{xi,yi,g(ri,θ),θ}∂g(ri1,θ),…,∂Φ{xi,yi,g(ri,θ),θ}∂g(rim,θ)],

and write . Now we define

 M=E{n−1r3(r22+n−1R)−1rT21}−E∂Φ{Xi,Yi,g(Yi−XTiβ1),…,g(Yi−XTiβm),θ}∂θT.

Define also

 u(xi,yi) = m∑k=1m∑j=1E[{σjπkτkg(rik,θ)∑ms=1πsτsg(ris,θ)}{m∑s=1πsτsg(γijks)} ×∂Φ{X,XTβj+σjrik,g(γijk),θ}∂g(rik,θ)⎤⎦−m∑k=1E[∂Φ{X,Y,g(r1,θ),θ}∂g(r1k,θ)g(r1k,θ)],

and

 v(xi,yi) = m∑j=1m∑k=1E⎡⎣∂Φ{X,σjrik+XTβj,g(γijk),θ}∂g(rik,θ)πkτkσjfY∣X(σjrik+XTβj,X)g(rik){∑ms=1πsτsg(ris)}⎤⎦ −m∑j=1m∑k=1m∑s=1m∑t=1E[∂Φ{X1,ηijkst,g(τ1(ηijkst−XT1β1)),…,g(τm(ηijkst−XT1βm)),θ}∂g(τs(σkrit+XTlβk−XTlβs)) ×g(τs(σkrit+XT2βk−XT2βs))∑mq=1πqτqg(τq(ηijkst−XT1βq)){∑mq=1πqτqg(τq(σkrit+XT2βk−XT2βq))}⎤⎦τsσjπkπsπtτtg(rit){∑mq=1πqτqg(riq)},

where and .

###### Theorem 2.

Under the Assumptions 1-7, the regression parameter estimator obtained from Algorithm 1 is consistent and satisfies

 √n(ˆθ−θ0)→N(0,V)

in distribution when , where

 V=M−1var[Φ{xi,yi,g(ri1,θ),…,g(rim,θ),θ}+u(xi,yi)+v(xi,yi)]M−1T.

Theorem 2 establishes the theoretical properties of the estimator in Algorithm 1. It also shows that the nonparametric density estimator has the same convergence properties as the classical nonparametric density estimator. The proof of Theorem 2 is lengthy and quite involved.

Let and be the resulting estimators of Algorithm 2 under the assumption of homogeneous component scales considered in Hunter and Young (2012). In Theorem 3 below, we show the consistency of and and provide their convergence rate properties, which have remained unsolved and are considered to be difficult to obtain in the literature. We benefit from the proof of Theorem 2, which sheds light on the problem and provides the basic approach to the proof. We first define some notations, while collect the detailed proof of Theorem 3 in the Supplement.

Define

 ˜Φ{xi,yi,g(ϵi),θ} = ⎡⎢ ⎢ ⎢⎣˜Φ1{xi,yi,g(ϵi),π,β1}⋮˜Φm{xi,yi,g(ϵi),π,βm}⎤⎥ ⎥ ⎥⎦,

where and

 ˜Φj{xi,yi,g(ϵi),π,βj}=[g(ϵij)∑mj=1πjg(ϵij)−1,πjxTig′(ϵij)∑mk=1πkg(ϵik)]T.

Also let

 ˜Ψ{t,g(t),g(ϵi),π}=∑mj=1πjg(ϵij)Kh(ϵij−t)∑mj=1πjg(ϵij)−g(t). (6)

Let and be defined similarly in Theorem 2 by replacing with and replacing with (See Section S.2 of the Supplementary document for more detail.)

###### Theorem 3.

Under the Assumptions 1-7, is consistent and satisfies

 √n(˜θ−θ0)→N(0,˜V)

in distribution when , where

 ˜V=˜M−1var[˜Φ{xi,yi,g(ϵi1,θ),…,g(ϵim,θ),θ}+˜u(xi,yi)+˜v(xi,yi)](˜M−1)T.

Next, we establish the consistency and the asymptotic normality of the , which is the least squares version of Algorithm 2 (Hunter and Young, 2012). To formally state the theoretical results in Theorem 4, we need to define some notations, while give the proof of Theorem 4 in the Supplement.

Let

 \widecheckΦ{xi,yi,g(ϵi1),…,g(ϵim),θ} = ⎡⎢ ⎢⎣\widecheckΦ1{xi,yi,g(ϵi1),…,g(ϵim),π,β1}⋮\widecheckΦm{xi,yi,g(ϵi1),…,g(ϵim),π,βm}⎤⎥ ⎥⎦,

where

 \widecheckΦj{xi,yi,g(ϵi1),…,g(ϵim),π,βj}=[g(ϵij)∑mj=1πjg(ϵij)−1,πjϵijg(ϵij)xTi∑mj=1πjg(ϵij)]T.

Let , and be defined similarly in Theorem 2 by replacing with and replacing with (See the Supplement for more detail.)

###### Theorem 4.

Under the Assumptions 1-7, is consistent and satisfies

 √n(\widecheckθ−θ0)→N(0,\widecheckV)

in distribution, where

 \widecheckV=\widecheckM−1var[\widecheckΦ{xi,yi,g(ϵi1,θ),…,g(ϵim,θ),θ}+\widechecku(xi,yi)+\widecheckv(xi,yi)](\widecheckM−1)T.

From the proof of Theorem 4 in Section S.4 of the Supplementary document, it is easy to see that we can also get consistent mixture regression parameter estimates if we replace the least squares criterion in the M step by other robust criteria such as Huber’s function (Huber, 1981) or Tukey’s bisquare function. The consistency of the estimators can be retained mainly because there is no modeling misspecification when estimating classification probabilities in the E step.

## 3 Simulation study

We conduct a series of simulation studies to demonstrate the effectiveness of the proposed estimator KDEEM under different scenarios of error distributions and compare them with the traditional normal assumption based MLE via the EM algorithm (MLEEM). For the proposed estimator, we use the traditional MLE as the initial values and select the bandwidth of the kernel density estimation of based on the method proposed by Sheather and Jones (1991). Better estimation results might be obtained if more sophisticated methods were used to select the bandwidth. See, for example, Sheather and Jones (1991) and Raykar and Duraiswami (2006). For illustration purpose, we also include KDEEM.H presented in Algorithm 2 and the corresponding least squares version KDEEM.LSE proposed by Hunter and Young (2012) for comparison.

We generate the independent and identically distributed (i.i.d.) data from the model

 Y={−3+3X+ϵ1,if Z=1;3−3X+ϵ2,if Z=2,

where is the component indicator of with , and .

We consider the following cases for the error distribution and :

Case I: ,

Case II: ,

Case III: ,

Case IV: ,

Case V: ,

Case VI: ,

Case VII: ,

and has the same distribution as , i.e. . We use Case I to check the efficiency loss of the new semiparametric mixture regression estimators compared to MLE when the error distribution is indeed normal. The distribution in Case III is bimodal and the distribution in Case IV is right skewed. Cases II, V, VI, and VII are non-normal error densities and are used to check the adaptiveness of the new method to various densities.

In Tables 1 to 6, we report the mean absolute bias (MAB) and the root mean squared error (RMSE) of the regression parameter estimates based on 1,000 replicates for all seven cases and for . For convenience of reading, all the values are multiplied by a factor of . In addition, for better comparison, we also report the relative efficiency (RE) for each estimator when compared to the classical method MLEEM. For example, RE of KDEEM is calculated as

 RE={RMSE(MLEEM)RMSE(KDEEM)}2.

A larger value of indicates better performance of the proposed method. Based on the simulation results, in Case I, when the error distribution is normal, MLE is the most efficient one as expected. However, for Cases II to VII, where the error pdf is not normal, KDEEM outperforms MLEEM and the improvement is very substantial, especially for the slope parameters. In addition, for all cases, KDEEM performs better than KDEEM.H and KDEEM.LSE, which is expected since the data generation models have the heterogeneous component scales.