Semiparametric estimation of a twocomponent mixture of linear regressions in which one component is known
Abstract
A new estimation method for the twocomponent mixture model introduced in [Van13] is proposed. This model consists of a twocomponent mixture of linear regressions in which one component is entirely known while the proportion, the slope, the intercept and the error distribution of the other component are unknown. In spite of good performance for datasets of reasonable size, the method proposed in [Van13] suffers from a serious drawback when the sample size becomes large as it is based on the optimization of a contrast function whose pointwise computation requires operations. The range of applicability of the method derived in this work is substantially larger as it relies on a methodofmoments estimator free of tuning parameters whose computation requires operations. From a theoretical perspective, the asymptotic normality of both the estimator of the Euclidean parameter vector and of the semiparametric estimator of the c.d.f. of the error is proved under weak conditions not involving zerosymmetry assumptions. In addition, an approximate confidence band for the c.d.f. of the error can be computed using a weighted bootstrap whose asymptotic validity is proved. The finitesample performance of the resulting estimation procedure is studied under various scenarios through Monte Carlo experiments. The proposed method is illustrated on three real datasets of size , 51 and 176,343, respectively. Two extensions of the considered model are discussed in the final section: a model with an additional scale parameter for the first component, and a model with more than one explanatory variable.
Mixture of linear regressions in which one component is known
,
class=MSC] \kwd[Primary ]62J05 \kwd[; secondary ]62G08
Asymptotic normality \kwdidentifiability \kwdlinear regression \kwdmethod of moments \kwdmixture \kwdmultiplier central limit theorem \kwdweighted bootstrap
Contents
1 Introduction
Practitioners are frequently interested in modeling the relationship between a random response variable and a dimensional random explanatory vector by means of a linear regression model estimated from a random sample of . Quite often, the homogeneity assumption claiming that the linear regression coefficients are the same for all the observations is inadequate. To allow different parameters for different groups of observations, a Finite Mixture of Regressions (FMR) can be considered; see [GruLei06, Lei04] for nice overviews.
Statistical inference for the fully parametric FMR model was first considered in [QuaRam78] where an estimation method based on the moment generating function was proposed. An EM estimating approach was investigated in [DeV89] in the case of two components. Variations of the latter approach were also considered in [JonMcL92] and [Tur00]. The problem of determining the number of components in the parametric FMR model was investigated in [HawAllSto01] using methods derived from the likelihood equation. In [HurJusRob03], the authors proposed a Bayesian approach to estimate the regression coefficients and also considered an extension of the model in which the number of components is unspecified. The asymptotics of maximum likelihood estimators of parametric FMR models were studied in [ZhuZha04]. More recently, an penalized method based on a Lassotype estimator for a highdimensional FMR model with was proposed in [StaBuhvan10].
As an alternative to parametric estimation of a FMR model, some authors suggested the use of more flexible semiparametric approaches. This research direction finds its origin in [HalZho03] where variate semiparametric mixture models of random vectors with independent components were considered. The authors showed in particular that, for , it is possible to identify a twocomponent model without parametrizing the distributions of the component random vectors. To the best of our knowledge, Leung and Qin [LeuQin06] were the first to estimate a FMR model semiparametrically. In the twocomponent case, they studied the situation in which the components are related by Anderson’s exponential tilt model [And79]. Hunter and Young [HunYou12] studied the identifiability of an component semiparametric FMR model and numerically investigated an EMtype algorithm for estimating its parameters. Vandekerkhove [Van13] proposed an estimation method for a twocomponent semiparametric mixture of regressions with symmetric errors in which one component is known. The latter approach was applied to data extracted from a highdensity microarray and modeled in [MarMarBer08] by means of a parametric FMR.
The semiparametric approach proposed in [Van13] is of interest for two main reasons. Due to its semiparametric nature, the method allows to detect complex structures in the error of the unknown regression component. It can additionally be regarded as a tool to assess the relevance of results obtained using EMtype algorithms. The approach has however three important drawbacks. First, it is not theoretically valid when the errors are not symmetric. Second, it is very computationally expensive for large datasets as it requires the optimization of a contrast function whose pointwise evaluation requires operations. Third, the underlying optimization method requires the choice of a weight function and initial values for the Euclidean parameters, neither choices being datadriven.
The object of interest of this paper is the twocomponent FMR model studied in [Van13] in which one component is entirely known while the proportion, the slope, the intercept and the error distribution of the other component are unknown. The estimation of the Euclidean parameter vector is achieved through the method of moments. Semiparametric estimators of the c.d.f. and the p.d.f. of the error of the unknown component are proposed. The proof of the asymptotic normality of the Euclidean and functional estimators is not based on zerosymmetrylike assumptions frequently found in the literature but only involves finite moments of order eight for the explanatory variable and the boundness of the p.d.f.s of the errors and their derivatives. The almost sure uniform consistency of the estimator of the p.d.f. of the unknown error is obtained under similar conditions. A consequence of these theoretical results is that, unlike for EMtype approaches, the estimation uncertainty can be assessed through largesample standard errors for the Euclidean parameters and by means of an approximate confidence band for the c.d.f. of the unknown error. The latter is computed using a weighted bootstrap whose asymptotic validity is proved.
From a practical perspective, it is worth mentioning that the range of applicability of the resulting semiparametric estimation procedure is substantially larger than the one proposed in [Van13] as its computation only requires operations and no tuning of parameters such as starting values or weight functions. As a consequence, very large datasets can be easily processed. For instance, as shall be seen in Section 6, the estimation of the parameters of the model from the ChIPmix data considered in [MarMarBer08] consisting of observations took less than 30 seconds on one 2.4 GHz processor. The estimation of the same model from a subset of observations using the method in [Van13] took more than two days on a similar processor.
The paper is organized as follows. Section 2 is devoted to a detailed description of the model, while Section 3 is concerned with its identifiability. The estimators of the Euclidean parameter vector and of the functional parameter are investigated in detail in Section 4. The finitesample performance of the proposed estimation method is studied for various scenarios through Monte Carlo experiments in Section 5. In Section 6, the proposed method is applied to the tone data analyzed, among others, in [HunYou12], to the aphids dataset studied initially in [BoiSinSinTaiTur98], and to the ChIPmix data considered in [MarMarBer08]. Two extensions of the FMR model under consideration are discussed in the last section: a model with an additional scale parameter for the first component, and a model with more than one explanatory variable.
Note finally that all the computations reported in this work were carried out using the R statistical system [Rsystem] and that the main corresponding R functions are available on the web page of the second author.
2 Problem and notation
Let be a Bernoulli random variable with unknown parameter , let be an valued random variable with , and let be two absolutely continuous centered real valued random variables with finite variances and independent of . Assume additionally that is independent of , and . Furthermore, for fixed , let be the random variable defined by
i.e.,
(1) 
The above display is the equation of a mixture of two linear regressions with as mixing variable.
Let and denote the c.d.f.s of and , respectively. Furthermore, , and are assumed known while , , and are assumed unknown. The aim of this work is to propose and study an estimator of based on i.i.d. copies of . Now, define , and , and notice that
(2) 
where, to simplify the notation, and . It follows that the previous estimation problem is equivalent to the problem of estimating from the observation of i.i.d. copies of .
As we continue, the unknown c.d.f.s of and will be denoted by and , respectively. Also, for any , the conditional c.d.f. of given will be denoted by , and we have
(3) 
It follows that, for any , , the conditional p.d.f. of given , can be expressed as
(4) 
where and are the p.d.f.s of and , assuming that they exist on .
Note that, as shall be discussed in Section 7, it is possible to consider a slightly more general version of the model stated in (2) involving an unknown scale parameter for the first component. This more elaborate model remains identifiable and estimation through the method of moments is theoretically possible. However, from a practical perspective, estimation of this scale parameter through the method of moments seems quite unstable insomuch as that an alternative estimation method appears to be required. Notice also that another more straightforward extension of the model will be considered in Section 7 allowing to deal with more than one explanatory variable.
3 Identifiability
Since (2) is clearly equivalent to
(5) 
we immediately obtain that
(6) 
It follows that the coefficients and can be identified from (6) if . In addition, we have
(7)  
where and are the standard deviations of and , respectively. If contains three points such that the vectors are linearly independent then, from (7), we can identify the coefficients , and . It then remains to identify , and from the equations
(8) 
From the above system, we see that , and can be identified provided . If , then and cannot be identified, and, as shall become clear in the sequel, neither can . If , then the model in (2) coincides with the model studied in [BorDelVan06] where it was shown that identifiability does not necessary hold even if is assumed to have a zerosymmetric distribution. It follows that for identifiability to hold it is necessary that the unknown component actually exists () and that its slope is nonzero (). The latter conditions will be assumed in the rest of the paper.
Before discussing the identifiability of the functional part of the model, it is important to notice that the conditions on stated above are merely sufficient conditions. For instance, if , then and can be identified from (6) and can be identified from (7), which is enough to uniquely determine .
Let us finally consider the functional part of the model. For any , denote by the c.d.f. defined by
(9) 
For any , this can be rewritten as
For , we then obtain
Now, for any , let be defined by
(10) 
It follows that is identified since
(11) 
The above equation is at the root of the derivation of an estimator for .
4 Estimation
Let be the probability distribution of . For ease of exposition, we will frequently use the notation adopted in the theory of empirical processes as presented in [Kos08, van98, vanWel96] for instance. Given a measurable function , for some integer , will denote the integral . Also, the empirical measure obtained from the random sample will be denoted by , where is the probability distribution that assigns a mass of 1 at . The expectation of under the empirical measure is then and the quantity is the empirical process evaluated at . The arrow ‘’ will be used to denote weak convergence in the sense of Definition 1.3.3 in [vanWel96] and, for any set , will stand for the space of all bounded realvalued functions on equipped with the uniform metric. Key results and more details can be found for instance in [Kos08, van98, vanWel96].
4.1 Estimation of the Euclidean parameter vector
To estimate the Euclidean parameter vector , we first need to estimate the vector whose components were expressed in terms of , and in (8). From (6) and (7), it is natural to consider the regression function
where, for any , is defined by
(12) 
As an estimator of , we then naturally consider that satisfies
where , the gradient of with respect to , is given by
Now, for any integers , define
and let
which respectively estimate
The linear equation can then equivalently be rewritten as . Provided the matrices and are invertible, we can write and . Notice that, in practice, this amounts to performing an ordinary leastsquares linear regression of on to obtain and , while , and are given by an ordinary leastsquares linear regression of on and .
To obtain an estimator of , we use the relationships induced by (6) and (7) and recalled in (8). Leaving the third equation aside because it involves the unknown standard deviation of , we obtain three possible estimators of :
three possibles estimators of :
and, three possibles estimators of :
There are therefore 27 possible estimators of . Their asymptotics can be obtained under reasonable conditions similar to those stated in Assumptions A1 and A2 below. Unfortunately, all 27 estimators turned out to behave quite poorly in small samples. This prompted us to look for alternative estimators within the “same class”.
We now describe an estimator of that was obtained empirically and that behaves significantly better for small samples than the aforementioned ones. The new regression function under consideration is , , where, for any ,
(13) 
As previously, let be the estimator of , and notice that the main difference between the approach based on (12) and the approach based on (13) is that the former involves the linear regression of on and , while the latter relies on the linear regression of on only, which appears to result in better estimation accuracy. Now, let
which respectively estimate
Then, proceeding as for the estimators based on (12), we have, provided the matrices and are invertible, that and . In practice, and (resp. and ) merely follow from the ordinary leastsquares linear regression of on (resp. on ), while for .
To obtain an estimator of , we immediately have from the second term in (13) corresponding to the linear regression of on that
where the second equality comes from the fact that for . Now, using (5), we obtain
From the first term in (13) corresponding to the linear regression of on and (6), we have that and . Combining these with the previous display, we obtain
This leads to the following estimator of :
As we continue, the subsets of on which the functions , and exist and are differentiable will be denoted by , and , respectively, and will stand for .
To derive the asymptotic behavior of , we consider the following assumptions:

(i) has a finite fourth order moment; (ii) has a finite eighth order moment.

the variances of and are strictly positive and finite.
Clearly, Assumption A1 (ii) implies Assumption A1 (i), and Assumption A2 implies that the matrix defined above is invertible.
The following result, proved in Appendix A, characterizes the asymptotic behavior of the estimator .
Proposition 4.1.
Assume that .

Under Assumptions A1 (i) and A2, .

Suppose that Assumptions A1 (ii) and A2 are satisfied and let be the 3 by 8 matrix defined by
Then,
where . As a consequence, converges in distribution to a centered multivariate normal random vector with covariance matrix , which can be consistently estimated by in the sense that .
4.2 Estimation of the functional parameter
To estimate the unknown c.d.f. of , it is natural to start from (11). For a known , the term defined in (9) may be estimated by the empirical c.d.f. of the random sample , i.e.,
Similarly, since (the c.d.f. of ) is known, a natural estimator of the term defined in (10) is given by the empirical mean of the random sample , i.e.,
To obtain estimators of and , it is then natural to consider the plugin estimators and , respectively, based on the estimator of proposed in the previous subsection.
We shall therefore consider the following nonparametric estimator of :
(14) 
Note that is not necessarily a c.d.f. as it is not necessarily increasing and can be smaller than zero or greater than one. In practice, we shall consider the partially corrected estimator , where and denote the maximum and minimum, respectively.
To derive the asymptotic behavior of the previous estimator, we consider the following additional assumptions on the p.d.f.s and of and , respectively:

(i) and exist and are bounded on ; (ii) and exist and are bounded on .
Before stating one of our main results, let us first define some additional notation. Let and be two classes of measurable functions from to defined respectively by
and
Furthermore, let be a bounded subset of containing , and let be the class of measurable functions from to defined by
With the previous notation, notice that, for any ,
and that, under Assumptions A1 (ii) and A2, Proposition 4.1 states that
Next, for any , let
(15) 
with and .
The following result, proved in Appendix B, gives the weak limit of the empirical process .
Proposition 4.2.
Assume that and that Assumptions A1, A2 and A3 hold. Then, for any ,
where and the empirical process converges weakly to in with a Brownian bridge.
Let us now discuss the estimation of the p.d.f. of . Starting from (11) and after differentiation, it seems sensible to estimate , , by the empirical mean of the observable sample . Hence, a natural estimator of