Semiparametric mixtures of symmetric distributions

# Semiparametric mixtures of symmetric distributions

Cristina Butucea and Pierre Vandekerkhove
Université Paris-Est Marne-la-Vallée
###### Abstract

We consider in this paper the semiparametric mixture of two distributions equal up to a shift parameter. The model is said to be semiparametric in the sense that the mixed distribution is not supposed to belong to a parametric family. In order to insure the identifiability of the model it is assumed that the mixed distribution is symmetric, the model being then defined by the mixing proportion, two location parameters, and the probability density function of the mixed distribution. We propose a new class of -estimators of these parameters based on a Fourier approach, and prove that they are -consistent under mild regularity conditions. Their finite-sample properties are illustrated by a Monte Carlo study and a benchmark real dataset is also studied with our method.

AMS 2000 subject classifications. Primary 62G05, 62G20; secondary 62E10.
Key words and phrases. Asymptotic normality, consistency, contrast estimators, Fourier transform, identifiability, inverse problem, semiparametric, two-component mixture model.

## 1 Introduction

The probability density functions (pdf) of -variate multicomponent mixture models are defined by

 g(x)=k∑i=1λifi(x),x∈Rd, (1)

where the unknown proportions ( and ) and the unknown pdf are to be estimated. Generally the ’s are supposed to belong to a parametric family of density functions turning the inference problem for model (1) into a purely parametric estimation problem. There exists an extensive literature on this subject including the monographs of Everitt and Hand (1981), Titterington et al. (1985) or McLachlan and Peel (2000), which provide a good overview of the existing methods in this case such as maximum likelihood, minimum chi-square, moments method, Bayesian approaches etc. Note that the estimation of the number of components in model (1) may also be a crucial issue leading to various rates of convergence for maximum likelihood estimators, as discussed by Chen (1995). In that case, the selection model is an important topic, see for example Dacunha-Castelle & Gassiat (1999), Lemdani & Pons (1999), and Leroux (1992). In addition the choice of a parametric family for the ’s may be difficult when few informations are known from each subpopulations. However, model (1) is generally nonparametrically nonidentifiable without additionnal assumptions. This is no longer true when training data are available from each subpopulation; see for example Cerrito (1992), Hall (1981), Lancaster & Imbens (1996), Murray & Titterington (1978), and Qin (1999). Hall and Zhou (2003) first considered the case where no parametric assumptions are made about the ’s involved in model (1). These authors looked at -variate mixtures of two distributions, each having independent components, and proved that, under mild regularity conditions, their model is identifiable when . They propose in addition -consistent estimators of the univariate marginal cumulative distribution functions and the mixing proportion. Even if model (1) is not nonparametrically identifiable there exists for and , many real data sets in the statistical literature for which such a model is used under parametric assumptions on the ’s, such as the Old Faithfull dataset, see Azzalini & Bowman (1990), which corresponds to time measurement (in minute) between eruptions of the Old Faithfull geyser in Yellowstone National Park, USA. Another famous example deals with average amounts of precipitation (rainfall) in inches for United States cities (from the Statistical abstract of the United States, 1975; see McNeil (1977).These data sets are both included in the R statistical package.
To model from a semiparametric point of view this type of data ( and ), Bordes, Mottelet & Vandekerkhove (2006) (in abreviate BMV) and Hunter, Wang & Hettmansperger (2007) (in abreviate HWH) proposed jointly to consider i.i.d. sample data drawn from a common pdf satisfying

 g(x)=k∑i=1λif(x−μi),x∈R, (2)

where , for all such that and is an unknown pdf. When is supposed to be symmetric about zero, that is for all , the above authors proposed -estimation methods based on the cumulative distribution function (cdf) in order to estimate separately the Euclidean and functional part of model (2). The crucial part of their work deals with the identifiability of model (2) under the simple symmetry assumption on . Their basic results are established in BMV, Theorem 2.1 and HWH, Theorem 1, 2 and Corollary 1. The mixed density in (2) can also be seen as the density of i.i.d. observations in a convolution model:

 Xi=Zi+εi,i=1,...,n, (3)

where ’s are i.i.d. with common pdf and independent of i.i.d. errors ’s with discrete law such that , for . Previous results mean that, if is known and is supposed to be symmetric about 0, then we can identify the law of the errors and esimate nonparametrically the pdf . Let us notice that the mixture problem in (2) and the deconvolution problem in (3) are the same. They are both an inverse problem with unknown operator (i.e. convolution with an unknown law having support on unknown points). In particular when , and , according to Theorem 2.1. in BMV, such a model is identifiable if the Euclidean parameter , where and the mixed density is symmetric about 0. When , BMV prove, under mild conditions, that both the Euclidean parameter and the cumulative distribution function of of model (2) are estimated almost surely at the rate , for all (see Theorem 3.3 and 3.4). When or 3, HWH prove under mild conditions, the strong consistency of their estimator, and establish, under very technical conditions, its asymptotic normality (see Theorems 3 and 4 therein).

In this paper we propose to investigate a new estimation method. Let us first recall that BMV propose an iterative procedure to invert the operator and a contrast which is based on the cdf and the symmetry of the underlying unknown pdf . HWH introduce a contrast based on the cdf of the observations and estimate the euclidean parameter using the symmetry property of the unknown pdf . Here, we use Fourier analysis to invert the operator and see that under identifiability assumptions the inverse problem is well posed. Then we construct a contrast based on characteristic functions of our data which allows to estimate when is symmetric. This contrast is a functional of which is estimated by a U-statistic of order 2 at parametric rate under very mild smoothness assumption on (Sobolev smoothness larger than 1/4). Our procedure is easier to deal with and allows to get a central limit theorem for the estimator of under much simpler conditions than those of Theorem 4 in HWH. Moreover, we define a kernel estimator of the pdf and prove that it attains the same nonparametric rate as in the direct problem of density estimation. The inverse problem does not affect the pointwise rate of convergence of the density estimator. Our estimators and convergence results generalize to the mixture model with components, as soon as the model verifies identifiability assumptions. Such assumptions are known for only, see Corollary 1 in HWH.

The paper is organized as follows: in Section 2 we propose a contrast function based on a Fourier transform of the dataset pdf and derive our estimation method; in Section 3 we present our main asymptotic result which concern the -rate of convergence for the Euclidean part of the parameter and show that the classical nonparametric rate of convergence is achieved for our inverse Fourier nonparametric estimator; Section 4 is dedicated to auxiliary results and proofs; in Section 5 we propose a Monte Carlo study of our estimators on several simulated examples and implement our method on a real dataset which deals with the average amounts of precipitation (rainfall) in inches for United States cities, see McNeil (1977).

## 2 Estimation procedure

We observe independent, identically distributed random variables having common pdf in the model

 g(x)=p0f(x−α0)+(1−p0)f(x−β0),x∈R, (4)

where denotes the unknown value of the Euclidean parameter and is unknown, symmetric pdf in a large nonparametric class of functions.

For identifiability reasons, let belong to a compact set . Therefore, there are positive , which are smaller than 1/2, such that .

Note that in case we can still identify but not . As this case reduces to the estimation of the location of an unknown symmetric pdf as in Beran (1978), we do not consider this case further on.

From now on, we denote by the Fourier transform and recall that if we have the inversion formula .

Let us denote , for all and , and see that it cannot be as soon as . It is enough to notice that for all .

The contrast uses the symmetry of the underlying, unknown pdf . For the first time in the literature of mixture models, we relate the symmetry of to the fact that its Fourier transform has no imaginary part. More precisely, in model (4)

 g∗(u)=(p0eiuα0+(1−p0)eiuβ0)f∗(u)=M(θ0,u)f∗(u),u∈R.

When is supposed to be symmetric about , we can hope that , for all , if and only if . This basic result is formally stated in the following theorem.

###### Theorem 1

Consider model (2) with symmetric about 0 and . Then we have for some if and only if .

Proof. Notice that for all such that we explicitly have

 Im(g∗(u)M(θ,u))=Im(f∗(u)M(θ0,u)M(θ,u))=f∗(u)|M(θ,u)|2Im((M(θ0,u)¯M(θ,u)))=0,

for all . As , we get that is null in a neighborhood of which leads, following the proof of Theorem 2.1 in BMV, to the wanted result .

Assuming known we can recover the true value of the Euclidean parameter by minimizing the discrepancy measure defined by

 S(θ):=∫R(Im(g∗(u)M(θ,u)))2dW(u),θ∈Θ, (5)

where is a Lebesgue-absolutely continuous probability measure supported by .

Note that we can also write

 S(θ)=∫R[12i(g∗(u)M(θ,u)−¯g∗(u)¯M(θ,u))]2dW(u).

From now on, denotes the complex conjugate of .

###### Proposition 1

The function in (5) is a contrast function, i.e. for all , and if and only if .

Proof. The Fourier transform being continuous, the same holds for . By Theorem 1, if there exists such that , and there exists and such that on . It follows that

 S(θ)≥ε2∫u0+γu0−γdW(u)>0.

Otherwise if it is straightforward to check that .

Discussion. We point out that basic results similar to Theorem 1 and Proposition 1, can be established for model (2) when under sufficient identiability conditions. Indeed, in that case, it is enough to replace by and by and check that the analog of Theorem 1 can be established following the Proof of Lemma A. 1, under conditions provided in Corollary 1, in HWH. Finally, similar estimators to those in Sections 2.1 and 2.2 and asymptotic results like those established in Section 3 for , can be established with a little extra work for .

### 2.1 Contrast minimization for the Euclidean parameter

Let the estimator of be the following M-estimator

 ^θn=argminθ∈ΘSn(θ), (6)

where , depending on some parameter (small with ), is the following estimator of

 Sn(θ)=−14n(n−1)∫|u|≤1/hn∑j≠k,j,k=1(eiuXkM(θ,u)−e−iuXkM(θ,−u))(eiuXjM(θ,u)−e−iuXjM(θ,−u))dW(u). (7)

The estimator is inspired by kernel estimators of quadratic functional of the pdf as previously studied in Butucea (2007). It is written here in the Fourier domain. It is known that by removing the diagonal terms in the double sum (i.e. taking ) the bias is reduced with respect to the estimator where we plug an estimator of into .

Let us denote by

 Zk(θ,u) := eiuXkM(θ,u)−e−iuXkM(θ,−u), J(θ,u) := g∗(u)M(θ,u)−g∗(−u)M(θ,−u).

Then it is easy to see that

 Sn(θ) = −14n(n−1)n∑j≠k,j,k=1∫|u|≤1/hZk(θ,u)Zj(θ,u)dW(u), S(θ) = −14∫RJ2(θ,u)dW(u),

and that .

### 2.2 Kernel based nonparametric estimator

After estimating the Euclidean parameter, we want to estimate the nonparametric function . We suggest to use cross-validation for a kernel estimator as follows. We denote by the leave-one-out estimator of , which uses the sample without the -th observation. Then we plug this in the classical nonparametric kernel estimator, whenever the unknown is required. This procedure gives, in Fourier domain,

 f∗n(u)=1nn∑k=1K∗(bnu)eiuXkM(^θn,−k,u), (8)

where the kernel ( and ) and the bandwidth are properly chosen. Note that is in and and has an inverse Fourier transform which we denote by . Therefore, the estimator of is

 fn(x)=1nbnn∑k=1Gn(x−Xkbn). (9)

It is important to notice at this step, that the estimator is obtained by inversion of a nonparametric kernel estimator

 gn(x)=1nbnn∑k=1K(x−Xkbn), (10)

with kernel and bandwidth . The inversion is done in Fourier domain with the estimated instead of the true :

 f∗n(u)=g∗n(u)M(^θn,−k,u).

When dealing with the rain fall dataset studied in Section 4, we propose to consider, as in BMV, the version of the estimator (which has a negative part due to the small number of observations) defined by

 ~fn(x)=fn(x)Ifn(x)≥0∫Rfn(x)Ifn(x)≥0. (11)

## 3 Main results

Let us state first several assumptions.

Assumption A Let be a cumulative distribution function of some random variable which admits finite absolute moments up to the third order:

 ∫R(1+|u|+u2+|u|3)dW(u)<∞.

Assumption B We assume that the underlying probability density belongs to a ball of radius in the Sobolev space of functions having smoothness :

 W(β,L)={f:R→R+:∫f=1,∫|f∗(u)|2|u|2βdu≤L},

where denotes the Fourier transform of the function .

The weight function has been introduced for integrability of our estimator of the criterium and its derivatives with respect to . It is completely arbitrary and it may help compute numerically the values of our integrals by Monte-Carlo simulation, but it slightly reduces the asymptotic efficiency of . We could have used integrals with respect to the Lebesgue measure for highest efficiency of , but this would require stronger assumptions of smoothness and moments for the unknown probability density function .

###### Proposition 2

For each , the empirical contrast function defined in (7) with when , is such that

 supf∈W(β,L)supθ∈ΘE[(Sn(θ)−S(θ))2]≤L2(1−2P)4h4β+1(1−2P)2n+1(1−2P)4n2,

as .

An easy consequence of the Theorem is that as .

Moreover, if we choose the squared bias of is infinitely smaller when compared to its variance. So the mean squared error converges at rate as soon as .

###### Theorem 2

The estimator defined in (6) converges in probability to the true value of the Euclidean parameter as .

###### Theorem 3

The estimator defined in (6) with such that is asymptotically normally distributed:

 √n(^θn−θ0)\lx@stackreld→N(0,Σ), as n→∞,

where , , and .

The next theorem gives the upper bounds for the rate of convergence of the nonparametric estimator of , at some fixed point , over Sobolev classes of functions. The main message of the theorem is that, if then the nonparametric rates for density estimation are reached, provided a correct choice of parameters and . This might seem surprising, but it is again related to the fact that the inverse problem under consideration is well posed and the estimation of the Euclidean parameter does not affect the nonparametric rate for estimating .

###### Theorem 4

Let the estimator of be defined in (6) and the estimator of at some fixed point in (9), with , for some and a kernel in and in with Fourier transform having support included in .

If ,

 limsupn→∞supf∈W(β,L)supθ0∈ΘEθ0,f[n−2β−12β|fn(x)−f(x)|2]≤C,

for some constant which depends on and on .

We can choose an arbitrary point and write

 supf∈W(β,L)supθ0∈ΘEθ0,f[n−2β−12β|fn(x)−f(x)|2]≥supf∈W(β,L)Eθ,f[n−2β−12β|fn(x)−f(x)|2]

The lower bounds are known in the case of density estimation from direct observations, see for example results for more general Besov classes of functions in Härdle et al. (1998). They generalize easily to our case, with fixed .

## 4 Simulations

We implement our method and study its behaviour on samples of size . The mean behaviour of our estimator of is calculated by replicating times the same experiment. We considered that the underlying symmetric density is either Gaussian, Cauchy or Laplace. We give the mean value of the estimated parameter and its standard deviation in Tables 1, 3 and 4, respectively. We also plot the nonparametric estimator of the underlying density as compared to the true, in Figure 1.

We see that smaller is , smaller is the standard deviation of . This is indeed intuitively clear, as which is larger represents the fraction of data sampled from the second population or else the amount of information about the population which is located at .

We note that the previous estimation methods based on the distribution function require usually finite moments up to some order. These methods cannot deal with the Cauchy density that we consider here, see Table 3. Indeed, our method is based on Fourier transform, which is fast decreasing in this case. We also consider non smooth Laplace density (or double exponential), see Table 4. Its Fourier transform is slowly decreasing, but we chose the weight function in order to deal with this problem. Therefore, all integrals have relatively small support of integration and the computation is fast enough.

In the Table 2 we propose to illustrate the sensitivity of our method with respect to the symmetry assumption by considering a symmetric case against various shapeless mixed distributions close to the symmetric case.

Comments on Table 1-4. Comparing the rows 3 and 5 of Table 1 with the rows 2 and 5 of Table 2 in BMV, it appears that our estimator is clearly less unstable than the estimator proposed by these authors when is the pdf. Table 2 summarizes the performance of our method in slightly shapeless situation where is the pdf of the distribution satisfying and , for all . When ( is a symmetric bimodal pdf with mean 0 and variance equal to 1) it is then interesting to compare the performance of our method, see row 1 of Table 2, with its performances in the similar Gaussian case, see row 4 in Table 1, the noticeable fact being that the variance of is smaller in the Gaussian case. When the bias of is badly affected when the standard deviations of the estimators is stable. The results provided in Table 3 seems to show that the heavy tails of the Cauchy distribution have essentially a bad influence on the standard deviation of . Comparing Table 1 and Table 4 it appears that the peak on the graph of the Laplace pdf helps to estimate the parameter but do not work in favor of the other parameters.

Rainfall dataset. In this paragraph we propose to study the performances of our method when compared to the results obtained in BMV. We have implemented the Gauss kernel estimator with bandwidth , , and used in (8), instead of , the estimator . When is the Gauss kernel, we explicitly have

 fn(x)=1nn∑k=1∫RQ(bn,^θn;u)[^pncos(u(Xk−x−^αn))+(1−^pn)cos(u(Xk−x−^βn))]du,

where

 Q(θ,b;u):=12π×e−b2u2/22p2−2p+1+2p(1−p)cos(u(α−β)).

The results provided by our method are , , and the behavior of the functional estimators is summarized in Figure 3. Before commenting the good performances of our estimator in Figure 3, it is crucial to notice that the reconstruction of the pdf by coincides with itself, according to (8-11) and replacing by . This basic phenomenon is illustrated in Figure 2.

As mentioned in Section 2.2, the function is not necessarily a pdf due to its negative part (coming from the small size of and the fact that model (4) is not necessarily the true underlying model), hence it is needed to regularize into which leads to consider, on this real dataset, . This modification explains the fact the graph of does not match exactly the graph of .

Actually we observe that the graph of fits almost perfectly the graph of in the interval , when it generates an extra bump in the interval [-20,0]. Nethertheless when comparing our graphs to the graphs obtained in BMV (including a comparison with the two-component Gaussian mixture model), we observe that we both have the extra bump issue on the intervall [-20,0], on the other hand we better estimate the two first bumps appearing on the graph of within the interval . We think that our methodological approach performs better than the existing one, mainly because we do not symmetrize our functional estimator in order to mimic as much as possible the shape of (which shapeless is precisely the reason why , see Figure 2).

## 5 Auxiliary results and Proofs

Let us use the notation for the Euclidean norm of a vector and for any matrix in .

###### Lemma 1
1. For all , we have

 max{supθ∈Θ|Zk(θ,u)|,supθ∈Θ|J(θ,u)|}≤21−2P,

for any from 1 to .

2. For all , we have

 max{supθ∈Θ∥˙Zk(θ,u)∥,supθ∈Θ∥˙J(θ,u)∥}≤4(1+|u|)(1−2P)2,

for any from 1 to .

3. For all , we have

 ∥¨Zk(θ,u)∥2≤C(1+|u|+u2)(1−2P)3,

for some absolute constant , for any and for any from 1 to .

Proof. 1. It is easy to see that and that

 |J(θ,u)|≤2∣∣∣g∗(u)M(θ,u)∣∣∣≤2(1−2P).

2. We note that

 ˙Zk(θ,u)=−eiuXkM2(θ,u)⎛⎜⎝eiuα−eiuβiupeiuαiu(1−p)eiuβ⎞⎟⎠+e−iuXkM2(θ,−u)⎛⎜⎝e−iuα−e−iuβ−iupe−iuα−iu(1−p)e−iuβ⎞⎟⎠,

and that

 E[˙Zk(θ,u)]=˙J(θ,u)=−g∗(u)M2(θ,u)⎛⎜⎝eiuα−eiuβiupeiuαiu(1−p)eiuβ⎞⎟⎠+g∗(−u)M2(θ,−u)⎛⎜⎝e−iuα−e−iuβ−iupe−iuα−iu(1−p)e−iuβ⎞⎟⎠.

We have

 ∥˙J(θ,u)∥ = ∥g∗(u)M2(θ,u)˙M(θ,u)+g∗(−u)M2(θ,−u)˙M(θ,−u)∥ ≤ 1(1−2P)2(2(22+p2u2+(1−p)2u2))1/2≤4(1+|u|)(1−2P)2

and the same goes for .

3. We write briefly

 ¨Zk(θ,u) = −eiuXkM2(θ,u)¨M(θ,u)+e−iuXkM2(θ,−u)¨M(θ,−u) +2eiuXkM3(θ,u)˙M(θ,u)⋅˙M(θ,u)⊤−2e−iuXkM3(θ,−u)˙M(θ,−u)⋅˙M(θ,−u)⊤.

We deduce our bound from above.

###### Lemma 2
1. For all , we have

 ∥˙Zk(θ,u)−˙Zk(θ′,u)∥≤∥θ−θ′∥⋅C(1+|u|+u2)(1−2P)3,

for any and any from 1 to .

2. For all , we have

 ∥¨Zk(θ,u)−¨Zk(θ′,u)∥2≤∥θ−θ′∥⋅C(1+|u|+u2+|u|3)(1−2P)4,

for some absolute constant , for any and for any from 1 to .

Proof. The proof uses a Taylor expansion and bounds from and similar to the Lemma 1.

Proof of Proposition 2. It is easy to see that . Therefore the estimation bias is

 |E(Sn(θ))−S(θ)| = 14∫|u|>1/h(g∗(u)M(θ,u)−¯g∗(u)¯M(θ,u))2dW(u) ≤ ∫|u|>1/h(Img∗(u)M(θ,u))2dW(u) ≤ 1(1−2p)2∫|u|>1/h|g∗(u)|2dW(u).

If we assume , for some and , then

 |E(Sn(θ))−S(θ)| ≤ h2βL(1−2P)2,h→0. (12)

We have for the variance

 Var(Sn(θ)) = 116E⎡⎢⎣⎛⎝1n(n−1)n∑j≠k,j,k=1∫|u|≤1/h(Zj(θ,u)Zk(θ,u)−J2(θ,u))dW(u)⎞⎠2⎤⎥⎦.

It decomposes in , where

 Tn = Vn = E⎡⎣(2nn∑k=1∫|u|≤1/h(Zk(θ,u)−J(θ,u))J(θ,u)dW(u))2⎤⎦

Indeed, random variables in the previous sums are uncorrelated. Let us study the asymptotic behavior of these terms. On the one hand,

 Tn = 1n(n−1)E⎡⎣(∫|u|≤1/h(Z1(θ,u)−J(θ,u))(Z2(θ,u)−J(θ,u))dW(u))2⎤⎦ ≤ 1n(n−1)E⎡⎣(∫|u|≤1/hZ1(θ,u)Z2(θ,u)dW(u))2⎤⎦≤16(1−2P)4n2,

since from Lemma 1 we have . In addition,

 Vn = 4nE⎡⎣(∫|u|≤1/hZ1(θ,u)J(θ,u