Semiparametric mixtures of symmetric distributions
Abstract
We consider in this paper the semiparametric mixture of two distributions equal up to a shift parameter. The model is said to be semiparametric in the sense that the mixed distribution is not supposed to belong to a parametric family. In order to insure the identifiability of the model it is assumed that the mixed distribution is symmetric, the model being then defined by the mixing proportion, two location parameters, and the probability density function of the mixed distribution. We propose a new class of estimators of these parameters based on a Fourier approach, and prove that they are consistent under mild regularity conditions. Their finitesample properties are illustrated by a Monte Carlo study and a benchmark real dataset is also studied with our method.
AMS 2000 subject classifications. Primary 62G05, 62G20; secondary 62E10.
Key words and phrases. Asymptotic normality, consistency, contrast
estimators, Fourier transform, identifiability, inverse problem, semiparametric, twocomponent mixture model.
1 Introduction
The probability density functions (pdf) of variate multicomponent mixture models are defined by
(1) 
where the unknown proportions ( and ) and the unknown pdf are to be estimated. Generally the ’s are supposed to belong to a parametric family of density functions turning the inference problem for model (1)
into a purely parametric estimation problem. There exists an extensive literature on this subject including the monographs of
Everitt and Hand (1981), Titterington
et al. (1985) or McLachlan and Peel (2000), which provide a good overview of the existing methods in this case such as maximum likelihood, minimum chisquare, moments method, Bayesian approaches etc. Note that the estimation of the number of components
in model (1) may also be a crucial issue leading to various rates of convergence for maximum likelihood estimators, as discussed
by Chen (1995). In that case, the selection model is an important topic, see for example DacunhaCastelle & Gassiat (1999), Lemdani & Pons (1999), and Leroux (1992).
In addition the choice of a parametric family for the ’s may be difficult when few informations are known from each subpopulations.
However, model (1) is generally nonparametrically nonidentifiable without additionnal assumptions. This is no longer true when training data
are available from each subpopulation; see for example Cerrito (1992), Hall (1981), Lancaster & Imbens (1996), Murray & Titterington (1978), and Qin (1999).
Hall and Zhou (2003) first considered the case where no parametric assumptions
are made about the ’s involved in model (1). These authors looked at variate mixtures of two distributions, each having independent components, and proved that, under mild regularity conditions, their model is identifiable when . They propose in addition
consistent estimators of the univariate marginal cumulative distribution functions and the mixing proportion. Even if model
(1) is not nonparametrically identifiable there exists for and , many real data sets in the statistical literature for which such a model is used under parametric assumptions on the ’s, such as the Old Faithfull dataset, see Azzalini & Bowman (1990), which corresponds to time measurement (in minute) between eruptions of the Old Faithfull geyser in Yellowstone National Park, USA. Another famous example deals with average amounts of precipitation (rainfall) in inches for United States cities (from the Statistical abstract of the United States, 1975; see McNeil (1977).These data sets are both included in the R statistical package.
To model from a semiparametric point of view this type of data ( and ), Bordes, Mottelet & Vandekerkhove (2006) (in abreviate BMV) and Hunter, Wang & Hettmansperger (2007) (in abreviate HWH) proposed jointly to consider i.i.d. sample data drawn from a common pdf satisfying
(2) 
where , for all such that and is an unknown pdf. When is supposed to be symmetric about zero, that is for all , the above authors proposed estimation methods based on the cumulative distribution function (cdf) in order to estimate separately the Euclidean and functional part of model (2). The crucial part of their work deals with the identifiability of model (2) under the simple symmetry assumption on . Their basic results are established in BMV, Theorem 2.1 and HWH, Theorem 1, 2 and Corollary 1. The mixed density in (2) can also be seen as the density of i.i.d. observations in a convolution model:
(3) 
where ’s are i.i.d. with common pdf and independent of i.i.d. errors ’s with discrete law such that , for . Previous results mean that, if is known and is supposed to be symmetric about 0, then we can identify the law of the errors and esimate nonparametrically the pdf . Let us notice that the mixture problem in (2) and the deconvolution problem in (3) are the same. They are both an inverse problem with unknown operator (i.e. convolution with an unknown law having support on unknown points). In particular when , and , according to Theorem 2.1. in BMV, such a model is identifiable if the Euclidean parameter , where and the mixed density is symmetric about 0. When , BMV prove, under mild conditions, that both the Euclidean parameter and the cumulative distribution function of of model (2) are estimated almost surely at the rate , for all (see Theorem 3.3 and 3.4). When or 3, HWH prove under mild conditions, the strong consistency of their estimator, and establish, under very technical conditions, its asymptotic normality (see Theorems 3 and 4 therein).
In this paper we propose to investigate a new estimation method. Let us first recall that BMV propose an iterative procedure to invert the operator and a contrast which is based on the cdf and the symmetry of the underlying unknown pdf . HWH introduce a contrast based on the cdf of the observations and estimate the euclidean parameter using the symmetry property of the unknown pdf . Here, we use Fourier analysis to invert the operator and see that under identifiability assumptions the inverse problem is well posed. Then we construct a contrast based on characteristic functions of our data which allows to estimate when is symmetric. This contrast is a functional of which is estimated by a Ustatistic of order 2 at parametric rate under very mild smoothness assumption on (Sobolev smoothness larger than 1/4). Our procedure is easier to deal with and allows to get a central limit theorem for the estimator of under much simpler conditions than those of Theorem 4 in HWH. Moreover, we define a kernel estimator of the pdf and prove that it attains the same nonparametric rate as in the direct problem of density estimation. The inverse problem does not affect the pointwise rate of convergence of the density estimator. Our estimators and convergence results generalize to the mixture model with components, as soon as the model verifies identifiability assumptions. Such assumptions are known for only, see Corollary 1 in HWH.
The paper is organized as follows: in Section 2 we propose a contrast function based on a Fourier transform of the dataset pdf and derive our estimation method; in Section 3 we present our main asymptotic result which concern the rate of convergence for the Euclidean part of the parameter and show that the classical nonparametric rate of convergence is achieved for our inverse Fourier nonparametric estimator; Section 4 is dedicated to auxiliary results and proofs; in Section 5 we propose a Monte Carlo study of our estimators on several simulated examples and implement our method on a real dataset which deals with the average amounts of precipitation (rainfall) in inches for United States cities, see McNeil (1977).
2 Estimation procedure
We observe independent, identically distributed random variables having common pdf in the model
(4) 
where denotes the unknown value of the Euclidean parameter and is unknown, symmetric pdf in a large nonparametric class of functions.
For identifiability reasons, let belong to a compact set . Therefore, there are positive , which are smaller than 1/2, such that .
Note that in case we can still identify but not . As this case reduces to the estimation of the location of an unknown symmetric pdf as in Beran (1978), we do not consider this case further on.
From now on, we denote by the Fourier transform and recall that if we have the inversion formula .
Let us denote , for all and , and see that it cannot be as soon as . It is enough to notice that for all .
The contrast uses the symmetry of the underlying, unknown pdf . For the first time in the literature of mixture models, we relate the symmetry of to the fact that its Fourier transform has no imaginary part. More precisely, in model (4)
When is supposed to be symmetric about , we can hope that , for all , if and only if . This basic result is formally stated in the following theorem.
Theorem 1
Consider model (2) with symmetric about 0 and . Then we have for some if and only if .
Proof. Notice that for all such that we explicitly have
for all . As , we get that is null in a neighborhood of which leads, following the proof of Theorem 2.1 in BMV, to the wanted result .
Assuming known we can recover the true value of the Euclidean parameter by minimizing the discrepancy measure defined by
(5) 
where is a Lebesgueabsolutely continuous probability measure supported by .
Note that we can also write
From now on, denotes the complex conjugate of .
Proposition 1
The function in (5) is a contrast function, i.e. for all , and if and only if .
Proof. The Fourier transform being continuous, the same holds for . By Theorem 1, if there exists such that , and there exists and such that on . It follows that
Otherwise if it is straightforward to check that .
Discussion. We point out that basic results similar to Theorem 1 and Proposition 1, can be established for model (2) when under sufficient identiability conditions. Indeed, in that case, it is enough to replace by and by and check that the analog of Theorem 1 can be established following the Proof of Lemma A. 1, under conditions provided in Corollary 1, in HWH. Finally, similar estimators to those in Sections 2.1 and 2.2 and asymptotic results like those established in Section 3 for , can be established with a little extra work for .
2.1 Contrast minimization for the Euclidean parameter
Let the estimator of be the following Mestimator
(6) 
where , depending on some parameter (small with ), is the following estimator of
(7) 
The estimator is inspired by kernel estimators of quadratic functional of the pdf as previously studied in Butucea (2007). It is written here in the Fourier domain. It is known that by removing the diagonal terms in the double sum (i.e. taking ) the bias is reduced with respect to the estimator where we plug an estimator of into .
Let us denote by
Then it is easy to see that
and that .
2.2 Kernel based nonparametric estimator
After estimating the Euclidean parameter, we want to estimate the nonparametric function . We suggest to use crossvalidation for a kernel estimator as follows. We denote by the leaveoneout estimator of , which uses the sample without the th observation. Then we plug this in the classical nonparametric kernel estimator, whenever the unknown is required. This procedure gives, in Fourier domain,
(8) 
where the kernel ( and ) and the bandwidth are properly chosen. Note that is in and and has an inverse Fourier transform which we denote by . Therefore, the estimator of is
(9) 
It is important to notice at this step, that the estimator is obtained by inversion of a nonparametric kernel estimator
(10) 
with kernel and bandwidth . The inversion is done in Fourier domain with the estimated instead of the true :
When dealing with the rain fall dataset studied in Section 4, we propose to consider, as in BMV, the version of the estimator (which has a negative part due to the small number of observations) defined by
(11) 
3 Main results
Let us state first several assumptions.
Assumption A Let be a cumulative distribution function of some random variable which admits finite absolute moments up to the third order:
Assumption B We assume that the underlying probability density belongs to a ball of radius in the Sobolev space of functions having smoothness :
where denotes the Fourier transform of the function .
The weight function has been introduced for integrability of our estimator of the criterium and its derivatives with respect to . It is completely arbitrary and it may help compute numerically the values of our integrals by MonteCarlo simulation, but it slightly reduces the asymptotic efficiency of . We could have used integrals with respect to the Lebesgue measure for highest efficiency of , but this would require stronger assumptions of smoothness and moments for the unknown probability density function .
Proposition 2
An easy consequence of the Theorem is that as .
Moreover, if we choose the squared bias of is infinitely smaller when compared to its variance. So the mean squared error converges at rate as soon as .
Theorem 2
The estimator defined in (6) converges in probability to the true value of the Euclidean parameter as .
Theorem 3
The next theorem gives the upper bounds for the rate of convergence of the nonparametric estimator of , at some fixed point , over Sobolev classes of functions. The main message of the theorem is that, if then the nonparametric rates for density estimation are reached, provided a correct choice of parameters and . This might seem surprising, but it is again related to the fact that the inverse problem under consideration is well posed and the estimation of the Euclidean parameter does not affect the nonparametric rate for estimating .
Theorem 4
Let the estimator of be defined in (6) and the estimator of at some fixed point in (9), with , for some and a kernel in and in with Fourier transform having support included in .
If ,
for some constant which depends on and on .
We can choose an arbitrary point and write
The lower bounds are known in the case of density estimation from direct observations, see for example results for more general Besov classes of functions in Härdle et al. (1998). They generalize easily to our case, with fixed .
4 Simulations
We implement our method and study its behaviour on samples of size . The mean behaviour of our estimator of is calculated by replicating times the same experiment. We considered that the underlying symmetric density is either Gaussian, Cauchy or Laplace. We give the mean value of the estimated parameter and its standard deviation in Tables 1, 3 and 4, respectively. We also plot the nonparametric estimator of the underlying density as compared to the true, in Figure 1.
We see that smaller is , smaller is the standard deviation of . This is indeed intuitively clear, as which is larger represents the fraction of data sampled from the second population or else the amount of information about the population which is located at .
We note that the previous estimation methods based on the distribution function require usually finite moments up to some order. These methods cannot deal with the Cauchy density that we consider here, see Table 3. Indeed, our method is based on Fourier transform, which is fast decreasing in this case. We also consider non smooth Laplace density (or double exponential), see Table 4. Its Fourier transform is slowly decreasing, but we chose the weight function in order to deal with this problem. Therefore, all integrals have relatively small support of integration and the computation is fast enough.
Empirical means  Standard deviations  

100  (0.05, 1, 2)  (0.0808, 1.0398, 2.0181)  (0.0477, 0.3038, 0.1354) 
100  (0.10, 1, 2)  (0.1205, 1.0433, 1.9990)  (0.0478, 0.2829, 0.1569) 
100  (0.15, 1, 2)  (0.1609, 0.9874, 2.0093)  (0.0406, 0.2964, 0.1455) 
100  (0.25, 1, 2)  (0.2389, 0.9848, 1.9458)  (0.0407, 0.2936, 0.2059) 
100  (0.35, 1, 2)  (0.3338, 1.0049, 1.9278)  (0.0439, 0.3151, 0.2200) 
100  (0.45, 1, 2)  (0.4194, 0.9836, 1.9683)  (0.0362, 0.2996, 0.2727) 
In the Table 2 we propose to illustrate the sensitivity of our method with respect to the symmetry assumption by considering a symmetric case against various shapeless mixed distributions close to the symmetric case.
Empirical means  Standard deviations  

100  0.5  (0.2302, 1.0153, 1.9420)  (0.0390, 0.2949, 0.2627) 
100  0.55  (0.2299, 1.0206, 1.9639)  (0.0418, 0.3319, 0.2693) 
100  0.6  (0.2330, 0.9703, 1.9637)  (0.0402, 0.3134, 0.2808) 
100  0.65  (0.2289, 0.9938, 2.0434)  (0.0399, 0.2572, 0.2744) 
Empirical means  Standard deviations  

100  (0.2, 1, 5)  (0.1987, 0.9888, 5.0116)  (0.0620, 0.3127, 0.2199) 
100  (0.2, 1, 2)  (0.1915, 1.1103, 1.9728)  (0.0580, 0.2374, 0.2630) 
100  (0.2, 1, 1.5)  (0.2068, 1.0815, 1.5358)  (0.0588, 0.2267, 0.2219) 
100  (0.2, 1, 1.2)  (0.2092, 1.0890, 1.1871)  (0.0626, 0.2398, 0.2452) 
Empirical means  Standard deviations  

100  (0.05, 1, 2)  (0.0520, 0.9768, 2.0034)  (0.0280, 0.4276, 0.1704) 
100  (0.15, 1, 2)  (0.1518, 0.9765, 1.9769)  (0.0317, 0.4109, 0.1802) 
100  (0.25, 1, 2)  (0.2447, 1.0103, 1.9886)  (0.0290, 0.4423, 0.2056) 
100  (0.35, 1, 2)  (0.3432, 0.9602, 1.9407)  (0.0297, 0.4014, 0.2344) 
100  (0.45, 1, 2)  (0.4300, 0.9710, 1.9547)  (0.0315, 0.4114, 0.3158) 
Comments on Table 14. Comparing the rows 3 and 5 of Table 1 with the rows 2 and 5 of Table 2 in BMV, it appears that our estimator is clearly less unstable than the estimator proposed by these authors when is the pdf. Table 2 summarizes the performance of our method in slightly shapeless situation where is the pdf of the distribution satisfying and , for all . When ( is a symmetric bimodal pdf with mean 0 and variance equal to 1) it is then interesting to compare the performance of our method, see row 1 of Table 2, with its performances in the similar Gaussian case, see row 4 in Table 1, the noticeable fact being that the variance of is smaller in the Gaussian case. When the bias of is badly affected when the standard deviations of the estimators is stable. The results provided in Table 3 seems to show that the heavy tails of the Cauchy distribution have essentially a bad influence on the standard deviation of . Comparing Table 1 and Table 4 it appears that the peak on the graph of the Laplace pdf helps to estimate the parameter but do not work in favor of the other parameters.
Rainfall dataset. In this paragraph we propose to study the performances of our method when compared to the results obtained in BMV. We have implemented the Gauss kernel estimator with bandwidth , , and used in (8), instead of , the estimator . When is the Gauss kernel, we explicitly have
where
The results provided by our method are , , and the behavior of the functional estimators is summarized in Figure 3. Before commenting the good performances of our estimator in Figure 3, it is crucial to notice that the reconstruction of the pdf by coincides with itself, according to (811) and replacing by . This basic phenomenon is illustrated in Figure 2.
As mentioned in Section 2.2, the function is not necessarily a pdf due to its negative part (coming from the small size of and the fact that model (4) is not necessarily the true underlying model), hence it is needed to regularize into which leads to consider, on this real dataset, . This modification explains the fact the graph of does not match exactly the graph of .
Actually we observe that the graph of fits almost perfectly the graph of in the interval , when it generates an extra bump in the interval [20,0]. Nethertheless when comparing our graphs to the graphs obtained in BMV (including a comparison with the twocomponent Gaussian mixture model), we observe that we both have the extra bump issue on the intervall [20,0], on the other hand we better estimate the two first bumps appearing on the graph of within the interval . We think that our methodological approach performs better than the existing one, mainly because we do not symmetrize our functional estimator in order to mimic as much as possible the shape of (which shapeless is precisely the reason why , see Figure 2).
5 Auxiliary results and Proofs
Let us use the notation for the Euclidean norm of a vector and for any matrix in .
Lemma 1

For all , we have
for any from 1 to .

For all , we have
for any from 1 to .

For all , we have
for some absolute constant , for any and for any from 1 to .
Proof. 1. It is easy to see that and that
2. We note that
and that
We have
and the same goes for .
3. We write briefly
We deduce our bound from above.
Lemma 2

For all , we have
for any and any from 1 to .

For all , we have
for some absolute constant , for any and for any from 1 to .
Proof. The proof uses a Taylor expansion and bounds from and similar to the Lemma 1.
Proof of Proposition 2. It is easy to see that . Therefore the estimation bias is
If we assume , for some and , then
(12) 
We have for the variance
It decomposes in , where
Indeed, random variables in the previous sums are uncorrelated. Let us study the asymptotic behavior of these terms. On the one hand,
since from Lemma 1 we have . In addition,