A transformation-based approach to Gaussianmixture density estimation for bounded data

# A transformation-based approach to Gaussian mixture density estimation for bounded data

## Abstract

Finite mixture of Gaussian distributions provide a flexible semi-parametric methodology for density estimation when the variables under investigation have no boundaries. However, in practical applications variables may be partially bounded (e.g. taking non-negative values) or completely bounded (e.g. taking values in the unit interval). In this case the standard Gaussian finite mixture model assigns non-zero densities to any possible values, even to those outside the ranges where the variables are defined, hence resulting in severe bias. In this paper we propose a transformation-based approach for Gaussian mixture modelling in case of bounded variables. The basic idea is to carry out density estimation not on the original data but on appropriately transformed data. Then, the density for the original data can be obtained by a change of variables. Both the transformation parameters and the parameters of the Gaussian mixture are jointly estimated by the Expectation-Maximisation (EM) algorithm. The methodology for partially and completely bounded data is illustrated using both simulated data and real data applications.

Keywords: Bounded support; Density estimation; EM algorithm; Gaussian mixture models; Range-power transformation.

## 1 Introduction

Density estimation is the problem of inferring a probability density function given a finite number of sample data points drawn from a population described by a probability distribution. Broadly speaking, three alternative approaches to density estimation can be distinguished. In the parametric approach a parametric distribution is assumed for the density with unknown parameters which are estimated by fitting the parametric function using the observed data. Conversely, in the nonparametric approach no density function is assumed a priori, but its form is completely determined by the data. Histograms and kernel density estimation (KDE) are two popular methods that belong to this class, and both are characterised by the number of parameters growing with the size of the dataset. Furthermore, extensions to higher dimensionality is problematic. A third approach is based on finite mixture models, where the unknown density is expressed as a convex combination of one or more probability density functions. In this class a popular model is the Gaussian mixture model which assumes the Gaussian distribution for the underlying component densities.

Gaussian mixture models (GMMs) can approximate any continuous density with arbitrary accuracy provided the model has a sufficient number of components and the parameters of the model are correctly estimated (Escobar and West, 1995; Roeder and Wasserman, 1997). However, the standard GMM for density estimation does not take into account whether or not a variable has bounded support.

Consider the graphs in Figure 1 which show some histograms for random samples drawn from two distributions, one bounded from below (top panels), and one having both lower and upper bounds (bottom panels). In these graphs boundaries are shown as vertical dots, true densities are represented as solid lines, and density estimates based on GMMs as dashed lines (see left panels of Figure 1). In both cases, the estimated densities are unsatisfactory at the boundaries, but also in the range of admissible values. A possible way to tackle these problems is to abandon the use of GMMs in favour of alternative component distributions. Another option is to remain in the realm of the Gaussian mixtures framework, but analyse the data in a transformed scale. The right panels of Figure 1 show the density estimates obtained with the GMDEB approach proposed in this paper. In both cases, the true underlying densities appear to be well approximated, and with natural boundaries constraints clearly satisfied. Figure 1: Histograms for random samples drawn from a χ2(3) (top panels) and a Beta(2,1.2) (bottom panels) distributions with the corresponding density functions (solid lines) and boundaries of the random variable (vertical dotted lines). Panels on the left show the estimated densities obtained by fitting a GMM on the original scale (blue dashed lines). Panels on the right show the densities estimated by the GMDEB transformation approach (blue thick lines).

In this paper a transformation-based approach to density estimation based on GMMs is proposed and discussed. The basic idea is to use an invertible function to map the bounded range of a variable to the whole real line, estimate the density of the transformed variable, and then back-transform to the original scale. This approach seems very natural and it has been around for a long time (see Wand et al., 1991; Marron and Ruppert, 1994, for KDE), but a simple and efficient implementation of this methodology is not yet available in the context of mixture density estimation. Note, however, that a similar approach based on Manly transformation has been recently proposed by Zhu and Melnykov (2016) for modelling skewed data in model-based clustering.

In Section 2 the GMMs approach to density estimation is reviewed. Then, Section 3 presents the proposed range-power transformation method for density estimation using GMMs in case of bounded variables. The model is described and the corresponding maximum likelihood estimates are derived through the EM algorithm. Section 4 contains the results of some simulation studies carried out to evaluate the proposed methodology and to compare with other available methods. In Section 5 some real-world datasets are analysed. The final section provides some concluding remarks.

## 2 Finite mixture modelling

### 2.1 Finite mixture for density estimation

Consider a vector of random variables taking values in the sample space with , and assume that the probability density function can be written as a finite mixture density of components of the form

 f(x;Ψ)=G∑g=1πgfg(x;θg),

where is the parameters vector. The mixing weights must satisfy the constraints for all , and . The th component density is usually taken as known except for the associated parameter(s) . Most applications assume that all component densities arise from the same parametric distribution family, although this need not be the case in general. In particular, a popular model specifies , where is the Gaussian density with mean and covariance matrix . Then, the Gaussian mixture model (GMM) can be written as

 f(x;Ψ)=G∑g=1πgϕ(x;μg,Σg), (1)

where in this case represents the entire parameters vector. Note that is an operator that forms a vector by extracting unique elements of a symmetric matrix.

In this paper we refer to (1) as the Gaussian mixture density estimate (GMDE) model. The usual nonparametric kernel density estimate (KDE) can be viewed as a mixture of components with uniform weights, i.e. (Titterington et al., 1985, pp. 28–29). Compared to KDE, finite mixture modelling uses a smaller number of components (i.e. less parameters), so it has smaller variance. Conversely, compared to parametric density estimation, finite mixture modelling has the advantage of (potentially) using more parameters, so introducing less estimation bias. There are also disadvantages related to mixture modelling, such as an increased learning complexity and lack of closed-form solution, so it needs to resort to numerical procedures (e.g. EM algorithm), and in certain cases there can be identifiability issues.

### 2.2 Estimation of Gaussian finite mixture model

Consider a random sample of observations on variables drawn from the mixture distribution in (1). Then, the log-likelihood is given by

 ℓ(Ψ)=n∑i=1logG∑g=1πgϕ(xi;μg,Σg). (2)

Direct maximisation of the log-likelihood function is not straightforward, so MLEs are usually obtained via the Expectation-Maximisation (EM) algorithm (Dempster et al., 1977; McLachlan and Peel, 2000).

An incomplete-data formulation of the mixture problem is introduced by associating to each observation a latent component-label vector (). This is a -dimensional vector, with the generic element or 0 according to whether or not arises from the th component of the mixture. Assuming independence of the complete-data vector , and the multinomial distribution for the component-label vectors s, the complete-data log-likelihood is given by

 ℓC(Ψ)=n∑i=1G∑g=1zig{logπg+logϕ(xi;μg,Σg)}. (3)

The log-likelihood (2) is maximised using the EM algorithm, an iterative algorithm that alternates two steps, called E-step and M-step, which guarantees, under fairly general conditions, the convergence to at least a local maximiser. The objective function at iteration of the EM algorithm is the conditional expectation of the complete-data log-likelihood (3), the so-called Q-function:

 Q(Ψ;Ψ(m))=n∑i=1G∑g=1ˆz(m)ig{logπg+logϕ(xi;μg,Σg)},

where , i.e. the estimated posterior probability at iteration of the EM algorithm, with the indicator function which equals 1 if the condition is fulfilled and 0 otherwise.

In the E-step the Q-function is evaluated, using the parameter values obtained at the previous step, to get the updated posterior probabilities

 ˆz(m+1)ig=ˆπ(m)gϕ(xi;ˆμ(m)g,ˆΣ(m)g)∑Gk=1ˆπ(m)kϕk(xi;ˆμ(m)k,ˆΣ(m)k).

Then, in the M-step the parameters vector is updated by maximising the -function given the previous values and the updated posterior probabilities , i.e.

 tˆΨ(m+1)=argmaxΨQ(Ψ;ˆΨ(m)).

In the case of a multivariate Gaussian mixture the M-step yields

 ˆπ(m+1)g=∑ni=1ˆz(m+1)ignandˆμ(m+1)g=∑ni=1ˆz(m+1)igxi∑ni=1ˆz(m+1)ig.

The update formula for the covariance matrix depends upon the structure of the within-component covariance matrices. Parsimonious parameterisation of the covariance matrices can be expressed through the eigendecomposition , where is a scalar controlling the volume of the corresponding ellipsoid, is a diagonal matrix specifying the shape of the density contours, and is an orthogonal matrix which determines the orientation of the ellipsoid (Banfield and Raftery, 1993; Celeux and Govaert, 1995). For instance, assuming an unconstrained covariance matrix, the updating formula is

Scrucca et al. (2016, Table 3) summarise some parameterisations of within-component covariance matrices, and the corresponding geometric characteristics, currently available in the mclust software. The previous unconstrained covariance matrix is indicated as VVV model. Note, however, that for some models no closed-formula is available, so numerical optimisation is required.

The EM algorithm requires the specification of initial values for the the parameters, say . Alternatively, an initial assignment of observations to the components of the mixture can be made, basically starting the EM algorithm from the M-step. In any case, the initialisation of the EM algorithm is often crucial because the likelihood surface tends to have multiple modes, although it usually produces sensible results when started from reasonable starting values (Wu, 1983, p. 150). For a further discussion on this point and a recent proposal see Scrucca and Raftery (2015).

Information criteria based on penalised forms of the log-likelihood are routinely used in finite mixture modelling for model selection, i.e. to decide how many components should be included in the mixture, but also which covariance parameterisations to adopt in the Gaussian case. Two popular criteria are the Bayesian information criterion (BIC; Schwartz, 1978; Fraley and Raftery, 1998) and the integrated complete-data likelihood criterion (ICL; Biernacki et al., 2000). When the goal is density estimation, Roeder and Wasserman (1997) showed that the GMDE model selected using BIC is a consistent estimator of the true density. If only the order of the mixture is needed, formal hypothesis testing can also be pursued by likelihood ratio test (LRT). However, standard regularity conditions do not hold for the null distribution of the LRT statistic to have its usual chi-squared distribution (McLachlan and Peel, 2000, Chap. 6), and significance must be assessed by resampling approaches. For a recent review see McLachlan and Rathnayake (2014), and for an implementation in the mclust software see Scrucca et al. (2016).

## 3 Methodology

### 3.1 Gaussian mixture density estimation for variables with bounded support

Let be a -variate random vector from a distribution with density having bounded support , and be some family of continuous monotonic transformations that map to the -dimensional real space. Then, we can write as the transformed set of variables with density having unbounded support .

Suppose that the density of the transformed data can be expressed through a Gaussian finite mixture density of the form

 g(y;Ψ)=G∑g=1πgϕ(y;μg,Σg).

If an estimate is available, then by the continuous change of variable theorem the density of the untransformed data can be obtained as

 ˆf(x)=ˆg(t(x;λ))⋅|J(t(x;λ))|, (4)

where is the Jacobian of the transformation, i.e. the determinant of the matrix of partial derivatives.

### 3.2 Range-power transformation for variables with bounded support

#### Lower bound case

Suppose is a univariate random variable with bounded support , where , and density . A preliminary range transformation can be defined as , which maps . Let be a continuous monotonic transformation that maps . For instance, adopting the well-known Box-Cox power transformation (Box and Cox, 1964) we may write

 t(˙x;λ)=⎧⎪⎨⎪⎩(x−l)λ−1λ% ifλ≠0log(x−l)ifλ=0, (5)

which has continuous first derivative equal to for any .

The original Box-Cox power transformation method is restricted to the univariate case, but it can be extended also to the multivariate case as described in Velilla (1993). However, further development of the multivariate case can greatly simplified by working in a coordinate-wise fashion. Thus, in this paper we propose the use of the range-power transformation in (5) for each dimension separately.

#### Lower and upper bound case

Suppose now that is a univariate random variable with bounded support in , where . Consider the preliminary range transformation which maps . As in the previous case, adopting a Box-Cox power transformation we can write

 t(˙x;λ)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(x−lu−x)λ−1λifλ≠0log(x−lu−x)ifλ=0, (6)

with continuous first derivative given by

 t′(˙x;λ)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩(x−lu−x)λ−1u−l(u−x)2ifλ≠01x−l+1u−xifλ=0.

Following the approach discussed above, the multivariate case can be tackled by working in a coordinate-wise fashion, hence applying the range-power transformation in (6) to each variable separately.

### 3.3 Estimation

Maximum likelihood estimation can be pursued via the EM algorithm under the assumption that the density on the transformed scale can be expressed as

 g(t(˙x;Ψ,λ))=G∑g=1πgϕ(t(˙x;λ);μg,Σg).

If the above equation holds, then the density function on the original scale is given by

 f(x;Ψ,λ)=G∑g=1πgϕ(t(˙x;λ);μg,Σg)|J(t(˙x;λ)|,

where and is the Jacobian of the transformation. Note that as consequence of the coordinate independent approach to multivariate range-power transformation, the matrix of first derivatives is diagonal, so the Jacobian reduces to

 J(t(˙x;λ))=det[∂t(˙x;λ)∂x]=p∏j=1∂t(˙xj;λj)∂xj.

The conditional expectation of the complete log-likelihood given the observed data can be expressed as

 Q(Ψ;Ψ(m))=n∑i=1G∑g=1ˆz(m)ig{logπg+logϕ(t(˙xi;λ);μg,Σ)+log|J(t(˙xi;λ))|},

where . Therefore, in the E-step the posterior probabilities are updated using

 ˆz(m+1)ig=ˆπ(m)gϕ(t(˙xi;ˆλ(m));ˆμ(m)g,ˆΣ(m)g)∑Gg=1ˆπ(m)gϕ(t(˙xi;ˆλ(m));ˆμ(m)g,ˆΣ(m)g).

In the M-step the parameters are updated by maximising the -function given the previous values of the parameters and the updated posterior probabilities. This can be done in two steps. In the first step an updated value is computed by numerically maximising the Q-function with respect to because no closed-form expression is available. To this goal, a Newton-type numerical optimisation algorithm can be used. In our implementation we used the L-BFGS-B method of Byrd et al. (1995) available in the optim() function for the R statistical software. The remaining parameters are then obtained as in standard EM algorithm but accounting for the updated transformation parameters , i.e.

 ˆπ(m+1)g=∑ni=1ˆz(m+1)ignandˆμ(m+1)g=∑ni=1ˆz(m+1)igt(˙xi;ˆλ(m+1))∑ni=1ˆz(m+1)ig.

Again, the update formula for the covariance matrix depends upon the assumed eigendecomposition model. In the most general case of an unconstrained covariance matrix, i.e. the VVV model, we have

 ˆΣ(m+1)g=∑ni=1ˆz(m+1)ig(t(˙xi;ˆλ(m+1))−ˆμ(m+1)g)(t(˙xi;ˆλ(m+1))−ˆμ(m+1)g)⊤∑ni=1ˆz(m+1)ig.

Initialisation of the above EM algorithm is obtained by first estimating the optimal marginal transformations, then using the final classification from a -means algorithm on the range-power transformed variables. This initial partition of data points is used to start the algorithm from the M-step. Finally, the EM algorithm is stopped when the log-likelihood improvement falls below a specified tolerance value or a maximum number of iterations is reached.

## 4 Simulation studies

In this section we present some simulation studies designed to compare the proposed GMDEB approach to some density estimators for bounded variables discussed in the literature. The comparison is based on the integrate squared error (ISE):

 ISE(ˆf)=∫[ˆf(x)−f(x)]2dx,

where is the unknown true density and is its estimate based on a random sample of observations. Thus, the is a measure of discrepancy between the true and the estimated density based on a squared loss criterion. It is equal to 0 when the estimated density perfectly coincides with the true density, and increases as the differences between the two densities get larger. For more details see Scott (2009, Sec. 2.3). In the following sections the ISE is computed via numerical integration.

### 4.1 Distributions with lower bound

The univariate densities with lower bound support considered in this study are shown in Figure 2. They are all bounded at zero with different degrees of skewness, positive for the first three densities and negative for the last one. Figure 2: Univariate densities with lower bound considered in the simulation study.

The proposed method (GMDEB) is compared with the following density estimators:

• simpleKern which refers to the simple boundary correction method proposed by Jones (1993) which is equivalent to a kernel weighted local linear fitting near the boundary;

• reflectKern which indicates the reflection method of Schuster (1985) which amounts to reflect the observed data points at the origin, then a density estimate is obtained using this augmented dataset with a simple correction to ensure that integrate to one;

• cutnormKern which is the cut and normalisation method of Gasser and Müller (1979) where the kernel is truncated at the boundary and re-normalised to unity;

• logtransKern which is the method proposed by Marron and Ruppert (1994) which fits a kernel density estimator on the log-scale and then back-transforms the result with an explicit normalisation step;

• logSpline which estimates a density using cubic splines to approximate the log-density using knots located as described in Stone et al. (1997);

• GaMixDE which estimates a density by fitting a mixture of Gamma densities using cubic splines to approximate the log-density using knots located as described in Stone et al. (1997);

• GMDE which is the standard density estimate from GMMs with no boundary correction.

The first four methods mentioned above are implemented in the evmix R package (Scarrott et al., 2018; Hu and Scarrott, 2018), whereas the logSpline estimator is available in the logspline R package (Kooperberg, 2016). For GaMixDE the code is available in the R package mixtools (Benaglia et al., 2009; Young et al., 2017), whereas GMDE is obtained from the mclust R package (Fraley et al., 2017). In the last two cases the number of mixture components is selected using the BIC criterion.

Figure 3 graphically summarises the simulation results obtained on 1000 replications. Overall the GMDEB estimator appears to be able to approximate the true density better than the other KDE methods with boundary correction, in particular when the sample size is small. The proposed approach also shows less variability, which decreases as the sample size increases. Clearly the GMDEB approach appears to be inferior to the Gamma mixture density estimator, although not by much, when the true density belongs to the Gamma family of distributions (i.e. in the first two cases), but it is better in the last two cases, in particular for the left-skewed Gompertz distribution. Figure 3: Boxplots of ISE distribution from 1000 replications of the simulation study for the selected univariate densities with lower bound.

### 4.2 Distributions with lower and upper bounds

For the case of univariate densities with both lower and upper bounds support, a list of distributions considered in the simulation study is shown in Figure 4. The first two settings involve the Beta distribution with parameters selected to produce a symmetric case and a skewed case. The last two settings consider two further asymmetric cases, the Kumaraswamy distribution with density on , and the logarithmic peak distribution with density on . Figure 4: Univariate densities with lower and upper bounds considered in the simulation study.

The cases described above should provide a broad spectrum of examples for comparing different estimators. To this goal the proposed method (GMDEB) is compared with the following density estimators:

• beta1Kern, beta2Kern which use the Beta and modified Beta kernels proposed by Chen (1999) followed by a renormalisation to ensure a proper density;

• copulaKern which uses the bivariate Gaussian copula based kernels of Jones and Henderson (2007);

• logSpline which fits a density using cubic splines to approximate the log-density using knots located as described in Stone et al. (1997);

• BeMixDE which estimates the density by fitting a mixture of Beta distributions

The first two estimators are available in the R package evmix (Scarrott et al., 2018; Hu and Scarrott, 2018). The logSpline estimator is available in the logspline R package (Kooperberg, 2016). For the BeMixDE the betareg R package (Grün et al., 2012) is used with the number of mixture components selected using BIC.

Figure 5 reports the simulation results obtained on 1000 replications. By looking at the boxplots the GMDEB approach appears to be more accurate than the other non-parametric density estimators. Furthermore, its accuracy is slightly lower than the Beta mixture density estimator in the first three cases (which, however, are all cases related to the Beta distribution), but is better in the last case. Overall, GMDEB seems to provide robust reliable density estimates when both lower and upper bounds are present. Figure 5: Boxplots of ISE distribution from 1000 replications of the simulation study for the selected univariate densities with lower and upper bounds.

## 5 Real data analyses

### 5.1 Suicide data

Consider the suicide dataset which has been analysed by many authors (see for instance Silverman, 1986; Karunamuni and Alberts, 2005). The dataset provides the lengths of 86 spells of psychiatric treatment undergone by control patients in a suicide study. By its nature, the measured variable can only take positive values.

The GMDEB model with the largest BIC is the simple single component model (E,1) with estimated transformation parameter . The corresponding density estimate on the original scale is reported on the left panel of Figure 6, which is then compared on the remaining panels to some boundary-corrected versions of kernel density estimators (center panel), and to the logSpline and the Gamma mixture density estimates (right panel). As it can be seen from these graphs, the GMDEB estimate is able to well approximate the histogram-based empirical distribution without including any artifacts that appear in all the other density estimates. For instance, kernel-based methods tend to underestimate near the origin and show a slight bump in the right tail of the distribution. On the contrary, the logSpline and the Gamma mixture density estimates tend to overestimate near the origin. Figure 6: Histogram of suicide data with GMDEB density estimate on the left panel. The remaining panels graphically compare the GMDEB estimate with some other density estimates, such as kernel-based with boundary corrections, logSpline and Gamma mixture.

### 5.2 Acidity data

This dataset provides the values of an acidity index (acid-neutralizing capacity, ANC) measured in a sample of 155 lakes in North-Central Wisconsin. Several authors have previously analysed the data using a mixture of Gaussian distributions on the log-scale (Crawford et al., 1992; Crawford, 1994; Richardson and Green, 1997; McLachlan and Peel, 2000).

We analyse the data in the original scale because the proposed method automatically selects the “optimal” transformation and takes into account the implicit lower bound of the index that can not assume negative values. From the left panel of Figure 7 we can see that according to BIC the best model is the one with two mixture components having different variances (V,2), closely followed by models (E,2) and (V,3). The right panel of Figure 7 shows the histogram of the data and the density estimated with model (V,2) using the GMDEB approach with transformation parameter . The density obtained appears to accurately follow the distribution of the data, indicating the presence of two separated skewed distributions having different dispersions, smaller for the component close to the origin and larger for higher values of ANC. This is also confirmed by the graphs in Figure 8, which show the component densities scaled by the estimated prior probabilities (left panel) and the estimated posterior probabilities . Using the standard cut-off value of , lakes with ANC smaller than about 232 are assigned to the first group, otherwise to the second group. This is in agreement with previous findings. Figure 7: Plot of BIC values for different number of mixture components and within-component variances (left panel). Histogram of acidity data with GMDEB density estimate (right panel). Figure 8: Plot of estimated mixture density and rescaled component densities (left panel), and corresponding estimated posterior probabilities for the acidity data (right panel).

### 5.3 Racial data

Geenens (2013) presented an analysis on data giving the proportion of white student enrolled in 56 school districts in Nassau County (Long Island, New York), for the 1992–1993 school year. The density estimate for this dataset should only be supported on the range. See also Simonoff (1996, Sec. 3.2).

The selected model on the transformed scale is (E,1) with , and the corresponding density estimate on the original scale is shown in Figure 9. This can be compared with the Beta density obtained by estimating the parameters with maximum likelihood. The latter density seems to put too much emphasis close to the upper boundary and in the middle values of the distribution, while completely missing the bulk of the data between 70% and 90% of white students. On the contrary, the proposed approach provides a density estimate which correctly identify the majority of the data with at least 70% of white students, but also the small peak near the lower boundary containing schools with almost 0% white students. These findings largely agree with those reported in Geenens (2013, Fig. 3). Figure 9: GMDEB density estimate for the racial data (solid line) and Beta density with parameters estimated by maximum likelihood (dashed line).

### 5.4 Plasma data

Consider the data from a study on the association between the low plasma concentrations of retinol, beta-carotene, or other carotenoids on the increased risk of developing certain types of cancer (Nierenberg et al., 1989). The joint distribution of plasma Retinol (ng/ml) and plasma beta-carotene (ng/ml) is bounded below at zero for both variables. The left panel of Figure 10 shows the scatterplot of data points observed on 314 patients.

If the bivariate density is estimated using the the standard GMM, the model with the largest BIC has 3 components and unconstrained covariance matrix (VVV). This relatively large number of components is related to the presence of a strong skewness in the data distribution. Furthermore, a non-negligible mass of density is assigned to negative values of plasma beta-carotene.

Both issues can be solved using the proposed range-transformation approach. The selected model for the bivariate density estimation has BIC , with a diagonal equal variance structure and a single component (EII,1). The transformation parameters are estimated as . The right panel of Figure 10 shows the highest density regions (HDRs; Hyndman, 1996) corresponding to proportions . In this case the joint data distribution appears to be well approximated by the estimated density. Figure 10: Plot of data points for the plasma dataset (left panel) and the corresponding bivariate density estimated using the GMDEB approach (right panel). In the latter case, the graph shows the highest density regions corresponding to proportions (0.25,0.5,0.75,0.9).

### 5.5 C-horizon layer of the Kola data

The Kola Ecogeochemistry Project (1993-1998) collected data on more than 50 chemical elements on four different primary sample materials: terrestrial moss, and the O-, B-, and C-horizon of podzolic soils located in parts of northern Finland, Norway and Russia. The main aim of the project was the documentation of the impact of the Russian nickel industry on the Arctic environment. The data are available on Reimann et al. (1998), see also Reimann et al. (2011). Here we analyse the distribution of nickel (Ni), copper (Cu), and chromium (Cr) on the C-horizon layer. For each of the 605 sites, the detected concentrations of the above mentioned heavy metals are provided. Clearly, concentrations are bounded below at zero, and a preliminary data exploration suggests that the joint distribution is highly skewed. Figure 11: Scatterplot matrix of chromium (Cr), copper (Cu), and nickel (Ni) concentrations on the C-horizon layer. The density estimated using the GMDEB approach is shown using HDRs corresponding to 25%, 50%, and 75% probability regions.

The selected GMDEB model according to the BIC criterion is a two components mixture model with variable volume and equal shape and orientation, i.e. VEE in mclust nomenclature. The estimated vector of transformation parameters is . Figure 11 contains the scatterplot matrix of heavy metal concentrations with the estimated density projected onto the marginal bivariate subspaces. The latter are shown as HDRs corresponding to 25%, 50%, and 75% probability regions. The distribution of nickel, copper, and chromium on the C-horizon layer is clearly skewed, with most sites having concentrations close to the origin. However, there are also a number of sites with relatively high concentrations. Further insights can be obtained by examining the distribution of metal concentrations conditional on the HDR to which the observed sites belong, as shown in Figure 12. Looking at the boxplots for the conditional distributions, we can see that the central part of the distribution, i.e. that corresponding to 0-25% HDR, is characterised by the lowest concentration levels, whereas higher concentrations of heavy metals can be found as we move to regions of lower density. Figure 12: Boxplots for the distributions of heavy metals conditional on HDR regions from the GMDEB density estimate.

## 6 Conclusion

This paper addresses the problem of density estimation using GMMs when variables are partially or completely bounded. By introducing a range-power transformation of the data, it is possible to obtain a GMM for density estimation on the transformed data, and then to derive an accurate estimate of the density on the original scale which takes into account the natural bounds of the variables. The proposed model is estimated by maximum likelihood using the EM algorithm. We showed that this transformation-based approach is able to deal with variables having both lower and upper bounds, and the results obtained are often better than those provided by other methods usually based on modified versions of kernel density estimation. The transformation-based approach seems to be very promising and, in principle, it can be applied to other types of non-Gaussian variables, such as in the presence of skewness, and for other purposes outside density estimation, for instance in clustering. We defer these very important issues to future works.

### References

1. Banfield, J. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821.
2. Benaglia, T., Chauveau, D., Hunter, D. R., and Young, D. (2009). mixtools: An R Package for Analyzing Finite Mixture Models. Journal of Statistical Software 32, 1–29.
3. Biernacki, C., Celeux, G., and Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 719–725.
4. Box, G. and Cox, D. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 26, 211–252.
5. Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing 16, 1190–1208.
6. Celeux, G. and Govaert, G. (1995). Gaussian Parsimonious Clustering Models. Pattern Recognition 28, 781–793.
7. Chen, S. X. (1999). Beta kernel estimators for density functions. Computational Statistics & Data Analysis 31, 131–145.
8. Crawford, S. L. (1994). An application of the Laplace method to finite mixture distributions. Journal of the American Statistical Association 89, 259–267.
9. Crawford, S. L., DeGroot, M. H., Kadane, J. B., and Small, M. J. (1992). Modeling lake-chemistry distributions: Approximate Bayesian methods for estimating a finite-mixture model. Technometrics 34, 441–453.
10. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 39, 1–38.
11. Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the american statistical association 90, 577–588.
12. Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41, 578–588.
13. Fraley, C., Raftery, A. E., and Scrucca, L. (2017). mclust: Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation. URL https://CRAN.R-project.org/package=mclust, R package version 5.4.
14. Gasser, T. and Müller, H.-G. (1979). Kernel estimation of regression functions. Springer.
15. Geenens, G. (2013). Probit transformation for kernel density estimation on the unit interval. Journal of the American Statistical Association 109, 346–358.
16. Grün, B., Kosmidis, I., and Zeileis, A. (2012). Extended Beta Regression in R: Shaken, Stirred, Mixed, and Partitioned. Journal of Statistical Software 48, 1–25.
17. Hu, Y. and Scarrott, C. (2018). evmix: An R package for Extreme Value Mixture Modeling, Threshold Estimation and Boundary Corrected Kernel Density Estimation. Journal of Statistical Software 84, 1–27.
18. Hyndman, R. J. (1996). Computing and graphing highest density regions. The American Statistician 50, 120–126.
19. Jones, M. C. (1993). Simple boundary correction for kernel density estimation. Statistics and Computing 3, 135–146.
20. Jones, M. C. and Henderson, D. A. (2007). Miscellanea kernel-type density estimation on the unit interval. Biometrika 94, 977–984.
21. Karunamuni, R. J. and Alberts, T. (2005). On boundary correction in kernel density estimation. Statistical Methodology 2, 191–212.
22. Kooperberg, C. (2016). logspline: Logspline Density Estimation Routines. URL https://CRAN.R-project.org/package=logspline, R package version 2.1.9.
23. Marron, J. S. and Ruppert, D. (1994). Transformations to reduce boundary bias in kernel density estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 653–671.
24. McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
25. McLachlan, G. J. and Rathnayake, S. (2014). On The Number Of Components In A Gaussian Mixture Model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4, 341–355.
26. Nierenberg, D. W., Stukel, T. A., Baron, J. A., Dain, B. J., Greenberg, E. R., et al. (1989). Determinants of plasma levels of beta-carotene and retinol. American Journal of Epidemiology 130, 511–521.
27. Reimann, C., Äyräs, M., Chekushin, V., Bogatyrev, I., Boyd, R., et al. (1998). Environmental Geochemical Atlas of the Central Barents Region. Geological Survey of Norway, Trondheim, Norway .
28. Reimann, C., Filzmoser, P., Garrett, R., and Dutter, R. (2011). Statistical Data Analysis Explained: Applied Environmental Statistics With R. John Wiley & Sons.
29. Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59, 731–792.
30. Roeder, K. and Wasserman, L. (1997). Practical Bayesian Density Estimation Using Mixtures of Normals. Journal of the American Statistical Association 92, 894–902.
31. Scarrott, C., Hu, Y., Akbar, A., and of Canterbury, U. (2018). evmix: Extreme Value Mixture Modelling, Threshold Estimation and Boundary Corrected Kernel Density Estimation. URL https://CRAN.R-project.org/package=evmix, R package version 2.10.
32. Schuster, E. F. (1985). Incorporating support constraints into nonparametric estimators of densities. Communications in Statistics-Theory and methods 14, 1123–1136.
33. Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 31–38.
34. Scott, D. W. (2009). Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2nd edition.
35. Scrucca, L., Fop, M., Murphy, T. B., and Raftery, A. E. (2016). mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models. R Journal 8, 205–233.
36. Scrucca, L. and Raftery, A. E. (2015). Improved initialisation of model-based clustering using Gaussian hierarchical partitions. Advances in Data Analysis and Classification 4, 447–460.
37. Silverman, B. W. (1986). Density Estimation. Chapman & Hall/CRC.
38. Simonoff, J. (1996). Smoothing Methods in Statistics. Springer.
39. Stone, C. J., Hansen, M. H., Kooperberg, C., and Truong, Y. K. (1997). Polynomial splines and their tensor products in extended linear modeling. Annals of Statistics 25, 1371–1470.
40. Titterington, D. M., Smith, A. F., and Makov, U. E. (1985). Statistical analysis of finite mixture distributions. Wiley, Chichester; New York.
41. Velilla, S. (1993). A note on the multivariate Box-Cox transformations to normality. Statistics and Probability Letters 17, 441–451.
42. Wand, M. P., Marron, J. S., and Ruppert, D. (1991). Transformations in density estimation. Journal of the American Statistical Association 86, 343–353.
43. Wu, C. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics 11, 95–103.
44. Young, D., Benaglia, T., Chauveau, D., and Hunter, D. (2017). mixtools: Tools for Analyzing Finite Mixture Models. URL https://CRAN.R-project.org/package=mixtools, R package version 1.1.0.
45. Zhu, X. and Melnykov, V. (2016). Manly transformation in finite mixture modeling. Computational Statistics & Data Analysis .
Comments 0
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters Loading ...
239056 You are asking your first question!
How to quickly get a good answer:
• Keep your question short and to the point
• Check for grammar or spelling errors.
• Phrase it like a question Test
Test description