Theoretical Grounding for Estimation in Conditional Independence Multivariate Finite Mixture Models

# Theoretical Grounding for Estimation in Conditional Independence Multivariate Finite Mixture Models

Xiaotian Zhu and David R. Hunter
Data Science and Statistics, AbbVie, North Chicago, IL; Statistics Department, Pennsylvania State University, University Park, PA
Email: xiaotian.zhu@abbvie.com Email: dhunter@stat.psu.edu
###### Abstract

For the nonparametric estimation of multivariate finite mixture models with the conditional independence assumption, we propose a new formulation of the objective function in terms of penalized smoothed Kullback-Leibler distance. The nonlinearly smoothed majorization-minimization (NSMM) algorithm is derived from this perspective. An elegant representation of the NSMM algorithm is obtained using a novel projection-multiplication operator, a more precise monotonicity property of the algorithm is discovered, and the existence of a solution to the main optimization problem is proved for the first time.

ixture model; penalized smoothed likelihood; majorization-minimization.

{classcode}

62G05; 62H30

m

## 1 Introduction

In recent years, several studies have advanced the development of estimation algorithms, based on expectation-maximization (EM) and its generalization called majorization-minimization (MM), for nonparametric estimation for conditional independence multivariate finite mixture models. The idea for these algorithms had its genesis in the stochastic EM algorithm of Bordes et al. (2007) and was later extended to a deterministic algorithm by Benaglia et al. (2009) and Benaglia et al. (2011). These algorithms were placed on a more stable theoretical foundation due to the ascent property established by Levine et al. (2011). A detailed account of these algorithms, along with the related theory of parameter identifiability, is presented in the survey article by Chauveau et al. (2015). This paper follows up on this line of research, extending the theoretical foundations of this method and deriving novel results while also simplifying their formulation.

Conditional independence multivariate finite mixture models have fundamental importance in both statistical theory and applications; for example, as Chauveau et al. (2015) point out, these models are related to the random-effects models of Laird and Ware (1982). The basic setup assumes that -dimensional vectors , , are simple random samples from a finite mixture of components with positive mixing proportions that sum to 1, and density functions . Here, we assume is known. For recent work that addresses the estimation of , along with a different approach to the estimation of the model parameters than the one outlined here, see Bonhomme et al. (2014) and Kasahara and Shimotsu (2014).

The conditional independence assumption, which arises naturally in analysis of data with repeated measurements, says each , , is equivalent to the product of its marginal densities . Thus, the mixture density is

 g(x)=m∑j=1λjfj(x)=m∑j=1λjr∏k=1fj,k(xk) (1)

for any . This is often regarded as a semi-parametric model with being the Euclidean parameters and , , being the functional parameters. Let denote all of these parameters.

The identifiability of the parameters in the model (1) was not clear until the breakthrough in Hall and Zhou (2003) which established the identifiability when and . Some follow-up work appeared, for example, Hall et al. (2005) and Kasahara and Shimotsu (2009), until the fundamental result that established generic identifiability of (1) for was obtained (Allman et al., 2009) based on an algebraic result of Kruskal (1976, 1977).

Bordes et al. (2007) proposed a stochastic nonparametric EM algorithm (npEM) estimation algorithm for the estimation of semiparametric mixture models. Benaglia et al. (2009) and Benaglia et al. (2011) proposed a deterministic version of the algorithm for the estimation of (1) and studied bandwidth slection related to it. However, all these algorithms lack an objective function as well as the descent property which chracterizes any traditional EM algorithm (Dempster et al., 1977). A significant improvement comes from Levine et al. (2011), which proposes a smoothed likelihood as the objective function and leads to a smoothed version of the npEM that does possess the desired descent property. The authors point out the similarities between their approach and the one in Eggermont (1999) for non-mixtures. However, the constraints imposed by the condition that each must integrate to one lead to tricky optimization issues and necessitate a slightly awkward normalization step to satisfy these constraints. In reformulating the parameter space, the current paper removes the constraints and provides a rigorous justification for the algorithm, proving the existence of a solution to the main optimization problem for the first time. In addition, this paper sharpens the descent property by deriving a positive lower bound on the size of the decrease in the objective function at each iteration.

## 2 Reframing the Estimation Problem

In the following, we first consider an ideal setting where the target density is known (i.e., the sample size is infinity). Then we replace the target density by its empirical version and obtain the discrete algorithm.

### 2.1 Setup and Notation

Let and let denote a target density on , with support in the interior of , where is a compact and convex set in . Without loss of generality, assume is the closed -dimensional cube . We are interested in the case when is a finite mixture of products of fully unspecified univariate measures, with unknown mixing parameters.

We make the following assumptions:

1. Let the number of mixing components in be fixed and denoted by . There exist non-negative functions , , such that

 g(x)=m∑j=1ej(x). (2)
2. For each ,

 ej(x)=θjr∏k=1ej,k(xk), (3)

where and for each , , is positive with support in . Hence each is in , positive, and with support in .

Given a bandwidth , let be nonnegative and with support in , such that

1. For ,

 ∫sh(v,u)du=∫sh(u,z)du=1. (4)
2. There exist positive numbers and such that for any ,

 M1(h)≤sh(v,z)≤M2. (5)
3. The function has continuous first-order partial derivatives on and there exists a constant such that for any ,

 ∣∣∣∂∂vsh(v,z)|v=u∣∣∣≤Band∣∣∣∂∂zsh(v,z)|z=x∣∣∣≤B. (6)
4. If we define , then

 fj(x)≥(M1(h))r, (7)

for all and for each .

Before stating the optimization problem, we define the smoothing operators , , and , as follows.

For any , let

 (Shf)(x)=∫~sh(x,u)f(u)duand(S∗hf)(x)=∫~sh(u,x)f(u)du, (8)

where

 ~sh(x,u)=r∏k=1sh(xk,uk)for x,u∈Rr. (9)

Furthermore, let

 (Nhf)(x)={exp[(S∗hlogf)(x)]for x∈Ω,0 elsewhere. (10)

These smoothing operators are well-known and have many desirable properties (Eggermont, 1999). For instance, Lemma 1.1 of Eggermont (1999) states that for any nonnegative functions and in ,

 KL(Shg1,Shg2)≤KL(g1,g2), (11)

where is the Kullback-Leibler divergence defined by

 KL(g1,g2)=∫[g1logg1g2+g2−g1]. (12)

### 2.2 Main Optimization Problem

Now, we assume conditions (i) through (vi) and propose to estimate by minimizing the function

 (13)

subject to these conditions. In fact the only assumptions that impose any constraints on are (ii) and (vi). Minimization of can be written equivalently as minimization of the penalized smoothed Kullback-Leibler divergence

 KL(g,m∑j=1(Nhej))+∫[m∑j=1ej−m∑j=1(Nhej)](x)dx, (14)

where in (14) the second term acts like a roughness penalty.

The discrete version of the optimization problem replaces by , where is the empirical distribution function of a random sample of size , and in this case we minimize

 ldiscrete(e)=−1nn∑i=1logm∑j=1(Nhej)(xi)+∫m∑j=1ej(x). (15)

Although we do not constrain to require that the sum of all is a density as required by Equation (2), this property is guaranteed by the main optimization:

###### Theorem 2.1.

Any solution to (13) or (15) satisfies

 ∫m∑j=1~ej(x)=1. (16)
###### Proof.

For any fixed , differentiation shows that the function is minimized at the unique value

 ^α=1/∫m∑j=1ej(x). (17)

Thus if is a minimizer then Equation (16) must hold. ∎

From (16), we see that for each , can be interpreted as the mixing weight corresponding to the th mixture component.

## 3 The NSMM Algorithm

In this section, we derive an iterative algorithm, using majorization-minimization (Hunter and Lange, 2004), to minimize Equation (13). The algorithm, which we refer to as the nonlinearly smoothed majorization-minorization (NSMM) algorithm, coincides with that of Levine et al. (2011), despite the different derivation.

### 3.1 An MM Algorithm

Given the current estimate satisfying assumptions (ii) and (vi), let us define

 w(0)j(x)=(Nhe(0)j)(x)m∑j′=1(Nhe(0)j′)(x) (18)

for , noting that . The concavity of the logarithm function gives

 l(e) −l(e(0)) =−∫g(x)logm∑j=1(Nhe(0)j)(x)m∑j′=1(Nhe(0)j′)x⋅(Nhej)(x)(Nhe(0)j)(x)dx+∫(m∑j=1ej−m∑j=1e(0)j) ≤−∫g(x)m∑j=1(Nhe(0)j)(x)m∑j′=1(Nhe(0)j′)(x)⋅log(Nhej)(x)(Nhe(0)j)(x)dx+∫(m∑j=1ej−m∑j=1e(0)j). (19)

So if we let

 b(0)(e)=−∫g(x)m∑j=1w(0)j(x)⋅log(Nhej)(x)dx+∫(m∑j=1ej), (20)

we obtain

 l(e)−l(e(0))≤b(0)(e)−b(0)(e(0)). (21)

Using the MM algorithm terminology of Hunter and Lange (2004), Inequality (21) means that may be said to majorize at , up to an additive constant. Minimizing therefore yields a function satisfying

 l(e(1))≤l(e(0)). (22)

Thus, we now consider how to minimize , subject to the assumptions on that were stated at the beginning. This is to be done component-wise. That is, for each , we wish to minimize

 b(0)j(e) =−∫g(x)w(0)j(x)⋅∫~sh(u,x)logej(u)dudx+∫ej =−∬g(x)w(0)j(x)⋅~sh(u,x)[r∑k=1logej,k(uk)+logθj]dudx +∫θjr∏k=1ej,k(uk)du. (23)

Up to an additive term that does not involve any , Expression (23) is

 −r∑k=1∬g(x)w(0)j(x)⋅sh(uk,xk)logej,k(uk)duk dx+∫θjr∏k=1ej,k(uk)du. (24)

For any in , we can view Expression (24) as an integral with respect to d. Differentiating the integrand with respect to and equating the result to zero, Fubini’s Theorem gives

 ^ej,k(uk)∝∫g(x)w(0)j(x)⋅sh(uk,xk) dx. (25)

This tells us, according to (3), that

 ^ej(u)=αjr∏k=1∫g(x)w(0)j(x)⋅sh(uk,xk) dx (26)

for some constant . To find , we plug (26) into (23) and differentiate with respect to , which gives as a final result

 ^ej(u)=r∏k=1∫g(x)w(0)j(x)⋅sh(uk,xk) dx[∫g(x)w(0)j(x) dx]r−1. (27)

To summarize, our NSMM algorithm starts with some initial estimate satisfying assumptions (ii) and (vi), then iterates according to

 e(p+1)(u)=G(e(p))(u), (28)

where performs the one-step update of Equation (27). In practical terms, NSMM is identical to the non-parametric maximum smoothed likelihood algorithm proposed in Levine et al. (2011). However, our derivation uses a simpler parameter space and the normalization involved in each step of the algorithm is now a result of optimization. We have thus rigorously derived the NSMM algorithm as a special case of the majorization-minimization method.

In the discrete case, we replace the density by the empirical distribution defined by the sample; thus, the algorithm iterates according to the following until convergence, assuming is the current step estimate:

Majorization Step: For , , compute

 w(p)j(xi)=(Nhe(p)j)(xi)m∑j=1(Nhe(p)j)(xi). (29)

Minimization Step: Let

 e(p+1)j(u)=r∏k=1n∑i=11nw(p)j(xi)sh(uk,xik)(n∑i=11nw(p)j(xi))r−1. (30)

### 3.2 The Projection-Multiplication Operator

The NSMM algorithm of Section 3.1 can be summarized in an elegant way using the projection-multiplication operator, defined as follows. For any nonnegative function on such that , and , let the operator , which factorizes as a product of marginal functions on , be defined by

 (Pf)(x)=[r∏k=1∫Rr−1f(x)dx1dx2⋯dxk−1dxk+1⋯dxr][∫f](r−1). (31)

When is a density on , the right side of (31) simplifies because the denominator is 1. As the next lemma points out, the operator commutes with the Operator.

###### Lemma 3.1.

Assume is an integrable nonnegative function on with support in a compact set . We have

 (P∘Sh)f=(Sh∘P)f. (32)
###### Proof.

See Appendix (A.1). ∎

Lemma 3.1 implies that , which performs the one-step update of the NSMM algorithm, can be expressed concisely as

 (G(e(p)))j(u)=[P∘Sh(g⋅w(p)j)](u) (33)

for . In the discrete or finite-sample case, places weight at each sampled point. Equation (33) therefore suggests a geometric intuition of in the discrete case, which is illustrated in Figure 1.

### 3.3 Sharpened Monotonicity

For any MM algorithm, including any EM algorithm (Dempster et al., 1977), the well-known monotonicity property of Inequality (22) says that the value of the objective function moves, at each iteration, toward the direction of being optimized (Hunter and Lange, 2004). For the NSMM algorithm, this descent property was first proved in Levine et al. (2011). In Proposition 3.2, we present a novel result that strengthens Inequality (22) by giving an explicit formula for the nonnegative value .

###### Proposition 3.2.

In the continuous (infinite-sample) version of the NSMM algorithm, at any step , we have

 l(e(p))−l(e(p+1))=m∑j=1KL(e(p+1)j,e(p)j)+m∑j=1KL(g⋅w(p)j,g⋅w(p+1)j). (34)
###### Proof.

See Appendix (A.2). ∎

###### Remark 1.

The discrete version of Proposition 3.2 is

 l(e(p))−l(e(p+1))=m∑j=1KL(e(p+1)j,e(p)j)+1nn∑i=1m∑j=1w(p)j(xi)logw(p)j(xi)w(p+1)j(xi). (35)

Proposition 3.2 implies the following corollary:

###### Corollary 3.3.

In the NSMM algorithm, at any step , we have

 l(e(p))−l(e(p+1))≥m∑j=1KL(e(p+1)j,e(p)j). (36)

Inequality (36) may be established directly, using Jensen’s Inequality, and we include this proof separately as an appendix because it is interesting in its own right.

###### Proof.

Direct proof of Corollary 3.3 can be found in Appendix (A.3). ∎

Corollary 3.3 implies the following two novel results. First, Corollary 3.4 guarantees that we only need to search among fixed point(s) of the NSMM algorithm for a solution to the minimization problem. This gives a theoretical basis for using the NSMM algorithm for this estimation problem.

###### Corollary 3.4.

Any minimizer of or is a fixed point of the corresponding NSMM algorithm.

###### Proof.

Since the right side of (36) is strictly positive when for any , a necessary condition for to minimize is that , i.e., that is a fixed point of the algorithm. The same reasoning works for . ∎

Second, Corollary 3.5 ensures among other things that the distance between estimates of adjacent steps from an NSMM sequence will tend to zero, a result that is used in the next section.

###### Corollary 3.5.

In the NSMM algorithm, at any step , we have

 l(e(p))−l(e(p+1))≥m∑j=114∥∥e(p+1)j−e(p)j∥∥21, (37)

where denotes the norm.

###### Proof.

The result follows from Inequality (3.21) in Eggermont and LaRiccia (2001), which states that

 KL(g1,g2)≥14∥g1,g2∥21 (38)

for functions and . ∎

## 4 Existence of a Solution to the Maximization Problem

In this section, we verify the existence of at least one solution to the main optimization problem of Section 2.2, a novel result as far as we are aware.

###### Lemma 4.1.

Given satisfying assumption (ii), we have . In the discrete case, we have .

###### Proof.

See Appendix (A.4). ∎

Together, Lemma 3.3 and Lemma 4.1 imply the following corollary.

###### Corollary 4.2.

In the NSMM algorithm, will tend to a finite limit as goes to infinity. This result also holds in the discrete case.

We now establish some technical results that lead to the main conclusion of this section, namely, the existence of a minimizer of both and .

###### Lemma 4.3.

Assume conditions (i) through (vi). For each , , any NSMM sequence is uniformly bounded and equicontinuous on . This result also holds in the discrete case.

###### Proof.

See Appendix (A.5). ∎

More generally, Lemma 4.3 implies the following result:

###### Lemma 4.4.

For satisfying assumptions (i) through (vi), in either the discrete or the continuous case, for and , , we have

 (G(e))j(u)≤Mr2, (39) (40)

The following lemma establishes a sort of lower semi-continuity of the functional , which will be needed in proving existence of at least one solution to the main optimization problem.

###### Lemma 4.5.

Let be nonnegative and with support in for each and , where and . Assume each uniformly converges to in and that all are bounded from above by a constant . Let and represent and , respectively. Then we have

 l(γ(∞))≤liminfp→∞l(γ(p)). (41)

This is also true for the discrete case.

###### Proof.

See Appendix (A.6). ∎

###### Theorem 4.6.

Under assumptions (i) through (vi), there exists at least one solution to the main optimization problem (13). This is also true in the discrete case.

###### Proof.

See Appendix (A.7). ∎

To conclude this section, we discuss the rationale behind assumption (vi) and related issues such as why the operator is well-defined as we applied it.

###### Lemma 4.7.

In an NSMM sequence , the are all strictly positive for all . Moreover, if we let and , then

 (M1(h))r≤f(p)j(u)≤Mr2 (42)

for all , , and .

###### Proof.

See Appendix (A.8). ∎

Lemma (4.7) shows why in assumption (vi) we require the marginal densities of each mixture component to be bounded below by and guarantees that dividing by zero never occurs in any NSMM sequence.

## 5 Discussion

Starting from the conditional independence finite multivariate mixture model as set forth in the work of Benaglia et al. (2009) and Levine et al. (2011), this manuscript proposes an equivalent but simplified parameterization. This reformulation leads to a novel and mathematically coherent version of the penalized Kullback-Leibler divergence as the main optimization criterion for the estimation of the parameters.

In this new framework, certain constraints that were previously imposed on the parameter space may be eliminated, and the solutions obtained may be shown to follow these constraints naturally. These contributions help to rigorously justify the non-parametric maximum smoothed likelihood (npMSL) estimation algorithm established by Levine et al. (2011).

As part of our investigation, we have discovered several new results, including a sharper monotonicity property of the NSMM algorithm that could ultimately contribute to future investigations of the true convergence rate or other asymptotic properties of the algorithm. We also prove, for the first time, the existence of at least one solution for the estimation problem of this model.

Because of the elegant simplicity and mathematical tractability associated with this framework, we believe the results herein will serve as the basis for future research on this useful nonparametric model.

## Appendix A Mathematical Proofs

### a.1 Proof of Lemma 3.1

###### Proof.

Since is linear, we only need to consider the case where is a density function. By Fubini’s Theorem and Equation (31),

 [(P∘Sh)f](x) =r∏k=1∫Rr−1⎛⎜⎝∫Rr~sh(x,u)f(u)du⎞⎟⎠dx1dx2⋯dxk−1dxk+1⋯dxr =r∏k=1∫Rr⎛⎜ ⎜⎝∫Rr−1~sh(x,u)f(u)dx1dx2⋯dxk−1dxk+1⋯dxr⎞⎟ ⎟⎠du =r∏k=1∫Rsh(xk,uk)⎛⎜ ⎜⎝∫Rr−1f(u)du1du2⋯duk−1duk+1⋯dur⎞⎟ ⎟⎠duk =∫Rr(r∏k=1sh(xk,uk))⋅r∏k=1⎛⎜ ⎜⎝∫Rr−1f(u)du1du2⋯duk−1duk+1⋯dur⎞⎟ ⎟⎠du =[(Sh∘P)f](x).

### a.2 Proof of Propositon 3.2

###### Proof.

Direct evaluation and the definition of Kullback-Leibler divergence in Equation (12) give

 l(e(p))−l(e(p+1)) =∫g(x)logm∑c=1(Nhe(p+1)c)(x)m∑d=1(Nhe(p)d)(x)dx=m∑j=1∫g(x)w(p)j(x)logm∑c=1(Nhe(p+1)c)(x)m∑d=1(Nhe(p)d)(x)dx =m∑j=1∫g(x)w(p)j(x)logw(p)j(x)w(p+1)j(x)⋅(Nhe(p+1)j)(x)(Nhe(p)j)(x)dx =m∑j=1∫g(x)w(p)j(x)log(Nhe(p+1)j)(x)(Nhe(p)j)(x)dx+m∑j=1∫g(x)w(p)j(x)logw(p)j(x)w(p+1)j(x)dx =m∑j=1KL(e(p+1)j,e(p)j) +m∑j=1∫⎡⎢⎣g(x)w(p)j(x)logg(x)w(p)j(x)g(x)w(p+1)j(x)+g(x)w(p+1)j(x)−g(x)w(p)j(x)⎤⎥⎦dx =m∑j=1KL(e(p+1)j,e(p)j)+m∑j=1KL(g⋅w(p)j,g⋅w(p+1)j). (43)

### a.3 Direct Proof of Corollary (3.3)

###### Proof.

If we define and , then Jensen’s inequality together with some simplification give

 l(e(p))−l(e(p+1)) =∫g(x)logm∑j=1(Nhe(p+1)j)(x)m∑j′=1(Nhe(p)j′)(x)dx=∫g(x)logm∑j=1(Nhe(p)j)(x)m∑j′=1(Nhe(p)j′)(x)⋅(Nhe(p+1)j)(x)(Nhe(p)j)(x)dx ≥∫g(x)m∑j=1(Nhe(p)j)(x)m∑j′=1(Nhe(p)j′)(x)⋅log(Nhe(p+1)j)(x)(Nhe(p)j)(x)dx =m∑j=1∫g(x)w(p)j(x)logλ(p+1)jr∏k=1Nhf(p+1)j,k(xk)λ(p)jr∏k=1Nhf(p)j,k(xk)dx =m∑j=1∫g(x)w(p)j(x)⎡⎢⎣logλ(p+1)jλ(p)j+r∑k=1∫sh(uk,xk)logf(p+1)j,k(uk)f(p)j,k(uk)duk⎤⎥⎦dx =m∑j=1λ(p+1)jlogλ(p+1)jλ(p)j+m∑j=1r∑k=1∫(∫g(x)w(p)j(x)sh(uk,xk)dx)logf(p+1)j,k(uk)f(p)j,k(uk)duk =m∑j=1λ(p+1)jlogλ(p+1)jλ(p)j+m∑j=1r∑k=1∫λ(p+1)jf(p+1)j,k(uk)logf(p+1)j,k(uk)f(p)j,k(uk)duk =m∑j=1∫λ(p+1)j(r∏k=1f(p+1)j,k(uk))logλ(p+1)jr∏k=1f(p+1)j,k(uk)λ(p)jr∏k=1f(p)j,k(uk)du =m∑j=1∫e(p+1)j(u)loge(p+1)j(u)e(p)j(u)du =m∑j=1∫⎛⎜⎝e(p+1)j(u)loge(p+1)j(u)e(p)j(u)+e(p)j(u)−e(p+1)j(u)⎞⎟⎠du =m∑j=1KL(e(p+1)j,e(p)j).

### a.4 Proof of Lemma 4.1

###### Proof.

For each , , and , Jensen’s Inequality gives

 (Nhej)(x) =exp{∫~sh(u,x)logej(u)du} ≤∫~sh(u,x)exp[logej(u)]du =∫~sh(u,x)ej(u)