PMOG: The projected mixture of Gaussians model with application to blind source separation

# PMOG: The projected mixture of Gaussians model with application to blind source separation

## 1 Abstract

We extend the mixtures of Gaussians (MOG) model to the projected mixture of Gaussians (PMOG) model. In the PMOG model, we assume that dimensional input data points are projected by a dimensional vector into 1-D variables . The projected variables are assumed to follow a 1-D MOG model. In the PMOG model, we maximize the likelihood of observing to find both the model parameters for the 1-D MOG as well as the projection vector . First, we derive an EM algorithm for estimating the PMOG model. Next, we show how the PMOG model can be applied to the problem of blind source separation (BSS). In contrast to conventional BSS where an objective function based on an approximation to differential entropy is minimized, PMOG based BSS simply minimizes the differential entropy of projected sources by fitting a flexible MOG model in the projected 1-D space while simultaneously optimizing the projection vector . The advantage of PMOG over conventional BSS algorithms is the more flexible fitting of non-Gaussian source densities without assuming near-Gaussianity (as in conventional BSS) and still retaining computational feasibility.

## 2 Introduction

The mixture of Gaussians (MOG) is a flexible model with application to many real world problems. The adjustable parameters in a 1-D MOG model include the number of component Gaussian distributions, the mean and variance of each component distribution and their mixing fractions. Given 1-D data points , these distributional parameters can be efficiently estimated by maximum likelihood (ML) using the expectation maximization (EM) algorithm [7]. Now consider the following situation:

• Data points are not given directly but suppose that we are given vectors . Next, 1-D scalar variables are generated by where is an unknown projection vector.

• Suppose that projected variables follow a 1-D MOG model which we will refer to as a projected mixture of Gaussians (PMOG) model in this work.

Can we estimate the PMOG distributional parameters as well as the projection vector using ML? In particular, can we derive an EM algorithm for this PMOG model similar to the standard EM algorithm for the conventional MOG model?

While estimating the PMOG model is an interesting problem in its own right, we will show that it is also closely related to the problem of estimating the differential entropy of a random variable. Given this fact, we will show that the PMOG model can be applied to the problem of linear blind source separation (BSS) and linear independent component analysis (ICA). We use the term ICA to refer to ”square mixing” where there are equal number of latent sources and mixed signals and BSS to refer to ”non-square and noisy” mixing where there are more mixtures corrupted with additive Gaussian noise than latent sources. From this point of view, ICA is a special ”noise free” case of BSS.

BSS or ICA is a well studied problem and we refer the reader to [5, 15, 4] for a detailed overview of relevant work. A central point in BSS is choosing the ”measuring function” for differential entropy of a random variable. One of the most widely used algorithm for BSS is the FastICA (FICA) algorithm [13]. The FICA algorithm works by optimizing the differential entropy based contrast functions developed in the seminal work by Hyvarinen et al. [12]. These contrast functions are approximations to the differential entropy of a random variable under the following conditions:

• Expected values of certain functions are given. The density of is estimated to be the maximum entropy distribution (MED) subject to the these constraints.

• Most importantly, the assumed MED is, in the words of Hyvarinen et al. [12] ”not very far from a Gaussian distribution”. Let us denote this simplified form of by .

Hyvarinen et al. calculated expressions for the differential entropy using the distribution given flexible user defined functions . A key question that arises is: are these approximations to differential entropy and BSS solutions based on adequate when the true density and hence is not ”near Gaussian”?

In another seminal paper, Attias et al. [1] developed a general solution to the BSS problem where the latent source density was modeled as a ”factorial MOG” density. This significant advance removed the ”near Gaussianity” assumption on the latent source densities. In addition, [1] developed an EM algorithm for the ML solution of the mixing parameters and MOG source parameters in BSS followed by a posterior mean or maximum aposteriori (MAP) estimate of the sources. However, this algorithm becomes computationally intractable for sources. Moreover, the solution by Attias et al. assumes ”exact independence” between the sources i.e., it does not allow any partial dependence between sources. Can we develop a solution to the BSS problem that retains the flexible latent density modeling of [1], retains computational tractability and can be applied under partial dependence between sources?

In this work, we develop the PMOG model and then apply it to the problem of BSS to address these questions:

1. We describe the PMOG model and derive an EM algorithm for estimating its parameters in section 4.

2. We show how PMOG can be applied to solving a BSS problem including cases where partial dependence between sources is allowed in sections 6, 7.

## 3 Notation

• Scalars will be denoted in a non-bold font (e.g. ) possibly with subscripts (e.g. , ). We will use bold face lower case letters possibly with subscripts to denote vectors (e.g. ) and bold face upper case letters possibly with subscripts to denote matrices (e.g. ). The transpose of a matrix will be denoted by and its inverse will be denoted by . We will denote the identity matrix by . A vector or matrix of all zeros will be denoted by a bold face zero whose size should be clear from context.

• The th component of vector will be denoted by whereas the th component of vector will be denoted by . The element of matrix will be denoted by . Estimates of variables will be denoted by putting a hat on top of the variable symbol. For example, an estimate of will be denoted by . The 2-norm of a vector will be denoted by .

• If is a random vector with a multivariate Normal distribution with mean and covariance then we will denote this distribution by . Similarly, if is a scalar random variable with a Normal distribution with mean and variance then we will denote this distribution by . We will use to denote the uniform random distribution on . Unless otherwise stated all logarithms are natural logarithms (i.e., means ). Probability density functions will be denoted by capital letters with the vector of arguments in parenthesis such as or . When necessary we will also indicate the dependence of a probability density on a vector of parameters by using the notation . The expected value of a function with respect to the density will be denoted by or simply by .

## 4 Projected mixture of Gaussians (PMOG) model

In this section, we try to answer the following questions:

• What is the PMOG model? What is given and what needs to be estimated?

• Can we derive an EM algorithm for estimating the PMOG model?

• How are the MOG parameters and projection vector estimated in the M-step?

• What other precautions need to be taken to make the monotonic likelihood improvement property of EM hold true?

### 4.1 What is the PMOG model?

Suppose we are given vectors where each is a vector. Suppose is a matrix formed by assembling these vectors into a matrix i.e.,

 Z=[z1,z2,…,zn] (4.1)

We can think of these vectors as realizations of a random vector . Suppose is an unknown vector that defines new ”projected” scalars such that

 ui=wTzi=zTiw (4.2)

Again, these scalars can be thought of as realizations of the random variable . Suppose we are also given a matrix of full column rank with . It is given that the unknown vector satisfies the constraints

 wTw =1 (4.3) GTw =0 (4.4)

It is assumed that the density of random variable is a mixture of Gaussians i.e.,

 P(u∣π,μ,σ2)=R∑k=1πkN(u∣μk,σ2k) (4.5)

In the above equation, is a vector . Similarly and are also vectors. The class fractions satisfy:

 R∑k=1πk=1 (4.6) 0≤πk≤1 (4.7)

Equation 4.5 can be equivalently written in terms of as follows:

 P(wTz∣π,μ,σ2)=R∑k=1πkN(wTz∣μk,σ2k) (4.8)

We will call the model 4.8 a projected mixture of Gaussians or PMOG model. A pictorial description of the PMOG model is given in Fig. 1.

### 4.2 Estimating the PMOG model

Assuming that for are independent realizations of , we can write their joint density as

 P(wTz1,wTz2,…,wTzn∣π,μ,σ2)=n∏i=1P(wTzi∣π,μ,σ2) (4.9)

To simplify notation, we will denote the left hand size of 4.9 by . With this notation, we can write:

 P(ZTw∣π,μ,σ2)=n∏i=1P(wTzi∣π,μ,σ2) (4.10)

The problem is to maximize (equation 4.10) w.r.t and . For a given this problem is equivalent to the standard mixture of Gaussians (MOG) estimation problem and can be handled efficiently by the expectation maximization (EM) algorithm [7]. In the PMOG model, the novelty is to allow to be an unknown that is estimated along with MOG parameters . Next we introduce priors on and . The purpose of introducing priors on and is simply to prevent the collapse of a Gaussian component from PMOG onto a single data point. Suppose we assume a Dirichlet prior on the vector . This prior is described by a parameter vector with elements .

 P(π∣β)∝R∏k=1πβk−1k (4.11)

Similarly we assume a product of inverse Gamma prior on . Suppose is a vector with elements and similarly is a vector with elements then

 P(σ2∣θ,γ)=R∏k=1P(σ2k∣θk,γk)∝R∏k=1(σ2k)−(θk+1)exp(−1γkσ2k) (4.12)

We use an improper prior for , . With this choice of priors the posterior distribution can be written as:

 P(π,μ,σ2∣ZTw)=P(ZTw∣π,μ,σ2)P(π∣β)P(σ2∣θ,γ)P(ZTw) (4.13)

Maximization of this posterior density w.r.t both and is difficult because of the presence of in the denominator. We could however, maximize the posterior density w.r.t only and maximize the likelihood term w.r.t . This is equivalent to solving the problem:

 maxπ,μ,σ2,wF(π,μ,σ2,w)=P(ZTw∣π,μ,σ2)P(π∣β)P(σ2∣θ,γ) (4.14)

### 4.3 EM algorithm for maximizing F

The main development in this subsection is an EM algorithm for maximizing . It is much easier to deal with the logarithm of rather than itself. Now

 logF(π,μ,σ2,w)=logP(ZTw∣π,μ,σ2)+logP(π∣β)+logP(σ2∣θ,γ) (4.15)

Substituting 4.10, 4.11 and 4.12 into 4.15 we get:

 logF(π,μ,σ2,w) (4.16) ∝log(n∏i=1P(wTzi∣π,μ,σ2))+log(R∏k=1πβk−1k)+log(R∏k=1(σ2k)−(θk+1)exp(−1γkσ2k))

which can be simplified to:

 logF(π,μ,σ2,w) (4.17) ∝n∑i=1logP(wTzi∣π,μ,σ2)+R∑k=1(βk−1)logπk+R∑k=1(−(θk+1)logσ2k−1γkσ2k)

In the above equation, we have ignored the constants of the prior densities on and since they do not depend on the unknown parameters. Thus the objective function to be maximized can be written as:

 H(π,μ,σ2,w)=H1(π,μ,σ2,w)+H2(π,σ2) (4.18)

where

 H1(π,μ,σ2,w)=n∑i=1logP(wTzi∣π,μ,σ2) (4.19)

and

 H2(π,σ2)=R∑k=1(βk−1)logπk+R∑k=1(−(θk+1)logσ2k−1γkσ2k) (4.20)

Substituting 4.8 into 4.19 we get:

 H1(π,μ,σ2,w)=n∑i=1logR∑k=1πkN(wTzi∣μk,σ2k) (4.21)

Introducing auxillary variables such that:

 R∑k=1αki=1 (4.22) 0≤αki≤1 (4.23)

we can write

 H1(π,μ,σ2,w)=n∑i=1logR∑k=1(πkN(wTzi∣μk,σ2k)αki)αki (4.24)

By using the concavity of the function, we can write:

 H1(π,μ,σ2,w)≥Q(π,μ,σ2,w,α)=n∑i=1R∑k=1αkilog(πkN(wTzi∣μk,σ2k)αki) (4.25)

From 4.18 we see that

 H(π,μ,σ2,w)=H1(π,μ,σ2,w)+H2(π,σ2)≥Q(π,μ,σ2,w,α)+H2(π,σ2) (4.26)

The rightmost inequality becomes an equality when we choose:

 αki=πkN(wTzi∣μk,σ2k)∑Rk=1πkN(wTzi∣μk,σ2k) (4.27)

In the context of EM, the are also known as responsibilities. Suppose are the parameter values at iteration and be the corresponding responsibilities computed from 4.27. Then we have

 Double exponent: use braces to clarify (4.28) Double exponent: use braces to clarify (by 4.26 and 4.27) Double exponent: use braces to clarify (by 4.26 and 4.27)

Now suppose, given and , we calculate such that

 Double exponent: use braces to clarify (4.29) Double exponent: use braces to clarify

From 4.29 and 4.28 we get:

 Double exponent: use braces to clarify (4.30) Double exponent: use braces to clarify =H(π(t),μ(t),σ2(t),w(t)) (by 4.26 and 4.27)

This shows that the objective function increases monotonically from iteration to iteration and hence converges to a local maximum. This gives us the following EM algorithm for the PMOG model:

Initialization Choose initial values of parameters either randomly or by other techniques such as clustering using -means [17]. The output of this stage is the initial values . Following initialization, the E and M steps outlined below are performed in an alternating fashion until convergence.

E-step In the E-step, we calculate the responsibilities using the current estimate of parameters and equation 4.27.

M-step In the M-step, we solve the problem:

 Double exponent: use braces to clarify (4.31)

In ordinary MOG, the objective function in 4.31 is a convex function of and a closed form solution exists for the M-step. If optimizing also w.r.t then the objective function becomes non-convex and hence we need to explicitly impose the post optimization condition 4.29 for EM convergence. We use a relative error criterion to detect EM convergence. Our convergence criterion is:

 Double exponent: use braces to clarify (4.32) Double exponent: use braces to clarify (4.33)

Here is a user specified relative error. We used in our experiments.

### 4.4 Solving the M-step problem

In this subsection, we discuss in detail the solution of problem 4.31. During the estimation of each variable and we account for the relevant constraints using Lagrange multipliers and Karush-Kuhn-Tucker optimality conditions (see [19] for details).

Estimating Differentiating the M-step objective function w.r.t and noting the constraint the optimality condition is given by:

 ∂∂πkQ(π,μ,σ2,w,α(t))+∂∂πkH2(π,σ2)−λ∂∂πk(R∑k=1πk−1)=0 (4.34)

Now,

 ∂∂πkQ(π,μ,σ2,w,α(t)) =1πkn∑i=1α(t)ki (4.35) ∂∂πkH2(π,σ2) =1πk(βk−1) (4.36)

From 4.34 and 4.35 we get:

 1πkn∑i=1α(t)ki+1πk(βk−1)=λ⟹1λ(n∑i=1α(t)ki+(βk−1))=πk (4.37)

Imposing the constraint and noting that we get:

 λ=R∑k=1(n∑i=1α(t)ki+(βk−1))⟹λ=n+R∑k=1(βk−1) (4.38)

Thus is given by:

 πk=∑ni=1α(t)ki+(βk−1)n+∑Rk=1(βk−1) (4.39)

We have ignored the inequality constraints on during optimization. By choosing we can ensure that these constraints are always satisfied and inactive and hence can be disregarded during optimization.

Estimating Differentiating the M-step objective w.r.t and noting that is not dependent on , the optimality condition is given by:

 ∂∂μkQ(π,μ,σ2,w,α(t))=0 (4.40)

Upon substituting the derivative, we get

 n∑i=1α(t)ki((wTzi−μk)σ2k)=0 (4.41)

Upon simplifying, is given by:

 μk=∑ni=1α(t)ki(wTzi)∑ni=1α(t)ki (4.42)

Estimating Differentiating the M-step objective w.r.t the optimality condition is:

 ∂∂σ2kQ(π,μ,σ2,w,α(t))+∂∂σ2kH2(π,σ2)=0 (4.43)

Now,

 ∂∂σ2kQ(π,μ,σ2,w,α(t))=n∑i=1α(t)ki{(wTzi−μk)22σ4k−12σ2k} (4.44)

and

 ∂∂σ2kH2(π,σ2)=−(θk+1)σ2k+1γk1σ4k (4.45)

Substituting 4.44 and 4.45 in 4.43 we get:

 n∑i=1α(t)ki{(wTzi−μk)22σ4k−12σ2k}−(θk+1)σ2k+1γk1σ4k=0 (4.46)

Upon simplification, we get:

 σ2k=2γ−1k+∑ni=1α(t)ki(wTzi−μk)22(θk+1)+∑ni=1α(t)ki (4.47)

Estimating The Lagrangian function for optimizing is given by:

 L(w,λ1,λ2)=Q(π,μ,σ2,w,α(t))−{λ1(wTw−1)}−{λT2GTw} (4.48)

Since must satisfy the constraints 4.3, and since does not depend on the optimality condition is given by:

 ∂∂wQ(π,μ,σ2,w,α(t))−∂∂w{λ1(wTw−1)}−∂∂w{λT2GTw}=0 (4.49)

In the above equation, is the Lagrange multiplier for the constraint and is a Lagrange multiplier vector for the constraint . Substituting the derivatives, we can write the optimality condition as:

 n∑i=1R∑k=1α(t)ki{−(wTzi−μk)ziσ2k}−2λ1w−Gλ2=0 (4.50)

Upon simplification, we can write the optimality condition as:

 b−Aw−2λ1w−Gλ2=0 (4.51)

where the vector and matrix are independent of and are given by:

 b=n∑i=1R∑k=1α(t)kiμkσ2kzi (4.52)

and

 A=n∑i=1R∑k=1α(t)kiσ2kzizTi (4.53)

Premultiplying both sides of 4.51 by and noting the constraints on given by 4.3 we get:

 wTb−wTAw−2λ1(1)−0=0 (4.54)

In other words, the is given by:

 λ1=12(wTb−wTAw) (4.55)

Similarly, premultiplying both sides of 4.51 by and noting the constraint we get:

 GT(b−Aw)−2λ1(0)−(GTG)λ2=0 (4.56)

Note that is of size with and of full column rank . This means that is non-singular and invertible. Hence we can solve for to get:

 λ2=(GTG)−1GT(b−Aw) (4.57)

Substituting from 4.55 and from 4.57 into 4.51 we can re-write the optimality condition as:

 b−Aw−(wTb−wTAw)w−G(GTG)−1GT(b−Aw)=0 (4.58)

Let

 PG=Iq−G(GTG)−1GT (4.59)

Essentially is an orthogonal projector to the columns of . Thus the optimality condition for can be simplified to:

 PG(b−Aw)−(wTb−wTAw)w=0 (4.60)

Since is a vector, for fixed and this is a system of cubic equations in variables.

### 4.5 Breaking up the M-step into two parts

Suppose we have already estimated the current respobsibilities in the E-step. Our goal is to simultaneously solve the M-step equations for and . Our strategy for solving the M-step equations is to note that with fixed, and can be solved explicitly using 4.39, 4.42 and 4.47 respectively. While for fixed and (i.e., fixed and ), we can calculate by solving a system of cubic equations 4.60. Hence we can break up the M-step into 2 parts as follows:

1. M-step, Part 1:
In the first part, we keep fixed at its current value and solve the following maximization problem:

 Double exponent: use braces to clarify (4.61)

This is equivalent to solving for and using 4.39, 4.42 and 4.47.

2. M-step, Part 2:
In the second part, we keep and fixed at their current values and solve the maximization problem:

 Double exponent: use braces to clarify (4.62)

This is equivalent to solving the system of cubic equations 4.60 for .

We repeatedly perform the alternating maximizations in part 1 and part 2 until the absolute value of the change in from one alternating maximization to the next is below a user specified tolerance . We used in out experiments.

There is a however a complication that needs to be taken care of when implementing the overall EM step. Note that when excluding , the M-step objective has closed form solutions for and . In addition, we can show that these closed form solutions result in a maximization of the partial M-step function with fixed . Can we say the same thing about maximizing for fixed and ? Does the 2nd part of M-step for optimizing converge to a maximum? Suppose is a solution to 4.60. The Hessian of the Lagrangian 4.48 is given by:

 ∇2wwL(w∗,λ∗1,λ∗2)=−A−2λ∗1Iq (4.63)

Here, is simply the Lagrange multiplier from 4.55 evaluated at . Note that is a symmetric and positive definite matrix. If then we can guarantee that will be negative definite and the 2nd part of M-step will have found a local maximum. How can we be sure that ? How can we ensure that the M-step solution for and is admissible?

One way to solve this problem is to check 4.29 explicitly after convergence of each overall M-step. If 4.29 is not satisfied then we re-initialize the M-step with a random vector that satisfies and and re-solve the 2 part M-step. This implies we are simply checking for an increase in the objective function 4.31 after each M-step.

### 4.6 Solving for the projection vector

Solving for is equivalent to finding zeros of the set of equations:

 f(w)=PG(b−Aw)−(wTb−wTAw)w (4.64)

that satisfy the constraints in 4.3. We consider below three possible cases:

#### Case 1

Suppose . Note that if satisfies then equation is satisfied. Since is invertible, define a candidate solution for the second part of the M-step as:

 w1=A−1b (4.65)

Is this an acceptable solution? If satisfies and then it is an acceptable solution. However, there is no guarantee that this will be true for satisfying 4.65. Thus the solution 4.65 for is not admissible.

#### Case 2

Suppose . If then solving reduces to solving:

 PG(b−Aw)=0 (4.66)

Another candidate solution for the second part of the M-step is given by:

 w2=A−1(b−Gη) (4.67)

where is an arbitrary vector. Does this satisfy the distinct constraints and ? The number of variables on the right hand side of 4.67 is but the number of constraints that they have to satisfy is (constraints on ) and so the solution 4.67 for is not admissible.

#### Case 3

Suppose and we solve for that satisfies . This means that will satisfy:

 PG(b−Aw)=(wTb−wTAw)w (4.68)

Will such a satisfy and ? Premultiplying both sides of 4.68 by and noting the definition of in 4.59 we get:

 GTPG(b−Aw)=0=(wTb−wTAw)GTw (4.69)

Since , 4.69 gives us:

 GTw=0 (4.70)

Premultiplying both sides of 4.68 by we get:

 wTPG(b−Aw)=(wTb−wTAw)wTw (4.71)

From 4.59 and 4.70 we know that and so 4.71 can be written as:

 wT(b−Aw)=(wTb−wTAw)=(wTb−wTAw)wTw (4.72)

Since , 4.72 gives us:

 wTw=1 (4.73)

Therefore if then any solution to will satisfy both and . To solve for such that we can use Newton’s method. The Jacobian of w.r.t is given by:

 J(w)=−PGA−wbT−(bTw)Iq+(wTAw)Iq+2wwTA (4.74)

Assuming, is non-singular the basic Newton update is given by:

 w(i+1)=w(i)−[J(w(i))]−1f(w(i)) (4.75)

To avoid problems caused by intermediate singularity of we can replace by a modified version for a sufficiently large such that is non-singular (see [19] for more details). For instance, we could use the Levenberg-Marquardt modification of the Newton’s method which essentially does this replacement. In this case, the update equation for can be written as:

 w(i+1)=w(i)−[J(w(i))+ηIq]−1f(w(i)) (4.76)

The projection vector is initialized for the Newton iteration as follows:

 winit=PGA−1b (4.77) winit=winit||winit||2 (4.78)

The complete EM algorithm pseudocode is given in Fig. 2.

## 5 The BSS problem

In this section, we will try to answer the following two questions;

• What is the BSS problem and why is its solution difficult?

• What is the connection between BSS, differential entropy and source correlation?

### 5.1 The linear BSS problem

We consider the general version of a linear BSS problem. Vectors are generated from latent source vectors as follows:

 xi=μ+Asi+εi (5.1)

In this linear mixing model, is the mixing matrix with , is the mean vector and is the noise vector with distribution . We assume without loss of generality that each component of has zero mean and unit variance i.e., and . Thus, the second order statistics of can be summarized as:

 E(si)=0 (5.2) E(sisTi)=Σs (5.3)

where is the unknown correlation matrix between the source components. Note that the diagonal elements of are ones (1’s). No distributional assumptions are made on the density of vector . One simply requires that the components of vector are minimally dependent (or maximally independent) on each other in an information theoretic sense. Given independent realizations of the vector (generated from independent realizations of sources ), the goal is to estimate the unknown sources when and the noise variance are unknown.

### 5.2 Why is the BSS problem difficult?

The BSS problem is difficult because the estimation of mixing parameters and is coupled with the estimation of latent sources . Note that the equation 5.1 gives us the conditional density of given