PMOG: The projected mixture of Gaussians model with application to blind source separation
1 Abstract
We extend the mixtures of Gaussians (MOG) model to the projected mixture of Gaussians (PMOG) model. In the PMOG model, we assume that dimensional input data points are projected by a dimensional vector into 1D variables . The projected variables are assumed to follow a 1D MOG model. In the PMOG model, we maximize the likelihood of observing to find both the model parameters for the 1D MOG as well as the projection vector . First, we derive an EM algorithm for estimating the PMOG model. Next, we show how the PMOG model can be applied to the problem of blind source separation (BSS). In contrast to conventional BSS where an objective function based on an approximation to differential entropy is minimized, PMOG based BSS simply minimizes the differential entropy of projected sources by fitting a flexible MOG model in the projected 1D space while simultaneously optimizing the projection vector . The advantage of PMOG over conventional BSS algorithms is the more flexible fitting of nonGaussian source densities without assuming nearGaussianity (as in conventional BSS) and still retaining computational feasibility.
2 Introduction
The mixture of Gaussians (MOG) is a flexible model with application to many real world problems. The adjustable parameters in a 1D MOG model include the number of component Gaussian distributions, the mean and variance of each component distribution and their mixing fractions. Given 1D data points , these distributional parameters can be efficiently estimated by maximum likelihood (ML) using the expectation maximization (EM) algorithm [7]. Now consider the following situation:

Data points are not given directly but suppose that we are given vectors . Next, 1D scalar variables are generated by where is an unknown projection vector.

Suppose that projected variables follow a 1D MOG model which we will refer to as a projected mixture of Gaussians (PMOG) model in this work.
Can we estimate the PMOG distributional parameters as well as the projection vector using ML? In particular, can we derive an EM algorithm for this PMOG model similar to the standard EM algorithm for the conventional MOG model?
While estimating the PMOG model is an interesting problem in its own right, we will show that it is also closely related to the problem of estimating the differential entropy of a random variable. Given this fact, we will show that the PMOG model can be applied to the problem of linear blind source separation (BSS) and linear independent component analysis (ICA). We use the term ICA to refer to ”square mixing” where there are equal number of latent sources and mixed signals and BSS to refer to ”nonsquare and noisy” mixing where there are more mixtures corrupted with additive Gaussian noise than latent sources. From this point of view, ICA is a special ”noise free” case of BSS.
BSS or ICA is a well studied problem and we refer the reader to [5, 15, 4] for a detailed overview of relevant work. A central point in BSS is choosing the ”measuring function” for differential entropy of a random variable. One of the most widely used algorithm for BSS is the FastICA (FICA) algorithm [13]. The FICA algorithm works by optimizing the differential entropy based contrast functions developed in the seminal work by Hyvarinen et al. [12]. These contrast functions are approximations to the differential entropy of a random variable under the following conditions:

Expected values of certain functions are given. The density of is estimated to be the maximum entropy distribution (MED) subject to the these constraints.

Most importantly, the assumed MED is, in the words of Hyvarinen et al. [12] ”not very far from a Gaussian distribution”. Let us denote this simplified form of by .
Hyvarinen et al. calculated expressions for the differential entropy using the distribution given flexible user defined functions . A key question that arises is: are these approximations to differential entropy and BSS solutions based on adequate when the true density and hence is not ”near Gaussian”?
In another seminal paper, Attias et al. [1] developed a general solution to the BSS problem where the latent source density was modeled as a ”factorial MOG” density. This significant advance removed the ”near Gaussianity” assumption on the latent source densities. In addition, [1] developed an EM algorithm for the ML solution of the mixing parameters and MOG source parameters in BSS followed by a posterior mean or maximum aposteriori (MAP) estimate of the sources. However, this algorithm becomes computationally intractable for sources. Moreover, the solution by Attias et al. assumes ”exact independence” between the sources i.e., it does not allow any partial dependence between sources. Can we develop a solution to the BSS problem that retains the flexible latent density modeling of [1], retains computational tractability and can be applied under partial dependence between sources?
In this work, we develop the PMOG model and then apply it to the problem of BSS to address these questions:

We describe the PMOG model and derive an EM algorithm for estimating its parameters in section 4.
3 Notation

Scalars will be denoted in a nonbold font (e.g. ) possibly with subscripts (e.g. , ). We will use bold face lower case letters possibly with subscripts to denote vectors (e.g. ) and bold face upper case letters possibly with subscripts to denote matrices (e.g. ). The transpose of a matrix will be denoted by and its inverse will be denoted by . We will denote the identity matrix by . A vector or matrix of all zeros will be denoted by a bold face zero whose size should be clear from context.

The th component of vector will be denoted by whereas the th component of vector will be denoted by . The element of matrix will be denoted by . Estimates of variables will be denoted by putting a hat on top of the variable symbol. For example, an estimate of will be denoted by . The 2norm of a vector will be denoted by .

If is a random vector with a multivariate Normal distribution with mean and covariance then we will denote this distribution by . Similarly, if is a scalar random variable with a Normal distribution with mean and variance then we will denote this distribution by . We will use to denote the uniform random distribution on . Unless otherwise stated all logarithms are natural logarithms (i.e., means ). Probability density functions will be denoted by capital letters with the vector of arguments in parenthesis such as or . When necessary we will also indicate the dependence of a probability density on a vector of parameters by using the notation . The expected value of a function with respect to the density will be denoted by or simply by .
4 Projected mixture of Gaussians (PMOG) model
In this section, we try to answer the following questions:

What is the PMOG model? What is given and what needs to be estimated?

Can we derive an EM algorithm for estimating the PMOG model?

How are the MOG parameters and projection vector estimated in the Mstep?

What other precautions need to be taken to make the monotonic likelihood improvement property of EM hold true?
4.1 What is the PMOG model?
Suppose we are given vectors where each is a vector. Suppose is a matrix formed by assembling these vectors into a matrix i.e.,
(4.1) 
We can think of these vectors as realizations of a random vector . Suppose is an unknown vector that defines new ”projected” scalars such that
(4.2) 
Again, these scalars can be thought of as realizations of the random variable . Suppose we are also given a matrix of full column rank with . It is given that the unknown vector satisfies the constraints
(4.3)  
(4.4) 
It is assumed that the density of random variable is a mixture of Gaussians i.e.,
(4.5) 
In the above equation, is a vector . Similarly and are also vectors. The class fractions satisfy:
(4.6)  
(4.7) 
Equation 4.5 can be equivalently written in terms of as follows:
(4.8) 
We will call the model 4.8 a projected mixture of Gaussians or PMOG model. A pictorial description of the PMOG model is given in Fig. 1.
4.2 Estimating the PMOG model
Assuming that for are independent realizations of , we can write their joint density as
(4.9) 
To simplify notation, we will denote the left hand size of 4.9 by . With this notation, we can write:
(4.10) 
The problem is to maximize (equation 4.10) w.r.t and . For a given this problem is equivalent to the standard mixture of Gaussians (MOG) estimation problem and can be handled efficiently by the expectation maximization (EM) algorithm [7]. In the PMOG model, the novelty is to allow to be an unknown that is estimated along with MOG parameters . Next we introduce priors on and . The purpose of introducing priors on and is simply to prevent the collapse of a Gaussian component from PMOG onto a single data point. Suppose we assume a Dirichlet prior on the vector . This prior is described by a parameter vector with elements .
(4.11) 
Similarly we assume a product of inverse Gamma prior on . Suppose is a vector with elements and similarly is a vector with elements then
(4.12) 
We use an improper prior for , . With this choice of priors the posterior distribution can be written as:
(4.13) 
Maximization of this posterior density w.r.t both and is difficult because of the presence of in the denominator. We could however, maximize the posterior density w.r.t only and maximize the likelihood term w.r.t . This is equivalent to solving the problem:
(4.14) 
4.3 EM algorithm for maximizing
The main development in this subsection is an EM algorithm for maximizing . It is much easier to deal with the logarithm of rather than itself. Now
(4.15) 
Substituting 4.10, 4.11 and 4.12 into 4.15 we get:
(4.16)  
which can be simplified to:
(4.17)  
In the above equation, we have ignored the constants of the prior densities on and since they do not depend on the unknown parameters. Thus the objective function to be maximized can be written as:
(4.18) 
where
(4.19) 
and
(4.20) 
Substituting 4.8 into 4.19 we get:
(4.21) 
Introducing auxillary variables such that:
(4.22)  
(4.23) 
we can write
(4.24) 
By using the concavity of the function, we can write:
(4.25) 
From 4.18 we see that
(4.26) 
The rightmost inequality becomes an equality when we choose:
(4.27) 
In the context of EM, the are also known as responsibilities. Suppose are the parameter values at iteration and be the corresponding responsibilities computed from 4.27. Then we have
(4.28)  
(by 4.26 and 4.27)  
(by 4.26 and 4.27) 
Now suppose, given and , we calculate such that
(4.29)  
(4.30)  
(by 4.26 and 4.27) 
This shows that the objective function increases monotonically from iteration to iteration and hence converges to a local maximum. This gives us the following EM algorithm for the PMOG model:
Initialization Choose initial values of parameters either randomly or by other techniques such as clustering using means [17]. The output of this stage is the initial values . Following initialization, the E and M steps outlined below are performed in an alternating fashion until convergence.
Estep In the Estep, we calculate the responsibilities using the current estimate of parameters and equation 4.27.
Mstep In the Mstep, we solve the problem:
(4.31) 
In ordinary MOG, the objective function in 4.31 is a convex function of and a closed form solution exists for the Mstep. If optimizing also w.r.t then the objective function becomes nonconvex and hence we need to explicitly impose the post optimization condition 4.29 for EM convergence. We use a relative error criterion to detect EM convergence. Our convergence criterion is:
(4.32)  
(4.33) 
Here is a user specified relative error. We used in our experiments.
4.4 Solving the Mstep problem
In this subsection, we discuss in detail the solution of problem 4.31. During the estimation of each variable and we account for the relevant constraints using Lagrange multipliers and KarushKuhnTucker optimality conditions (see [19] for details).
Estimating Differentiating the Mstep objective function w.r.t and noting the constraint the optimality condition is given by:
(4.34) 
Now,
(4.35)  
(4.36) 
(4.37) 
Imposing the constraint and noting that we get:
(4.38) 
Thus is given by:
(4.39) 
We have ignored the inequality constraints on during optimization. By choosing we can ensure that these constraints are always satisfied and inactive and hence can be disregarded during optimization.
Estimating Differentiating the Mstep objective w.r.t and noting that is not dependent on , the optimality condition is given by:
(4.40) 
Upon substituting the derivative, we get
(4.41) 
Upon simplifying, is given by:
(4.42) 
Estimating The Lagrangian function for optimizing is given by:
(4.48) 
Since must satisfy the constraints 4.3, and since does not depend on the optimality condition is given by:
(4.49) 
In the above equation, is the Lagrange multiplier for the constraint and is a Lagrange multiplier vector for the constraint . Substituting the derivatives, we can write the optimality condition as:
(4.50) 
Upon simplification, we can write the optimality condition as:
(4.51) 
where the vector and matrix are independent of and are given by:
(4.52) 
and
(4.53) 
Premultiplying both sides of 4.51 by and noting the constraints on given by 4.3 we get:
(4.54) 
In other words, the is given by:
(4.55) 
Similarly, premultiplying both sides of 4.51 by and noting the constraint we get:
(4.56) 
Note that is of size with and of full column rank . This means that is nonsingular and invertible. Hence we can solve for to get:
(4.57) 
Substituting from 4.55 and from 4.57 into 4.51 we can rewrite the optimality condition as:
(4.58) 
Let
(4.59) 
Essentially is an orthogonal projector to the columns of . Thus the optimality condition for can be simplified to:
(4.60) 
Since is a vector, for fixed and this is a system of cubic equations in variables.
4.5 Breaking up the Mstep into two parts
Suppose we have already estimated the current respobsibilities in the Estep. Our goal is to simultaneously solve the Mstep equations for and . Our strategy for solving the Mstep equations is to note that with fixed, and can be solved explicitly using 4.39, 4.42 and 4.47 respectively. While for fixed and (i.e., fixed and ), we can calculate by solving a system of cubic equations 4.60. Hence we can break up the Mstep into 2 parts as follows:

Mstep, Part 2:
In the second part, we keep and fixed at their current values and solve the maximization problem:(4.62) This is equivalent to solving the system of cubic equations 4.60 for .
We repeatedly perform the alternating maximizations in part 1 and part 2 until the absolute value of the change in from one alternating maximization to the next is below a user specified tolerance . We used in out experiments.
There is a however a complication that needs to be taken care of when implementing the overall EM step. Note that when excluding , the Mstep objective has closed form solutions for and . In addition, we can show that these closed form solutions result in a maximization of the partial Mstep function with fixed . Can we say the same thing about maximizing for fixed and ? Does the 2nd part of Mstep for optimizing converge to a maximum? Suppose is a solution to 4.60. The Hessian of the Lagrangian 4.48 is given by:
(4.63) 
Here, is simply the Lagrange multiplier from 4.55 evaluated at . Note that is a symmetric and positive definite matrix. If then we can guarantee that will be negative definite and the 2nd part of Mstep will have found a local maximum. How can we be sure that ? How can we ensure that the Mstep solution for and is admissible?
One way to solve this problem is to check 4.29 explicitly after convergence of each overall Mstep. If 4.29 is not satisfied then we reinitialize the Mstep with a random vector that satisfies and and resolve the 2 part Mstep. This implies we are simply checking for an increase in the objective function 4.31 after each Mstep.
4.6 Solving for the projection vector
Solving for is equivalent to finding zeros of the set of equations:
(4.64) 
that satisfy the constraints in 4.3. We consider below three possible cases:
Case 1
Suppose . Note that if satisfies then equation is satisfied. Since is invertible, define a candidate solution for the second part of the Mstep as:
(4.65) 
Is this an acceptable solution? If satisfies and then it is an acceptable solution. However, there is no guarantee that this will be true for satisfying 4.65. Thus the solution 4.65 for is not admissible.
Case 2
Suppose . If then solving reduces to solving:
(4.66) 
Another candidate solution for the second part of the Mstep is given by:
(4.67) 
where is an arbitrary vector. Does this satisfy the distinct constraints and ? The number of variables on the right hand side of 4.67 is but the number of constraints that they have to satisfy is (constraints on ) and so the solution 4.67 for is not admissible.
Case 3
Suppose and we solve for that satisfies . This means that will satisfy:
(4.68) 
Will such a satisfy and ? Premultiplying both sides of 4.68 by and noting the definition of in 4.59 we get:
(4.69) 
Since , 4.69 gives us:
(4.70) 
Premultiplying both sides of 4.68 by we get:
(4.71) 
From 4.59 and 4.70 we know that and so 4.71 can be written as:
(4.72) 
Since , 4.72 gives us:
(4.73) 
Therefore if then any solution to will satisfy both and . To solve for such that we can use Newton’s method. The Jacobian of w.r.t is given by:
(4.74) 
Assuming, is nonsingular the basic Newton update is given by:
(4.75) 
To avoid problems caused by intermediate singularity of we can replace by a modified version for a sufficiently large such that is nonsingular (see [19] for more details). For instance, we could use the LevenbergMarquardt modification of the Newton’s method which essentially does this replacement. In this case, the update equation for can be written as:
(4.76) 
The projection vector is initialized for the Newton iteration as follows:
(4.77)  
(4.78) 
The complete EM algorithm pseudocode is given in Fig. 2.
EM algorithm for estimating the PMOG model
(4.79) 
(4.80) 
5 The BSS problem
In this section, we will try to answer the following two questions;

What is the BSS problem and why is its solution difficult?

What is the connection between BSS, differential entropy and source correlation?
5.1 The linear BSS problem
We consider the general version of a linear BSS problem. Vectors are generated from latent source vectors as follows:
(5.1) 
In this linear mixing model, is the mixing matrix with , is the mean vector and is the noise vector with distribution . We assume without loss of generality that each component of has zero mean and unit variance i.e., and . Thus, the second order statistics of can be summarized as:
(5.2)  
(5.3) 
where is the unknown correlation matrix between the source components. Note that the diagonal elements of are ones (1’s). No distributional assumptions are made on the density of vector . One simply requires that the components of vector are minimally dependent (or maximally independent) on each other in an information theoretic sense. Given independent realizations of the vector (generated from independent realizations of sources ), the goal is to estimate the unknown sources when and the noise variance are unknown.
5.2 Why is the BSS problem difficult?
The BSS problem is difficult because the estimation of mixing parameters and is coupled with the estimation of latent sources . Note that the equation 5.1 gives us the conditional density of given