Kullback Proximal Algorithms for Maximum Likelihood Estimation^{1}^{1}1Stephane Chretien is with Université Libre de Bruxelles, Campus de la Plaine, CP 21001, 1050 Bruxelles, Belgium, (schretie@smg.ulb.ac.be) and Alfred Hero is with the Dept. of Electrical Engineering and Computer Science, 1301 Beal St., University of Michigan, Ann Arbor, MI 481092122 (hero@eecs.umich.edu). This research was supported in part by AFOSR grant F49620970028
Abstract
Accelerated algorithms for maximum likelihood image reconstruction are essential for emerging applications such as 3D tomography, dynamic tomographic imaging, and other high dimensional inverse problems. In this paper, we introduce and analyze a class of fast and stable sequential optimization methods for computing maximum likelihood estimates and study its convergence properties. These methods are based on a proximal point algorithm implemented with the KullbackLiebler (KL) divergence between posterior densities of the complete data as a proximal penalty function. When the proximal relaxation parameter is set to unity one obtains the classical expectation maximization (EM) algorithm. For a decreasing sequence of relaxation parameters, relaxed versions of EM are obtained which can have much faster asymptotic convergence without sacrifice of monotonicity. We present an implementation of the algorithm using Moré’s Trust Region update strategy. For illustration the method is applied to a nonquadratic inverse problem with Poisson distributed data.
Keywords: accelerated EM algorithm, KullbackLiebler relaxation, proximal point iterations, superlinear convergence, Trust Region methods, emission tomography.
LIST OF FIGURES

Two rail phantom for 1D deblurring example.

Blurred two level phantom. Blurring kernel is Gaussian with standard width approximately equal to rail separation distance in phantom. An additive randoms noise of 0.3 was added.

Snapshot of log–Likelihood vs iteration for plain EM and KPP EM algorithm. Plain EM initially produces greater increases in likelihood function but is overtaken by KPP EM at 7 iterations and thereafter.

The sequence vs iteration for plain EM and KPP EM algorithms. Here is limiting value for each of the algorithms. Note the superlinear convergence of KPP.

Reconstructed images after 150 iterations of plain EM and KPP EM algorithms.

Evolution of the reconstructed source vs iteration for plain EM and KPP EM.
1 Introduction
Maximum likelihood (ML) or maximum penalized likelihood (MPL) approaches have been widely adopted for image restoration and image reconstruction from noise contaminated data with known statistical distribution. In many cases the likelihood function is in a form for which analytical solution is difficult or impossible. When this is the case iterative solutions to the ML reconstruction or restoration problem are of interest. Among the most stable iterative strategies for ML is the popular expectation maximization (EM) algorithm [8]. The EM algorithm has been widely applied to emission and transmission computed tomography [39, 23, 36] with Poisson data. The EM algorithm has the attractive property of monotonicity which guarantees that the likelihood function increases with each iteration. The convergence properties of the EM algorithm and its variants have been extensively studied in the literature; see [42] and [15] for instance. It is well known that under strong concavity assumptions the EM algorithm converges linearly towards the ML estimator . However, the rate coefficient is small and in practice the EM algorithm suffers from slow convergence in late iterations. Efforts to improve on the asymptotic convergence rate of the EM algorithm have included: Aitken’s acceleration [28], overrelaxation [26], conjugate gradient [20] [19], Newton methods [30] [4], quasiNewton methods [22], ordered subsets EM [17] and stochastic EM [25]. Unfortunately, these methods do not automatically guarantee the monotone increasing likelihood property as does standard EM. Furthermore, many of these accelerated algorithms require additional monitoring for instability [24]. This is especially problematic for high dimensional image reconstruction problems, e.g. 3D or dynamic imaging, where monitoring could add significant computational overhead to the reconstruction algorithm.
The contribution of this paper is the introduction of a class of accelerated EM algorithms for likelihood function maximization via exploitation of a general relation between EM and proximal point (PP) algorithms. These algorithms converge and can have quadratic rates of convergence even with approximate updating. Proximal point algorithms were introduced by Martinet [29] and Rockafellar [38], based on the work of Minty [31] and Moreau [33], for the purpose of solving convex minimization problems with convex constraints. A key motivation for the PP algorithm is that by adding a sequence of iterationdependent penalties, called proximal penalties, to the objective function to be maximized one obtains stable iterative algorithms which frequently outperform standard optimization methods without proximal penalties, e.g. see Goldstein and Russak [1]. Furthermore, the PP algorithm plays a paramount role in nondifferentiable optimization due to its connections with the MoreauYosida regularization; see Minty [31], Moreau [33], Rockafellar [38] and HiriartHurruty and Lemaréchal [16].
While the original PP algorithm used a simple quadratic penalty more general versions of PP have recently been proposed which use nonquadratic penalties, and in particular entropic penalties. Such penalties are most commonly applied to ensure nonnegativity when solving Lagrange duals of inequality constrained primal problems; see for example papers by Censor and Zenios [5], Ekstein [10], Eggermont [9], and Teboulle [40]. In this paper we show that by choosing the proximal penalty function of PP as the KullbackLiebler (KL) divergence between successive iterates of the posterior densities of the complete data, a generalization of the generic EM maximum likelihood algorithm is obtained with accelerated convergence rate. When the relaxation sequence is constant and equal to unity the PP algorithm with KL proximal penalty reduces to the standard EM algorithm. On the other hand for a decreasing relaxation sequence the PP algorithm with KL proximal penalty is shown to yield an iterative ML algorithm which has much faster convergence than EM without sacrificing its monotonic likelihood property.
It is important to point out that relations between particular EM and particular PP algorithms have been previously observed, but not in the full generality established in this paper. Specifically, for parameters constrained to the nonnegative orthant, Eggermont [9] established a relation between an entropic modification of the standard PP algorithm and a class of multiplicative methods for smooth convex optimization. The modified PP algorithm that was introduced in [9] was obtained by replacing the standard quadratic penalty by the relative entropy between successive nonnegative parameter iterates. This extension was shown to be equivalent to an “implicit” algorithm which, after some approximations to the exact PP objective function, reduces to the “explicit” Shepp and Vardi EM algorithm [39] for image reconstruction in emission tomography. Eggermont [9] went on to prove that the explicit and implicit algorithms are monotonic and both converge when the sequence of relaxation parameters is bounded below by a strictly positive number.
In contrast to [9], here we establish a general and exact relation between the generic EM procedure, i.e. arbitrary incomplete and complete data distributions, and an extended class of PP algorithms. As pointed out above, the extended PP algorithm is implemented with a proximal penalty which is the relative entropy (KL divergence) between successive iterates of the posterior densities of the complete data. This modification produces a class of algorithms which we refer to as KullbackLiebler proximal point (KPP). We prove a global convergence result for the KPP algorithm under strict concavity assumptions. An approximate KPP is also proposed using the Trust Region strategy [32, 34] adapted to KPP. We show, in particular, that both the exact and approximate KPP algorithms have superlinear convergence rates when the sequence of positive relaxation parameters converge to zero. Finally, we illustrate these results for KPP acceleration of the Shepp and Vardi EM algorithm implemented with Trust Region updating.
The results given here are also applicable to the nonlinear updating methods of Kivinen and Warmuth [21] for accelerating the convergence of Gaussian mixturemodel identification algorithms in supervised machine learning, see also Warmuth and Azoury [41] and Helmbold, Schapire, Singer and Warmuth [14]. Indeed, similarly to the general KPP algorithm introduced in this paper, in [14] the KL divergence between the new and the old mixture model was added to the gradient of the Gaussian mixturemodel likelihood function, appropriately weighted with a multiplicative factor called the learning rate parameter. This procedure led to what the authors of [14] called an exponentiated gradient algorithm. These authors provided experimental evidence of significant improvements in convergence rate as compared to gradient descent and ordinary EM. The results in this paper provide a general theory which validate such experimental results for a very broad class of parametric estimation problems.
The outline of the paper is as follows. In Section 2 we provide a brief review of key elements of the classical EM algorithm. In Section 3, we establish the general relationship between the EM algorithm and the proximal point algorithm. In section 4, we present the general KPP algorithm and we establish global and superlinear convergence to the maximum likelihood estimator for a smooth and strictly concave likelihood function. In section 5, we study second order approximations of the KPP iteration using Trust Region updating. Finally, in Section 6 we present numerical comparisons for a Poisson inverse problem.
2 Background
The problem of maximum likelihood (ML) estimation consists of finding a solution of the form
(1) 
where is an observed sample of a random variable defined on a sample space and is the loglikelihood function defined by
(2) 
and denotes the density of at parametrized by a vector parameter in . One of the most popular iterative methods for solving ML estimation problems is the Expectation Maximization (EM) algorithm described in Dempster, Laird, and Rubin [8] which we recall for the reader.
A more informative data space is introduced. A random variable is defined on with density parametrized by . The data is more informative than the actual data in the sense that is a compression of , i.e. there exists a noninvertible transformation such that . If one had access to the data it would therefore be advantageous to replace the ML estimation problem (1) by
(3) 
with . Since the density of is related to the density of through
(4) 
for an appropriate measure on . In this setting, the data are called incomplete data whereas the data are called complete data.
Of course the complete data corresponding to a given observed sample are unknown. Therefore, the complete data likelihood function can only be estimated. Given the observed data and a previous estimate of denoted , the following minimum mean square error estimator (MMSE) of the quantity is natural
where, for any integrable function on , we have defined the conditional expectation
and is the conditional density function given
(5) 
The EM algorithm generates a sequence of approximations to the solution (3) starting from an initial guess of and is defined by
A key to understanding the convergence of the EM algorithm is the decomposition of the likelihood function presented in Dempster, Laird and Rubin [8]. As this decomposition is also the prime motivation for the KPP generalization of EM it will be worthwhile to recall certain elements of their argument. The likelihood can be decomposed as
(6) 
where
It follows from elementary application of Jensen’s inequality to the log function that
(7) 
Observe from (6) and (7) that for any the function is a lower bound on the log likelihood function . This property is sufficient to ensure monotonicity of the algorithm. Specifically, since the the Mstep implies that
(8) 
one obtains
This is the well known monotonicity property of the EM algorithm.
Note that if the function in (6) were scaled by an arbitrary positive factor the function would remain a lower bound on , the right hand side of (2) would remain positive and monotonicity of the algorithm would be preserved. As will be shown below, if is allowed to vary with iteration in a suitable manner one obtains a monotone, superlinearly convergent generalization of the EM algorithm.
3 Proximal point methods and the EM algorithm
In this section, we present the proximal point (PP) algorithm of Rockafellar and Martinet. We then demonstrate that EM is a particular case of proximal point implemented with a Kullbacktype proximal penalty.
3.1 The proximal point algorithm
Consider the general problem of maximizing a concave function . The proximal point algorithm is an iterative procedure which can be written
(10) 
The quadratic penalty is relaxed using a sequence of positive parameters . In [38], Rockafellar showed that superlinear convergence of this method is obtained when the sequence converges towards zero. In numerical implementations of proximal point the function is generally replaced by a piecewise linear model [16].
3.2 Proximal interpretation of the EM algorithm
In this section, we establish an exact relationship between the generic EM procedure and an extended proximal point algorithm. For our purposes, we will need to consider a particular KullbackLiebler (KL) information measure. Assume that the family of conditional densities is regular in the sense of Ibragimov and Khasminskii [18], in particular and are mutually absolutely continuous for any and in . Then the RadonNikodym derivative exists for all and we can define the following KL divergence:
(11) 
Proposition 1
The EM algorithm is equivalent to the following recursion with ,
(12) 
For general positive sequence the recursion in Proposition 1 can be identified as a modification of the PP algorithm (10) with the standard quadratic penalty replaced by the KL penalty (11) and having relaxation sequence . In the sequel we call this modified PP algorithm the KullbackLiebler proximal point (KPP) algorithm. In many treatments of the EM algorithm the quantity
is the surrogate function that is maximized in the Mstep. This surrogate objective function is identical (up to an additive constant) to the KPP objective of (12) when .
Proof of Proposition 1: The key to making the connection with the proximal point algorithm is the following representation of the M step:
This equation is equivalent to
since the additional term is constant in . Recalling that ,
We finally obtain
which concludes the proof.
4 Convergence of the KPP Algorithm
In this section we establish monotonicity and other convergence properties of the KPP algorithm of Proposition 1.
4.1 Monotonicity
For bounded domain of , the KPP algorithm is well defined since the maximum in (12) is always achieved in a bounded set. Monotonicity is guaranteed by this procedure as proved in the following proposition.
Proposition 2
The loglikelihood sequence is monotone nondecreasing and satisfies
(13) 
We next turn to asymptotic convergence of the KPP iterates .
4.2 Asymptotic Convergence
In the sequel (respectively ) denotes the gradient (respectively the Hessian matrix) of in the first variable. For a square matrix , denotes the greatest eigenvalue of a matrix and denotes the smallest.
We make the following assumptions
Assumptions 1
We assume the following:

is twice continuously differentiable on and is twice continuously differentiable in in .

where is the standard Euclidean norm on .

and on every bounded set.

for any in , and on every bounded set.
These assumptions ensure smoothness of and and their first two derivatives in . Assumption 1.iii also implies strong concavity of . Assumption 1.iv implies that is strictly convex and that the parameter is strongly identifiable in the family of densities (see proof of Lemma 1 below). Note that the above assumptions are not the minimum possible set, e.g. that and are upper bounded follows from continuity, Assumption 1.ii and the property , respectively.
We first characterize the fixed points of the KPP algorithm.
A result that will be used repeatedly in the sequel is that for any
(14) 
This follows immediately from the information inequality for the KL divergence [7, Thm. 2.6.3]
so that, by smoothness Assumption 1.i, has a stationary point at .
Proposition 3
Proof: Consider a fixed point of the recurrence relation (12) for constant. Then,
As and are both smooth in , must be a stationary point
Thus, as by (14) ,
(15) 
Since is strictly concave, we deduce that is a maximizer of .
The following will be useful.
Lemma 1
Let the conditional density be such that satisfies Assumption 1.iv. Then, given two bounded sequences and , implies that .
Proof: Let be any bounded set containing both sequences and . Let denote the minimum
(16) 
Assumption 1.iv implies that . Furthermore, invoking Taylor’s theorem with remainder, is strictly convex in the sense that for any
As and , recall (14), we obtain
The desired result comes from passing to the limit .
Using these results, we easily obtain the following.
Lemma 2
Let the densities and be such that Assumptions 1 are satisfied. Then is bounded.
Proof: Due to Proposition 2, the sequence is monotone increasing. Therefore, assumption 1.ii implies that is bounded.
In the following lemma, we prove a result which is often called asymptotic regularity [2].
Lemma 3
Let the densities and be such that and satisfy Assumptions 1. Let the sequence of relaxation parameters satisfy . Then,
(17) 
Proof: By Assumption 1.iii and by Proposition 2 is bounded and monotone. Since, by Lemma 2, is a bounded sequence converges. Therefore, which, from (13), implies that vanishes when tends to infinity. Since is bounded below by : . Therefore, Lemma 1 establishes the desired result.
We can now give a global convergence theorem.
Theorem 1
Let the sequence of relaxation parameters be positive and converge to a limit . Then the sequence converges to the solution of the ML estimation problem (1).
Proof: Since is bounded, one can extract a convergent subsequence with limit . The defining recurrence (12) implies that
(18) 
We now prove that is a stationary point of . Assume first that converges to zero, i.e. . Due to Assumptions 1.i, is continuous in . Hence, since is bounded on bounded subsets, (18) implies
Next, assume that . In this case, Lemma 3 establishes that
Therefore, also tends to . Since is continuous in equation (18) gives at infinity
Finally, by (14), and
(19) 
The proof is concluded as follows. As, by Assumption 1.iii, is concave, is a maximizer of so that solves the Maximum Likelihood estimation problem (1). Furthermore, as positive definiteness of implies that is in fact strictly concave, this maximizer is unique. Hence, has only one accumulation point and converges to which ends the proof.
We now establish the main result concerning speed of convergence. Recall that a sequence is said to converge superlinearly to a limit if:
(20) 
Theorem 2
Assume that the sequence of positive relaxation parameters converges to zero. Then, the sequence converges superlinearly to the solution of the ML estimation problem (1).
Proof: Due to Theorem 1, the sequence converges to the unique maximizer of . Assumption 1.i implies that the gradient mapping is continuously differentiable. Hence, we have the following Taylor expansion about .
(21)  
where the remainder satisfies
Since maximizes , . Furthermore, by (14), . Hence, (21) can be simplified to
(22) 
From the defining relation (12) the iterate satisfies
(23) 
So, taking in (4.2) and using (23), we obtain
Thus,
(24) 
On the other hand, one deduces from Assumptions 1 (i) that is locally Lipschitz in the variables and . Then, since, is bounded, there exists a bounded set containing and a finite constant such that for all , , and in ,
Using the triangle inequality and this last result, (4.2) asserts that for any
(25) 
Now, consider again the bounded set containing . Let and denote the minima
Since for any symmetric matrix , is lower bounded by the minimum eigenvalue of , we have immediately that
(26) 
By Assumptions 1.iii and 1.iv, and, after substitution of (26) into (4.2), we obtain
(27) 
for all . Therefore, collecting terms in (27)
(28) 
Now, recall that is convergent. Thus, and subsequently, due to the definition of the remainder . Finally, as converges to zero, is bounded and , equation (28) gives (20) with and the proof of superlinear convergence is completed.
5 Second order Approximations and Trust Region techniques
The maximization in the KPP recursion (12) will not generally yield an explicit exact recursion in and . Thus implementation of the KPP algorithm methods may require line search or onesteplate approximations similar to those used for the Mstep of the nonexplicit penalized EM maximum likelihood algorithm [13]. In this section, we discuss an alternative which uses second order function approximations and preserves the convergence properties of KPP established in the previous section. This second order scheme is related to the wellknown Trust Region technique for iterative optimization introduced by Moré [32].
5.1 Approximate models
In order to obtain computable iterations, the following second order approximations of and are introduced
and
In the following, we adopt the simple notation (a column vector). A natural choice for and is of course
and
The approximate KPP algorithm is defined as
(29)  
At this point it is important to make several comments. Notice first that for , , and , the approximate step (5.1) is equivalent to a Newton step. It is well known that Newton’s method, also known as Fisher scoring, has superlinear asymptotic convergence rate but may diverge if not properly initialized. Therefore, at least for small values of the relaxation parameter , the approximate PPA algorithm may fail to converge for reasons analogous in Newton’s method [37]. On the other hand, for the term penalizes the distance of the next iterate to the current iterate . Hence, we can interpret this term as a regularization or relaxation which stabilizes the possibly divergent Newton algorithm without sacrificing its superlinear asymptotic convergence rate. By appropriate choice of the iterate can be forced to remain in a region around over which the quadratic model is accurate [32][3].
In many cases a quadratic approximation of a single one of the two terms or is sufficient to obtain a closed form for the maximum in the KPP recursion (12). Naturally, when feasible, such a reduced approximation is preferable to the approximation of both terms discussed above. For concreteness, in the sequel, although our results hold for the reduced approximation also, we only prove convergence for the proximal point algorithm implemented with the full twoterm approximation.
Finally, note that (5.1) is quadratic in and the minimization problem clearly reduces to solving a linear system of equations. For of moderate dimension, these equations can be efficiently solved using conjugate gradient techniques [34]. However, when the vector in (5.1) is of large dimension, as frequently occurs in inverse problems, limited memory BFGS quasiNewton schemes for updating may be computationally much more efficient, see for example [34], [35], [27], [12] and [11].
5.2 Trust Region Update Strategy
The Trust Region strategy proceeds as follows. The model is maximized in a ball centered at where is a proximity control parameter which may depend on , and where is a norm; well defined due to positive definiteness of (Assumption 1.iv). Given an iterate consider a candidate for defined as the solution to the constrained optimization problem
subject to
(30) 
By duality theory of constrained optimization [16], and the fact that is strictly concave, this problem is equivalent to the unconstrained optimization
(31) 
where
and is a Lagrange multiplier selected to meet the constraint (30) with equality: .
5.3 Implementation
The parameter is said to be safe if produces an acceptable increase in the original objective . An iteration of the Trust Region method consists of two principal steps
Rule 1. Determine whether is safe or not. If is safe, set and take an approximate Kullback proximal step . Otherwise, take a null step .
Rule 2. Update depending on the result of Rule 1.
Rule 1 can be implemented by comparing the increase in the original loglikelihood to a fraction of the expected increase predicted by the approximate model . Specifically, the Trust Region parameter is accepted if
(32) 
Rule 2 can be implemented as follows. If was accepted by Rule 1, is increased at the next iteration in order to extend the region of validity of the model . If was rejected, the region must be tightened and is decreased at the next iteration.
The Trust Region strategy implemented here is essentially the same as that proposed by Moré [32].
Algorithm 1
Step 0. (Initialization) Set , and the “curve search” parameters , with .
Step 2. If then set . Otherwise, set .
Step 3. Set . Update the model . Update using Procedure 1.
Step 4. Go to Step 1.
The procedure for updating is given below.
Procedure 1
Step 0. (Initialization) Set and such that .
Step 1. If then take .
Step 2. If