# Adaptive Monte Carlo Maximum Likelihood

Błażej Miasojedow Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland
Wojciech Niemiro Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland
Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
Jan Palczewski School of Mathematics, University of Leeds, Leeds, UK
Wojciech Rejchel Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
###### Abstract

We consider Monte Carlo approximations to the maximum likelihood estimator in models with intractable norming constants. This paper deals with adaptive Monte Carlo algorithms, which adjust control parameters in the course of simulation. We examine asymptotics of adaptive importance sampling and a new algorithm, which uses resampling and MCMC. This algorithm is designed to reduce problems with degeneracy of importance weights. Our analysis is based on martingale limit theorems. We also describe how adaptive maximization algorithms of Newton-Raphson type can be combined with the resampling techniques. The paper includes results of a small scale simulation study in which we compare the performance of adaptive and non-adaptive Monte Carlo maximum likelihood algorithms.

###### Keywords:
maximum likelihood, importance sampling, adaptation, MCMC, resampling

## 1 Introduction

Maximum likelihood (ML) is a well-known and often used method in estimation of parameters in statistical models. However, for many complex models exact calculation of such estimators is very difficult or impossible. Such problems arise if considered densities are known only up to intractable norming constants, for instance in Markov random fields or spatial statistics. The wide range of applications of models with unknown norming constants is discussed e.g. in [10]. Methods proposed to overcome the problems with computing ML estimates in such models include, among others, maximum pseudolikelihood [2], “coding method” [9] and Monte Carlo maximum likelihood (MCML) [4], [15], [9], [17]. In our paper we focus on MCML.

In influential papers [4], [5] the authors prove consistency and asymptotic normality of MCML estimators. To improve the performance of MCML, one can adjust control parameters in the course of simulation. This leads to adaptive MCML algorithms. We generalize the results of the last mentioned papers first to an adaptive version of importance sampling and then to a more complicated adaptive algorithm which uses resampling and Markov chain Monte Carlo (MCMC) [7]. Our analysis is asymptotic and it is based on the martingale structure of the estimates. The main motivating examples are the autologistic model (with or without covariates) and its applications to spatial statistics as described e.g. in [9] and the autonormal model [11].

Denote by , , a family of unnormalized densities on space . A dominating measure with respect to which these densities are defined is denoted for simplicity by . Let be an observation. We intend to find the maximizer of the log-likelihood

 ℓ(θ)=logfθ(yobs)−logc(θ),

where is the normalizing constant. We consider the situation where this constant,

 c(θ)=∫Yfθ(y)dy,

is unknown and numerically intractable. It is approximated with Monte Carlo simulation, resulting in

 ℓm(θ)=logfθ(yobs)−log^cm(θ), (2.1)

where is a Monte Carlo (MC) estimate of . The classical importance sampling (IS) estimate is of the form

 ^cm(θ)=1mm∑j=1fθ(Yj)h(Yj), (2.2)

where are i.i.d. samples from an instrumental density . Clearly, an optimal choice of depends on the maximizer of , so we should be able to improve our initial guess about while the simulation progresses. This is the idea behind adaptive importance sampling (AIS). A discussion on the choice of instrumental density is deferred to subsequent subsections. Let us describe an adaptive algorithm in the following form, suitable for further generalizations. Consider a parametric family , of instrumental densities.

1. Set an initial value of , , .

2. Draw .

3. Update the approximation of :

 ^cm(θ)=m−1m^cm−1(θ)+1mfθ(Ym)hψm(Ym).
4. Update : choose based on the history of the simulation.

5. ; go to 2.

At the output of this algorithm we obtain an AIS estimate

 ^cm(θ)=1mm∑j=1fθ(Yj)hψj(Yj). (2.3)

The samples are neither independent nor have the same distribution. However (2.3) has a nice martingale structure. If we put

 Fm=σ{Yj,ψj:j≤m}

then is -measurable. The well-known property of unbiasedness of IS implies that

 E(fθ(Ym+1)hψm+1(Ym+1)∣∣Fm)=c(θ). (2.4)

In other words, are martingale differences (MGD), for every fixed .

### 2.1 Hypo-convergence of ℓm and consistency of ^θm

In this subsection we make the following assumptions.

###### Assumption 1

For any

 supψ∫fθ(y)2hψ(y)dy<∞.
###### Assumption 2

The mapping is continuous for each fixed .

Assumption 1 implies that for any , there is a constant such that for all ,

 E((fθ(Yj)hψj(Yj))2∣∣∣Fj−1)≤Mθ,a.s.,

because . Note that Assumption 1 is trivially true if the mapping is uniformly bounded for , . Recall also that

 m(^cm(θ)−c(θ))=m∑j=1(fθ(Yj)hψj(Yj)−c(θ))

is a zero-mean martingale. Under Assumption 1, for a fixed , we have a.s. by the SLLN for martingales (see Theorem A.2, Appendix A), so a.s. This is, however, insufficient to guarantee the convergence of maximum likelihood estimates (maximizers of ) to . Under our assumptions we can prove hypo-convergence of the log-likelihood approximations.

###### Definition 1

A sequence of functions epi-converges to if for any we have

 supB∈N(x)limsupm→∞infy∈Bgm(y)≤g(x), supB∈N(x)liminfm→∞infy∈Bgm(y)≥g(x),

where is a family of all (open) neighbourhoods of .

A sequence of functions hypo-converges to if epi-converges to .

An equivalent definition of epi-convergence is in the following theorem:

###### Theorem 1

([14, Proposition 7.2]) epi-converges to iff at every point

 limsupm→∞gm(xm)≤g(x),for some % sequence xm→x, liminfm→∞gm(xm)≥g(x),for every % sequence xm→x.

As a corollary to this theorem comes the proposition that will be used to prove convergence of , the maximizer of , to (see, also, [1, Theorem 1.10]).

###### Proposition 1

Assume that epi-converges to , and . Then .

###### Proof

(We will use Theorem 1 many times.) Let be a sequence converging to and such that (such sequence exists). This implies that . On the other hand, , where the equality follows from the second assumption on . Summarizing, . In particular, .

Take any and let be such that . There exists a sequence converging to such that , hence . By arbitrariness of we obtain . This completes the proof.

###### Theorem 2

If Assumptions 1 and 2 are satisfied, then hypo-converges to almost surely.

###### Proof

The proof is similar to the proof of Theorem 1 in [5]. We have to prove that epi-converges to . Fix .

Step 1: For any , we have

 liminfm→∞infφ∈B^cm(φ)≥∫infφ∈Bfφ(y)dy=:c–(B). (2.5)

Indeed,

 infφ∈B^cm(ϕ) =infφ∈B1mm∑j=1fφ(Yj)hψj(Yj)≥1mm∑j=1infφ∈Bfφ(Yj)hψj(Yj) =1mm∑j=1(infφ∈Bfφ(Yj)hψj(Yj)−c–(B))+c–(B).

The sum is that of martingale differences, so assuming that there is such that

 supjE((infφ∈Bfφ(Yj)hψj(Yj)−c–(B))2∣∣∣Fj−1)≤M

the SLLN implies (2.5). We have the following estimates:

 E((infφ∈Bfφ(Yj)hψj(Yj)−c–(B))2∣∣∣Fj−1)=Var(infφ∈Bfφ(Yj)hψj(Yj)∣∣∣Fj−1) ≤E((infφ∈Bfφ(Yj)hψj(Yj))2∣∣∣Fj−1)≤E((fθ(Yj)hψj(Yj))2∣∣∣Fj−1)≤Mθ,

where the last inequality is by Assumption 1.

Step 2: We shall prove that .

The left-hand side is bounded from below by . Further, we have

 supB∈N(θ)c–(B)≥limδ↓0c–(B(θ,δ))=∫limδ↓0infφ∈B(θ,δ)fφ(y)dy=∫fθ(y)dy=c(θ),

where the first equality follows from the dominated convergence theorem (the dominator is ) and the last – from the Assumption 2.

Step 3: We have

 supB∈N(θ)limsupm→∞infφ∈B^cm(φ)≤supB∈N(θ)infφ∈Blimsupm→∞^cm(φ)=supB∈N(θ)infφ∈Bc(φ)≤c(θ).

Hence, .

Note that almost sure convergence in the next Proposition corresponds to the randomness introduced by AdapIS and is fixed throughout this paper.

###### Proposition 2

If Assumptions 1 and 2 hold, is the unique maximizer of and sequence (where maximizes ) is almost surely bounded then almost surely.

###### Proof

As we have already mentioned, by SLLN for martingales, , pointwise. Hypo-convergence of to implies, by Proposition 1, that the maximizers of have accumulation points that are the maximizers of . If has a unique maximizer then any convergent subsequence of , maximizers of , converges to . The conclusion follows immediately.

Of course, it is not easy to show boundedness of in concrete examples. In the next section we will prove consistency of in models where log-likelihoods and their estimates are concave.

### 2.2 Central Limit Theorem for Adaptive Importance Sampling

Let be a maximizer of , i.e. the AIS estimate of the likelihood given by (2.1) with (2.3). We assume that is a unique maximizer of . For asymptotic normality of , we will need the following assumptions.

###### Assumption 3

First and second order derivatives of with respect to (denoted by and ) exist in a neighbourhood of and we have

 ∇c(θ)=∫∇fθ(y)dy,∇2c(θ)=∫∇2fθ(y)dy.

.

###### Assumption 5

Matrix is negative definite.

###### Assumption 6

For every , function is continuous and .

###### Assumption 7

For some we have almost surely.

###### Assumption 8

There exists a nonnegative function such that and the inequalities

 supψfθ⋆(y)2+αhψ(y)1+α≤g(y),supψ|∇fθ⋆(y)|2+αhψ(y)1+α≤g(y),supψ∥∇2fθ⋆(y)∥1+αhψ(y)α≤g(y)

are fulfilled for some and also for .

###### Assumption 9

Functions are asymptotically stochastically equicontionuous at , i.e. for every there exists such that

 limsupm→∞P(sup|θ−θ⋆|≤δ∥∇2ℓm(θ)−∇2ℓm(θ⋆)∥>ε)=0.

Let us begin with some comments on these assumptions and note simple facts which follow from them. Assumption 3 is a standard regularity condition. It implies that a martingale property similar to (2.4) holds also for the gradients and hessians:

 (2.6)

Assumption 4 stipulates square root consistency of . It is automatically fulfilled if is concave, in particular for exponential families. Assumption 7 combined with 6 is a “diminishing adaptation” condition. It may be ensured by an appropriately specifying step 4 of AdapIS. The next assumptions are not easy to verify in general, but they are satisfied for exponential families on finite spaces, in particular for our “motivating example”: autologistic model. Let us also note that our Assumption 9 plays a similar role to Assumption (f) in [5, Thm. 7].

Assumption 8 together with (2.4) and (2.6) allows us to apply SLLN for martingales in a form given in Theorem A.2, Appendix A. Indeed, , and are MGDs with bounded moments of order . It follows that, almost surely,

 ^cm(θ⋆)→c(θ⋆),∇^cm(θ⋆)→∇c(θ⋆),∇2^cm(θ⋆)→∇2c(θ⋆). (2.7)

Now we are in a position to state the main result of this section.

###### Theorem 3

If Assumptions 3-9 hold then

 √m(^θm−θ⋆)→N(0,D−1VD−1)in distribution,

where and

 V=1c(θ⋆)2VARY∼hψ⋆[∇fθ⋆(Y)hψ⋆(Y)−∇c(θ⋆)c(θ⋆)fθ⋆(Y)hψ⋆(Y)],

where is defined in Assumption 7.

###### Proof

It is well-known (see [12, Theorem VII.5]) that we need to prove

 √m∇ℓm(θ⋆)d→N(0,V) (2.8)

and that for every , the following holds:

 sup|θ−θ⋆|≤M/√mm∣∣ℓm(θ)−ℓm(θ⋆)−(θ−θ⋆)⊤∇ℓm(θ⋆)−12(θ−θ⋆)⊤D(θ−θ⋆)∣∣p→0. (2.9)

First we show (2.8). Since and , we obtain that

 ∇ℓm(θ⋆)=∇c(θ⋆)c(θ⋆)−∇^cm(θ⋆)^cm(θ⋆)=∇c(θ⋆)c(θ⋆)^cm(θ⋆)−∇^cm(θ⋆)^cm(θ⋆). (2.10)

The denominator in the above expression converges to in probability, by (2.7). In view of Slutski’s theorem, to prove (2.8) it is enough to show asymptotic normality of the numerator. We can write

 ∇c(θ⋆)c(θ⋆)^cm(θ⋆)−∇^cm(θ⋆)=−1mm∑j=1ξj,

where we use the notation

 ξj=∇fθ⋆(Yj)hψj(Yj)−∇c(θ⋆)c(θ⋆)fθ⋆(Yj)hψj(Yj).

Now note that are martingale differences by (2.4) and (2.6). Moreover,

 E(ξjξTj|Fj−1)=∫(∇fθ⋆(y)hψj(y)−∇c(θ⋆)c(θ⋆)fθ⋆(y)hψj(y))(∇fθ⋆(y)hψj(y)−∇c(θ⋆)c(θ⋆)fθ⋆(y)hψj(y))⊤hψj(y)dy,

so Assumptions 6 and 7 via dominated convergence and Assumption 8 (with in the exponent) entail

 E(ξjξTj|Fj−1)a.s.−−→c(θ⋆)2V.

Now we use Assumption 8 (with in the exponent) to infer the Lyapunov-type condition

 E(|ξj|2+α|Fj−1)≤const⋅∫g(y)dy<∞.

The last two displayed formulas are sufficient for a martingale CLT (Theorem A.1, Appendix A). We conclude that

 1√mm∑j=1ξjξj⊤d→N(0,c(θ⋆)2V),

hence the proof of (2.8) is complete.

Now we proceed to a proof of (2.9). By Taylor expansion,

 ℓm(θ)−ℓm(θ⋆)−(θ−θ⋆)⊤∇ℓm(θ⋆)=12(θ−θ⋆)⊤∇2ℓm(~θ)(θ−θ⋆)

for some . Consequently, the LHS of (2.9) is

 ≤sup|θ−θ⋆|≤M/√m~θ∈[θ,θ⋆]m∣∣∣12(θ−θ⋆)⊤(∇2ℓm(~θ)−∇2ℓ(θ⋆))(θ−θ⋆)∣∣∣≤M22sup|θ−θ⋆|≤M/√m∥∥∇2ℓm(θ)−∇2ℓ(θ⋆)∥∥≤M22sup|θ−θ⋆|≤M/√m∥∥∇2ℓm(θ)−∇2ℓm(θ⋆)∥∥+M22∥∥∇2ℓm(θ⋆)−∇2ℓ(θ⋆)∥∥.

The first term above goes to zero in probability by Assumption 9. The second term also goes to zero because

 ∇2ℓm(θ⋆)−∇2ℓ(θ⋆)=∇2logc(θ⋆)−∇2log^cm(θ⋆)=∇2c(θ⋆)c(θ⋆)−∇c(θ⋆)c(θ⋆)∇c(θ⋆)⊤c(θ⋆)−∇2^cm(θ⋆)^cm(θ⋆)+∇^cm(θ⋆)^cm(θ⋆)∇^cm(θ⋆)⊤^cm(θ⋆)p→0,

in view of (2.7). Therefore (2.9) holds and the proof is complete.

### 2.3 Optimal importance distribution

We advocate adaptation to improve the choice of instrumental distribution . But which is the best? If we use (non-adaptive) importance sampling with instrumental distribution then the maximizer of the MC likelihood approximation has asymptotic normal distribution, namely , () with

 V=1c(θ⋆)2VARY∼h[∇fθ⋆(Y)h(Y)−∇c(θ⋆)c(θ⋆)fθ⋆(Y)h(Y)].

This fact is well-known [5] and is a special case of Theorem 3. Since the asymptotic distribution is multidimensional its dispersion can be measured in various ways, e.g., though the determinant, the maximum eigenvalue or the trace of the covariance matrix. We examine the trace which equals to the asymptotic mean square error of the MCML approximation (the asymptotic bias is nil). Notice that

 c(θ⋆)2V=VARY∼hη(Y)h(Y)=EY∼hη(Y)η(Y)⊤h(Y)2=∫η(y)η(y)⊤h(y)dy,

where

 η(y)=∇fθ⋆(y)−∇c(θ⋆)c(θ⋆)fθ⋆(y).

Since , the minimization of is equivalent to

 ∫|D−1η(y)|2h(y)dy→min,

subject to and . By Schwarz inequality we have

 ∫|D−1η(y)|2h(y)dy=∫(|D−1η(y)|√h(y))2dy∫(√h(y))2dy≥(∫|D−1η(y)|√h(y)√h(y)dy)2=(∫|D−1η(y)|dy)2,

with equality only for . The optimum choice of is therefore

 h⋆(y)∝∣∣ ∣∣D−1(∇fθ⋆(y)−∇c(θ⋆)c(θ⋆)fθ⋆(y))∣∣ ∣∣. (2.11)

Unfortunately, this optimality result is chiefly of theoretical importance, because it is not clear how to sample from and how to compute the norming constant for this distribution. This might well be even more difficult than computing .

The following example shows some intuitive meaning of (2.11). It is a purely “toy example” because the simple analitical formulas exist for and while MC is considered only for illustration.

###### Example 1

Consider a binomial distribution on given by . Parametrize the model with the log-odds-ratio , absorb the factor into the measure to get the standard exponenial family form with

 fθ(y)=eθyandc(θ)=n∑y=0(ny)eθy=(1+eθ)n.

Taking into account the facts that and we obtain that (2.11) becomes (factor is a scalar so can be omitted). In other words, the optimum instrumental distribution for AIS MCML, expressed in terms of is

 PY∼h⋆(Y=y)∝(ny)|y−yobs|py(1−p)n−y.

Importance sampling, even in its adaptive version (AIS), suffers from the degeneracy of weights. To compute the importance weights we have to know norming constants for every (or at least their ratios). This requirement severly restricts our choice of the family of instrumental densities . Available instrumental densities are far from and far from . Consequently the weights tend to degenerate (most of them are practically zero, while a few are very large). This effectively makes AIS in its basic form impractical. To obtain practically applicable algorithms, we can generalize AIS as follows. In the same situation as in Section 2, instead of the AIS estimate given by (2.3), we consider a more general Monte Carlo estimate of of the form

 ^cm(θ)=1mm∑j=1^d(θ,ψj), (3.1)

where the summands are computed by an MC method to be specified later. For now let us just assume that this method depends on a control parameter which may change at each step. A general adaptive algorithm is the following:

1. Set an initial value of , , .

2. Compute an ‘‘incremental estimate’’ .

3. Update the approximation of :

 ^cm(θ)=m−1m^cm−1(θ)+1m^d(θ,ψm).
4. Update : choose based on the history of the simulation.

5. ; go to 2.

AdapIS in Section 2 is a special case of AdapMCML which is obtained by letting .

### 3.1 Variance reduction via resampling and MCMC

The key property of the AIS exploited in Section 2 is the martingale structure implied by (2.4) and (2.6). The main asymptotic results generalize if given , the estimates of and its derivatives are conditionally unbiased. We propose an algorithm for computing in (3.1) which has the unbiasedness property and is more efficient than simple AIS. To some extent it is a remedy for the problem of weight degeneracy and reduces the variance of Monte Carlo approximations. As before, consider a family of “instrumental densities” . Assume they are properly normalized () and the control parameter belongs the same space as the parameter of interest (). Further assume that for every we have at our disposal a Markov kernel on which preserves distribution , i.e. . Let us fix This is a setup in which we can apply the following importance sampling-resampling algorithm ISReMC:

Algorithm ISReMC

1. Sample .

2. Compute the importance weights and put .

3. Sample [Discrete distribution with mass at point ].

4. For generate a Markov chain trajectory, starting from and using kernel :

 Y⋆k=Y0k,Y1k,…,Ysk,Ys+1k,…,Ys+nk.
5. Compute given by

 ^d(θ,ψ)=W∙l1rr∑k=11ns+n∑u=s+1fθ(Yuk)fψ(Yuk). (3.2)

This algorithm combines the idea of resampling (borrowed from sequential MC; steps 2 and 3) with computing ergodic averages in multistart MCMC (step 4; notice that is a burn-in and is the actual used sample size for a single MCMC run, repeated times). More details about ISReMC are in [7]. In our context it is sufficient to note the following key property of this algorithm.

###### Lemma 1

If is the output of IReMC then for every and every ,

 E^d(θ,ψ)=c(θ).

If Assumption 3 holds then also

 E∇^d(θ,ψ)=∇c(θ),E∇2^d(θ,ψ)=∇2c(θ).
###### Proof

We can express function and its derivatives as “unnormalized expectations” with respect to the probability distribution with density :

 c(θ)=EY∼πψfθ(Y)fψ(Y)c(ψ),∇c(θ)=EY∼πψ∇fθ(Y)fψ(Y)c(ψ),∇2c(θ)=EY∼πψ∇2fθ(Y)fψ(Y)c(ψ).

Let us focus on . Write

 a(y)=E(1ns+n∑u=s+1fθ(Yu)fψ(Yu)∣∣∣Y0=y) (3.3)

for the expectation of a single MCMC estimate started at . Kernel preserves by assumption, therefore . Put differently, .

We make a simple observation that

 E(^d(θ,ψ)∣∣Y1,…,Yl,Y⋆1,…,Y⋆r)=W∙l1rr∑k=1a(Y