# Adaptive Monte Carlo Maximum Likelihood

## Abstract

We consider Monte Carlo approximations to the maximum likelihood estimator in models with intractable norming constants. This paper deals with adaptive Monte Carlo algorithms, which adjust control parameters in the course of simulation. We examine asymptotics of adaptive importance sampling and a new algorithm, which uses resampling and MCMC. This algorithm is designed to reduce problems with degeneracy of importance weights. Our analysis is based on martingale limit theorems. We also describe how adaptive maximization algorithms of Newton-Raphson type can be combined with the resampling techniques. The paper includes results of a small scale simulation study in which we compare the performance of adaptive and non-adaptive Monte Carlo maximum likelihood algorithms.

## 1Introduction

Maximum likelihood (ML) is a well-known and often used method in estimation of parameters in statistical models. However, for many complex models exact calculation of such estimators is very difficult or impossible. Such problems arise if considered densities are known only up to intractable norming constants, for instance in Markov random fields or spatial statistics. The wide range of applications of models with unknown norming constants is discussed e.g. in [10]. Methods proposed to overcome the problems with computing ML estimates in such models include, among others, maximum pseudolikelihood [2], “coding method” [9] and Monte Carlo maximum likelihood (MCML) [4], [15], [9], [17]. In our paper we focus on MCML.

In influential papers [4], [5] the authors prove consistency and asymptotic normality of MCML estimators. To improve the performance of MCML, one can adjust control parameters in the course of simulation. This leads to adaptive MCML algorithms. We generalize the results of the last mentioned papers first to an adaptive version of importance sampling and then to a more complicated adaptive algorithm which uses resampling and Markov chain Monte Carlo (MCMC) [7]. Our analysis is asymptotic and it is based on the martingale structure of the estimates. The main motivating examples are the autologistic model (with or without covariates) and its applications to spatial statistics as described e.g. in [9] and the autonormal model [11].

## 2Adaptive Importance Sampling

Denote by , , a family of unnormalized densities on space . A dominating measure with respect to which these densities are defined is denoted for simplicity by . Let be an observation. We intend to find the maximizer of the log-likelihood

where is the normalizing constant. We consider the situation where this constant,

is *unknown and numerically intractable*. It is approximated with Monte Carlo simulation, resulting in

where is a Monte Carlo (MC) estimate of . The classical importance sampling (IS) estimate is of the form

where are i.i.d. samples from an instrumental density . Clearly, an optimal choice of depends on the maximizer of , so we should be able to improve our initial guess about while the simulation progresses. This is the idea behind *adaptive importance sampling* (AIS). A discussion on the choice of instrumental density is deferred to subsequent subsections. Let us describe an adaptive algorithm in the following form, suitable for further generalizations. Consider a parametric family , of instrumental densities.

`Algorithm`

AdapIS

Set an initial value of , , .

Draw .

Update the approximation of :

Update : choose based on the history of the simulation.

; go to 2.

At the output of this algorithm we obtain an AIS estimate

The samples are neither independent nor have the same distribution. However (Equation 3) has a nice *martingale* structure. If we put

then is -measurable. The well-known property of unbiasedness of IS implies that

In other words, are martingale differences (MGD), for every fixed .

### 2.1Hypo-convergence of and consistency of

In this subsection we make the following assumptions.

Assumption ? implies that for any ,

there is a constant such that for all ,

because . Note that Assumption ? is trivially true if the mapping is uniformly bounded for , . Recall also that

is a zero-mean martingale. Under Assumption ?, for a fixed , we have a.s. by the SLLN for martingales (see Theorem ?, Appendix A), so a.s. This is, however, insufficient to guarantee the convergence of maximum likelihood estimates (maximizers of ) to . Under our assumptions we can prove hypo-convergence of the log-likelihood approximations.

An equivalent definition of epi-convergence is in the following theorem:

As a corollary to this theorem comes the proposition that will be used to prove convergence of , the maximizer of , to (see, also, [1]).

(We will use Theorem ? many times.) Let be a sequence converging to and such that (such sequence exists). This implies that . On the other hand, , where the equality follows from the second assumption on . Summarizing, . In particular, .

Take any and let be such that . There exists a sequence converging to such that , hence . By arbitrariness of we obtain . This completes the proof.

The proof is similar to the proof of Theorem 1 in [5]. We have to prove that epi-converges to . Fix .

Step 1: For any , we have

Indeed,

The sum is that of martingale differences, so assuming that there is such that

the SLLN implies . We have the following estimates:

where the last inequality is by Assumption ?.

Step 2: We shall prove that .

The left-hand side is bounded from below by . Further, we have

where the first equality follows from the dominated convergence theorem (the dominator is ) and the last – from the Assumption ?.

Step 3: We have

Hence, .

Note that almost sure convergence in the next Proposition corresponds to the randomness introduced by `AdapIS`

and is fixed throughout this paper.

As we have already mentioned, by SLLN for martingales, , pointwise. Hypo-convergence of to implies, by Proposition ?, that the maximizers of have accumulation points that are the maximizers of . If has a unique maximizer then any convergent subsequence of , maximizers of , converges to . The conclusion follows immediately.

Of course, it is not easy to show boundedness of in concrete examples. In the next section we will prove consistency of in models where log-likelihoods and their estimates are concave.

### 2.2Central Limit Theorem for Adaptive Importance Sampling

Let be a maximizer of , i.e. the AIS estimate of the likelihood given by with . We assume that is a unique maximizer of . For asymptotic normality of , we will need the following assumptions.

Let us begin with some comments on these assumptions and note simple facts which follow from them. Assumption ? is a standard regularity condition. It implies that a martingale property similar to holds also for the gradients and hessians:

Assumption ? stipulates square root consistency of . It is automatically fulfilled if is concave, in particular for exponential families. Assumption ? combined with ? is a “diminishing adaptation” condition. It may be ensured by an appropriately specifying step 4 of `AdapIS`

. The next assumptions are not easy to verify in general, but they are satisfied for exponential families on finite spaces, in particular for our “motivating example”: autologistic model. Let us also note that our Assumption ? plays a similar role to Assumption (f) in [5].

Assumption ? together with and allows us to apply SLLN for martingales in a form given in Theorem ?, Appendix A. Indeed, , and are MGDs with bounded moments of order . It follows that, almost surely,

Now we are in a position to state the main result of this section.

It is well-known (see [12]) that we need to prove

and that for every , the following holds:

First we show . Since and , we obtain that

The denominator in the above expression converges to in probability, by . In view of Slutski’s theorem, to prove it is enough to show asymptotic normality of the numerator. We can write

where we use the notation

Now note that are martingale differences by and . Moreover,

so Assumptions ? and ? via dominated convergence and Assumption ? (with in the exponent) entail

Now we use Assumption ? (with in the exponent) to infer the Lyapunov-type condition

The last two displayed formulas are sufficient for a martingale CLT (Theorem ?, Appendix A). We conclude that

hence the proof of is complete.

Now we proceed to a proof of . By Taylor expansion,

for some . Consequently, the LHS of is

The first term above goes to zero in probability by Assumption ?. The second term also goes to zero because

in view of . Therefore holds and the proof is complete.

### 2.3Optimal importance distribution

We advocate adaptation to improve the choice of instrumental distribution . But which is the best? If we use (non-adaptive) importance sampling with instrumental distribution then the maximizer of the MC likelihood approximation has asymptotic normal distribution, namely , () with

This fact is well-known [5] and is a special case of Theorem ?. Since the asymptotic distribution is multidimensional its dispersion can be measured in various ways, e.g., though the determinant, the maximum eigenvalue or the trace of the covariance matrix. We examine the trace which equals to the asymptotic mean square error of the MCML approximation (the asymptotic bias is nil). Notice that

where

Since , the minimization of is equivalent to

subject to and . By Schwarz inequality we have

with equality only for . The optimum choice of is therefore

Unfortunately, this optimality result is chiefly of theoretical importance, because it is not clear how to sample from and how to compute the norming constant for this distribution. This might well be even more difficult than computing .

The following example shows some intuitive meaning of . It is a purely “toy example” because the simple analitical formulas exist for and while MC is considered only for illustration.

## 3Generalized adaptive scheme

Importance sampling, even in its adaptive version (AIS), suffers from the degeneracy of weights. To compute the importance weights we have to know norming constants for every (or at least their ratios). This requirement severly restricts our choice of the family of instrumental densities . Available instrumental densities are far from and far from . Consequently the weights tend to degenerate (most of them are practically zero, while a few are very large). This effectively makes AIS in its basic form impractical. To obtain practically applicable algorithms, we can generalize AIS as follows. In the same situation as in Section 2, instead of the AIS estimate given by , we consider a more general Monte Carlo estimate of of the form

where the summands are computed by an MC method to be specified later. For now let us just assume that this method depends on a control parameter which may change at each step. A general adaptive algorithm is the following:

`Algorithm`

AdapMCML

Set an initial value of , , .

Compute an “incremental estimate” .

Update the approximation of :

Update : choose based on the history of the simulation.

; go to 2.

AdapIS

in Section 2 is a special case of `AdapMCML`

which is obtained by letting .

### 3.1Variance reduction via resampling and MCMC

The key property of the AIS exploited in Section 2 is the martingale structure implied by and . The main asymptotic results generalize if *given , the estimates of and its derivatives are conditionally unbiased*. We propose an algorithm for computing in which has the unbiasedness property and is more efficient than simple AIS. To some extent it is a remedy for the problem of weight degeneracy and reduces the variance of Monte Carlo approximations. As before, consider a family of “instrumental densities” . Assume they are properly normalized () and the control parameter belongs the same space as the parameter of interest (). Further assume that for every we have at our disposal a Markov kernel on which preserves distribution , i.e. . Let us fix This is a setup in which we can apply the following importance sampling-resampling algorithm `ISReMC`

:

`Algorithm`

ISReMC

Sample .

Compute the importance weights and put .

Sample [Discrete distribution with mass at point ].

For generate a Markov chain trajectory, starting from and using kernel :

Compute given by

This algorithm combines the idea of resampling (borrowed from sequential MC; steps 2 and 3) with computing ergodic averages in multistart MCMC (step 4; notice that is a burn-in and is the actual used sample size for a single MCMC run, repeated times). More details about `ISReMC`

are in [7]. In our context it is sufficient to note the following key property of this algorithm.

We can express function and its derivatives as “unnormalized expectations” with respect to the probability distribution with density :

Let us focus on . Write

for the expectation of a *single* MCMC estimate started at . Kernel preserves by assumption, therefore . Put differently, .

We make a simple observation that

This conditional expectation takes into account only randomness of the MCMC estimate in step 4 of the algorithm. Now we consecutively “drop the conditions”:

The expectation above takes into account the randomness of the resampling in step 3. Finally, since in step 1, we have

This ends the proof for . Exactly the same argument applies to and .

We can embed the unbiased estimators produced by `ISReMC`

in our general adaptive scheme `AdapMCML`

. At each step of the adaptive algorithm, we have a new control parameter . We generate a sample from , compute weights, resample and run MCMC using . Note that the whole sampling scheme at stage (including computation of weights) depends on but not on . In the adaptive algorithm random variable is measurable, where is the history of simulation up to stage . Therefore the sequence of incremental estimates satisfies, for every ,

Moreover, first and second derivatives exist and

Formulas and are analogues of and .

### 3.2Asymptotics of adaptive MCML

In this subsection we restrict our considerations to *exponential families on finite spaces*. This will allow us to prove main results without formulating complicated technical assumptions (integrability conditions analoguous to Assumption ? would be cumbersome and difficult to verify). Some models with important applications, such as *autologistic* one, satisfy the assumptions below.

Now, since is finite (although possibly very large),

Note that Assumption ? is automatically satisfied.

We consider algorithm `AdapMCML`

with incremental estimates produced by `ISReMC`

. The likelihood ratio in and its derivatives assume the following form:

(the derivatives are with respect to , with fixed). Assumptions ? and ? together with Assumption ? imply that are uniformly bounded, if belongs to a compact set. Indeed, the importance weights in are uniformly bounded by Assumptions ? and ?. Formula shows that the ratios are also uniformly bounded for and belonging to bounded sets. Since the statistics are bounded, the same argument shows that and are uniformly bounded, too.

For exponential families, and are convex functions. It is a well known property of exponential family that and thus it is a nonnegative definite matrix. A closer look at reveals that is also a variance-covariance matrix with respect to some discrete distribution. Indeed, it is enough to note that is of the form

for some and (although if `ISReMC`

within `AdapMCML`

is used to produce then and are quite complicated random variables depending on ).

Let be a maximizer of and assume that is the unique maximizer of .

Boundedness of for a fixed together with implies that is a bounded sequence of martingale differences. It satisfies the assumptions of SLLN for martingales in Appendix A. Therefore . Consequently, we also have , pointwise. Pointwise convergence of convex functions implies uniform convergence on compact sets [13]. The conclusion follows immediately.

Note that is a purely imaginary object, being a result of an algorithm initialized at a “limiting instrumental parameter” and evaluated at the “true MLE” , both unknown. It is introduced only to concisely describe the variance/covariance matrix . Note also that is equal to , the *observed* value of the sufficient statistic.

The proof is similar to that of Theorem ?, so we will not repeat all the details. The key argument is again based on SLLN and CLT for martingales (see Appendix A). In the present situation we have more complicated estimators than in Theorem ?. They are now given by . On the other hand, we work under the assumption that is an exponential family on a finite state space . This implies that conditions and are fulfilled and the martingale differences therein are uniformly bounded (for any fixed and also for running through a compact set). Concavity of and further simplifies the argumentation.

As in the proof of Theorem ?, we claim that and hold. The first of these conditions, , is justified exactly in the same way: by applying the CLT to the numerator and SLLN to the denominator of . Now, we consider martingale differences given by

It follows from the discussion preceding the theorem that are uniformly bounded, so the CLT can be applied. Similarly, SLLN can be applied to .

Assumption holds because third order derivatives of are uniformly bounded in the neighbouhood of . This allows us to infer condition in the same way as in the proof of Theorem ?.

Note also that we do not need Assumption ?. To deduce the conclusion of the theorem from we have to know that is square-root consistent. But this follows from the facts that is concave and the maximizer of the quadratic function in is square-root consistent by .

## 4Simulation results

In a series of small scale simulation experiments, we compare two algorithms. The first one, used as a “Benchmark” is a non-adaptive MCML. The other is `AdapMCML`

which uses `ISReMC`

estimators, as described in Section 3. Synthetic data used in our study are generated from autologistic model, described below. Both algorithms use Gibbs Sampler (GS) as an MCMC subroutine and both use Newton-Raphson iterations to maximize MC log-likelihood approximations.

### 4.1Non-adaptive and adaptive Newton-Raphson-type algorithms

Well-known Newton-Raphson (NR) method in our context updates points approximating maximum of the log-likelihood as follows:

where is given by .

Non-adaptive

algorithms are obtained when some fixed value of the “instrumental parameter” is used to produce MC samples. Below we recall a basic version of such an algorithm, proposed be Geyer [5] and examined e.g. in [9]. If we consider an exponenial family given by Assumption ?, then . Let be fixed and be samples approximately drawn from distribution . In practice an MCMC method is applied to produce such samples, stands for a burn-in. In all our experiments the MCMC method is a deterministic scan Gibbs Sampler (GS). Now, we let

Consequently, if and , then the derivatives of the log-likelihood are expressed via weighted moments,

The adaptive

algorithm uses given by , with summands computed by `ISReMC`

, exactly as described in Section 3. The MCMC method imbedded in `ISReMC`

is GS, the same as in the non-adaptive algorithm. Importance sampling distribution in steps 1 and 2 of `ISReMC`

is pseudo-likelihood, described by formula in the next subsection. Computation of in step 4 of `AdapMCML`

uses one NR iteration: