On the Stationary Distribution of Iterative Imputations

On the Stationary Distribution of Iterative Imputations

Jingchen Liu, Andrew Gelman, Jennifer Hill, and Yu-Sung Su

Columbia University, Columbia University, New York University,
Tsinghua University
July 14, 2019
Abstract

Iterative imputation, in which variables are imputed one at a time each given a model predicting from all the others, is a popular technique that can be convenient and flexible, as it replaces a potentially difficult multivariate modeling problem with relatively simple univariate regressions. In this paper, we begin to characterize the stationary distributions of iterative imputations and their statistical properties. More precisely, when the conditional models are compatible (defined in the text), we give a set of sufficient conditions under which the imputation distribution converges in total variation to the posterior distribution of a Bayesian model. When the conditional models are incompatible but are valid, we show that the combined imputation estimator is consistent.

1 Introduction

Iterative imputation is a widely used approach for imputing multivariate missing data. The procedure starts by randomly imputing missing values using some simple stochastic algorithm. Missing values are then imputed one variable at a time, each conditionally on all the others using a model fit to the current iteration of the completed data. The variables are looped through until approximate convergence (as measured, for example, by the mixing of multiple chains).

Iterative imputation can be an appealing way to express uncertainty about missing data. There is no need to explicitly construct a joint multivariate model of all types of variables: continuous, ordinal, categorical, and so forth. Instead, one need only specify a sequence of families of conditional models such as linear regression, logistic regression, and other standard and already programmed forms. The distribution of the resulting imputations is implicitly defined as the invariant (stationary) distribution of the Markov chain corresponding to the iterative fitting and imputation process.

Iterative, or chained, imputation is convenient and flexible and has been implemented in various ways in several statistical software packages, including mice [28] and mi [26] in R, IVEware [16] in SAS, and ice in Stata [19, 20]. The popularity of these programs suggests that the resulting imputations are believed to be of practical value. However, the theoretical properties of iterative imputation algorithms are not well understood. Even if, as we would prefer, the fitting of each imputation model and the imputations themselves are performed using conditional Bayesian inference, the stationary distribution of the algorithm (if it exists) does not in general correspond to Bayesian inference on any specified multivariate distribution.

Key questions are: (1) Under what conditions does the algorithm converge to a stationary distribution? (2) What statistical properties does the procedure admit given that a unique stationary distribution exists?

Regarding the first question, researchers have long known that the Markov chain may be non-recurrent (“blowing up” to infinity or drifting like a nonstationary random walk), even if each of the conditional models is fitted using a proper prior distribution.

In this paper, we focus mostly on the second question—the characterization of the stationary distributions of the iterative imputation conditional on its existence. Unlike usual MCMC algorithms, which are designed in such a way that the invariant distribution and target distribution are identical, the invariant distribution of iterative imputation (even if it exists) is largely unknown.

The analysis of iterative imputation is challenging for at least two reasons. First, the range of choices of conditional models is wide, and it would be difficult to provide a solution applicable to all situations. Second, the literature on Markov chains focuses on known transition distributions. With iterative imputation, the distributions for the imputations are known only within specified parametric families. For example, if a particular variable is to be updated conditional on all the others using logistic regression, the actual updating distribution depends on the logistic regression coefficients which are themselves estimated given the latest update of the missing values.

The main contribution of this paper is to develop a mathematical framework under which the asymptotic properties of iterative imputation can be discussed. In particular, we demonstrate the following results.

  1. Given the existence of a unique invariant (stationary) distribution of the iterative imputation Markov chain, we provide a set of conditions under which this distribution converges in total variation to the posterior distribution of a joint Bayesian model, as the sample size tends to infinity. Under these conditions, iterative imputation is asymptotically equivalent to full Bayesian imputation using some joint model. Among these conditions, the most important is that the conditional models are compatible—that there exists a joint model whose conditional distributions are identical to the conditional models specified by the iterative imputation (Definition 1). We discuss in Section 3.

  2. We consider model compatibility as a typically necessary condition for the iterative imputation distribution to converge to the posterior distribution of some Bayesian model (Section 3.4).

  3. For incompatible models whose imputation distributions are generally different from any Bayesian model, we show that the combined completed-data maximum likelihood estimate of the iterative imputation is a consistent estimator if the set of conditional models is valid, that is, if each conditional family contains the true probability distribution (Definition 3 in Section 4.).

The analysis presented in this paper connects to the existing separate literatures on missing data imputation and Markov chain convergence. Standard textbooks on imputation inference are [11, 22], and some key papers are [10, 3, 13, 14, 21, 23, 24]. Large sample properties are studied by [17, 25, 29], small samples are by [3], and the issue of congeniality between the imputer’s and analyst’s models is considered by [13].

Our asymptotic findings for compatible and incompatible models use results on convergence of Markov chains, a subject on which there is a vast literature on stability and rate of convergence ([8, 1, 2]). In addition, empirical diagnostics of Markov chains have been suggested by many authors, for instance, [7]. For the analysis of compatible models, we need to construct a bound for convergence rate using renewal theory [4, 15, 18], which has the advantage of not assuming the existence of an invariant distribution, which is naturally yielded by the minorization and drift conditions.

In Section 2 of this article, we lay out our notation and assumptions. We then briefly review the framework of iterative imputation and the Gibbs sampler. In Section 3, we investigate compatible conditional models. In Section 4, the discussion focuses on incompatible models. Section 5 includes several simulation examples. An appendix is attached containing the technical developments and a brief review of the literature for Markov chain convergence via renewal theory.

2 Background

Consider a data set with cases and variables, where represents the complete data and is the -th variable. Let be the vector of observed data indicators for variable , equaling for observed variables and for missing, and let and denote the observed and missing subsets of variable :

To facilitate our description of the procedures, we define

We use boldface to denote the entire data set and to denote individual observations. Therefore, denotes the -th variable of one observation and denotes all the variables except for the -th one.

Throughout, we assume that the missing data process is ignorable. One set of sufficient conditions for ignorability is that the process is missing at random and the parameter spaces for and are distinct, with independent prior distributions [11, 22].

2.1 Inference using multiple imputations

Multiple imputation is a convenient tool to handle incomplete data set by means of complete-data procedures. The framework consists of producing copies of the imputed data and applying the users’ complete data procedures to each of the multiply imputed data sets. Suppose that copies of point estimates and variance estimates are obtained, denoted by , . The next step is to combine them into a single point estimate and a single variance estimate [11]. If the imputed data are drawn from the joint posterior distribution of the missing data under a Bayesian model, under appropriate congeniality conditions, is asymptotically equal to the posterior mean of and is asymptotically equal to the posterior variance of ([22, 13]). The large sample theory of Bayesian inference ensures that the posterior mean and variance are asymptotically equivalent to the maximum likelihood estimate and its variance based on the observed data alone (see [5]). Therefore, the combined estimator from imputed samples is efficient. Imputations can also be constructed and used under other inferential frameworks; for example, Robins and Wang [17, 29] propose estimates based on estimating equations and derive corresponding combining rules. For our purposes here, what is relevant is that the multiple imputations are being used to represent uncertainty about the joint distribution of missing values in a multivariate dataset.

2.2 Bayesian modeling, imputation, and Gibbs sampling

In Bayesian inference, multiply imputed data sets are treated as samples from the posterior distribution of the full (incompletely-observed) data matrix. In the parametric Bayesian approach, one specifies a family of distributions and a prior and then performs inference using i.i.d. samples from the posterior predictive distribution,

(1)

where . Direct simulation from (1) is generally difficult. One standard solution is to draw approximate samples using the Gibbs sampler or some more complicated Markov chain Monte Carlo (MCMC) algorithm. In the scenario of missing data, one can use the “data augmentation” strategy to iteratively draw given and given . Under regularity conditions (positive recurrence, irreducibility, and aperiodicity; see [8]), the Markov process is ergodic with limiting distribution .

In order to connect these results to the iterative imputation that is the subject of the present article, we consider a slightly different Gibbs scheme which consists of indefinite iteration of following steps:,

Step 1.

Draw and ;

Step 2.

Draw and ;

Step .

Draw and .

At each step, the posterior distribution is based on the updated values of the parameters and imputed data. It is not hard to verify that the Markov chain evolving according to steps 1 to (under mild regularity conditions) converges to the posterior distribution of the corresponding Bayesian model.

2.3 Iterative imputation and compatibility

For iterative imputation, we need to specify conditional models,

for with prior distributions for . Iterative imputation adopts the following scheme to construct a Markov chain,

Step 1.

Draw from , which is the posterior distribution associated with and ; draw from ;

Step 2.

Draw from , which is the posterior distribution associated with and ; draw from ;

Step .

Draw from , which is the posterior distribution associated with and ; draw from .

Iterative imputation has the practical advantage that, at each step, one only needs to set up a sensible regression model of given . This substantially reduces the modeling task, given that there are usually standard linear or generalized linear models for univariate responses of different variable types. In contrast, full Bayesian (or likelihood) modeling requires the more difficult task of constructing a joint model for . Whether it is preferable to perform easy task or one difficult task, depends on the problem at hand. All that is needed here is the recognition that, in some settings, users prefer the easy steps of iterative imputation.

But iterative imputation has conceptual problems. Except in some special cases, there will not in general exist a joint distribution of such that for each . In addition, it is unclear whether the Markov process has a probability invariant distribution; if there is such a distribution, it lacks characterization.

In this paper, we discuss the properties of the stationary distribution of the iterative imputation process by first classifying the set of conditional models as compatible (defined as there existing a joint model which is consistent with all the conditional models) or incompatible.

We refer to the Markov chain generated by the scheme in Section 2.2 as the Gibbs chain and that generated by the scheme in Section 2.3 as the iterative chain. Our central analysis works by coupling the two.

3 Compatible conditional models

3.1 Model compatibility

Analysis of iterative imputation is particularly challenging partly because of the large collection of possible choices of conditional models. We begin by considering a restricted class, compatible conditional models, defined as follows:

Definition 1

A set of conditional models is said to be compatible if there exists a joint model and a collection of surjective maps, such that for each , , and ,

Otherwise, is said to be incompatible.

Though imposing certain restrictions, compatible models do include quite a collection of procedures practically in use (e.g. ice in Stata). In what follows, we give a few examples of compatible and incompatible conditional models.

We begin with a simple linear model, which we shall revisit in Section 5.

Example 1 (bivariate Gaussian)

Consider a binary continuous variable and conditional models

These two conditional models are compatible if and only if lie on a subspace determined from the joint model,

with and . The reparameterization from to the parameters of the conditional models is:

The following example is a natural extension.

Example 2 (continuous data)

Consider a set of conditional linear models: for each ,

where is a vector, . Consider the joint model of . Then the conditional distribution of each given is Gaussian. The maps ’s can be derived by conditional multivariate Gaussian calculations.

Example 3 (continuous and binary data)

Let be a Bernoulli random variable and be a continuous random variable. The conditional models are as follows:

The above conditional models are compatible with the following joint model:

If we let

the conditional models and this joint model are compatible with each other. Similarly compatible models can be defined for other natural exponential families. See [6, 12].

Example 4 (incompatible Gaussian conditionals)

There are many incompatible conditional models. For instance,

are compatible only if .

3.2 Total variation distance between two transition kernels

Let be the Gibbs chain and be the iterative chain. Both chains live on the space of the missing data. We write the completed data as for the Gibbs chain () and the iterative chain (). The transition kernels are

(2)

where is a generic notation for the state of the processes. The transition kernels ( and ) depend on . For simplicity, we omit the index of in the notation of . Also, we let

for , being some starting distribution. The probability measure also depends on . Let denote the total variation distance between two measures, that is, for two measures, and , defined on the same probability space

We further define

and for . Let be the stationary distribution of . We intend to establish conditions under which

in probability as and thus the iterative imputation and the joint Bayesian imputation are asymptotically the same.

Our basic strategy for analyzing the compatible conditional models is to first establish that the transition kernels and are close to each other in a large region (depending on the observed data ), that is, as for ; and, second, to show that the two stationary distributions are close to each other in total variation in that the stationary distributions are completely determined by the transition kernels. In this subsection, we start with the first step, that is, to show that converges to .

Both the Gibbs chain and the iterative chain evolve by updating each missing variable from the corresponding posterior predictive distributions. Upon comparing the difference between the two transition kernels associated with the simulation schemes in Sections 2.2 and 2.3, it suffices to compare the following posterior predictive distributions (for each ),

(3)
(4)

where and denote the posterior distributions under and respectively. Due to compatibility, the distributions of the missing data given the parameters are the same for the joint Bayesian model and the iterative imputation model:

if . The only difference lies in their posterior distributions. In fact, the distance between two posterior predictive distributions is bounded by the distance between the posterior distributions of parameters. Therefore, we move to comparing and .

Parameter augmentation.

Upon comparing the posterior distributions of and , the first disparity to reconcile is that the dimensions are usually different. Typically is of a lower dimension. Consider the linear model in Example 1. The conditional models include three parameters (two regression coefficients and variance of the errors), while the joint model has five parameters . This is because the (conditional) regression models are usually conditional on the covariates. The joint model not only parameterizes the conditional distributions of given but also the marginal distribution of . Therefore, it includes extra parameters, although the distributions of the missing data is independent of these parameters. We augment the parameter space of the iterative imputation to with the corresponding map . The augmented parameter is a non-degenerated reparameterization of , that is, is a one-to-one (invertible) map.

To illustrate this parameter augmentation, we consider the linear model in Example 1 in which , where we use and to denote mean and variance of the first variable, and to denote the mean and variance of the second, and to denote the correlation. The reparameterization is,

The function maps to the regression coefficients and the variance of the residuals; maps to the marginal mean and variance of . Similarly, we can define the map of and .

Impact of the prior distribution.

Because we are assuming compatibility, we can drop the notation for conditional model of the -th variable. Instead, we unify the notation to that of the joint Bayesian model . In addition, we abuse the notation and write for . For instance, in Example 1, we write as long as , , and .

The prior distribution on for the joint Bayesian model implies a prior on , denoted by

For the full Bayesian model, the posterior distribution of is

Because , the above posterior distribution can be further reduced to

If we write

then the posterior distribution of under the joint Bayesian model is

Compared with the posterior distribution of the iterative imputation procedure, which is proportional to

the difference lies in the prior distributions, and .

Controlling the distance between the posterior predictive distributions.

We put forward tools to control the distance between the two posterior predictive distributions in (3) and (4). Let be the generic notation the observed data, and let and be two posterior densities of . Let be the density function for future observations given the parameter , and let and be the posterior predictive distributions:

It is straightforward to obtain that

(5)

The next proposition provides sufficient conditions that vanishes.

Proposition 1

Let be the sample size. Let and be two posterior density functions that share the same likelihood but have two different prior distributions and . Let

and denote sample size. Let be the partial derivative with respect to and let be a random variable such that

where “ ” denotes inner product and . If there exists a random variable with finite variance under , such that

(6)

then there exists a constant such that for sufficiently large

(7)

We prove this proposition in Appendix A.

Remark 1

We adapt Proposition 1 to the analysis of the conditional models. Expresion (6) implies that the posterior standard deviation of is . For most parametric models, (6) is satisfied as long as the observed Fisher information is bounded from below by for some . In particular, we let be the complete-data MLE and . Then, (6) is satisfied on the set for any fixed .

Remark 2

In order to verify that is bounded, one only needs to know and up to a normalizing constant. This is because the bound is in terms of . This helps to handle the situation when improper priors are used and it is not feasible to obtain a normalized prior distribution. In the current context, the prior likelihood ratio is .

Remark 3

If is twice differentiable, the convergence rate in (7) can be improved to . However, is sufficient for the current analysis.

3.3 Convergence of the invariant distributions

With Proposition 1 and Remark 1, we have established that the transition kernels of the Gibbs chain and the iterative chain are close to each other in a large region . The subsequent analysis falls into several steps. First, we slightly modify the processes by conditioning them on the set with stationary distributions (details provided below). The stationary distributions of the conditional processes and the original processes ( and ) are close in total variation. Second, we show (in Lemma 2) that, with a bound on the convergence rate, and are close in total variation and so it is with and . The bound of convergence rate can be established by Proposition 2.

To proceed, we consider the chains conditional on the set where the two transition kernels are uniformly close to each other. In particular, for each set , we let

(8)

That is, we create another two processes, for which we update the missing data conditional on . The next lemma shows that we only need to consider the chains conditional on the set .

Lemma 1

Suppose that both and are positive Harris recurrent. We can choose as in the form of Remark 1 and sufficiently large so that

(9)

in probability as . Let be the Markov chains following , defined as in (8), with invariant distribution . Then,

(10)

as .

The proof is elementary by the representation of through the renewal theory and therefore is omitted. Based on the above lemma, we only need to show that . The expression approaches 0 uniformly for . This implies that

as uniformly for . With the above convergence, we use the following lemma to establish the convergence between and .

Lemma 2

Let admit data-dependent transition kernels for . We use to denote sample size. Suppose that each admits a data-dependent unique invariant distribution, denoted by , and that the following two conditions hold:

  1. The convergence of the two transition kernels

    (11)

    in probability as . The function is either a geometric drift function for or a constant, i.e., .

  2. Furthermore, there exists a monotone decreasing sequence (independent of data) and a starting measure (depending on data) such that

    (12)

    as .

Then,

(13)

in probability as .

Remark 4

The above lemma holds if or is a drift function. For the analysis of convergence in total variation, we only need that . The results when is a drift function is prepared for the analysis of incompatible models.

The first condition in the above lemma has been obtained by the result of Proposition 1. Condition (12) is more difficult to establish. According to the standard results in [18] (see also (35) in the appendix), one set of sufficient conditions for (12) is that the chains and admit a common small set, ; in addition, each of them admits their own drift functions associated with the small set (c.f. Appendix C).

Gibbs chains typically admit a small set and a drift function , that is, for some positive measure

(14)

for , ; for some and for all

(15)

With the existence of and a bound of convergence (with starting point ) can be established for the Gibbs chain by standard results (see, for instance, [18]), and only depends on and . Therefore, it is necessary to require that and are independent of . In contrast, the small set and drift function could be data-dependent.

Given that and are close in “”, the set is also a small set for , that is for some , all , and all measurable set . The following proposition, whose proof is given in the appendix, establishes the conditions under which is also a drift function for .

Proposition 2

Assume the following conditions hold.

  1. The transition kernel admits a small set and a drift function satisfying (15).

  2. Let () be the the ratio of prior distributions for each conditional model (possibly depending on the data) so that on the set

  3. For each and , there exists a serving as the bound in (6) for each . In addition, satisfies the following moment condition

    (16)

    where is the expectation associated with the updating distribution of and is the state of the chain when the -th variable is just updated. The convergence is uniform in .

Then, there exits such that as tends to infinity with probability converging to one the following inequality holds

(17)
Remark 5

The intuition of the above proposition is as follows. satisfying inequality (15) is a drift function of to . Since the and are close to each other, we may expect that . The above proposition basically states the conditions under which this approximation is indeed true and suggests that be a drift function of if it is a drift function of . Condition (16) is imposed for a technical purpose. In particular, we allow the expectation of to grow to infinity but at a slower rate than . Therefore, it is a mild condition.

We now summarize the analysis and the results of the compatible conditional models in the following theorem.

Theorem 1

Suppose that a set of conditional models is compatible with a joint model . The Gibbs chain and the iterative chain then admit transition kernels and unique stationary distributions . Suppose the following conditions are satisfied:

  • Let . One can choose sufficiently large so that in probability as .

  • The conditions in Proposition 2 hold.

Then,

in probability as .

Remark 6

One sufficient condition for A1 is that the stationary distributions of under converge to a value , where and are not necessarily the same.

Remark 7

In addition to the conditions of Proposition 1, Proposition 2 also requires that one constructs a drift function towards a small set for the Gibbs chain. One can usually construct and free of data if the proportion of missing data is bounded from the above by . The most difficult task usually lies in constructing a drift function. For illustration purpose, we construct a drift function (in the supplement material) for the linear example in Section 5.

Proof of Theorem 1. We summarize the analysis of compatible models in this proof. If ’s are compatible with , then the conditional posterior predictive distributions of the Gibbs chain and the iterative chain are given in (3) and (4). Thanks to compatibility, the “” distance between the posterior predictive distributions are bounded by the distance between the posterior distributions of their own parameters as in (