Targeted Maximum Likelihood Estimation using Exponential Families
Targeted maximum likelihood estimation (TMLE) is a general method for
estimating parameters in semiparametric and nonparametric
models. Each iteration of TMLE involves fitting a parametric
submodel that targets the parameter of interest.
We investigate the use of exponential families to
define the parametric submodel. This implementation of TMLE gives a general
approach for estimating any smooth parameter in the nonparametric model.
A computational advantage of this approach is that each iteration of TMLE involves estimation of a parameter
in an exponential family, which is a convex optimization problem for which software implementing reliable and computationally efficient methods
exists. We illustrate the method in three estimation problems, involving
the mean of an outcome missing at random, the parameter of a median
regression model, and the causal effect of a continuous
We conduct a simulation study comparing different choices for the parametric submodel, focusing on the first of these problems.
To the best of our knowledge, this is the first study investigating robustness of TMLE to
different specifications of the parametric submodel. We find that the
choice of submodel can have an important impact on the behavior of the
estimator in finite samples.
Key words: TMLE, Exponential family, Convex optimization.
Targeted maximum likelihood estimation [TMLE, 25, 24] is a general template for estimation in semiparametric or nonparametric models. The key to each update step of TMLE is to specify and fit a parametric submodel with certain properties, described in detail below. (A parametric submodel is defined as a parametric model that is contained in the overall model.) Throughout this paper, the overall model is a nonparametric model. Therefore, many choices are available for the form of the parametric submodel.
We investigate the performance of TMLE when the parametric submodel is chosen to be from an exponential family. A computational advantage of this choice is that because the parametrization as well as the parameter space of an exponential family are always convex , standard methods for optimization can be applied to solve this problem. Another advantage of this approach is that it can be applied to a wide variety of estimation problems. Specifically, it can be used to estimate any smooth parameter defined in the nonparametric model, under conditions described below.
We demonstrate how to implement TMLE using exponential families in the following three estimation problems:
Estimating the mean of an outcome missing at random, where covariates are observed for the entire sample.
Estimating a nonparametric extension of the parameter in a median regression model.
Estimating the causal effect of a continuous-valued exposure.
We conduct a simulation study comparing different choices for the parametric submodel, focusing on the first of these problems.
We next present an overview of the TMLE template, and illustrate the implementation of TMLE for each of the three problems above. We conclude with a discussion of practical issues and directions for future research.
2 Targeted maximum likelihood template
Let the random vector representing what is observed on an experimental unit be denoted by , with sample space . Let be an independent, identically distributed sample of observations , each drawn from the unknown, true distribution . We assume that , where is the nonparametric model, defined as the class of all distributions having a continuous density with respect to a dominating measure . Let denote the class of all densities corresponding to a distribution in , and let denote the density corresponding to . Let denote a -dimensional Euclidean parameter with known efficient influence function . That is, is a mapping from to for which is the pathwise derivative, as defined, e.g., in Bickel et al. . We refer to such a parameter as a smooth parameter. Many commonly used parameters are smooth, including all those in this paper. The efficient influence function, by definition, satisfies
Denote the true value by . The template for a targeted maximum likelihood estimator is defined by the following steps:
Construct an initial estimator of the true, unknown density ;
Construct a sequence of updated density estimates , . Given the current density estimate , the updated density is constructed by specifying a regular parametric submodel of , where is an open subset of . The submodel is required to satisfy two properties. First, it must equal the current density estimate at , i.e., . Second, the score of at must equal the efficient influence function for at , i.e.,
The parameter of the submodel is fit using maximum likelihood estimation, i.e.,
and the updated density is defined to be the density in the parametric submodel
corresponding to , i.e., .
Iterate the previous step until convergence, i.e., until . Denote the last step of the procedure by .
Define the TMLE of to be the substitution estimator .
The TMLE algorithm above can be generalized in the following ways: one can let each represent an estimate of only certain components of the density (typically those components relevant to estimation of the parameter for a specific problem); the parametric submodel may satisfy a relaxed score condition, in that the efficient influence function need only be contained in the linear span of the score at ; or another loss function may be used in place of the log-likelihood for estimating in (3). We discuss the latter generalization in more detail in Section 4.3.
The result of the above TMLE procedure is that at the final density estimate , we have
i.e., the final density estimate is a solution to the efficient influence function estimating equation. This property, combined with the estimator being a substitution estimator and the smoothness of the parameter, is fundamental to proving that TMLE has desirable properties. For example, it is asymptotically linear with influence function equal to the efficient influence function under certain assumptions, as described by .
We give a heuristic argument for (4), which is rigorously justified under regularity conditions given in Result 1 of . Assume that at the final iteration of TMLE, i.e., the iteration where is defined, we have . Then by equation (3), the penultimate density must satisfy
If is strictly convex in , then the derivative at of the right side of (5) equals , which implies
3 Implementing TMLE using an exponential family
The key step in the TMLE algorithm is step 2, which requires a choice of the parametric submodel at each iteration . Let denote the density at the current iteration , and consider construction of the parametric model in step 2. One option is to use a submodel from an exponential family, represented as
where the normalizing constant . The model is defined for all for which the integral in is finite.
A key feature of parametric models of the form (7) is that they automatically satisfy the two conditions in step 2 of the TMLE algorithm. First, by (7), we have at that . Second, the score condition (2) holds since we have, for all possible values of ,
where the second equality follows from exchanging the order of differentiation and integration (justified under smoothness conditions by Fubini’s Theorem), and the last equality follows from by (1).
An advantage of using (7) as a submodel is that maximum likelihood estimation of in an exponential family is a convex optimization problem, which is computationally tractable. In particular, we take advantage of the various R functions available to solve convex optimization problems.
In Section 4.5, we illustrate a different implementation of TMLE that uses parametric submodels given by
for a normalizing constant, and which also has a convex log-likelihood function.
In Sections 4-6, we describe three estimation problems that can be solved with TMLE. The first problem, estimating the mean of a variable missing at random, is an example where there exists a TMLE implementation that requires a single iteration and that can be solved through a logistic regression of the outcome on a so called “clever covariate.” In contrast, the examples in Sections 5-6 generally require multiple iterations, and cannot be solved using a clever covariate.
4 Example 1: The mean of a variable missing at random
4.1 Problem Definition
Assume we observe independent, identically distributed draws , each having the observed data structure , where is a vector of baseline random variables, is an indicator of the outcome being observed, and is the binary outcome. For participants with , we do not observe their outcome (since we only observe , which equals zero for such participants); however, these participants do contribute baseline variables. The only assumptions we make on the joint distribution of are that is missing at random conditioned on , i.e., , and that with probability 1.
Define the outcome regression , the propensity score , and the marginal density of the baseline variables . All of these components of the density are assumed unknown. The parameter of interest is , which by the missing at random assumption equals . This parameter only depends on the components and of the joint distribution . We denote the parameter of interest as , where denotes the expectation with respect to the marginal distribution of .
Note that in general (since the latter equals , i.e., the expectation of with respect to the distribution of given ) except in the special case called missing completely at random, where and are marginally independent. Denote the true mean of by . Identification and estimation of is a widely studied problem [e.g., 18, 11]. The estimation problem becomes particularly challenging when the dimension of is large, since nonparametric estimators using empirical means suffer from the curse of dimensionality. It is a challenging problem even when consists of a few, continuous-valued, baseline variables, as shown by .
Below, we contrast four TMLE implementations for the above estimation problem. The purpose is to compare multiple options for the parametric submodel, in this simple problem. Also, we demonstrate the general approach of using the exponential family (7) as parametric submodel, in this relatively well-studied problem, before applying it to more challenging problems in Sections 5 and 6. In Section 4.2 we present an implementation of TMLE from van der Laan and Rubin , which requires only a single iteration. A variation of this estimator that uses weighted logistic regression is presented in Section 4.3. In Section 4.4, we describe a TMLE implementation using the exponential family (7) as a submodel; we also present a fourth implementation using the submodel (14).
Under the conditions described in Theorem 1 of Appendix A, the asymptotic distribution of the TMLE estimator is not sensitive to the choice of submodel when each of the initial estimators and converges to its true value at a rate faster than . However, the choice of submodel may impact finite sample performance. Also, when one of the estimators and does not converge to the true value, the submodel choice may even affect performance asymptotically. To shed light on this, we perform a simulation study in Section 4.5.
In general, TMLE implementations require that one has derived the efficient influence function of the parameter of interest with respect to the assumed model (which is the nonparametric model throughout this paper). For the parameter in this section, the efficient influence function is given by 
We will also denote by , using the fact that the density can be decomposed into the components defined above.
4.2 First TMLE implementation for the mean of an outcome missing at random
The first TMLE implementation has been extensively discussed in the literature [e.g., 25, 24], and we only provide a brief recap. Following the template from Section 2, we first define initial estimators and of and , respectively. These could obtained, e.g., by fitting logistic regression models or by machine learning methods. We set the initial estimator of to be the empirical distribution of the baseline variables , i.e., the distribution placing mass on each observation of . The TMLE presented next is equivalent to the estimator presented in page 1141 of  when is a logistic regression model fit. A detailed discussion of the similarities between the TMLE template and the estimators that stem from  is presented in Appendix 2 of . We now show the construction of a parametric submodel for this problem.
Construction of the parametric submodel
In this implementation of TMLE, the components , , and are each updated separately, such that they solve the corresponding part of the efficient influence function estimating equation. Consider the -th step of the TMLE algorithm described in Section 2. For the conditional expectation of given among individuals with , and an estimator , we define the logistic model
where . For the marginal distribution of we define the exponential model
where . The variable has often been referred to as the “clever covariate”. The initial estimator of is not modified.
It is straightforward to show that the efficient influence function is a linear combination of the scores of this joint parametric model for the distribution of .
We now describe the TMLE implementation based on the submodel construction above. For this case, this procedure involves only one iteration. In the first iteration we have
This is because the MLE of in the nonparametric model is precisely the empirical . An estimate of the parameter in the model
may obtained by running a logistic regression of among individuals with on without intercept and including an offset variable . We now compute the updated estimate of as
The score equation corresponding to this logistic regression model is
Note that this matches the first component of the efficient influence function (9). Proceeding to the second iteration, we estimate the parameter in the model
by running a logistic regression of on with offset and without intercept among participants with . This is equivalent to solving the score equation
in . By convexity and (11), the solution to the score equation in the above display is equal to zero, and so the algorithm is terminates in the second iteration. The TMLE is thus defined as . Confidence intervals may be constructed using the non-parametric bootstrap. For a more detailed discussion of the asymptotic properties of this estimator, as well as simulations, see .
4.3 Second TMLE implementation for the mean of an outcome missing at random
Consider the following -th iteration parametric submodel for the expectation of conditional on among participants with
Let be the first-step estimator of the intercept term in a weighted logistic regression with weights and offset variable . Let the updated estimator of be defined as
By a similar argument as in the previous section, the estimate of in the following iteration is equal to zero, and the TMLE of , defined as , converges in one step. Note that this implementation of the TMLE also satisfies the score equation (11).
In addition, note that in this section does not correspond to the MLE of a parametric submodel. As a consequence, is not a targeted maximum likelihood estimator as defined in Section 2. Instead, it is part of a broader class of estimators referred to as targeted minimum loss-based estimators, also abbreviated as TMLE . These estimators generalize the TMLE framework of Section 2 by allowing the use of general loss functions in estimation of the parameter in the parametric submodel. In the example of this section the loss function used is the weighted least squares loss function.
This TMLE implementation is analogous to the estimator of Marshall Joffe discussed in  when is a parametric model. The Joffe estimator is presented in  as a doubly robust alternative to the augmented IPW estimators when the weights are highly variable, i.e., when there are empirical violations to the positivity assumption; we simulate such scenarios in Section 4.5. To the best of our knowledge, the above TMLE implementation was first discussed by Stitelman et al.  in the context of longitudinal studies.
4.4 Third and fourth TMLE implementations for the mean of an outcome missing at random
We next give implementations of TMLE based on the following two types of submodel for :
Here are the corresponding normalizing constants, and is the efficient influence function given in (9). Model (13) is the general exponential family introduced in (7), while (14) is an alternative submodel.
The third and fourth TMLE implementations for estimating are defined by the following iterative procedure:
Construct initial estimators , , and for , , and , respectively. We use the same initial estimators as in Section 4.2.
Construct a sequence of updated density estimates , , where at each iteration we construct as follows: estimate as
where is given by (13) or (14), for the third or fourth implementation, respectively. Computation of requires evaluation of , which in turn requires , , and . Define , and define , , to be the corresponding components of .
The previous step is iterated until convergence, i.e., until . Denote the last step of the procedure by .
The TMLE of is defined as the substitution estimator , for and the corresponding components of .
If the initial estimator is the empirical distribution, then is a density (with respect to counting measure) with positive mass only at the observed values . This is an important computational characteristic when computing the normalizing constant , since integrals over become weighted sums over the sample. The optimization in each iteration of step 2 is carried out using the BFGS [4, 7, 8, 21] algorithm as implemented in the function optim(). The optimization problem is convex in , so that under regularity conditions we expect the algorithm to converge to the global optimum.
Motivation for TMLE implementation with submodel (14).
Submodel (13) is not necessarily well define for an unbounded efficient influence function . However, submodel (14) is always bounded and can be used with any . An example of an unbounded efficient influence function is given by (9) under empirical violations of the assumption . This problem has been extensively discussed, particularly in the context of continuous outcomes [e.g., 1, 17, 9]. A TMLE with submodel (14) as presented in this section may provide an alternative solution to those presented in the literature.
4.5 Evaluating sensitivity of the TMLE to the choice of parametric submodel
We perform a simulation study to explore the sensitivity of the TMLE to the above four different choices of parametric submodels from Sections 4.2- 4.4. We generate data satisfying the missing at random assumption defined in Section 4.1.
Data generating mechanism for simulations
The observed data on each participant is the vector , where . The following defines the joint distribution of the variables :
where denotes the Bernoulli distribution with probability of and probability of . We consider the following three missing outcome distributions, which are referred to as missingness mechanisms, and are depicted in Figure 1:
We refer to these missingness mechanisms as D1, D2, and D3 respectively. A practical violation of the positivity assumption is said to occur if for some values of , we have . Practical positivity violations are moderate under D2 and severe under D3. We consider these three missingness mechanisms to assess the performance of each TMLE implementation under different practical violations to the positivity assumption, a scenario of high interest since many doubly robust estimators can perform poorly . A large fraction of small probabilities as in D3 may be unlikely in a missing data application. However, it is very common in survey sample estimation, a field in which inverse probability weighted estimators are the rule. For reference, the minimum missingness probability in D3 is , and the median is . This is consistent with survey weights found in the literature [e.g., 19].
Various studies have investigated the performance of different estimators under violations to the positivity assumption [e.g., 11, 16]. We focus on TMLEs, assessing the impact of the choice of submodel. Each missingness mechanism, combined with the joint distribution of defined above, determines the joint distribution of the observed data . We simulated 10000 samples of sizes 200, 500, 1000, and 10000, respectively. This was done for each missingness mechanism.
We implemented the four types of TMLE described in this paper, using four different sets of working models for and : (i) correctly specified models for both, (ii) correct model for and incorrect model for , (iii) incorrect model for and correct model for , (iv) incorrect models for both and . Misspecification of the working model for consisted of using a logistic regression of on among individuals with ; misspecification of the working model for consisted of running logistic regressions of on . The TMLE iteration was stopped whenever .
Table 1 shows the relative efficiency (using as reference the analytically computed efficiency bound) of the four estimators for different sample sizes under each working model specification. The efficiency bounds for distributions D1, D2, and D3 are 0.34, 1.05, and 55.23, respectively.
The estimators with model specification (i) would be expected to have asymptotic relative efficiency equal to 1, which they approximately do at sample size 10000. The MSE of all estimators under severe positivity violations (missingness mechanism D3) and model specifications (i), (ii), and (iii) is smaller than the efficiency bound for sample sizes 200 and 500. This fact does not contradict theory; it is expected since the TMLE is a substitution estimator and therefore has variance bounded between zero and one, even when the efficiency bound divided by the sample size falls outside of this interval. A similar observation is true for model specification (ii) for all sample sizes. In this case, misspecification of the missingness model causes a substantial reduction in variability of the inverse probability weights, which results in smaller finite sample variance. However, we note that the relative efficiency gets closer to its theoretical value of one as the sample size increases. In the extreme case of D3 a sample size of 10000 was not large enough to observe the properties predicted by asymptotic theory.
It was not expected that under D1, the TMLE estimators are approximately semiparametric efficient for models (ii) and (iii). This may be a particularity of this data generating mechanism, perhaps due to the low dimension of the problem and the smoothness of this data generating mechanism. As expected, the bias of the estimators times square root of does not converge for model specification (iv), since asymptotic theory dictates that in general at least one of the working models must be correctly specified to imply consistency of the TMLE.
Table 2 shows the percent bias associated with each estimator. The second TMLE implementation 2 performs better than its competitors in finite samples under severe violations to the positivity assumption (D3), particularly when the missingness model is correctly specified (model specifications (i) and (iii)). However, under moderate positivity violation (D2) and misspecification of the outcome mechanism (model specification (iii)) in large samples (), TMLE implementation 2 performed worse than all of its competitors (MSE 1.63 vs 1.06). Implementations 3 and 4 did not perform particularly better than the best of implementations 1 and 2 for any scenario under consideration.
Another important question to ask when deciding on a parametric submodel is the computational efficiency of the estimators. Our simulations are in accordance to what we have observed in practice for this and other parameters, in that TMLE typically requires 6 or fewer iterations. In the above simulations, the time required to compute the TMLE for a single data set was typically less than a second.
5 Example 2: Median regression
Consider the median regression model:
where is a known, smooth function in , and where the conditional median of given is a.s. This last condition is equivalent to having with probability 1 that
We let denote a dominating measure for the distributions we consider, and denote by the density of . We say the above median regression model is correctly specified if at the true data generating distribution , we have for some that the conditional median under of given is , with probability 1. Throughout, we do not assume the median regression model is correctly specified.
Define the following nonparametric extension of (which maps each density to a value in ):
We assume there is a unique minimizer in of . Under this assumption, if the median regression model (18,19) is correctly specified, then this unique minimizer equals . However, even when (18,19) is misspecified, the parameter in (20) is well defined as long as there is a unique minimizer of . To simplify the notation, we denote by and by . The goal is to estimate based on i.i.d. draws from an unknown data generating distribution .
The nonparametric estimator of is the minimizer in of . Koenker and Park  proposed a solution to this optimization problem based on linear programming. Their methods are implemented in the quantreg  R package. We develop a TMLE for , in order to demonstrate it is a general methodology that can be applied to a variety of estimation problems, and to compare its performance versus the estimator of Koenker and Park  that is explicitly tailored to the problem in this section.
The efficient influence function for the parameter (20) in the nonparametric model, at distribution , is (up to a normalizing constant which we suppress in what follows):
In particular, we have
for all sufficiently smooth .
Construction of parametric submodel
Given the estimate of at iteration of the TMLE algorithm, we construct a regular, parametric model satisfying: (i) and (ii) for each . We again use an exponential submodel as in (7), which in this case is
where the normalization constant . This parametric model is well-defined, regular, equals at , and has score:
under smoothness and integrability conditions that allow the interchange of the order of differentiation and integration. It follows from (22) and (24) that the score equals at , and therefore satisfies the conditions (i) and (ii) described above.
Implementation of Targeted Maximum Likelihood Estimator
We present an implementation of the TMLE applying the parametric submodel (23). First, we construct an initial density estimator . We let be the empirical distribution of . We fit a linear regression model for given with main terms only, and let be a normal distribution with conditional mean as given in the linear regression fit, and with conditional variance 1. We then define A more flexible method can be used to construct the initial fit for the density of given . Here we use this simple model to examine how well the TMLE can recover from a poor choice for the initial density estimate.
Initializing , the iterative procedure defining the TMLE involves the following computations, at each iteration (where represents setting to take value ):
The value of in (25) is approximated by grid search over , where for each value of considered, the expectation on the right hand side of (25) is approximated by Monte Carlo integration (based on generating 10000 independent realizations from the density and taking the empirical mean over these realizations). The value of in (28) is approximated by applying the Newton-Raphson algorithm to the summation on the right side of (28), where we use the analytically derived gradient and Hessian in the Newton-Raphson algorithm. The above process is iterated over until convergence (we used as stopping rule).
The density at the final iteration is denoted by , and the TMLE of is defined as . If the initial estimator of is consistent, it is possible to use standard arguments for the analysis of targeted maximum likelihood estimators  to show that this estimator is asymptotically linear with influence function equal to the efficient influence function .
We draw 10000 samples, each of size 1000, of a two dimensional covariate by drawing and , with independent. We consider two different outcome distributions for given ; the outcomes under each distribution are denoted by , respectively. The first outcome involves drawing , and setting
where is the inverse of the function . This represents a case in which the error distribution is skewed. The constant was selected since it is the median of , which implies the median given equals . The second outcome distribution involves drawing and setting
We are interested in estimating the parameters
where . The true value of is , whereas the true value of is approximately (obtained using Monte Carlo simulation). The median regression model is a correctly specified model under the first outcome distribution, but is incorrectly specified for the second outcome distribution. However, the minimizers and of the right sides of (32) and (33), respectively, are both well defined.
For each of the 10000 samples we computed the estimator described above. The marginal distribution of was estimated by the empirical distribution in the given sample. The conditional distribution of given was misspecified by running a linear regression of on with main terms only, and assuming that is normally distributed with conditional variance equal to one. This was done in order to assess how the TMLE can recover from a poor fit of the initial densities resulting from a distribution that is commonly used in statistical practice. We then computed the MSE across the 10000 estimates as , where represents the Euclidean norm. The results are presented in Table 3. For comparison, we computed the same results for two other estimators: the quantile regression function nlrq() implemented in the R package quantreg, and a substitution estimator (SE) that is the result of optimizing (25) in the first iteration with set to .
The TMLE and the estimator of Koenker and Park  perform similarly for , i.e., when the median regression model is correct. The TMLE and SE perform better for estimating , i.e., when the median regression model is incorrectly specified. This is not surprising since the estimator of Koenker and Park  is designed for the case where the median regression model is correctly specified.
6 Example 3: The causal effect of a continuous exposure
We explore the use of an exponential family as a parametric submodel only for certain components of the likelihood. Consider a continuous exposure , a binary outcome , and a set of covariates . For a user-given value we are interested in estimating the expectation of under an intervention that causes a shift of units in the distribution of conditional on . Formally, consider an i.i.d. sample of draws of the random variable . Denote , the marginal density of and the conditional density of given . We assume that these data were generated by a nonparametric structural equation model [NPSEM, 15]:
where , , and are unknown but fixed functions, and , , and are exogenous random variables satisfying the randomization assumption . We are interested in the causal effect on of a shift of units in . Consider the following intervened NPSEM
This intervened NPSEM represents the random variables that would have been observed in a hypothetical world in which every participant received additional units of exposure . Díaz and van der Laan  proved that
For each density , define the parameter
where , , and are the outcome conditional expectation, exposure mechanism, and covariate marginal density corresponding to , respectively. We also use the notation to refer to . We are interested in estimating the true value of the parameter .
The efficient influence function of at is given by  as:
Construction of the parametric submodel
Consider initial estimators and , which can be obtained, for example, through machine learning methods. We estimate the marginal density of with its empirical counterpart denoted , and construct a sequence of parametric submodels for by specifying each component as: