Sensitivity Analysis for Inverse Probability Weighting Estimators via the Percentile Bootstrap
Abstract.
To identify the estimand in missing data problems and observational studies, it is common to base the statistical estimation on the “missing at random” and “no unmeasured confounder” assumptions. However, these assumptions are unverifiable using empirical data and pose serious threats to the validity of the qualitative conclusions of the statistical inference. A sensitivity analysis asks how the conclusions may change if the unverifiable assumptions are violated to a certain degree. In this paper we consider a marginal sensitivity model which is a natural extension of Rosenbaum’s sensitivity model for matched observational studies. We aim to construct confidence intervals based on inverse probability weighting estimators, such that the intervals have asymptotically nominal coverage of the estimand whenever the data generating distribution is in the collection of marginal sensitivity models. We use a percentile bootstrap and a generalized minimax/maximin inequality to transform this intractable problem to a linear fractional programming problem, which can be solved very efficiently. We illustrate our method using a real dataset to estimate the causal effect of fish consumption on blood mercury level.
1. Introduction
A common task in statistics is to estimate the average treatment effect in observational studies. In such problems, the estimand is not identifiable from the observed data without further assumptions. The most common identification assumption, namely the “no unmeasured confounder” (NUC) assumption, asserts that the confounding mechanism is completely determined by some observed covariates. Based on this assumption, many statistical procedures have been proposed and thoroughly studied in the past decades, including propensity score matching (Rosenbaum and Rubin, 1983), inverse probability weighting (Horvitz and Thompson, 1952), and doubly robust and machine learning estimators (Robins et al., 1994, Van Der Laan and Rubin, 2006, Chernozhukov et al., 2017).
However, the underlying NUC assumption is not verifiable using empirical data, posing serious threats to the usefulness of the subsequent statistical inferences that crucially rely on this assumption. A prominent example is the antioxidant vitamin beta carotene. Willett (1990), after reviewing observational epidemiological data, concluded that “Available data thus strongly support the hypothesis that dietary carotenoids reduce the risk of lung cancer”. Quite unexpectedly, four years later a largescale randomized controlled trial (The AlphaTocopherol Beta Carotene Cancer Prevention Study Group, 1994) reached the opposite conclusion and found “a higher incidence of lung cancer among the men who received beta carotene than among those who did not (change in incidence, 18 percent; 95 percent confidence interval, 3 to 36 percent)”. The most probable reason for the disagreement between the observational studies and the randomized trial is insufficient control of confounding.
Criticism of confounding bias in observational studies dates at least back to Fisher (1958) who suggested that the association between smoking and lung cancer may be due to genetic predisposition. In response to this, Cornfield et al. (1959) conducted the first formal sensitivity analysis in an observational study. They concluded that, in order to explain the apparent odds ratio of getting lung cancer when there is no real causal effect, those with the genetic predisposition (or any hypothetical unmeasured confounder) must be 9 times more prevalent in smokers than in nonsmokers. Because a genetic predisposition was seen as unlikely to have such a strong effect, this strengthened the evidence that smoking had a causal harmful effect.
Cornfield et al. (1959)’s initial sensitivity analysis only applies to binary outcomes and ignores sampling variability. These limitations were later removed in a series of pioneering work by Rosenbaum and his coauthors (Rosenbaum, 1987, Gastwirth et al., 1998, Rosenbaum, 2002b, c). Rosenbaum’s sensitivity model considers all possible violations of the NUC assumption as long as the violation is less than some degree. Using a matched cohort design, Rosenbaum attempts to quantify the strength of the unmeasured confounders needed to not reject the sharp null hypothesis of no treatment effect. However, Rosenbaum’s framework is only limited to matched observational studies and usually requires effect homogeneity to construct confidence intervals for the average treatment effect.
Besides matching, another widely used method in causal inference is inverse probability weighting (IPW), where the observations are weighted by the inverse of the probability of being treated (Horvitz and Thompson, 1952). IPW estimators have good efficiency properties (Hirano et al., 2003) and can be augmented with outcome regression to become “doubly robust” (Robins et al., 1994). There is much recent work aiming to improve the efficiency of IPWtype estimators by using tools developed in machine learning and high dimensional statistics (Van der Laan and Rose, 2011, Belloni et al., 2014, Athey et al., 2016, Chernozhukov et al., 2017). However, they all heavily rely on the NUC assumption. Robustness of the IPWtype estimators to unmeasured confounding bias are usually studied using patternmixture models (Birmingham et al., 2003) or selection models (Scharfstein et al., 1999), in which the unmeasured confounding is usually modeled parametrically. See Richardson et al. (2014) for a recent overview and Section 7.1 for more references and discussion.
In this paper we propose a new framework for sensitivity analysis of missing data and observational studies problems, which can be applied to “smooth” estimators such as the inverse probability weighting (IPW) estimator and the “doubly robust” augmented IPW estimators. We consider a marginal sensitivity model introduced in Tan (2006, Section 4.2), which is a natural modification of Rosenbaum’s model. Compared to existing sensitivity analyses of IPW estimators, our sensitivity model is nonparametric in the sense that we do not require the unmeasured confounder to affect the treatment and the outcome in a parametric way. This is appealing because we can never observe the unmeasured confounder and thus cannot test any parametric assumption.
Our marginal sensitivity model measures the degree of violation of the NUC assumption by the odds ratio between the conditional probability of being treated given the measured confounders and conditional probability of being treated given the measured confounders and the outcome/potential outcome variable (see Section 3 for the precise definition). Given a userspecified magnitude for this oddsratio, the goal is to obtain a confidence interval of the estimand (the mean response vector in missing data problems and the average treatment effect in observational studies), with asymptotically coverage probability, for all data generating distributions which violate the MAR assumption within this threshold (see Definition 3).
The obvious approach to compute such a confidence interval involves using the asymptotic normal distribution of the IPW estimators. However, the asymptotic sandwich variance estimators of the IPW estimates are quite complicated. Finding extrema of the point estimates and variance estimates over a collection of sensitivity models is generally computationally intractable. We circumvent this problem by using the percentile bootstrap, reducing the problem to solving many linear fractional programs. The highlights of our strategy and the summary of the results are given below:

To obtain a confidence interval under the marginal sensitivity model, we simply maximize/minimize the IPW point estimates over the marginal sensitivity model over bootstrap resamples of the data. Then we report the sample quantile of the minima and sample quantile of the maxima, as the lower and upper endpoints of the confidence interval, respectively (Section 4.3).

The asymptotic coverage of this interval follows from the limiting normal distribution of the IPW estimators and a generalized minimax/maximin inequality, which justifies the interchange of quantile and infimum/supremum (Theorem 4 and Corollary 6). In fact, our interval covers the entire partially identified region of the estimand with probability tending to .

Maximizing/minimizing the IPW point estimate over the marginal sensitivity model is very efficient computationally. This is a linear fractional programming problem (ratio of two linear functions) which can be reduced to solving a linear program using the wellknown CharnesCooper transformation (Section 4.4). In fact, by local perturbations, it can be shown that the solution of the resulting linear program has the same/opposite order as the outcome vectors, using which the confidence interval can be computed in time linear in sample size, for every bootstrap resample (Proposition 5).

The percentile bootstrap approach and the reduction to linear programming are very general and can be extended to sensitivity analysis of other smooth estimators, such as IPW estimates of the mean of the nonrespondents and the average treatment effect on the treated (Section 6.1), the augmented inverse probability weighting estimator (Section 6.2), and inference for partially identified parameters (Section 7.3).
The rest of this paper is organized as follows. We introduce the necessary notations in Section 2 and the marginal sensitivity model in Section 3. Our method for constructing confidence under the marginal sensitivity model, for the IPW estimator for the mean response with missing data, is described in Section 4. The extension to estimating the average treatment effect in observational studies is discussed in Section 5. Other extensions, including the use of augmented IPW estimators, are considered in Section 6. We review some related sensitivity analysis methods and compare our framework with Rosenbaum’s sensitivity analysis in Section 7. In Section 8 we illustrate our method using a real data example and compare it with Rosenbaum’s sensitivity analysis.
2. Background
In this paper, we consider sensitivity analysis in two closely related problems: estimation of the mean response with missing data and estimation of the average treatment effect in observational studies:

In the missing data problem, we assume , are i.i.d. from a joint distribution , where for each subject , is the indicator of nonmissing response, is a vector of covariates, and is the response (observed only if ). In other words, we only observe for . Based on the observed data, our goal is to estimate the mean response in the complete data, where and indicates that the expectation is taken over the true data generating distribution .

In observational studies, we observe i.i.d. , where for each subject , is a binary treatment indicator (which is if treated and in control), is a vector of measured confounders, and
is the outcome. Furthermore we assume are i.i.d. from a joint distribution . Here, we are using Rubin (1974)’s potential outcome notation and have assumed the stable unit treatment value assumption (Rubin, 1980). The goal is to estimate the average treatment effect (ATE), . As before, means the expectation is taken over the data generating distribution . Notice that and are never observed together, which is often referred to as the “fundamental problem of causal inference” (Holland, 1986). With the potential outcome notation, the observational studies problem is essentially two missing data problems: use to estimate , and use to estimate .
Since some responses or potential outcomes are not observed, the estimands and defined above are not identifiable without further assumptions. For the missing data problem, Rubin (1976) used the term “missing at random” (MAR) to describe data that are missing for reasons related to completely observed variables in the data set:
Assumption 1 (Missing at random (MAR)).
under .
The corresponding assumption for observational studies is that the potential outcomes are independent of the treatment given the covariates (Rosenbaum and Rubin, 1983):
Assumption 2 (Strong ignorability or no unmeasured confounder (NUC)).
under .
With the additional assumption that no subject is missing or receives treatment/control with probability , the parameters and are identifiable from the data.
Assumption 3 (Overlap).
For ,

In the missing data problem, .

In observational studies, .
Since the seminal works of Rubin (1976) and Rosenbaum and Rubin (1983), many statistical methods have been developed for the missing data and observational studies problem based on first estimating the conditional probability (often called the propensity score in observational studies). One distinguished example is the inverse probability weighting (IPW) estimator which dates back to Horvitz and Thompson (1952),
where is a sampleestimate of . Observe that under Assumptions 3 and 1,
(1) 
which implies that can consistently estimate the mean response if converges to . Notice that the first equality in (1) is due to the tower property of conditional expectation and is always true if the denominator is positive with probability . The second equality in (1) uses by Assumptions 3 and 1. Similarly, can consistently estimate under Assumptions 3 and 2.
3. Sensitivity models
A critical issue of the above approach is that the MAR and NUC assumptions are not verifiable using the data because some responses/potential outcomes are not observed. Therefore, if the MAR or NUC assumption is violated, the IPW estimator or is biased and the confidence interval based on it (and any other existing estimator that assumes MAR or NUC) does not cover or at the nominal rate. In practice, statisticians often hope the data analyst can use her subject knowledge to justify the substantive reasonableness of MAR/NUC assumptions (Little and Rubin, 2014). However, speaking realistically, the MAR and NUC assumptions are perhaps never strictly satisfied, so the statistical inference is always subject to criticism of insufficient control of confounding.
To rebut such criticism and make the statistical analysis more credible, a natural question is how sensitive the results are to the violation of the MAR/NUC assumptions. This is clearly an important question, and, in fact, sensitivity analysis is often suggested or required for empirical publications.^{1}^{1}1For example, sensitivity analysis is required in the PatientCentered Outcome Research Institute (PCORI) methodology standards for handling missing data, see standard MD4 in https://www.pcori.org/researchresults/aboutourresearch/researchmethodology/pcorimethodologystandards. See also Little et al. (2012). Many sensitivity analysis methods have been subsequently developed in the missing data and observational studies problems. In this paper we will consider a sensitivity model that is closely related to Rosenbaum’s sensitivity model (Rosenbaum, 1987, 2002c). We refer the reader to Robins (1999), Scharfstein et al. (1999), Imbens (2003), Altonji et al. (2005), Hudgens and Halloran (2006), Vansteelandt et al. (2006), McCandless et al. (2007), VanderWeele and Arah (2011), Richardson et al. (2014), Ding and VanderWeele (2016) for alternative sensitivity analysis methods and Section 7 for a more detailed discussion.
We begin with a description of the marginal sensitivity model used in this paper, which is also considered by Tan (2006). To this end, with a slight abuse of notation, let . Notice that, when showing the IPW estimator is consistent in (1), the MAR assumption is useful because it implies that
(2) 
Similarly, the NUC assumption in the observational studies problem implies that , where , for .
When the MAR assumption is violated, equation (2) is no longer valid. The following wellknown proposition (proof included in Appendix A.1 for completeness) shows that is generally not identifiable from the data without the MAR assumption. For this reason we shall refer to a userspecified function as a sensitivity model.
Proposition 1.
In the missing data problem and assuming Assumption 3 holds, is not identifiable from the data.
In this paper we consider the following collection of sensitivity models in which the degree of violation of the MAR assumption is quantified by the odd ratio of the “complete data” selection probability and the “observed data” selection probability .
Definition 1 (Marginal Sensitivity Model).
Fix a parameter .

For the missing data problem, we assume , where
(3) 
For the observational studies problem, we assume for .
The relationship between this sensitivity model and Rosenbaum’s sensitivity model is examined in Section 7.2.
Remark 1.
To understand creftypecap 1, it can be conceptually easier to imagine that there is an unobserved variable that “summarizes” all unmeasured confounding. creftypecap 4 means that the odds ratio between and is always bounded by and , but the relation between and the potential outcomes are not constrained in any way. This leads to an alternative “added variable” representation of sensitivity model (Rosenbaum, 2002c). Robins (2002) pointed out that it suffices to consider the conditional probabilities when is either one of the potential outcomes. In the remainder of the paper we will follow Robin’s suggestion to simplify the notation.
Remark 2.
It is often convenient to use the logistic representation of the marginal sensitivity model (3). For simplicity, let us consider the missing data problem. Denote
(4) 
and be the logitscale difference of the observed data selection probability and the complete data selection probability. If we write further introduce the notation , then
(5) 
that is the marginal sensitivity model puts bound on the norm of .
4. Confidence Interval for the Mean Response
The primary goal of this paper is to construct confidence intervals for and under a specified collection of sensitivity models. For simplicity, we focus on the missing data problem in this section. The simple extension to observational studies is described in Section 5.
4.1. Confidence interval in sensitivity analysis
We begin by briefly extending the definition of sensitivity models. In creftypecap 1, a postulated sensitivity model is compared with the dataidentifiable . This model is most appropriate when is estimated nonparametrically using the data, but this is only possible for lowdimensional problems due to the curse of dimensionality. More often, for example, in propensity score matching or IPW, is estimated by a parametric model. In this case it is more appropriate to compare the postulated sensitivity model with the best parametric approximation to :
(6) 
where stands for the KullbackLeibler divergence and is a family of parametric models for the missingness probability. We recommend to use the logistic regression model
(7) 
which works seamlessly with our sensitivity model because the degree of violation of MAR is quantified by odds ratios. However our framework can be easily applied to other parametric models (e.g. different links).
Definition 2 (Parametric Marginal Sensitivity Model).
Next we define the estimand (mean response) under a specific sensitity model :
(9) 
where . By the definition of and the first equality in (1), it is easy to see that and are equal to the true mean response . The set will be referred to as the partially identifiable region under ; see Section 7.3 for more discussion.
Remark 3.
In defining and , we slightly abused the notation and did not indicate that they also depend on the “baseline” choice of parametric or nonparametric model of . This is because the choice of the sensitivity model is a natural consequence of the choice of the working model for . If is modeled nonparametrically (or parametrically), then we should use the collection of nonparametric (or parametric) sensitivity models (or ). For notational simplicity, hereafter we will often refer to instead of as the sensitivity model, since the former does not depends on the choice of the working model for . Consequently we will also call a collection of sensitivity models without specifying which parametric/nonparametric model was used for .
Remark 4.
Compared to creftypecap 4, the only difference in (8) is that the postulated sensitivity model is compared to the parametric model instead of the nonparametric probability . In other words, the parametric sensitivity model considers both

Model misspecification, that is, ; and

Missing not at random, that is, .
This is arguably a desirable feature, because the term “sensitivity analysis” is also widely used as the analysis of an empirical study’s robustness to parametric modeling assumptions.
With these notations, we can now define what is meant by a confidence interval under a collection of sensitivity models:
Definition 3.
A datadependent interval is called a confidence interval for the mean response under the collection of sensitivity models (may corresponds to or , see Remark 3), if
(10) 
is true for any data generating distribution such that or , depending on whether or is used. The interval is said to be an asymptotic confidence interval, if .
4.2. The IPW Point Estimates
Intuitively, a confidence interval as in Definition 3 must at least include a point estimate of for every . To this end, let be a postulated sensitivity model. The corresponding missingness probability can then be estimated by
(11) 
where
(12) 
We consider two point estimates of or , namely the IPW estimate and a stabilized version of it:

It is well known that, even under the MAR assumption, the IPW estimator can be unstable when the missingness probability (or the parametric approximation ) is close to for some (Kang and Schafer, 2007). To alleviate this issue, the stabilized IPW (SIPW) estimator, obtained by normalizing the weights, is often used in practice:
(13) It is easy to see that estimates .
Compared to IPW, the SIPW estimator is sample bounded (Robins et al., 2007), that is,
This property is even more desirable in sensitivity analysis because is almost always not the true missingness probability, so the total unnormalized weights can be very different from . For this reason, we will use the SIPW estimator in the remainder of this paper.
Heuristically, the confidence interval should at least contain the range of SIPW point estimates, . We defer the numerical computation of the extrema of SIPW point estimates till Section 4.4. For now we will focus on constructing the confidence interval assuming the range of point estimates is given.
4.3. Constructing the Confidence Interval
To construct a confidence interval for , we need to consider the sampling variability of the SIPW estimator described above. In the MAR setting, the most common way to estimate the variance is the asymptotic sandwich formula or the bootstrap (Efron and Tibshirani, 1994, Austin, 2016). In sensitivity analysis, we also need to consider all possible violations of the MAR assumption in . In this case, optimizing the estimated asymptotic variance over is generally computationally intractable, as explained below.
4.3.1. The Union Method
We begin by showing how individual confidence intervals of for can be combined into a confidence interval in sensitivity analysis.
Proposition 2.
Suppose there exists datadependent intervals such that
holds for every .

Let , . Then is an asymptotic confidence interval of under the collection of sensitivity models .

Moreover, if there exists (not depending on ) such that
(14) for all , then the union interval covers the partially identified region with probability at least ,
creftypecap 2 suggests the following way to construct a confidence interval for in sensitivity analysis using the asymptotic distribution of . Using the general theory of estimation, it is not difficult to establish that
(see, for example, Lunceford and Davidian (2004) or creftypecap 9 in the Appendix). Then using the sandwich variance estimator , an asymptotically confidence interval of is
where is the upper quantile. Then, by Proposition 2, an asymptotically confidence interval under the collection of sensitivity models is ,
However, the standard error is a very complicated function of (see creftypecap 9) and numerical optimization over is practically infeasible.
4.3.2. The Percentile Bootstrap
The centerpiece of our proposal is to use the percentile bootstrap to construct . Next we introduce the necessary notations to describe the bootstrap procedure. Let be the empirical measure of the sample , where , and be i.i.d. resamples from the empirical measure. Let the SIPW estimate (13) computing using the bootstrap resamples . For , the percentile bootstrap confidence interval of is given by
(15) 
where is the percentile of in the bootstrap distribution, that is,
where is the bootstrap resampling distribution.
We begin by showing that the percentile bootstrap interval is an asymptotically valid confidence interval of for the parametric sensitivity model . Notice that bootstrap is generally not valid if the missingness probability is modeled nonparametrically (Abadie and Imbens, 2008).
Theorem 3 (Validity of the Percentile Bootstrap).
The proof of this result, which invokes the general theory of bootstrap for estimators (Wellner and Zhan, 1996, Kosorok, 2006, Chapter 10), is explained in Appendix B.
Our percentile bootstrap confidence interval under the collection of sensitivity models is given by where
(16) 
The important thing to observe here is that the infimum/supremum is inside the quantile function in (16), which makes the computation especially efficient using linear programming (see Section 4.4). The interchange of quantile and infimum/supremum is justified in the following generalized (von Neumann’s) minimax/maximin inequalities (see Cohen (2013) for a similar result for finite sets).
Lemma 1.
Let be as defined in (17). Then
Asymptotic validity of the confidence interval in Equation 16 then immediately follows from the validity of the union method (creftypecap 2), the validity of the percentile bootstrap (creftypecap 3), and creftypecap 1.
Theorem 4.
Under the same assumptions as in Theorem 3, is an asymptotic confidence interval of the mean response , under the collection of sensitivity models . Furthermore, covers the partially identified region with probability at least .
4.4. Range of SIPW Point Estimates: Linear Fractional Programming
Theorem 4 transformed the sensitivity analysis problem to computing the extrema of the SIPW point estimates, and . In practice, we only repeat this over random resamples and compute the interval by
(17) 
For notational simplicity, below we consider how to compute the extrema and using the full observed data instead of the resampled data. Recalling (12) and (13), this is equivalent to solving
(18) 
where the optimization variables are for . All the other variables are observed or can be estimated from the data. Notice that needs to be reestimated in every bootstrap resample. Without loss of generality, assume that the first responses are observed, that is and , and suppose that the observed responses are in decreasing order, . Then (18) simplifies to
(19) 
This optimization problem is the ratio of two linear functions of the decision variables , hence called a linear fractional programming. It can be transformed to linear programming by the CharnesCooper transformation (Charnes and Cooper, 1962). Denote
This translates (19) into the following linear programming:
(20) 
Therefore, the range of the SIPW point estimate can be computed efficiently by solving the above linear program. Furthermore, the following result shows that the solution of (4.4) must have the same or opposite order as the outcomes , which enables even faster computation of (4.4).
Proposition 5.
creftypecap 5 implies that we only need to compute the objective of (19) for at most choices of by enumerating the index where it changes from to :
(21) 
It is easy to see that we only need time to compute and , for all the choices in (21). Hence, the computational complexity to solve the linear fractional programming (19) is . This is the best possible rate since it takes time to just compute the objective once.
The above discussion suggests that the interval , which entails finding and for every , can be computed extremely efficiently. The computational complexity is ignoring the time spent to fit the logistic propensity score models, where the extra is needed for sorting the data. Indeed, even under the MAR assumption, we need time to compute the Bootstrap confidence interval of , which is recommended in practice by Austin (2016). In conclusion, our proposal requires almost no extra cost to conduct a sensitivity analysis for the IPW estimator than to obtain its bootstrap confidence interval under MAR.
5. Confidence Interval for the ATE in the Sensitivity Model
As discussed in Section 2, an observational study is essentially two missing data problems, so the framework developed above can be easily extended. Recall that . Suppose we use a parametric model to model the propensity score using the observed covariates. The parametric sensitivity model assumes
(22) 
Let , and , for . Then (22) is equivalent to assuming for .
As in (9) define, for ,
(23) 
where . Denote . It is straightforward to show that . For this reason, as before, let
(24) 
be the collection of parametric sensitivity models in the observational studies problem. These can be estimated as in (11) by
(25) 
where is the estimated propensity score. Using this we can define the SIPW estimate of the average treatment effect as follows:
(26) 
where
and defined analogously.
Now, as in Section 4.3.2, using the percentile bootstrap we can obtain a asymptotically valid interval for :
where is the th bootstrap quantile of the SIPW estimates (26), as defined in Section 4.3.2. Then, interchanging the maximum/minimum and the quantile as in Theorem 6, we get a confidence interval for the averagetreatment effect for the collection of sensitivity models (24).
Corollary 6.
The interval in (27) can be computed efficiently using linear fractional programming as in Section 4.4. To simplify notation, assume, without loss of generality, the first units are treated () and the rest are the control (), and that the outcomes are ordered decreasingly among the first units and the other units. Then, as in (19), computing the interval (27) is equivalent to solving the following optimization problem: