Ginwidth=\Gin@nat@width,height=\Gin@nat@height,keepaspectratio \setkomafonttitle \setkomafontsection \addtokomafontsection \setkomafontsubsection \setkomafontsubsubsection \pdfstringdefDisableCommands
Bounding Bias Due to Selection
Louisa H. Smith and Tyler J. VanderWeele
When epidemiologic studies are conducted in a subset of the population, selection bias can threaten the validity of causal inference. This bias can occur whether or not that selected population is the target population, and can occur even in the absence of exposure-outcome confounding. However, it is often difficult to quantify the extent of selection bias, and sensitivity analysis can be challenging to undertake and to understand. In this article we demonstrate that the magnitude of the bias due to selection can be bounded by simple expressions defined by parameters characterizing the relationships between unmeasured factor(s) responsible for the bias and the measured variables. No functional form assumptions are necessary about those unmeasured factors. Using knowledge about the selection mechanism, researchers can account for the possible extent of selection bias by specifying the size of the parameters in the bounds. We also show that the bounds, which differ depending on the target population, result in summary measures that can be used to calculate the minimum magnitude of the parameters required to shift a risk ratio to the null. The summary measure can be used to determine the overall strength of selection that would be necessary to explain away a result. We then show that the bounds and summary measures can be simplified in certain contexts or with certain assumptions. Using examples with varying selection mechanisms, we also demonstrate how researchers can implement these simple sensitivity analyses.
Keywords: selection bias, sensitivity analysis, bias analysis, target population
When bias in an epidemiologic study is unavoidable, various methods can be used to assess the robustness of results to factors that limit causal inference, such as unmeasured confounding, measurement error, and selection. While there exist relatively simple sensitivity analysis approaches for measurement error and unmeasured confounding,^{1–5} those for selection bias are limited by computational or mathematical complexity, the need for strong assumptions or the specification of a large number of parameters, or applicability only to certain selection mechanisms or study designs.^{6–13}
In this article we show that selection bias can be bounded by straightforward expressions that allow for a simple approach to sensitivity analysis. We use the term selection bias to describe the extent to which a parameter being estimated differs from a causal effect in either the total population or some subset of it, due to the restriction of the study population. This bias is sometimes defined as that resulting from selecting on a collider; that is, a common effect of two variables in the causal structure.^{14} Such selection may be due to convenience in study design or analysis, a desire to evaluate an exposure-outcome relationship in a subset of the population, an attempt to limit other types of bias, or non-participation and loss to follow-up.
For example, in birth defects studies, it is often difficult or impossible to collect data on pregnancies that do not result in a live birth. Analyzing the exposure-outcome relationship in only live births may lead to selection bias when a factor determining the probability of live birth is also related to the exposure of interest. In other types of studies, a question about a particular subpopulation is the motivation for the selection. For example, an exposure (e.g., obesity) may be particularly harmful or protective among people with certain health conditions (e.g., cardiovascular disease). However, even when selection is due to interest in a particular subpopulation, selecting on a factor of interest does not eliminate the potential for bias, if for example that factor itself is related to the exposure.
Solutions to these problems often involve a number of assumptions, particularly when the target of causal inference is the whole population and not just the subpopulation from which the sample was selected. Our simplified approach to assessing selection bias makes clear the target of inference and limits the number of parameters and assumptions that determine the possible magnitude of the bias. We show that the magnitude of the bias in the causal risk ratio can be bounded by simple expressions that relate the variables in the causal structure, which may be known or hypothetical. In Table 2 and the Appendix we extend the results to the risk difference scale. The bounds differ depending on the target population of interest and the selection procedure, but require no assumptions about the type or number of measured or unmeasured variables that cause the bias, or interactions between pairs of variables. We consider several causal structures under which the bounds can be applied and motivate their use in the contexts above and with other examples. Finally, we show that under certain assumptions about the equality of the parameters determining bias, a summary measure can be constructed for each of various scenarios, which can be used as a simple technique for assessing the robustness to selection bias of results from an epidemiologic analysis.
Consider a situation in which a causal population-level risk ratio (RR) comparing two levels of an exposure denoted A is the parameter of interest. Although our results hold comparing any two values of categorical or continuous A, we will assume binary A\in\{0,1\} for ease of notation. Let Y\in\{0,1\} denote the binary outcome and S be a binary indicator of selection, where S=1 indicates the subset of the population included in the study and S=0 that which is excluded. Let C denote a set of measured covariates. In case-control studies, the odds ratio (OR) may approximate the RR; we assume this approximation holds throughout this article. Furthermore, although cases are selected with higher probability than controls in such studies, we can ignore that aspect of the selection mechanism, as it does not bias the OR.
We will use potential outcome notation wherein Y_{a} indicates the value of Y under treatment A=a. Let the causal RR conditional on covariates C=c, P(Y_{1}=1|c)/P(Y_{0}=1|c) be denoted \text{RR}^{\text{true}}_{AY}, and assume that it is identifiable as P(Y=1|A=1,c)/P(Y=1|A=0,c). This requires that certain identifiability conditions hold, including consistency, positivity, and, in particular, exchangeability Y_{a}\!\perp\!\!\!\perp A|C; that is, that Y_{a} is independent of actual exposure A conditional on measured covariates C.^{15} For simplicity in the development that follows we will assume that all analyses are carried out within strata of measured confounders C as necessary, and exclude reference to those variables, but all subsequent probability expressions can be interpreted as conditional on measured covariates C.
Suppose now, due to some selection mechanism, we are limited to estimating the RR only within a subpopulation, denoted by S=1, so that we estimate \text{RR}^{\text{obs}}_{AY}=P(Y=1|A=1,S=1)/P(Y=1|A=0,S=1). If we restrict analysis to S=1, selection bias occurs if it is not the case that Y_{a}\!\perp\!\!\!\perp A|S=1, even though Y_{a}\!\perp\!\!\!\perp A in the population. The bias is not due to the fact that the RRs in the total and selected populations differ, but to the fact that \text{RR}^{\text{obs}}_{AY} is not a causal effect even in the selected population. (Later in the text, Results 5A and 5B correspond to the situation in which such an effect is of interest.)
Several causal structures in which selection bias may occur are shown in Figure 1. In each situation, bias is induced by a selection process which is itself differential with respect to the exposure or outcome and some unmeasured (or possibly measured) factor(s), denoted U. Consider the setting in which conditional on some unmeasured covariate(s) U, we have Y\!\perp\!\!\!\perp S|\{A,U\}. This independence holds in the causal diagrams in Figures 1 (A), (B) and (D). For ease of notation, we will consider U to be a categorical variable, but the results hold for general U or vector of variables denoted U. We can rewrite the target parameter in terms of U:
\text{RR}^{\text{true}}_{AY}=\frac{\sum_{s=0}^{1}\left\{\sum_{u}P(Y=1|A=1,S=s,% U=u)P(U=u|A=1,S=s)\right\}P(S=s|A=1)}{\sum_{s=0}^{1}\left\{\sum_{u}P(Y=1|A=0,S% =s,U=u)P(U=u|A=0,S=s)\right\}P(S=s|A=0)}\;. |
Let the relative bias due to selection be defined as \text{RR}^{\text{obs}}_{AY}/\text{RR}^{\text{true}}_{AY}. By bounding this value, we can assess the maximum strength of the bias in terms of parameters that describe relationships between U and other variables.
We bound the relative bias from above, assuming that the \text{RR}^{\text{obs}}_{AY}>\text{RR}^{\text{true}}_{AY}. If not, and the \text{bias }<1, interest is naturally in a bound from below. We can then reverse the coding of A so that \text{bias }>1, resulting in an appropriate bound once the coding is reversed.
We define the following parameters:
\text{RR}_{UY|(A=1)}=\frac{\max_{u}P(Y=1|A=1,U=u)}{\min_{u}P(Y=1|A=1,U=u)} |
\text{RR}_{UY|(A=0)}=\frac{\max_{u}P(Y=1|A=0,U=u)}{\min_{u}P(Y=1|A=0,U=u)} |
\text{RR}_{SU|(A=1)}=\max_{u}\frac{P(U=u|A=1,S=1)}{P(U=u|A=1,S=0)} |
\text{RR}_{SU|(A=0)}=\max_{u}\frac{P(U=u|A=0,S=0)}{P(U=u|A=0,S=1)} |
The \text{RR}_{UY|(A=a)} parameters can be interpreted as the maximum relative risks for Y=1 comparing any two values of U within strata of A=1 and A=0, respectively. It need not be a causal relationship that is described by this risk ratio, as U may be downstream of Y in some situations that are susceptible to selection bias (e.g., Figure 1 (B)). The \text{RR}_{SU|(A=a)} parameters are the maximum factors by which selection is associated with an increased prevalence of some value of U within stratum A=1, and by which non-selection is associated with an increased prevalence of some value of U within stratum A=0. If \text{RR}^{\text{obs}}_{AY} has been estimated within strata of measured confounders C, then these parameters are defined conditional on the same confounders.
We now present our first result, a proof of which is given in the Appendix.
Result 1A. If Y\!\perp\!\!\!\perp S|\{A,U\}, then:
\frac{\text{RR}^{\text{obs}}_{AY}}{\text{RR}^{\text{true}}_{AY}}\leq\left(% \frac{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}{\text{RR}_{UY|(A=1)}+% \text{RR}_{SU|(A=1)}-1}\right)\times\left(\frac{\text{RR}_{UY|(A=0)}\times% \text{RR}_{SU|(A=0)}}{\text{RR}_{UY|(A=0)}+\text{RR}_{SU|(A=0)}-1}\right)\;. |
Result 1A tells us that the bias is guaranteed to be equal to or smaller in magnitude than the given expression, which we will call in general a bounding factor. A researcher or reader who proposes that some factor U has led to selection bias can propose values that plausibly describe the relationships between that factor and selection and the outcome and calculate the bounding factor from these parameters. That bounding factor (or set of bounding factors constructed from ranges of values) can be divided out of the estimate of \text{RR}^{\text{obs}}_{AY} to come up with the smallest possible RR that would be compatible with \text{RR}^{\text{true}}_{AY}.
After a rise in microcephaly cases in northeast Brazil closely followed an outbreak of Zika virus in that region, evidence from biological and ecologic data supported a causal link.^{16} In particular, models using surveillance data showed that the population risk of microcephaly increased after Zika infections in the first semester of pregnancy.^{17} The relationship was seemingly confirmed with the first case-control study to examine the association, from which de AraÃºjo et al. reported an adjusted OR of 73.1 (95% CI 13.0, \infty).^{18} Both live and still births were recruited as cases; however, pregnancies that resulted in miscarriage or elective abortion would have been missed by this study design, which corresponds to Figure 1 (A). The probability of not having a termination (S=1) may be affected by exposure to the virus (A) as well as socioeconomic or behavioral conditions such as lack of access to medical care (U), which may also affect the probability of microcephaly (Y) (e.g., giving birth in a public hospital, low education, and being unmarried have been associated with microcephaly in Brazil^{19}). Selecting only live and still births in the analysis may therefore lead to selection bias.
Suppose that access to medical care affected the probability of microcephaly by up to 2-fold among the Zika-exposed and unexposed (i.e., \text{RR}_{UY|(A=1)}=\text{RR}_{UY|(A=0)}=2) and that lack of access to medical care for pregnant women was up to 1.7 times more likely for women without an induced abortion among the Zika-exposed (\text{RR}_{SU|(A=1)}=1.7) and access to medical care up to 1.5 times more likely for women with an induced abortion among the unexposed (\text{RR}_{SU|(A=1)}=1.5). The bias factor is then
\left(\frac{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}{\text{RR}_{UY|(A=1% )}+\text{RR}_{SU|(A=1)}-1}\right)\times\left(\frac{\text{RR}_{UY|(A=0)}\times% \text{RR}_{SU|(A=0)}}{\text{RR}_{UY|(A=0)}+\text{RR}_{SU|(A=0)}-1}\right)= |
\left(\frac{2\times 1.7}{2+1.7-1}\right)\times\left(\frac{2\times 1.5}{2+1.5-1% }\right)=1.51\;. |
The most such selection bias could alter the estimate can be obtained by dividing the original estimate and confidence interval by this bias factor to obtain an OR of 48.4 (95% CI 8.6, \infty), which is of course still a very large effect estimate.
Instead of calculating each of the parameters in the bounding factor individually, we may be interested in assessing the overall susceptibility of a result to selection bias. This can be done with a single value that summarizes the extent to which a \text{RR}^{\text{obs}}_{AY}\neq 1 may be a spurious finding entirely due to selection bias.
Result 1B. If Y\!\perp\!\!\!\perp S|\{A,U\}, then the minimum magnitude of each of the four parameters that make up the bounding factor, assuming the four are equal, that would be sufficient to shift a given \text{RR}^{\text{obs}}_{AY} to the null is given by:
\text{RR}_{UY|(A=0)}=\text{RR}_{UY|(A=1)}=\text{RR}_{SU|(A=0)}=\text{RR}_{SU|(% A=1)}\geq\sqrt{\text{RR}^{\text{obs}}_{AY}}+\sqrt{\text{RR}^{\text{obs}}_{AY}-% \sqrt{\text{RR}^{\text{obs}}_{AY}}} |
For example, if \text{RR}^{\text{obs}}_{AY}=3, then all four of the parameters in the bounding factor must be equal to or greater than \sqrt{3}+\sqrt{3-\sqrt{3}}=2.9 to have generated sufficient selection bias. If one of the four is smaller than 2.9, then one or more of the parameters must be greater than 2.9 to compensate. Because it only depends on \text{RR}^{\text{obs}}_{AY}, this summary measure is easy to calculate and compare across studies. However, it is context specific: it is interpreted relative to the selection mechanism in a given study and conditional on whatever confounders have been controlled for in the analysis. Based on content knowledge, investigators and readers can judge whether there exists a U that could be so strongly related both to the outcome and to selection within strata of the exposure and the measured confounders.
Such calculations can also be performed using the lower limit of the confidence interval instead of \text{RR}^{\text{obs}}_{AY}, to see what strength of the selection parameters would be necessary to result in a confidence interval that includes the null value of 1.
With no assumptions about the exact nature of the unmeasured factors U, we can use Result 1B to assess the plausibility that the Zika-microcephaly association is fully explained by selection bias. By calculating the summary measure \sqrt{73.1-\sqrt{73.1}}+\sqrt{73.1}=16.6, we come to conclusions about the strength of the relationships with the unmeasured behaviors or socioeconomic conditions (such as lack of access to medical care) that would be necessary to produce an \text{RR}^{\text{obs}}_{AY} of 73.1 if \text{RR}^{\text{true}}_{AY}=1. An unmeasured variable (e.g., lack of access to medical care) that increased the risk of microcephaly by 16.6-fold in both exposed and unexposed women, that was 16.6 times higher among exposed women with live or still births than among those whose pregnancies were terminated, and was also 16.6 times lower among unexposed women could suffice, but weaker selection could not. Risk ratios of that magnitude are rarely seen in epidemiologic research, particularly in the context of behavioral differences, lending confidence that the increased microcephaly risk is not the result of selection bias. We can repeat the calculation with the lower bound of the confidence interval, 13.0, in order to assess the magnitude of selection bias necessary to shift the confidence interval to include the null. This gives a summary measure of 6.7; although it is perhaps plausible that one of the parameters is that large, it seems unlikely that all four are. Assuming that confounding was fully accounted for in the study via matching and multivariate control and that all variables were correctly measured, it seems that even in the presence of possible selection bias, the evidence is very strong that Zika infection in pregnant women causes microcephaly.
Here we consider a number of special cases which result in modified bounding factors and summary measures. Table 1 summarizes the results, and derivations are provided in the Appendix.
In some situations, U may not be unmeasured and may be common to the entire selected population. This is the case, for example, when some characteristic defines or directly leads to selection into a study. When this is true the bounding factor is simplified.
Result 2A. If S=U, then:
\frac{\text{RR}^{\text{obs}}_{AY}}{\text{RR}^{\text{true}}_{AY}}\leq\text{RR}_% {UY|(A=0)}\times\text{RR}_{UY|(A=1)}\;. |
We can also construct a summary measure for this situation. It can be used in the same way as that in Result 1B, but only describes the minimum magnitude of the two parameters in the modified bounding factor in Result 2A.
Result 2B. If S=U, then the minimum magnitude of each of the two parameters that make up the bounding factor in Result 2A, assuming they are equal, that would be sufficient to shift a given \text{RR}^{\text{obs}}_{AY} to the null is given by:
\text{RR}_{UY|(A=0)}=\text{RR}_{UY|(A=1)}\geq\sqrt{\text{RR}^{\text{obs}}_{AY}% }\;. |
Although Result 1A requires minimal assumptions, sometimes we can make assumptions that decrease the magnitude of the bounding factor, which can provide us with more confidence that a given result is not due to selection bias. The bounding factor is greatest when P(Y=1|A=1,S=1)>P(Y=1|A=1,S=0) and P(Y=1|A=0,S=1)<P(Y=1|A=0,S=0); that is, when selection is associated with increased risk of the outcome among the exposed and with decreased risk among the unexposed. However, if selection is associated with increased risk among both groups, then we have the following result.
Result 3A. If Y\!\perp\!\!\!\perp S|\{A,U\} and if P(Y=1|A=1,S=1)/P(Y=1|A=1,S=0) and P(Y=1|A=0,S=1)/P(Y=1|A=0,S=0) are both greater than 1, then:
\frac{\text{RR}^{\text{obs}}_{AY}}{\text{RR}^{\text{true}}_{AY}}\leq\frac{% \text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}{\text{RR}_{UY|(A=1)}+\text{RR}% _{SU|(A=1)}-1}\;. |
Results are analogous with decreased risk for both groups, with A=0 replacing A=1 in each of the parameters. (If selection is associated with decreased risk among the exposed and increased risk among the unexposed, the bias \leq 1, so A should be recoded to construct a meaningful bound.)
If assumptions about the consistency of the direction of the selection-outcome relationship can be made, then we can also use simpler expressions as the summary measures; for increased risk in both groups the summary measure is stated in the following result.
Result 3B. If Y\!\perp\!\!\!\perp S|\{A,U\} and if P(Y=1|A=1,S=1)/P(Y=1|A=1,S=0) and P(Y=1|A=0,S=1)/P(Y=1|A=0,S=0) are both greater than 1, then the minimum magnitude of each of the two parameters that make up the bounding factor in Result 3A, assuming they are equal, that would be sufficient to shift a given \text{RR}^{\text{obs}}_{AY} to the null is given by:
\text{RR}_{UY|(A=1)}=\text{RR}_{SU|(A=1)}\geq\text{RR}^{\text{obs}}_{AY}+\sqrt% {\text{RR}^{\text{obs}}_{AY}(\text{RR}^{\text{obs}}_{AY}-1)} |
When the outcome risk is decreased with selection in both exposure groups, the summary measure refers to the minimum strength of the parameters \text{RR}_{UY|(A=0)} and \text{RR}_{SU|(A=0)}. Results 3A and 3B have the same analytic form of the recently proposed “E-value” calculated to assess robustness to unmeasured confounding.^{20}
When S=U and we can make assumptions about the increase or decrease of risk in both exposure groups with selection, we can combine earlier results. For increased risk in both groups, we have the following result.
Result 4A. If S=U and if P(Y=1|A=1,S=1)/P(Y=1|A=1,S=0) and P(Y=1|A=0,S=1)/P(Y=1|A=0,S=0) are both greater than 1, then:
\frac{\text{RR}^{\text{obs}}_{AY}}{\text{RR}^{\text{true}}_{AY}}\leq\text{RR}_% {UY|(A=1)}\;. |
The summary measure describing the minimum magnitude of the sole parameter is also simplified.
Result 4B. If S=U and if P(Y=1|A=1,S=1)/P(Y=1|A=1,S=0) and P(Y=1|A=0,S=1)/P(Y=1|A=0,S=0) are both greater than 1, then the minimum magnitude of \text{RR}_{UY|(A=1)} that would be sufficient to shift a given \text{RR}^{\text{obs}}_{AY} to the null is given by:
\text{RR}_{UY|(A=1)}\geq\text{RR}^{\text{obs}}_{AY}\;. |
For many years the relationship between estrogen replacement therapy and endometrial cancer was clouded by controversy over proper study design to minimize bias. In an attempt to limit differential outcome detection by estrogen exposure, Horwitz and Feinstein simultaneously performed a case-control study of exogenous estrogens and endometrial cancer in a population of women who had undergone intra-endometrial diagnostic procedures and one with a more “conventional” sampling method.^{21} They claimed their estimates of an OR of 2.30 using the “alternative” sampling method and an OR of 11.98 with the conventional method supported their worry about biased cancer detection. However, their selection procedure was shown to induce bias.^{22} The structure leading to this bias is shown in Figure 1 (B), in which U represents a diagnostic procedure; in this case, all of those selected (S) have undergone such a procedure. We can use a bounding factor to assess how plausible it is that selection bias could explain the authors’ much reduced OR. In this context we are curious about whether bias could shift the result to a specific value and not to the null.
Since our proposed \text{RR}^{\text{true}}_{AY}=11.98>\text{RR}^{\text{obs}}_{AY}=2.30, the \text{bias }<1, and we recode the exposure for a relative bias of 11.98/2.3 = 5.2. Since everyone in the selected population had symptoms that led to a diagnostic procedure, we will use the bounding factor that assumes S=U. If we assume that having a hysterectomy is associated with increased cancer prevalence in both estrogen-exposed and non-exposed women, we have by Result 4B that 5.2<\text{RR}_{UY|(A=1)} (recalling that A=1, after recoding, now refers to those unexposed to estrogen). This means that in order for the difference in ORs to be possibly explained by selection bias (that is, for the \text{RR}^{\text{obs}}_{AY} of 2.30 to shift to at least 11.98 after accounting for selection bias), the prevalence of endometrial cancer in non-users of estrogen who have undergone hysterectomy or other diagnostic procedure must be greater than 5.2 times that in non-users who have not.
In some studies the causal risk ratio in the selected population, and not in the entire population, may be the target parameter. This may occur, for example, when S is an indicator of a well-defined population for which an estimated causal effect is desired and not simply the result of poor sampling or selective attrition.
Under the same notation and assumptions as above, we denote \text{RR}^{\text{true}}_{AY|(S=1)}=P(Y_{1}=1|S=1)/P(Y_{0}=1|S=1). Again, because it is not true that Y_{a}\!\perp\!\!\!\perp A|S=1, this is not identifiable as \text{RR}^{\text{obs}}_{AY}=P(Y=1|A=1,S=1)/P(Y=1|A=0,S=1). Instead, if Y_{a}\!\perp\!\!\!\perp A|\{S=1,U\}, the RR of interest is identified by marginalizing over the distribution of U in the selected population, resulting in
\text{RR}^{\text{true}}_{AY|(S=1)}=\frac{\sum_{u}P(Y=1|A=1,S=1,U=u)P(U=u|S=1)}% {\sum_{u}P(Y=1|A=0,S=1,U=u)P(U=u|S=1)}\;. |
Again we are concerned with the relative bias \text{RR}^{\text{obs}}_{AY}/\text{RR}^{\text{true}}_{AY|(S=1)}. This bias can be conceptualized as equivalent to that due to unmeasured confounding, which occurs when U is associated with the exposure A and also affects the outcome Y. Although A and U are marginally independent, as in Figures 1 (A), (C), and (D), an association between the two is induced by conditioning on selection into the study, represented by the boxed S in the diagrams. Then, within stratum S=1, we have a situation equivalent to confounding by U, due to its relationships with the exposure and outcome. Extending previously published bounds for bias due to unmeasured confounding,^{5} we have a bounding factor for inference in the selected population as follows.
Result 5A. If Y_{a}\!\perp\!\!\!\perp A|\{S=1,U\}, then:
\frac{\text{RR}^{\text{obs}}_{AY}}{\text{RR}^{\text{true}}_{AY|(S=1)}}\leq% \frac{\text{RR}_{UY|(S=1)}\times\text{RR}_{AU|(S=1)}}{\text{RR}_{UY|(S=1)}+% \text{RR}_{AU|(S=1)}-1} |
where
\text{RR}_{UY|(S=1)}=\max_{a}\frac{\max_{u}P(Y=1|A=a,S=1,u)}{\min_{u}P(Y=1|A=a% ,S=1,u)} |
\text{RR}_{AU|(S=1)}=\frac{\max_{u}P(U=u|A=1,S=1)}{\min_{u}P(U=u|A=0,S=1)}\;. |
The parameter \text{RR}_{UY|(S=1)} is the maximum risk ratio for the outcome given any two values of U among either the unexposed or exposed selected population. Because data is available on a sample of this population, this could be approximated using available data on measured confounders.
Because the second parameter, \text{RR}_{AU|(S=1)}, represents an association induced between two marginally independent variables (i.e., the dependence due to collider stratification), it is not as intuitive to specify. However, it is conceptually similar to one of the two required to define a bound for bias in the natural direct effect,^{23} where the A-U relationship is induced by conditioning on a mediator. In the great majority of cases, the bound that uses this parameter is smaller than one that uses the maximum RR for S=1 comparing two values of U or the maximum risk ratio for S=1 comparing two values of A instead.^{24} Depending on the structure of the selection bias mechanism, one of these two parameters might be more intuitive to quantify, and can generally replace \text{RR}_{AU|(S=1)} for an approximate bounding factor.
The summary measure that follows from Result 5A can then be given in the following result.
Result 5B. If Y_{a}\!\perp\!\!\!\perp A|\{S=1,U\}, then the minimum value of \text{RR}_{AU|(S=1)} and \text{RR}_{UY|(S=1)}, assuming the two parameters are equal, that would be sufficient to shift a given \text{RR}^{\text{obs}}_{AY} to the null is given by:
\text{RR}_{UY|(S=1)}=\text{RR}_{AU|(S=1)}\geq\text{RR}^{\text{obs}}_{AY}+\sqrt% {\text{RR}^{\text{obs}}_{AY}(\text{RR}^{\text{obs}}_{AY}-1)}\;. |
Results 5A and 5B have the same analytic form of the recently proposed “E-value” calculated to assess robustness to unmeasured confounding.^{20}
The obesity paradox is a well-known phenomenon in chronic disease epidemiology in which overweight and obesity are associated with increased survival compared to normal weight among patients with certain conditions.^{25} Whether this is a real causal effect (which could result in different weight recommendations for people living with chronic conditions) or due to bias – in particular, bias resulting from selection on a common effect (chronic disease) of both obesity and some unmeasured factor that is also related to death (Figure 1(C)) – is the subject of much debate.^{26–29} Gruberg et al. investigated the relationship between body mass index and one-year risk of death among patients who were treated for advanced coronary artery disease (CAD), finding that 10.6% of patients with normal body mass index died, more than double the percentage among the obese patients.^{30} Using the OR from their adjusted model, we can calculate that mortality risk was 1.50 times higher (95% CI 1.22, 1.86) in patients for whom body mass index was 10 units lower (corresponding approximately to the difference between obesity and normal weight).^{30} The authors controlled for a number of measured confounders including age and heart function.
Because we are interested in the population of CAD patients, we can use Result 5B to assess the plausibility of such a result being due to some unmeasured common cause of heart failure and death: 1.50+\sqrt{1.50(1.50-1)}=2.37. The unmeasured factor must increase the risk of death among normal weight or obese CAD patients by a factor of 2.37 (independent of the factors already included in the model), as well as differ between the obesity exposure categories in CAD patients by the same factor, if there were truly no protective effect of obesity on death in that population. Because the latter relationship, between obesity and the unmeasured factor, would be one induced solely by the selection of CAD patients, it may be difficult to specify. It may be more intuitive to consider an unmeasured factor that directly increases the risk of CAD by the same factor of 2.37, which will generally also suffice to bound the selection bias. We can repeat the calculation and interpretation with the lower bound of the confidence interval, which gives us a summary measure of 1.74, to assess the bias necessary for the confidence interval to include the null.
Because selection bias can be difficult to quantify, it is often ignored in sensitivity analysis or only explored in complex analyses that must be relegated to appendices. A simple way to characterize the possible extent of selection bias in terms of the relationships in the causal structure that induces it will allow researchers to more easily assess the plausibility of this bias with minimal assumptions.
Thinking about selection bias as described in this article will also force researchers to clearly define the target population of interest, whether that be the total population or those with the characteristics of the selected sample. Making assumptions to simplify the bounding factor can also compel them to think through the mechanisms by which selection bias occurs and the direction of the various effects. However, no such assumptions are required to use our main results. While this article focused on the relative bias of observed RRs, as relative effect measures are common in epidemiology and RRs are often approximated by ORs and hazard ratios under certain assumption, analogous bounds for observed effects on the risk difference scale are presented in Table 2, and their corresponding derivations in the Appendix.
The bounds we presented in this article can be used in several ways. If researchers have quantitative knowledge about factors influencing selection in their study, such as in a situation with loss to follow-up or participation in a sub-study, realistic RRs for unmeasured factors can be used as parameters in the bounds to explore to what extent these could affect \text{RR}^{\text{obs}}_{AY}. If only ranges of possible parameters are proposed, the bounds can be varied across those ranges in a table or figure in order to allow readers to consider the most plausible combinations. Finally, if all that is desired is a summary measure of the extent to which a result could be rendered null by selection bias, or shifted to any other proposed true value, the bounds can be used to describe the magnitude of the parameters that could result in such an observed value.
There are nonetheless several limitations to these bounds. First, they are only applicable under certain causal structures that lead to selection bias. The results here describe the maximum bias that could result from the parameters; the same parameters could also induce less bias. This conservative approach is useful when less is known about the selection mechanism and a simple exploration of the possible bias is desired. When more information is available, a more complex but precise method may be preferred.^{7,8,10–12} Next, the \text{RR}_{AU|(S=1)} parameter in the bound for the selected population is unintuitive and may be hard to specify even in the presence of solid knowledge about the selection mechanism; however, RRs relating the exposure or selection to the unobserved factor can usually be used in its place.^{24} Finally, this article only addresses bias due to selection and assumes other criteria for causal inference, such as control of exposure-outcome confounding and lack of measurement error, have been met. Future work could combine this approach to selection bias with other methods for bias analysis and could take into account the possibility that factors leading to selection bias could be sources of other types of bias.
Table 1: Summary of bounding factors and summary measures under different scenarios.
Bounding factor^{a} | Summary measure^{b} | |
---|---|---|
(A) | (B) | |
Result 1. General selection bias^{c,d} | \left(\frac{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}{\text{RR}_{UY|(A=1% )}+\text{RR}_{SU|(A=1)}-1}\right)\times\left(\frac{\text{RR}_{UY|(A=0)}\times% \text{RR}_{SU|(A=0)}}{\text{RR}_{UY|(A=0)}+\text{RR}_{SU|(A=0)}-1}\right) | \sqrt{\text{RR}^{\text{obs}}_{AY}}+\sqrt{\text{RR}^{\text{obs}}_{AY}-\sqrt{% \text{RR}^{\text{obs}}_{AY}}} |
Result 2. When S=U^{c,e} | \text{RR}_{UY|(A=0)}\times\text{RR}_{UY|(A=1)} | \sqrt{\text{RR}^{\text{obs}}_{AY}} |
Result 3. Increased risk with selection in both exposure groups^{c,f} | \frac{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}{\text{RR}_{UY|(A=1)}+% \text{RR}_{SU|(A=1)}-1} | \text{RR}^{\text{obs}}_{AY}+\sqrt{\text{RR}^{\text{obs}}_{AY}(\text{RR}^{\text% {obs}}_{AY}-1)} |
Result 4. S=U and increased risk^{c} | \text{RR}_{UY|(A=1)} | \text{RR}^{\text{obs}}_{AY} |
Result 5. Inference in the selected population^{g} | \frac{\text{RR}_{UY|(S=1)}\times\text{RR}_{AU|(S=1)}}{\text{RR}_{UY|(S=1)}+% \text{RR}_{AU|(S=1)}-1} | \text{RR}^{\text{obs}}_{AY}+\sqrt{\text{RR}^{\text{obs}}_{AY}(\text{RR}^{\text% {obs}}_{AY}-1)} |
^{a} The bias due to selection of the observed
risk ratio, \text{RR}^{\text{obs}}_{AY}/\text{RR}^{\text{true}}_{AY}, is guaranteed to be less than this value.
The parameters that define each bound are defined in the main text.
^{b} If all of the parameters in the bounding
factor are equal, then each must be greater than this value in order to
shift \text{RR}^{\text{obs}}_{AY} to 1. ^{c} The bound holds under the
assumption that Y\!\perp\!\!\!\perp S\{A,U\}.
^{c} The parameter of interest, \text{RR}^{\text{true}}_{AY}, is the causal
risk ratio for the whole population.
^{d} The
factor responsible for selection bias, U, is common to the entire
selected population.
^{e}
P(Y=1|A=1,S=1)/P(Y=1|A=1,S=0) and
P(Y=|A=0,S=1)/P(Y=|A=0,S=0) are both greater than
1.
^{f} The parameter of interest, \text{RR}^{\text{true}}_{AY|(S=1)}, is
the causal risk ratio in the selected population only. The bound holds
under the assumption that Y_{a}\!\perp\!\!\!\perp A|\{S=1,U\}.
Table 2: Summary of bounds for selection bias on the risk difference scale.
Bound^{a} | |
---|---|
Result 1. General selection bias^{b,c} | \frac{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}{\text{RR}_{UY|(A=1)}+% \text{RR}_{SU|(A=1)}-1}-P(Y=1|A=1,S=1)\times\frac{\text{RR}_{UY|(A=1)}+\text{% RR}_{SU|(A=1)}-1}{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}+P(Y=1|A=0,S=% 1)\times\frac{\text{RR}_{UY|(A=0)}\times\text{RR}_{SU|(A=0)}}{\text{RR}_{UY|(A% =0)}+\text{RR}_{SU|(A=0)}-1} |
Result 2. When S=U^{b,d} | \text{RR}_{UY|(A=1)}-P(Y=1|A=1,S=1)/\text{RR}_{UY|(A=1)}+P(Y=1|A=0,S=1)\times% \text{RR}_{UY|(A=0)} |
Result 3. Increased risk with selection in both exposure groups^{b,e} | \frac{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}{\text{RR}_{UY|(A=1)}+% \text{RR}_{SU|(A=1)}-1}-P(Y=1|A=1,S=1)\times\frac{\text{RR}_{UY|(A=1)}+\text{% RR}_{SU|(A=1)}-1}{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}} |
Result 4. S=U and increased risk | \text{RR}_{UY|(A=1)}-P(Y=1|A=1,S=1)/\text{RR}_{UY|(A=1)} |
Result 5. Inference in the selected population^{f} | \max\left(P(Y=1|A=0,S=1)\times\left(\frac{\text{RR}_{UY|(S=1)}\times\text{RR}_% {AU|(S=1)}}{\text{RR}_{UY|(S=1)}+\text{RR}_{AU|(S=1)}-1}-1\right),P(Y=1|A=1,S=% 1)\times\left(1-\frac{\text{RR}_{UY|(S=1)}+\text{RR}_{AU|(S=1)}-1}{\text{RR}_{% UY|(S=1)}\times\text{RR}_{AU|(S=1)}}\right)\right) |
^{a} The bias due to selection of the observed
risk difference, \text{RD}^{\text{obs}}_{AY}-\text{RD}^{\text{true}}_{AY}, is guaranteed to be less than this
value. The parameters that define each bound are defined in the main
text.
^{b} The bound holds under the assumption
that Y\!\perp\!\!\!\perp S\{A,U\}.
^{c} The
parameter of interest, \text{RD}^{\text{true}}_{AY}, is the causal risk difference for the
whole population.
^{d} The factor responsible for
selection bias, U, is common to the entire selected
population.
^{e}
P(Y=1|A=1,S=1)/P(Y=1|A=1,S=0) and
P(Y=1|A=0,S=1)/P(Y=|A=0,S=0) are both greater than
1.
^{f} The parameter of interest, \text{RD}^{\text{true}}_{AY|(S=1)}, is
the causal risk difference in the selected population only. The bound
holds under the assumption that Y_{a}\!\perp\!\!\!\perp A|\{S=1,U\}.
1. Rosner B, Willett WC, Spiegelman D. Correction of logistic regression relative risk estimates and confidence intervals for systematic within-person measurement error. Stat Med. 1989;154:1051–1069.
2. StÃ¼rmer T, Schneeweiss S, Avorn J, et al. Adjusting effect estimates for unmeasured confounding with validation data using propensity score calibration. Am J Epidemiol. 2005;162:279–289.
3. Cole SR, Chu H, Greenland S. Multiple-imputation for measurement-error correction. Int J Epidemiol. 2006;35:1074–1081.
4. Greenland S. Bayesian perspectives for epidemiologic research: III. Bias analysis via missing-data methods. Int J Epidemiol. 2009;38:1662–1673.
6. Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models. J Am Stat Assoc. 1999;94:1096–1120.
7. Greenland S. Multiple-bias modelling for analysis of observational data (with discussion). J Roy Stat Soc A. 2005;168:267–306.
8. Geneletti S, Richardson S, Best N. Adjusting for selection bias in retrospective, case-control studies. Biostatistics. 2009;10:17–31.
9. Howe CJ, Cole SR, Chmiel JS, et al. Limitation of inverse probability-of-censoring weights in estimating survival in the presence of strong selection bias. Am J Epidemiol. 2011;173:569–577.
10. TÃ¶rner A, Dickman P, Duberg AS, et al. A method to visualize and adjust for selection bias in prevalent cohort studies. Am J Epidemiol. 2011;174:969–976.
12. McGovern ME, BÃ¤rnighausen T, Marra G, et al. On the assumption of bivariate normality in selection models. Epidemiology. 2015;26:229–237.
13. Hanley JA. Correction of selection bias in survey data: Is the statistical cure worse than the bias? Am J Public Health. 2017;107:503–505.
14. HernÃ¡n MA, HernÃ¡ndez-DÃaz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004;15:615–625.
16. Rasmussen SA, Jamieson DJ, Honein MA, et al. Zika virus and birth defects — reviewing the evidence for causality. New Engl J Med. 2016;374:1981–1987.
17. Cauchemez S, Besnard M, Bompard P, et al. Association between zika virus and microcephaly in french polynesia, 2013–15: A retrospective study. The Lancet. 2016;387:2125–2132.
18. AraÃºjo TVB de, Ximenes RA de A, Miranda-Filho D de B, et al. Association between microcephaly, Zika virus infection, and other risk factors in Brazil: Final report of a case-control study. Lancet Infect Dis. 2018;18:328–336.
19. Silva AA, Barbieri MA, Alves MT, et al. Prevalence and risk factors for microcephaly at birth in Brazil in 2010. Pediatrics. 2018;141:e20170589.
20. VanderWeele TJ, Ding P. Sensitivity analysis in observational research: Introducing the E-value. Ann Intern Med. 2017;167:268–275.
21. Horwitz RI, Feinstein AR. Alternative analytic methods for case-control studies of estrogens and endometrial cancer. New Engl J Med. 1978;299:1089–1094.
22. Greenland S, Neutra R. An analysis of detection bias and proposed corrections in the study of estrogens and endometrial cancer. J Chron Dis. 1981;34:433–438.
23. Ding P, VanderWeele TJ. Sharp sensitivity bounds for mediation under unmeasured mediator-outcome confounding. Biometrika. 2016;103:483–490.
24. Smith LH, VanderWeele TJ. Mediational E-values: Approximate sensitivity analysis for unmeasured mediator-outcome confounding. Harvard T.H. Chan School of Public Health; 2018.
25. Lavie CJ, McAuley PA, Church TS, et al. Obesity and cardiovascular diseases. J Am Coll Cardiol. 2014;63:1345–1354.
27. Glymour MM, Vittinghoff E. Selection bias as an explanation for the obesity paradox: Just because it’s possible doesn’t mean it’s plausible. Epidemiology. 2014;25:4–6.
28. Banack HR, Kaufman JS. Does selection bias explain the obesity paradox among individuals with cardiovascular disease? Ann Epidemiol. 2015;25:342–349.
29. Sperrin M, Candlish J, Badrick E, et al. Collider bias is only a partial explanation for the obesity paradox. Epidemiology. 2016;27:525–530.
30. Gruberg L, Weissman NJ, Waksman R, et al. The impact of obesity on the short-term and long-term outcomes after percutaneous coronary intervention: The obesity paradox? J Am Coll Cardiol. 2002;39:578–584.
31. MunafÃ² MR, Tilling K, Taylor AE, et al. Collider scope: When selection bias can substantially influence observed associations. Int J Epidemiol. 2018;47:226–235.
Appendix for Bounding Bias Due to Selection
Louisa H. Smith and Tyler J. VanderWeele
Assume that the causal risk ratio \frac{P(Y_{1}=1)}{P(Y_{0}=1)} is identifiable (perhaps within strata of confounders) as
\text{RR}^{\text{true}}_{AY}=\frac{P(Y=1|A=1)}{P(Y=1|A=0)}\;. |
Assume, however, that we only have access to data in a selected sample, so we are actually estimating
\text{RR}^{\text{obs}}_{AY}=\frac{P(Y=1|A=1,S=1)}{P(Y=1|A=0,S=1)}\;. |
Finally, assume that although Y_{a}\!\perp\!\!\!\perp A, it is not the case that Y_{a}\!\perp\!\!\!\perp A|S=1, so that the observed risk ratio is a biased estimator of the true causal risk ratio in the total population. (When Y_{a}\!\perp\!\!\!\perp A|S=1, the causal risk ratio for the selected population is unbiased and can be estimated in the data, but may differ from the causal effect in the total population due to differences in the distribution of other risk factors for the outcome.)
We define the selection bias factor as
\text{bias}=\frac{\text{RR}^{\text{obs}}_{AY}}{\text{RR}^{\text{true}}_{AY}}\;. |
Assume that \text{bias }>1. If not, reverse the coding of A (so that we when we bound the bias from above, as follows, we are in fact bounding the originally coded bias from below).
Because
\text{RR}^{\text{true}}_{AY}=\frac{P(Y=1|A=1,S=0)P(S=0|A=1)+P(Y=1|A=1,S=1)P(S=% 1|A=1)}{P(Y=1|A=0,S=0)P(S=0|A=0)+P(Y=1|A=0,S=1)P(S=1|A=0)}\;, |
we have that
\text{bias}\leq\left\{\frac{P(Y=1|A=1,S=1)}{P(Y=1|A=0,S=1)}\right\}/\left\{% \frac{\min_{s}P(Y=1|A=1,S=s)}{\max_{s}P(Y=1|A=0,S=s)}\right\} |
=\left\{\frac{P(Y=1|A=1,S=1)}{\min_{s}P(Y=1|A=1,S=s)}\right\}\times\left\{% \frac{\max_{s}P(Y=1|A=0,S=s)}{P(Y=1|A=0,S=1)}\right\}\;. | (1) |
We have 4 possibilities for the right-hand side, depending on what values S takes on to maximize and minimize the respective expressions.
Take first the case in which S=0 in both places. This occurs when P(Y=1|A=1,S=1)\geq P(Y=1|A=1,S=0) and P(Y=1|A=0,S=1)\leq P(Y=1|A=0,S=0).
Then
\text{bias}\leq\left\{\frac{P(Y=1|A=1,S=1)}{P(Y=1|A=1,S=0)}\right\}\times\left% \{\frac{P(Y=1|A=0,S=0)}{P(Y=1|A=0,S=1)}\right\}\;. | (2) |
Now assume that there exists U such that Y\!\perp\!\!\!\perp S|\{A,U\}. We will assume a categorical U with values u=1,2,...,k for ease of notation, but U can also be continuous and/or a vector of random variables.
Since P(Y=1|A=a,S=1,U=u)=P(Y=1|A=a,S=0,U=u)=P(Y=1|A=a,U=u), we can rewrite equation (2):
\text{bias}\leq\left\{\frac{\sum_{u=1}^{k}P(Y=1|A=1,U=u)P(U=u|A=1,S=1)}{\sum_{% u=1}^{k}P(Y=1|A=1,U=u)P(U=u|A=1,S=0)}\right\}\times |
\left\{\frac{\sum_{u=1}^{k}P(Y=1|A=0,U=u)P(U=u|A=0,S=0)}{\sum_{u=1}^{k}P(Y=1|A% =0,U=u)P(U=u|A=0,S=1)}\right\}\;. |
By Lemma A.3. in Ding and VanderWeele 2016a,^{1} we have that
\frac{\sum_{u=1}^{k}P(Y=1|A=1,U=u)P(U=u|A=1,S=1)}{\sum_{u=1}^{k}P(Y=1|A=1,U=u)% P(U=u|A=1,S=0)}\leq\frac{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}{\text% {RR}_{UY|(A=1)}+\text{RR}_{SU|(A=1)}-1} | (3) |
and
\frac{\sum_{u=1}^{k}P(Y=1|A=0,U=u)P(U=u|A=0,S=0)}{\sum_{u=1}^{k}P(Y=1|A=0,U=u)% P(U=u|A=0,S=1)}\leq\frac{\text{RR}_{UY|(A=0)}\times\text{RR}_{SU|(A=0)}}{\text% {RR}_{UY|(A=0)}+\text{RR}_{SU|(A=0)}-1} | (4) |
where
\text{RR}_{UY|(A=1)}=\frac{\max_{u}P(Y=1|A=1,u)}{\min_{u}P(Y=1|A=1,u)} |
\text{RR}_{UY|(A=0)}=\frac{\max_{u}P(Y=1|A=0,u)}{\min_{u}P(Y=1|A=0,u)} |
\text{RR}_{SU|(A=1)}=\max_{u}\frac{P(U=u|A=1,S=1)}{P(U=u|A=1,S=0)} |
\text{RR}_{SU|(A=0)}=\max_{u}\frac{P(U=u|A=0,S=0)}{P(U=u|A=0,S=1)}\;. |
These values can be interpreted as the maximum relative risks comparing any two values of U on Y within strata of A=1 and A=0, respectively; and the maximum factors by which selection increases the prevalence of some value of U within the stratum A=1 and by which non-selection increases the relative prevalence of some value of U within stratum A=0.
\text{bias}\leq\left(\frac{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}{% \text{RR}_{UY|(A=1)}+\text{RR}_{SU|(A=1)}-1}\right)\times\left(\frac{\text{RR}% _{UY|(A=0)}\times\text{RR}_{SU|(A=0)}}{\text{RR}_{UY|(A=0)}+\text{RR}_{SU|(A=0% )}-1}\right)\;. | (5) |
Now consider the cases in which S, in one of both of the expressions in (1), takes on the value 1. In that case, one or both of the factors in (1) is equal to 1.
If P(Y=1|A=1,S=1)\leq P(Y=1|A=1,S=0) and P(Y=1|A=0,S=1)\geq P(Y=1|A=0,S=0) then
\text{bias }\leq 1\;. |
If P(Y=1|A=1,S=1)\geq P(Y=1|A=1,S=0) and P(Y=1|A=0,S=1)\geq P(Y=1|A=0,S=0) then
\text{bias }\leq\frac{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}{\text{RR% }_{UY|(A=1)}+\text{RR}_{SU|(A=1)}-1}\;. |
If P(Y=1|A=1,S=1)\leq P(Y=1|A=1,S=0) and P(Y=1|A=0,S=1)\leq P(Y=1|A=0,S=0) then
\text{bias }\leq\frac{\text{RR}_{UY|(A=0)}\times\text{RR}_{SU|(A=0)}}{\text{RR% }_{UY|(A=0)}+\text{RR}_{SU|(A=0)}-1}\;. |
Because the right-hand side of equation (5) is greater or equal to the right-hand side of the three bias inequalities under the other three conditions, then it is an upper bound for the bias in each case.
To construct a summary measure for the strength of a given risk ratio against selection bias, we can find the smallest risk ratio implied by the bounding factor that would be sufficient to reduce a given \text{RR}^{\text{obs}}_{AY} to \text{RR}^{\text{true}}_{AY}=1, assuming each of the parameters in the bounding factor were of that same magnitude. Denote that value RR. Then
\text{RR}^{\text{obs}}_{AY}\leq\frac{\text{RR}^{4}}{(2\text{RR}-1)^{2}}\;. |
Solving this inequality for RR shows us that for selection bias to completely explain away \text{RR}^{\text{obs}}_{AY},
\text{RR}_{UY|(A=1)}=\text{RR}_{UY|(A=0)}=\text{RR}_{SU|(A=1)}=\text{RR}_{SU|(% A=0)}\geq\sqrt{\text{RR}^{\text{obs}}_{AY}}+\sqrt{\text{RR}^{\text{obs}}_{AY}-% \sqrt{\text{RR}^{\text{obs}}_{AY}}}\;. |
In some cases selection may be directly determined by U, so that S=U. Then \text{RR}_{SU|(A=0)}=\text{RR}_{SU|(A=1)}=\frac{1}{0}. To bound the bias in such cases we can take the limit of the right-hand side of equation (5) as each \text{RR}_{SU} approaches \infty:
\text{bias }\leq\lim_{\text{RR}_{SU}\to\infty}\left(\frac{\text{RR}_{UY|(A=1)}% \times\text{RR}_{SU}}{\text{RR}_{UY|(A=1)}+\text{RR}_{SU}-1}\right)\times\left% (\frac{\text{RR}_{UY|(A=0)}\times\text{RR}_{SU}}{\text{RR}_{UY|(A=0)}+\text{RR% }_{SU}-1}\right) |
=\text{RR}_{UY|(A=0)}\times\text{RR}_{UY|(A=1)} |
When S=U, if \text{RR}^{\text{true}}_{AY}=1, then
\text{RR}^{\text{obs}}_{AY}\leq\text{RR}_{UY|(A=0)}\times\text{RR}_{UY|(A=1)}\;. |
By the same reasoning as above, if we assume both parameters in the bounding factor are of the same magnitude, then
\text{RR}_{UY|(A=0)}=\text{RR}_{UY|(A=1)}\leq\sqrt{\text{RR}^{\text{obs}}_{AY}% }\;. |
When P(Y=1|A=1,S=1)/P(Y=1|A=1,S=0) and P(Y=1|A=0,S=1)/P(Y=1|A=0,S=0) are both greater than 1, equation (1) can be rewritten
\text{bias}\leq\frac{P(Y=1|A=1,S=1)}{P(Y=1|A=1,S=0)}\;. |
Results 3A follows from the derivation of Result 1A using only that factor in (1), giving us:
\text{bias}\leq\frac{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}{\text{RR}% _{UY|(A=1)}+\text{RR}_{SU|(A=1)}-1}\;. |
Again denote by RR the smallest risk ratio implied by the bounding factor that would be sufficient to reduce a given \text{RR}^{\text{obs}}_{AY} to \text{RR}^{\text{true}}_{AY}=1, assuming each of the parameters in the bounding factor in Result 3B were of that same magnitude. Then
\text{RR}^{\text{obs}}_{AY}\leq\frac{\text{RR}^{2}}{2\text{RR}-1}\;. |
Solving this inequality for RR shows us that for selection bias to completely explain away \text{RR}^{\text{obs}}_{AY},
\text{RR}_{UY|(A=1)}=\text{RR}_{SU|(A=1)}\geq\text{RR}^{\text{obs}}_{AY}+\sqrt% {\text{RR}^{\text{obs}}_{AY}\left(\sqrt{\text{RR}^{\text{obs}}_{AY}}-1\right)}\;. |
Result 4A immediately follows from Results 2A and 3A.
Result 4B is trivial.
Now assume that the parameter of interest is the causal risk ratio within the selected population, \frac{P(Y_{1}=1|S=1)}{P(Y_{0}=1|S=1)}.
Again we can estimate
\text{RR}^{\text{obs}}_{AY}=\frac{P(Y=1|A=1,S=1)}{P(Y=1|A=0,S=1)} |
from a sample, but since it is not the case that Y_{a}\!\perp\!\!\!\perp A|S=1, the observed risk ratio is again a biased estimator of the causal risk ratio.
Assume, however, that Y_{a}\!\perp\!\!\!\perp A|\{S=1,U\}, so that
\text{RR}^{\text{true}}_{AY|(S=1)}=\frac{\sum_{u=1}^{k}P(Y=1|A=1,U=u,S=1)P(U=u% |S=1)}{\sum_{u=1}^{k}P(Y=1|A=0,U=u,S=1)P(U=u|S=1)} |
By Result 1 in in Ding and VanderWeele 2016b,^{2}
\frac{\text{RR}^{\text{obs}}_{AY}}{\text{RR}^{\text{true}}_{AY|(S=1)}}\leq% \frac{\text{RR}_{UY|(S=1)}\times\text{RR}_{AU|(S=1)}}{\text{RR}_{UY|(S=1)}+% \text{RR}_{AU|(S=1)}-1} |
where
\text{RR}_{UY|(S=1)}=\max_{a}\frac{\max_{u}P(Y=1|A=a,S=1,U=u)}{\min_{u}P(Y=1|A% =a,S=1,U=u)} |
\text{RR}_{AU|(S=1)}=\max_{u}\frac{P(U=u|A=1,S=1)}{P(U=u|A=0,S=1)}\;. |
The analytic form of Result 5A is equivalent to that of Result 2A. It therefore follows that the minimum magnitude of each of the two parameters that make up the bounding factor in Result 5A, assuming they are equal, that would be sufficient to shift a given \text{RR}^{\text{obs}}_{AY} to the null is given by:
\text{RR}_{UY|(S=1)}=\text{RR}_{AU|(S=1)}\leq\text{RR}^{\text{obs}}_{AY}+\sqrt% {\text{RR}^{\text{obs}}_{AY}(\text{RR}^{\text{obs}}_{AY}-1)}\;. |
As with the risk ratio, we assume that the causal risk difference P(Y_{1}=1)-P(Y_{0}=1) is identifiable as
\text{RD}^{\text{true}}_{AY}=P(Y=1|A=1)-P(Y=1|A=0)\;. |
We exclude the variables necessary to eliminate confounding from the conditioning statement for ease of notation, but the above could hold conditional on confounders C, in which case assume all probability statements that follow are also conditional on confounders C.
If we only have data from a selected population, we observe
\text{RD}^{\text{obs}}_{AY}=P(Y=1|A=1,S=1)-P(Y=1|A=0,S=1)\;. |
Again we assume that it is not the case that Y_{a}\!\perp\!\!\!\perp A|S=1, so that \text{RD}^{\text{obs}}_{AY} is a biased estimator of the causal risk difference. Now we are concerned with bias on the additive scale:
\text{bias }=\text{RR}^{\text{obs}}_{AY}-\text{RR}^{\text{true}}_{AY}\;. |
Assume that the bias is non-negative; if not, recode the exposure A so that it is.
Because \text{RD}^{\text{true}}_{AY}\geq\min_{s}P(Y=1|A=1,S=s)-\max_{s}P(Y=1|A=0,S=s), we have that
\text{bias }\leq\left[P(Y=1|A=1,S=1)-P(Y=1|A=0,S=1)\right]\;\;- |
\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\left[\min_{s}P(Y=1|A=1,S=s)-\max_{s}P% (Y=1|A=0,S=s)\right]\;. | (6) |
The right-hand side of equation (6) is maximized with S=0 in both conditioning statements, so we will find a bound for the bias under that condition.
We can therefore rewrite (6):
\text{bias }\leq\left[P(Y=1|A=1,S=1)-P(Y=1|A=1,S=0)\right]\;\;+ |
\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\left[P(Y=1|A=0,S=0)-P(Y=1|A=0,S=1)% \right]\;. | (7) |
The bias is bounded by the sum of two risk differences representing the association between S and Y within strata of A. To deal with each of them simultaneously we will consider bounding the apparent risk difference for any value A=a and two values of S, s and s^{*}:
\text{RD}^{\text{app}}_{SY}=P(Y=1|A=a,S=s)-P(Y=1|A=a,S=s^{*})\;. | (8) |
(This risk difference is never actually observed because we have no data for the stratum S=0, which must be either s or s^{*}.)
Assume there exists U such that P(Y=1|A=a,S=s,U=u)-P(Y=1|A=a,S=s^{*},U=u)=0 for all values u, or equivalently Y\!\perp\!\!\!\perp S|\{A,U\}. In other words, conditioning on U is sufficient to eliminate the apparent association between S and Y (and therefore the selection bias as well, as the extent to which \text{RD}^{\text{app}}_{SY} is non-zero (for each value of A) is essentially the extent of the bias due to selection). We will denote the risk difference conditional on U as \text{RD}^{\text{true}}_{SY}.
Because \text{RD}^{\text{true}}_{SY}=0, a bound for \text{RD}^{\text{app}}_{SY}-\text{RD}^{\text{true}}_{SY} is also a bound for \text{RD}^{\text{app}}_{SY}.
We can use results from Ding and VanderWeele 2016b^{2} to bound \text{RD}^{\text{app}}_{SY}. From their results we have that
\frac{P(Y=1|A=a,S=s)}{P(Y=1|A=a,S=s^{*})}\leq\frac{\text{RR}_{UY|(A=a)}\times% \text{RR}_{SU|(A=a)}}{\text{RR}_{UY|(A=a)}+\text{RR}_{SU|(A=a)}-1} | (9) |
where
\text{RR}_{UY|(A=a)}=\frac{\max_{u}P(Y=1|A=a,u)}{\min_{u}P(Y=1|A=a,u)} |
and
\text{RR}_{SU|(A=a)}=\max_{u}\frac{P(U=u|A=a,S=s)}{P(U=u|A=a,S=s^{*})}\;. |
Rearranging (9) shows us that
\text{RD}^{\text{app}}_{SY}\leq P(Y=1|A=a,S=s^{*})\times\frac{\text{RR}_{UY|(A% =a)}\times\text{RR}_{SU|(A=a)}}{\text{RR}_{UY|(A=a)}+\text{RR}_{SU|(A=a)}-1}\;\;- |
\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;P(Y=1|A=% a,S=s)\times\frac{\text{RR}_{UY|(A=a)}+\text{RR}_{SU|(A=a)}-1}{\text{RR}_{UY|(% A=a)}\times\text{RR}_{SU|(A=a)}}\;. |
Returning to equation (7), we now can replace each of the apparent risk differences with their bounds, which will be an overall bound for the bias:
\text{bias }\leq P(Y=1|A=1,S=0)\times\text{BF}_{1}-P(Y=1|A=1,S=1)/\text{BF}_{1% }\;\;+ |
\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;P(Y=1|A=0,S=1)\times\text{% BF}_{0}-P(Y=1|A=0,S=0)/\text{BF}_{0} |
where
\text{BF}_{1}=\frac{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}{\text{RR}_% {UY|(A=1)}+\text{RR}_{SU|(A=1)}-1} |
and
\text{BF}_{0}=\frac{\text{RR}_{UY|(A=0)}\times\text{RR}_{SU|(A=0)}}{\text{RR}_% {UY|(A=0)}+\text{RR}_{SU|(A=0)}-1} |
with the RR parameters defined as in section I.
Because the probabilities conditional on S=0 aren’t generally observed, we can replace those values with their possible extremes, 0 and 1, to obtain the bound:
\text{bias }\leq\text{BF}_{1}-P(Y=1|A=1,S=1)/\text{BF}_{1}+P(Y=1|A=0,S=1)% \times\text{BF}_{0}\;. | (10) |
As in section I, when S=U, we can take the limit of equation (10) as each of the \text{RR}_{SU} terms in \text{BF}_{1} and \text{BF}_{0} approaches \infty:
\text{bias }\leq\lim_{\text{RR}_{SU}\to\infty}\text{BF}_{1}-P(Y=1|A=1,S=1)/% \text{BF}_{1}+P(Y=1|A=0,S=1)\times\text{BF}_{0} |
=\text{RR}_{UY|(A=1)}-P(Y=1|A=1,S=1)/\text{RR}_{UY|(A=1)}+P(Y=1|A=0,S=1)\times% \text{RR}_{UY|(A=0)} |
When P(Y=1|A=1,S=1)-P(Y=1|A=1,S=0) and P(Y=1|A=0,S=1)-P(Y=1|A=0,S=0) are both greater than 0, (6) can be rewritten
\text{bias }\leq P(Y=1|A=1,S=1)-P(Y=1|A=1,S=0)\;. |
Following the derivation of Result 2A, we find that
\text{bias }\leq P(Y=1|A=1,S=0)\times\frac{\text{RR}_{UY|(A=1)}\times\text{RR}% _{SU|(A=1)}}{\text{RR}_{UY|(A=1)}+\text{RR}_{SU|(A=1)}-1}\;\;- |
\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;P(Y=1|A=% 1,S=1)\times\frac{\text{RR}_{UY|(A=1)}+\text{RR}_{SU|(A=1)}-1}{\text{RR}_{UY|(% A=1)}\times\text{RR}_{SU|(A=1)}}\;. |
In terms of the observable data, we have:
\text{bias }\leq\frac{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}{\text{RR% }_{UY|(A=1)}+\text{RR}_{SU|(A=1)}-1}-P(Y=1|A=1,S=1)\times\frac{\text{RR}_{UY|(% A=1)}+\text{RR}_{SU|(A=1)}-1}{\text{RR}_{UY|(A=1)}\times\text{RR}_{SU|(A=1)}}\;. |
Combining Results 2A and 3A, we have that
\text{bias }\leq\text{RR}_{UY|(A=1)}-P(Y=1|A=1,S=1)/\text{RR}_{UY|(A=1)}\;. |
When we are concerned with the causal risk difference P(Y_{1}=1|S=1)-P(Y_{0}=1|S=1), we assume that Y_{a}\!\perp\!\!\!\perp A|\{S=1,U\}; that is, conditioning on U is sufficient to eliminate the bias induced by conditioning on the selected population. This is an equivalent condition to that which requires U to suffice to control for confounding, conditional on measured confounders, in VanderWeele and Ding 2016b.^{2} We can use their results for bounding the causal risk difference under unmeasured confounding as follows.
Define the following for arbitrary U with K levels (for notational simplicity, as in Section I):
\text{RD}_{AY^{+}|S=1}^{\text{true}}=P(Y=1|A=1,S=1)-\sum_{k=1}^{K}P(Y=1|A=0,S=% 1,U=k)P(U=k|A=1,S=1) |
\text{RD}_{AY^{-}|S=1}^{\text{true}}=\sum_{k=1}^{K}P(Y=1|A=1,S=1,U=k)P(U=k|A=0% ,S=1)-P(Y=1|A=0,S=1) |
\text{RD}_{AY|S=1}^{\text{true}}=P(A=1|S=1)\times\text{RD}_{AY^{+}|S=1}^{\text% {true}}+(1-P(A=1|S=1))\times\text{RD}_{AY^{-}|S=1}^{\text{true}} |
\text{bias }=\text{RD}_{AY}^{\text{obs}}-\text{RD}_{AY|S=1}^{\text{true}} |
\text{BF}_{U}=\frac{\text{RR}_{UY|(S=1)}\times\text{RR}_{AU|(S=1)}}{\text{RR}_% {UY|(S=1)}+\text{RR}_{AU|(S=1)}-1} |
where the parameters in \text{BF}_{U} are defined as in Section I and \text{RD}_{AY}^{\text{obs}} as from Result 1A in Section II.
Because
\text{RD}_{AY^{|}S=1}^{\text{true}}\geq\min\left(\text{RD}_{AY^{+}|S=1}^{\text% {true}},\text{RD}_{AY^{-}|S=1}^{\text{true}}\right)\;, |
we have that
\text{bias }\leq\max\left(\text{RD}_{AY}^{\text{obs}}-\text{RD}_{AY^{+}|S=1}^{% \text{true}},\text{RD}_{AY}^{\text{obs}}-\text{RD}_{AY^{-}|S=1}^{\text{true}}% \right)\;. |
Using the lower bounds for the causal risk differences from Ding and VanderWeele 2016b,^{2} we have that
\text{RD}_{AY^{+}|S=1}^{\text{true}}-\text{RD}_{AY}^{\text{obs}}\leq P(Y=1|A=0% ,S=1)\times(\text{BF}_{U}-1) |
and
\text{RD}_{AY^{-}|S=1}^{\text{true}}-\text{RD}_{AY}^{\text{obs}}\leq P(Y=1|A=1% ,S=1)\times(1-1/\text{BF}_{U})\;. |
Therefore,
\text{bias }\leq\max\left(P(Y=1|A=0,S=1)\times(\text{BF}_{U}-1),P(Y=1|A=1,S=1)% \times(1-1/\text{BF}_{U})\right)\;. |