The choice of effect measure for binary outcomes: Introducing counterfactual outcome state transition parameters
Abstract
Standard measures of effect, including the risk ratio, the odds ratio, and the risk difference, are associated with a number of welldescribed shortcomings, and no consensus exists about the conditions under which investigators should choose one effect measure over another. In this paper, we introduce a new framework for reasoning about choice of effect measure by linking two separate versions of the risk ratio to a counterfactual causal model. In our approach, effects are defined in terms of “counterfactual outcome state transition parameters”, that is, the proportion of those individuals who would not have been a case by the end of followup if untreated, who would have responded to treatment by becoming a case; and the proportion of those individuals who would have become a case by the end of followup if untreated who would have responded to treatment by not becoming a case. Although counterfactual outcome state transition parameters are generally not identified from the data without strong monotonicity assumptions, we show that when they stay constant between populations, there are important implications for model specification, metaanalysis, and research generalization.
1 Background
Causal effects often differ between groups of people. Consequently, investigators are often required to reason carefully about which measures of effect, if any, can be expected to remain homogeneous between different populations or different subgroups. Investigator beliefs about effect homogeneity have important implications both for model specification, and for the choice of summary metric in metaanalysis [1]. It is commonly believed that the risk ratio is a more homogeneous effect measure than the risk difference, but recent methodological discussion has questioned the evidence for the conventional wisdom [2, 3].
Approaches to effect homogeneity based on standard effect measures have several noteworthy shortcomings, summarized for reference in Table 1:

Approaches that assume equality of the risk ratio or the risk difference may make predictions outside the bounds of valid probabilities.

The odds ratio and the risk ratio have ”zeroconstraints”: If the baseline risk in a population is either 0 or 1, approaches based on assuming that the targeted odds ratio is equal to the odds ratio in a referent population will necessarily result in concluding that the exposure has no effect. [4]

If the risk difference, risk ratio, or odds ratio are equal across different populations, then the proportion that responds to treatment in a given population is required to be a function of that population’s baseline risk. [5]

The odds ratio is not collapsible; i.e the marginal value of the odds ratio may not be equal to a weighted average of the stratumspecific odds ratios under any weighting scheme, even in the absence of confounding and other forms of structural bias. [6]

The predictions of the risk ratio model are not symmetric to whether the parameter is based on the probability of having the event, or the probability of not having the event, i.e whether we ”count the living or the dead”. [7]
In addition to the five points discussed above, we note that no generally applicable biological mechanism has been proposed that would guarantee the risk ratio, the risk difference, or the odds ratio stays constant between different populations. For earlier discussion of how biological mechanisms relate to the choice of effect measure, we point the reader to Siemiatycki and Thomas (1981) and Thompson (1991) [8] and [9].
Risk Difference  Risk Ratio  Odds Ratio  COST Parameters  

Predicts invalid probabilities  Yes  Yes  No  No 
Zeroconstraints  No  Yes  Yes  No 
Baseline risk dependence  Yes  Yes  Yes  No 
Noncollapsibility  No  No  Yes  No 
Asymmetry to outcome variable  No  Yes  No  No 
In this paper, we are interested in the effect of binary treatment (e.g., a drug) on binary outcome (e.g., a side effect) in two separate populations and . In all examples, we have randomized trial evidence for the average causal effect of the treatment in study population , and wish to predict the effect of introducing the treatment in the target population , in which the drug is not available and in which we can only collect observational data. Counterfactuals will be denoted using superscripts. For example, is an indicator for whether an individual would have experienced side effect if, possibly contrary to fact, she did not initiate treatment with drug A. Because we have a randomized trial in study population , the average baseline risk Pr, and the average risk under treatment, Pr are identified from the data. Since treatment is not available in target population , the average baseline risk, Pr, is also identified from the data. Our goal is to use this information, in combination with some plausible assumption about effect homogeneity, to predict the average risk under treatment in the target population, Pr.^{1}^{1}1We note that considerations other than effect heterogeneity may also be relevant to the choice of effect measure. In particular, while decisionmaking would benefit from information on both Pr and Pr in order to weigh the costs and benefits of the intervention across different outcomes, it may be sufficient to know the risk difference (RD), Pr, for example if the decisionmaker is neither riskaverse nor riskseeking over the relevant outcome (that is, if their preferences can be represented by a social welfare function which increases linearly with the number of averted incident cases). This has been used as an argument in favor of measuring effects on the additive scale. Others have argued that one should give preference to summary statistics which are readily understood by the public, and that inverse measures of effect such as the number needed to treat (NNT) are better suited for this purpose. The focus of the current paper is on situations where the investigators must present a summary measure of an effect in a study population, possibly conditional on a set of effect modifiers, for potential use across a range of different target populations. In such situations, generalizability is prioritized over these other considerations, at least concerning the choice of initial effect measure.
Briefly, we recall that VanderWeele (2012) [10] defined two separate types of effect heterogeneity: Effect modification in distribution (where the distributions of counterfactual variables vary across populations) and effect modification in measure (where a particular effect measure varies across populations). Here, we will propose a novel approach to effect homogeneity based on a new class of nonidentified effect parameters; and show how this framework can sometimes be used to reason about homogeneity in terms of standard, identifiable measures of effect such as the risk difference and the risk ratio. Specifically, we will consider two risk ratios, which are equivalent to the two risk ratios considered by Deeks (2012) [4], and defined as follows:
This paper is organized as follows. In part 2, we describe a scenario to illustrate the motivation for expanding upon existing effect parameters. In part 3, we introduce counterfactual outcome state transition (COST) parameters, and propose a definition of effect homogeneity based on these parameters. In part 4, we discuss the conditions under which COST parameters are identified from the data, and the implications of violations of these conditions. In part 5, we show that the COST parameters are not symmetric to the coding of the exposure variable, and discuss the implications of this observation. In part 6, we link COST parameters to substantive knowledge by providing examples of biological processes that result in effect homogeneity. In part 7, we discuss the empirical implications of our model. We conclude in part 8.
2 Motivating Example
Suppose a team of investigators have data from a randomized trial where the risk of a particular side effect was 2% under no treatment and 3% under treatment. They wish to predict the effect in a different target population, where the baseline risk of the side effect is 10% (Table 2). For simplicity, we postulate that this treatment does not prevent the outcome from occurring in any individual (“monotonic effect”); this assumption will be discussed in detail later in the paper.
While discussing their data analysis plan, the first investigator postulates that the standard risk ratio is equal between these two populations. Under this assumption, he estimates that 15% of the target population will experience the side effect under treatment. However, a second investigator believes that , not , might be equal between the two populations. Under this assumption, he estimates that 10.9% of the target population will experience the side effect under treatment.
A third investigator notes that neither the first nor second investigator have substantive arguments for choosing the versus the . However, she realizes that the group at risk of being adversely affected by treatment is the 90% of the target population who were originally destined not to experience the side effect. She points out that the first investigator’s assumption (i.e. that was equal between the populations) results in a prediction that among people who are at risk of being adversely affected by the treatment, the proportion who respond is much higher in population than in population , simply because the baseline risk is higher.
Given this, the third investigator instead suggests assuming a specific probability is constant across the populations: the probability that a person who was previously not destined to experience the outcome, does not experience the outcome in response to treatment. This probability can be computed in the trial as and applied to the 90% who were originally destined to be unaffected is the target population. In this case, the third investigator’s approach results in the exact same estimate as the second investigator’s approach: an estimated 10.9% of the target population will experience the side effect under treatment.
Pr  2%  2%  2%  2% 
Pr  3%  3%  3%  3% 
Effect  =1.5  =0.99  =0.01  =1.515 
Pr  10%  10%  10%  10% 
Predicted Pr  15%  10.9%  11%  14.4% 
Variations of the third investigator’s arguments have arisen independently multiple times in the literature, dating back more than half a century [7, 4], but these recommendations have rarely been translated into common practice. Throughout the rest of this paper, we will formalize this line of reasoning in a counterfactual causal model, in order to explore its scope, limits and implications.
3 Definition of the Counterfactual Outcome State Transition Parameters
We define “counterfactual outcome state transition” (COST) parameters based on the probability that a person who becomes a case if untreated remains a case if treated, and by the probability that a person who does not become a case if untreated remains a noncase if treated. We also offer interpretations of these quantities in terms of the four deterministic response types [11] (Table 3). A list of parameters considered in this paper is shown in Table 4.
Definition 1.
is defined as the probability of being a case if treated, among those who would have been a case if untreated:
Definition 2.
is defined as the probability of not being a case if treated, among those who would not have been a case if untreated:
In a deterministic model, can be interpreted as the proportion who are “Doomed”, among those who are either “Doomed” or “Preventative” and can be interpreted the proportion who are “Immune”, among those who are either “Immune” or “Causal”. We will refer to and as the COST parameters for introducing treatment. and can be indexed for a specific population with subscripts (e.g., is the parameter in population ). If the COST parameters for introducing treatment are equal between two populations (i.e. if and ,) this can equivalently be written as
Similar conditions were considered in different contexts by Gechter (2015) [12] and by Athey and Imbens (2006) [13]. In our setting of binary outcomes, this definition of effect homogeneity addresses all previously discussed limitations of the standard effect measures: The underlying parameters have no baseline risk dependence, do not produce logically impossible results, are collapsible over arbitrary baseline covariates, and have no zero constraints. Further, if the coding of the outcome is altered, the only consequence is that the values of and are reversed. (See appendix 1)
Type of individual  Description of response type  Potential outcomes 

Type 1  No effect of treatment (Individual Doomed to get the disease with respect to exposure)  , 
Type 2  Exposure Causative (Individual susceptible to exposure)  , 
Type 3  Exposure Preventative (Individual susceptible to exposure)  , 
Type 4  No effect of treatment (Individual Immune from disease with respect to exposure)  , 

Note that, under these definitions, it readily follows that Pr = Pr(Doomed) + Pr(Preventative) and that Pr= Pr(Doomed) + Pr(Causal).
Parameter  Definition  Key conditions for identification  Identifying expression 

No confounding  
No confounding  
No confounding and nonincreasing monotonicity  
No confounding and nondecreasing monotonicity  
No confounding and nondecreasing monotonicity  
No confounding and nonincreasing monotonicity 

All parameters shown above are defined separately in populations and . Whenever we need to clarify the population in which the parameter is defined, subscripts are used (e.g., ).
4 Identification of the Counterfactual Outcome State Transition Parameters
Now that we have defined COST parameters, and described their motivation and attractiveness, we proceed to discuss how we can compute them in our studies. In an ideal randomized trial, we can identify the standard effect measures (RR, RD) without further assumptions beyond those expected to hold by design. Unfortunately, this is not the case for COST parameters: COST parameters are generally not identifiable without further assumptions. Therefore, even if we (somehow) knew that the parameters G and H are equal between two populations, we may not be able to use this fact to predict what happens when we introduce treatment in the target population. We next introduce assumptions that lead to identifiability (Propositions 1, 4), and discuss scientific and clinical implications when we are (Propositions 2, 5) and when we are not (Propositions 3, 6) willing to make those assumptions. The key identifiability assumption is monotonicity [14, 15]: We say there is nonincreasing monotonicity if individuals who do not get the outcome if untreated, do not get the outcome if treated: Pr. In other words, . Similarly, nondecreasing monotonicity occurs if individuals who get the outcome if untreated also get the outcome if treated, in which case .
Monotonicity is defined in the context of the specific exposureoutcome under consideration, and its plausibility therefore needs to be evaluated on a casebycase basis. This means that within a trial, we may need to consider the plausibility of monotonocity for each outcome of interest separately. For instance, the antiarrhythmic agent Amiodarone can cause arrhythmia in some individuals and prevent it in others, and therefore does not have monotonic effects for the arrhythmia outcome. However, if the outcome of interest is a side effect such as pulmonary fibrosis, monotonicity may be a viable assumption because the drug is unlikely to prevent any individual from getting pulmonary fibrosis. In general, monotonicity could be a more reasonable approximation in situations where the outcome is a side effect strongly associated with pharmacological treatment.
Proposition 1.
If treatment monotonically reduces the incidence of event Y, then G is identified from the data of a randomized trial, and is equal to the standard risk ratio, RR().
Proposition 2.
If the counterfactual outcome state transition parameters for introducing treatment are equal between populations and , and if treatment monotonically reduces the risk of the outcome, then in the target population is equal to in the study population.
Proposition 3.
If the counterfactual outcome state transition parameters for introducing treatment are equal between populations and , and if treatment reduces the incidence of but not monotonically, then in the study population is a biased estimate of in the target population.^{2}^{2}2If there is substantial nonmonotonicity or if the baseline risks differ substantially between the populations, the bias term gets very large, resulting in highly biased estimates which may even be on the wrong side of 1. For example, if is 0.05 and is 0.99 in both populations, and the baseline risk is 0.005 in the study population but 0.05 in the target population, then the true risk ratio in the target population is 0.24, but the risk ratio in the study population is 2.05. We therefore caution against using as an approximation of unless it is plausible, in both populations, to expect ”nearmonotonicity”, defined as PrPr. This condition, which ensures that is approximately equal to , will be met if the prevalence of the ”doomed” response type is much higher than the prevalence of the ”causative” response type. In the setting of a randomized trial, this might be a reasonable approximation if there is reason to believe that a great majority of the events in the treatment arm would have occurred even in the absence of treatment. If the baseline risk in the target population is lower than in the study population, then the true risk under treatment in the target population is higher than what would be predicted by assuming , whereas the opposite holds if the baseline risk is higher in the target population. The magnitude of the bias is a function of the extent of nonmonotonicity, and of the ratio of baseline risk between the two populations (see appendix 2).
The following three propositions are exactly symmetric to the preceding results, but apply to situations where exposure increases the risk of the outcome (i.e., monotonicity holds in the other direction). In such situations, we identify instead of :
Proposition 4.
If treatment monotonically increases the incidence of event , then is identified from the data of a randomized trial, and is equal to the recoded risk ratio
Proposition 5.
If the counterfactual outcome state transition parameters for introducing treatment are equal between populations and , and if treatment monotonically increases the risk of the outcome, then in the study population is equal to in the target population.
Proposition 6.
If the counterfactual outcome state transition parameters for introducing treatment are equal between populations and , and if treatment increases the incidence of but not monotonically, then in the study population is a biased estimate of in the target population.
Proofs of propositions (16) are provided in Appendix 2.
5 Asymmetry to the Coding of the Exposure Variable
So far, we have focused on problems – and resolutions to some problems – related to how the investigator encodes the outcome variable in the database. However, COST parameters are not invariant to the coding of the exposure variable. To illustrate, reconsider the trial where the risk under treatment was 3% and the risk under no treatment was 2%. If we reverse the coding of the exposure variable, we notice that the new exposure variable in fact reduces the risk of the outcome, meaning that a naive application of our approach would suggest using the standard (but now inverted) risk ratio to estimate the probability of being unaffected by treatment. However, there is a subtle but important distinction from the earlier approach: by changing the definition of exposure, we have also changed the meaning of the parameter so that it is now defined as the probability of being unaffected by treatment among those who would have become cases under treatment, rather than among those who would have become cases under no treatment. This can be conceptualized as the effect of removing treatment from a fully treated population.
We will refer to the COST parameters associated with removing treatment as and . The choice of coding of the exposure variable is then equivalent to choosing whether to model equality of effect based on the parameters and , or based on the parameters and . For notational simplicity, we will continue to use the original coding of the exposure variable, and instead frame the question in terms of whether an investigator should define equality of effects based on the parameters and , or the parameters and .
Definition 3.
is defined as the probability of being a case if untreated, among those who would have been a case if treated:
Definition 4.
is defined as the probability of not being a case if untreated, among those who would not have been a case if treated:
In a deterministic model, can be interpreted as the proportion who are “Doomed”, among those who are either “Doomed” or “Causal”, and can be interpreted the proportion who are “Immune”, among those who are either “Immune” or “Preventative”. If the COST parameters for removing treatment are equal between two populations (i.e. if and ,) this can equivalently be written as . With these definitions, results exactly analogous to propositions 1 through 6 can be derived for and .
As implied earlier in this section, one can construct examples to show that equality of COST parameters for introducing treatment does not imply equality of COST parameters removing treatment. In fact, if the baseline risks differ, then the two homogeneity conditions rarely hold simultaneously (an obvious exception to this would be under the sharp causal null hypothesis). This observation arguably presents a major conceptual challenge to the claim that our definition captures the intuitive idea of “equal effects”: to make our inferences invariant to the coding of the outcome, we have made them dependent on the coding of the exposure. However, in the next section, we show that, with some background knowledge about biological mechanisms, it is possible to reason about whether the COST parameters for introducing treatment are more likely to be homogeneous than the COST parameters for removing treatment. In particular, we show that it is possible to provide models for the datagenerating process that guarantee equality of the COST parameters for introducing treatment, equality of the COST parameters for removing treatment, or neither.
6 Biological Knowledge and Equality of Treatment Effects
While it is usually not possible to reason a priori about whether the risk difference or odds ratio are equal between populations, we now provide a simple example to show that it is sometimes possible to reason based on biological knowledge about whether the COST parameters are equal between populations. This is intended only as a proofofconcept, and the example is purposefully oversimplified in order to illustrate the principles. An outline of a formal treatment of these ideas is provided in Appendix 3.
Consider a team of investigators who are interested in the effect of antibiotic treatment on mortality in patients with a specific bacterial infection. Since this antibiotic is known to reduce mortality, the investigators need to decide whether to report as an approximation of , or alternatively as an approximation of . In order to ensure external validity, this choice will be determined by their beliefs about which parameter is most likely to be constant across populations.
The investigators believe that the response to this antibiotic is completely determined by an unmeasured bacterial gene, such that only those who are infected with a bacterial strain with this gene respond to treatment. The prevalence of this bacterial gene is equal between populations, because the populations share the same bacterial ecosystem. If, as seems likely, the investigators further believe that the gene for susceptibility reduces mortality in the presence of antibiotics, but has no effect in the absence of antibiotics, they will conclude that may be equal between populations. If, on the other hand, they had concluded that the gene for susceptibility causes mortality in the absence of antibiotics but has no effect in the presence of antibiotics, they would instead expect equality of across populations.
For many antimicrobial therapies, the microbial genes that determine antibiotic susceptibility generally have functions that are perhaps better approximated by the first approach, and therefore motivates the choice to model the data as if is equal between the populations. In other situations, in the presence of different subject matter knowledge, similar logic could be used to reach different conclusions. One example of this occurs in pharmacological applications involving adverse reactions to drugs, where it may be reasonable to use the parameter if the determinants of susceptibility are equally distributed between populations, under certain assumptions about how those determinants interact with the drug. In many realistic applications, it may be more plausible that the COST parameters for introducing treatment are equal than the corresponding parameters for removing treatment, but this is by no means universal: for example, if some humans had retained our ancestors’ ability to synthesize Vitamin C endogenously, then the effect of fresh fruit on scurvy might be better modeled based on the parameter .
In most settings, the particular function of the attribute that determines treatment susceptibility will not be known. In such situations, it is necessary to reason on theoretical grounds about which model is a better approximation of reality. One possible approach to determine whether either type of effect equality is biologically plausible would be to consider how an attribute or gene with the necessary function avoided either reaching fixation or being selected out of existence. For example, a genotype that causes the outcome in the presence of exposure will very likely go extinct in a population where everyone is exposed, but may survive in a proportion of a population where everyone is generally unexposed. This equilibrium may be equal between different groups of people (for example, equal between men and women in the same gene pool). Therefore, if treatment was unavailable in recent evolutionary history, the COST parameters for introducing introducing treatment may be more likely to be equal than the COST parameters for removing treatment. Similarly, an attribute that prevents the outcome in the absence of exposure will quickly reach fixation if everyone is unexposed, but its absence may survive in a small, stable fraction of the population if everyone is exposed. Therefore, if everyone were exposed in recent evolutionary history, the COST parameters for removing treatment may be more likely to be equal than the COST parameters for introducing treatment. Thus, this line of reasoning may provide an additional viable argument for choosing the index level of the exposure variable based on the value which it took by default in recent evolutionary history.
7 Testing for Heterogeneity
If we believe that the COST parameters for introducing treatment are equal across populations, an empirical implication is that metaanalysis based on will be less heterogeneous for exposures which monotonically reduce the incidence of the outcome, whereas metaanalysis based on will be less heterogeneous for exposures which monotonically increase the incidence of the outcome.
However, this observation is complicated by an additional asymmetry associated with the ratio scale: If the outcome is rare (which is usually the case), Pr and Pr will both be close to , and will therefore also be close to , even if treatment has a substantial effect. This results in a compression of the scale, such that when heterogeneity is measured in terms of the absolute differences between effect sizes, clinically meaningful heterogeneity between populations will only be apparent at the second or even third decimal space. In contrast, heterogeneity on the scale generally manifests itself at the first decimal. Therefore, any attempt to measure heterogeneity based on the absolute difference between each study’s estimate and the overall metaanalytic estimate will result in higher values for than for rare outcomes, for reasons that arguably say more about the mathematical differences between the scales than about their relative usefulness for summarizing effect sizes.
One option may be to quantify each study’s deviation from the common effect in terms of the lowest possible proportion of individuals whose outcome variable must be “switched” in order for the effect estimate in the study to equal the overall metaanalytic estimate. Another potential approach to this problem is to use the Risk Difference () in place of for exposures which increase the incidence of . In Appendix 4, we show that if the outcome is rare, effect homogeneity on the scale implies nearhomogeneity on the risk difference scale. This approach is consistent with previous suggestions to consider the “relative benefits and absolute harms” [16] of medical interventions. Investigators using this approach must further keep in mind the potential differences in power between tests for homogeneity on the additive and multiplicative scales. [2]
8 Conclusion
We have proposed a new approach to considering effect equality across populations, which avoids several wellestablished shortcomings of definitions based on standard effect measures. Our approach distinguishes equality of the effect of introducing the treatment to a fully untreated population from equality of the effect of removing the treatment from a fully treated population; and therefore requires investigators to reason carefully about the distinction between the two homogeneity conditions. Further, we provided examples of biological models that correspond to each form of effect equality. While the utility of our approach is limited to the restricted range of applications where these biological models are a reasonable approximation of reality, we believe such applications could occur with some frequency when studying the effectiveness and safety of pharmaceuticals.
If investigators are willing to assume that the COST parameters for introducing treatment are equal between populations, it follows that the standard risk ratio should be used for exposures which monotonically reduce the incidence of the outcome, and that the recoded risk ratio should be used for exposures which monotonically increase the risk of the outcome. Thus, the risk ratio will generally be constrained between 0 and 1. If the outcome is rare, may be used in the place of for exposures that increase the incidence of the outcome. If the effect of exposure is not monotonic, the investigator may still choose the risk ratio model based on whether treatment increases or decreases the risk on average; in such situations, the extent of bias will be small if the extent of nonmonotonicity is small, or if the populations have comparable baseline risks. This approximation is highly sensitive to violations of these conditions, and if there is reason to suspect substantial nonmonotonicity, identification is not feasible and investigators may consider using an alternative approach to effect homogeneity [[17, 18]].
Appendix 1: Properties of COST parameters
Valid predictions: The COST parameter approach results in predictions for Pr that are valid probabilities, i.e. that are contained in the interval [0,1]. The predictions are generally of the form . In other words, is a weighted average of and 1. Since both and 1 are contained in [0,1], the prediction is also in this interval.
No ZeroConstraints: Here, we show that for any baseline risk in the target population Pr, there exist possible values of the COST parameters such that PrPr. For any baseline risk other than 0, it can easily be seen that such values exist, for example if and . If the baseline risk is zero, such values of the parameters also exist, for example if and
Baseline risk dependence: Here we define a measure of effect to be baseline risk dependent if, in order for the parameter to stay equal between populations, it is necessary that the proportion of individuals who respond to treatment (by experiencing the opposite outcome of what they would have experienced in the absence of treatment) varies with baseline risk. One can easily observe that COST parameters are not affected by such baseline risk dependence: This follows almost by definition, since COST parameters were designed specifically to avoid this form of baseline risk dependence.
Collapsibility: is collapsible if, for any baseline covariates , there exist weights such that . The weights Pr always satisfy this equation; the proof of this is exactly analogous to the corresponding proof for the risk ratio [[6]]. Analogously, the weights for are Pr, the weights for are Pr and the weights for are Pr.
Symmetry: If the coding of the outcome variable is reversed, the only consequence is that and change value. To illustrate this, we will discuss variables and parameters that are defined according to the recoded outcome; these are denoted with a star (i.e. ). Recall that we defined . If we reverse the coding, . By replacing with , this can be written as , which is equal to . The same logic can be used to show that .
Appendix 2: Proofs of propositions 16
In the following proofs, we will simplify the notation by defining , , and .
Proof of Proposition 1.
Note, first, that the risk under no treatment and the risk under treatment can be rewritten in terms of response types:
Moreover, because the response types are mutually exclusive and collectively exhaustive events, it follows that:
Recall that and can equivalently be defined as follows:
Therefore, under our definitions of G and H, the following relationship holds in any population:
Next, if treatment is monotonically protective, . The second term is therefore equal to 0, and it follows that , and that ∎
Proof of Proposition 2.
By same logic as in proposition 1  (1)  
By monotonicity  
By equal treatment effects  
By proposition 1 
∎
Proof of Proposition 3.
Define as . Then, if the baseline risk is higher in the target population, and if the baseline risk is lower in the target population. As in the motivating example, our goal is to estimate from information on , and . Let be the estimate of . For a protective treatment, if , we can conclude that the prediction is biased away from the null. Our goal is to show that, if the risk ratio is used to transport the effect, and the effects are equal according to Definition 3, then if . If the treatment effects and are equal between populations s and t , we know that the following relationship holds:
Alternatively, this can be written in terms of and :
The risk ratio that will be estimated in population can be written as
If we use an approach based on assuming that the risk ratio is equal between the populations, we will estimate by as follows
We will now compare with to see which is greater.
Between these two expressions, the terms , and and are shared and cancel. We therefore know that
or, equivalently,
If is positive, if
From the preceding results, we can further derive the bias term , which can be used to graph the amount of bias as a function of the ratio of baseline risks and of the extent of nonmonotonicity. ∎
Proof of Proposition 4.
As discussed earlier, we know that . If treatment monotonically increases the incidence, . We therefore have that . Solving this for , we get
(2)  
∎
Proof of Proposition 6.
The logic of the proof of Proposition 6 is exactly analogous to that presented in the context of Proposition 3.
∎
Appendix 3
Consider an unmeasured attribute that interacts with to determine treatment response. We will here outline how background biological knowledge can be encoded as restrictions on the joint distribution of counterfactuals of the type , and show that these restrictions may have implications for effect equality. This will allow us to clarify the link between biology and model choice. For simplicity, we will first consider a situation where response to treatment is fully determined by ; and later outline how this assumption can be relaxed.
For illustration, we will consider an example concerning the effect of treatment with antibiotics (), on mortality (). We will suppose that response to treatment is fully determined by bacterial susceptibility to that antibiotic (). In the following, we will suppose that attribute has the same prevalence in populations and (for example because the two populations share the same bacterial gene pool) and that treatment with has no effect in the absence of . Further, suppose that this attribute is independent of the baseline risk of the outcome (for example, old people at high risk of death may have the same strains of the bacteria as young people at low risk).
In order to get equality of effects between populations and , we need one further condition: If the attribute has no effect on in the absence of but prevents in the presence of , the effect of introducing treatment will be equal between the two populations; whereas if has no effect on in the presence of but causes in the absence of , the two populations will have equality of the effect of removing treatment.
The above is formalized with the following conditions:

is equally distributed in populations and :

has no effect in the absence of : in all individuals


has no effect in the absence of : in all individuals

prevents the outcome in the presence of : in all individuals

is independent of the baseline risk:



has no effect in the presence of : in all individuals

causes the outcome in the absence of : in all individuals

is independent of the risk under treatment:

Proposition 7.
If conditions 1, 2 and 3(ac) hold, then is equal between populations s and t. If conditions 1,2 and 4(ac) hold, then is equal between the populations.
Proof of Proposition 7.
(4)  
With the same argument, we can show that . By assumption 1, Pr( and Pr() are equal. ∎
Results similar to proposition 7 can be shown for attributes that are associated with a harmful effect of treatment. This will require conditions for that are comparable to 3(ac), and 4(ac). We will refer to the conditions that lead to equality of the parameter as 5(ac), and the conditions that lead to equality of the parameter as 6(ac):


has no effect in the absence of : in all individuals

causes the outcome in the presence of : in all individuals

is independent of the baseline risk:



has no effect in the presence of : in all individuals

prevents the outcome in the absence of : in all individuals

is independent of the risk under treatment:

We now sketch the outline of an argument for how this extends to situations where there is both a protective and a harmful attribute, which together completely determine whether drug has an effect. Consider the joint counterfactual where is an attribute that is associated with a protective effect of treatment with , and is an attribute associated with a harmful effect of treatment with , and the joint distribution of and is equal between the populations. It will be necessary that the two attributes are coherent with each other, in the sense that either meets conditions 3(ac) and meets conditions 5(ac), or that meets conditions 4(ac) and meets conditions 6(ac).
Suppose that for any combination of and , treatment with is either ineffective, protective or harmful. For example, an individual may have a strain of the bacterium that is susceptible to treatment ( , but also have a genetic variant that causes a severe, deadly allergic reaction to the drug (). In this case, we may believe that the allergic reaction supersedes the bacterial susceptibility, i.e. that treatment with is harmful for this combination of and .
For any individual, define if the person belongs to a joint stratum of and such that treatment has no effect (i.e. = in all individuals), if the individual belongs to a joint stratum of and such that in all individuals, and if he belongs to a joint stratum of , such that . Because no combination of and is associated with both harmful and preventative effects of , this covers all possibilities.
Using logic similar to the proof for singular attributes, we can show that is equal to Pr, is equal to Pr, and that these are equal between the two populations.
This can further be extended to multifactorial attributes. Consider the joint counterfactual where is a vector of attributes associated with protective effect of treatment, and is a vector of attributes associated with a harmful effect of treatment. We will suppose that all attributes in either operate according to conditions 3(ac) or conditions 4(ac), and that all attributes in operate according to corresponding conditions 5(ac) or (6ac) such that the conditions for are coherent with the conditions for . This extension will require that for any combination of and , treatment with is either harmful, protective or without effect; if this is believed to be the case, individuals can be assigned to strata of using the same logic as before.
Appendix 4
Proposition 8.
If the outcome is rare and if is equal between two populations, then the risk difference is approximately equal between the two populations.
Proof of Proposition 8.
If is equal between the two populations, the following relationship holds:  (5)  
This can be rewritten as:  
If the outcome is rare, the product terms on both sides are close to zero:  
This can be rewritten as:  
The left side of this expression is RD in population and the right side is RD in population :  
∎
References
 [1] Jonathan J. Deeks and Douglas G. Altman. Effect Measures for MetaAnalysis of Trials with Binary Outcomes. In Systematic Reviews in Health Care: MetaAnalysis in Context: Second Edition, pages 313–335. 2008.
 [2] Charles Poole, Ian Shrier, and Tyler J VanderWeele. Is the Risk Difference Really a More Heterogeneous Measure? Epidemiology, 26(5):714–8, 9 2015.
 [3] Orestis A Panagiotou and Thomas A Trikalinos. Commentary: On Effect Measures, Heterogeneity, and the Laws of Nature. Epidemiology, 26(5):710–3, 9 2015.
 [4] Jonathan J Deeks. Issues in the Selection of a Summary Statistic for MetaAnalysis of Clinical Trials with Binary Outcomes. Statistics in medicine, 21(11):1575–600, 6 2002.
 [5] Edna Schechtman. Odds ratio, relative risk, absolute risk reduction, and the number needed to treat  Which of these should we use? Value in Health, 5(5):431–436, 2002.
 [6] Anders Huitfeldt, Mats Julius Stensrud, and Etsuji Suzuki. On the Collapsibility of Measures of Effect in the Counterfactual Causal Framework. arXiv:1610.00033, 2016.
 [7] Mindel Cherniak Sheps. Shall We Count the Living of the Dead? The New England Journal of Medicine, 259(25):1210–4, 12 1958.
 [8] J Siemiatycki and D C Thomas. Biological models and statistical interactions: an example from multistage carcinogenesis. International journal of epidemiology, 10(4):383–7, 12 1981.
 [9] W D Thompson. Effect modification and the limits of biological inference from epidemiologic data. Journal of clinical epidemiology, 44(3):221–32, 1991.
 [10] Tyler J VanderWeele. Confounding and Effect Modification: Distribution and Measure. Epidemiologic Methods, 1(1):55–82, 8 2012.
 [11] Sander Greenland and James Matthew Robins. Identifiability, Exchangeability, and Epidemiological Confounding. International journal of epidemiology, 15(3):413–9, 9 1986.
 [12] Michael Gechter. Generalizing the Results from Social Experiments: Theory and Evidence from Mexico and India. Working Paper, 2015.
 [13] Susan Athey and Guido W. Imbens. Identification and Inference in Nonlinear DifferenceinDifferences Models. Econometrica, 74(2):431–497, 2006.
 [14] Tyler J VanderWeele and James M Robins. Signed Directed Acyclic Graphs for Causal Inference. Journal of the Royal Statistical Society. Series B, Statistical methodology, 72(1):111–127, 1 2010.
 [15] Tyler J VanderWeele and James M Robins. Properties of Monotonic Effects on Directed Acyclic Graphs. Journal of Machine Learning Research, 10:699–718, 2009.
 [16] Paul P Glasziou and Les M Irwig. An Evidence Based Approach to Individualising Treatment. BMJ, 311(7016), 1995.
 [17] Stephen R Cole and Elizabeth A Stuart. Generalizing Evidence from Randomized Clinical Trials to Target Populations: The ACTG 320 Trial. American journal of epidemiology, 172(1):107–15, 7 2010.
 [18] Elias Bareinboim and Judea Pearl. A General Algorithm for Deciding Transportability of Experimental Results. Journal of Causal Inference, 1(1):107–134, 1 2013.
Acknowledgements
The authors thank James Robins for suggesting the relationship between and the Risk Difference, Etsuji Suzuki for extensive comments on an earlier draft of the manuscript, and Miguel Hernan, Ryan Seals and Steve Goodman for discussions and for insights that improved the manuscript.
Author Contributions
AH had the original idea, provided the original version of the theorems and proofs, wrote the first draft of the manuscript and coordinated the research project. AG and SAS contributed original intellectual content and extensively restructured and revised the manuscript. All authors approved the final version of the manuscript.
Correspondence
All correspondence should be directed to Anders Huitfeldt at The MetaResearch Innovation Center at Stanford, Stanford University School of Medicine, 1070 Arastradero Road, Palo Alto CA 94303; email: anders@huitfeldt.net
Anonymous feedback is welcomed at http://www.admonymous.com/effectmeasurepaper
Anders Huitfeldt invokes Crocker’s Rules (http://sl4.org/crocker.html) on behalf of all authors for all anonymous and nonanonymous feedback on this manuscript.
Funding
The authors received no specific funding for this work. While this research was conducted, Dr. Huitfeldt was supported by the MetaResearch Innovation Center at Stanford, which is partly funded by a grant from the Laura and John Arnold Foundation. Dr. Goldstein is supported by National Library of Medicine Training Grant T15LMLM007079. Dr. Swanson is supported by DynaHEALTH grant (European Union H2020PHC 2014; 633595).