ISPEC: An EndtoEnd Framework for Learning Transportable, ShiftStable Models
Abstract
Shifts in environment between development and deployment cause classical supervised learning to produce models that fail to generalize well to new target distributions. Recently, many solutions which find invariant predictive distributions have been developed. Among these, graphbased approaches do not require data from the target environment and can capture more stable information than alternative methods which find stable feature sets. However, these approaches assume that the data generating process is known in the form of a full causal graph, which is generally not the case. In this paper, we propose ISpec, an endtoend framework that addresses this shortcoming by using data to learn a partial ancestral graph (PAG). Using the PAG we develop an algorithm that determines an interventional distribution that is stable to the declared shifts; this subsumes existing approaches which find stable feature sets that are less accurate. We apply ISpec to a mortality prediction problem to show it can learn a model that is robust to shifts without needing upfront knowledge of the full causal DAG.
fontsize=
1 Introduction
One of the primary barriers to the deployment of machine learning models in safetycritical applications is unintended behaviors arising at deployment that were not problematic during model development. For example, predictive policing systems have been shown to be vulnerable to predictive feedback loops that cause them to disproportionately overpatrol certain neighborhoods (Lum and Isaac, 2016; Ensign et al., 2018), and a patient triage model erroneously learned that asthma lowered the risk of mortality in pneumonia patients (Caruana et al., 2015). At the heart of many such unintended behaviors are shifts in environment—changes in the conditions that generated the training data and deployment data (Subbaswamy and Saria, 2019). An important step for ensuring that models will perform reliably under shifting conditions is for model developers to anticipate failures and train models in a way that addresses likely sources of error.
Consider the study of Zech et al. (2018), who trained a model to diagnose pneumonia from chest Xrays using data from one hospital. Notably, the Xrays contained stylistic features, including inlaid tokens that encoded geometric information such as the Xray orientation. When they applied the model to data from other hospital locations, they found the model’s performance significantly deteriorated, indicating that it failed to generalize across the shifts between hospitals. In particular, shifts in the distribution of style features occurred due to differences in equipment at different hospital locations and differences in imaging protocols between hospital departments.
More formally, Zech et al. encountered an instance of dataset shift, in which shifts in environment resulted in differing train and test distributions. Typical solutions for addressing dataset shift use samples from the test distribution to reweight training samples during learning (see QuiñoneroCandela et al. (2009) for an overview). In many practical applications, however, there are unknown or multiple possible test environments (e.g., for a cloudbased machine learning service), making it infeasible to acquire test samples. In contrast with reweighting solutions, proactive solutions do not use test samples during learning. For example, one class of proactive solutions is entirely datasetdriven: using datasets from multiple training environments, they empirically determine a stable conditional distribution that is invariant across the datasets (e.g., Muandet et al. (2013); RojasCarulla et al. (2018); Arjovsky et al. (2019)). A model of this distribution is then used to make predictions in new, unseen environments.
While the datasetdriven methods find a distribution that is invariant across the training datasets, they do not, in general, provide guarantees about the specific shifts in environment to which the resulting models are stable. This information is crucially important in safetycritical domains where incorrect decision making can lead to failures. In the pneumonia example, suppose we had multiple training datasets which contained shifts in style features due to differing equipment, but, critically, did not contain shifts in protocols between departments. When a datasetdriven method finds a predictive distribution that is invariant across the training datasets, its developers will not know that this distribution is stable to shifts in equipment but is not stable to shifts in imaging protocols. When the resulting model is deployed at a hospital with different imaging protocols (e.g., distribution of fronttoback vs backtofront Xrays), the model will make (potentially arbitrarily) incorrect predictions resulting in unanticipated misdiagnoses and disastrous failures.
Alternative methods use graphical representations of the data generating process (DGP) (e.g., causal directed acylic graphs (DAGs)), letting developers proactively reason about the DGP to specify shifts and provide stability guarantees. One advantage of explicit graphbased methods is that they allow the computation of stable interventional (Subbaswamy et al., 2019b) and counterfactual distributions (Subbaswamy and Saria, 2018); these retain more stable information than conditional distributions, leading to higher accuracy. A primary challenge in applying these approaches, however, is that in largescale complex domains it is very difficult to fully specify the graph (i.e., edge adjacencies and directions) from prior knowledge alone. We address this by extending graphical methods for finding stable distributions to partial graphs that can be learned directly from data.
Contributions: A key impediment to the deployment of machine learning is the lack of methods for training models that can generalize despite shifts across training and test environments. Stable interventional distributions estimated from data yield models that are guaranteed to be invariant to shifts (e.g., the modeler can upfront identify which shifts the model is protected against). However, to estimate such distributions, prior approaches require knowledge of the underlying causal DAG or extensive samples from multiple training environments. We propose ISpec, a novel endtoend framework which allows us to estimate stable interventional distributions when we do not have prior causal knowledge of the full graph. To do so, we learn a partial ancestral graph (PAG) from data; the PAG captures uncertainty in the graph structure. Then, we use the PAG to inform the choice of mutable variables, or shifts to protect against. We develop an algorithm that uses the PAG and set of mutable variables to determine a stable interventional distribution. We prove the soundness of the algorithm and prove that it subsumes existing datasetdriven approaches which find stable conditional distributions. Empirically, we apply ISpec to a large, complicated healthcare problem and show that we are able to learn a PAG, use it to inform the choice of mutable variables, and learn models that generalize well to new environments. We also use simulated data to provide insight into when stable models are desirable by examining how shifts of varying magnitude affect the difference in performance between stable and unstable models.
2 Background
The proposed framework, ISpec, uses PAGs and interventional distributions, which we briefly overview here.
Notation: Sets of variables are denoted by bold capital letters while their assignments are denoted by bold lowercase letters. The sets of parents, children, ancestors, and descendants in a graph will be denoted by , , , and , respectively. We will consider prediction problems with observed variables and target variable .
Causal Graphs: We assume the DGP underlying a prediction problem can be represented as a causal DAG with latent variables, or equivalently, a causal acylic directed mixed graph (ADMG) over which contains directed () and bidirected (, representing unobserved confounding) edges. A causal mechanism is the functional relationship that generates a child from its (possibly unobserved) parents.
Multiple ADMGs can contain the same conditional independence and ancestral information about the observed variables —consider, for example, an ADMG with edges and an ADMG with edge . However, an ADMG is associated with a unique maximal ancestral graph (MAG) (Richardson and Spirtes, 2002) which represents a set of ADMGs that share this information, with at most one edge between any pair of variables (e.g., the MAG associated with and is ). The problem is that multiple MAGs may contain the same independences (e.g., and are Markov equivalent). Fortunately, a partial ancestral graph (PAG) represents an equivalence class of MAGs, denoted by . and every MAG in have the same adjacencies, but they differ in the edge marks. An arrow head (or tail) is present in if that head (or tail) is present in all MAGs in . Otherwise, the edge mark is and the edge is partially (or non) directed. The PAG for , , and is . PAGs can be learned from data, and are the output of the FCI algorithm (Spirtes et al., 2000).
Because PAGs are partial graphs, we require a few additional definitions. First, a path from to is possibly directed if no arrowhead along the path is directed towards . In such a path, is a possible ancestor of , and is a possible descendant of . There are two kinds of directed edges in MAGs and PAGs. A directed edge is visible if there is a node not adjacent to , such that either there is an edge between and that is into , or there is a collider path between and that is into , and every node on the path is a parent of (Maathuis and Colombo, 2015). Otherwise, the edge is invisible. The importance of visible edges is that if is visible, then it implies that there is no unobserved confounder (i.e., no edge in any ADMG in the equivalence class).
Interventional Distributions: We now review interventional distributions, which we use to make stable predictions. First, note that the distribution of observed variables in an ADMG factorizes as
(1) 
where are unobserved variables corresponding to the bidirected edges. An interventional distribution of the form is defined in terms of the operator (Pearl, 2009).
Definition 1 (Causal Identifiability).
For disjoint , the effect of an intervention on conditional on is said to be identifiable from in if is (uniquely) computable from in any causal model which induces .
The ID algorithm (Shpitser and Pearl, 2006b, a) takes disjoint sets and an ADMG and returns an expression in terms of if is identifiable. Recently, the ID algorithm was extended to PAGs (Jaber et al., 2019b), with the CIDP algorithm
3 Methods
We now present ISpec, a framework for finding a stable distribution that is invariant to shifts in environments. ISpec works as follows: Given datasets collected from multiple source environments, a user defines a graphical invariance specification by first learning a PAG from the combined datasets (without requiring prior causal knowledge). Then, the user can determine which shifts to protect against by reasoning about the PAG and consulting it regarding shifts that occurred across the datasets. Given the resulting invariance specification (i.e., PAG and shifts to protect against), graphical criteria are used to search for the bestperforming stable interventional distribution which is guaranteed to be invariant to the specified shifts.
The rest of this section is organized as follows: In Section 3.1 we introduce invariance specifications. Next, in Section 3.2 we describe the steps of ISpec (Algorithm 1) and prove its correctness. Then, in Section 3.3 we establish the superiority of stable interventional distributions over stable conditional distributions, proving that Algorithm 1 subsumes existing datasetdriven methods. Finally, in Section 3.4 we discuss how ISpec can be adapted to settings in which data from only one environment is available.
3.1 Graphical Invariance Specifications
Our goal is to predict accurately in new environments without using test samples. To do so, we need a way to represent the possible environments and how they can differ. For this reason, we will now introduce invariance specs, which are built around a PAG and specify shifts to protect against. They do not require prior causal knowledge and can be learned from data. Given the invariance spec, we show that certain interventional distributions provide stability guarantees to the specified shifts.
First, we formalize the notion of a stable distribution. Stable distributions are the same in all environments, can be learned from the source environment data, and can be applied in the target environment without any adjustment.
Definition 2 (Stable Distribution).
For environment set , a distribution is said to be stable if, for any two environments in corresponding to joint distributions and , .
Stable distributions are defined with respect to a set of environments. We develop invariance specifications as a way to represent a set of environments when we do not have prior causal knowledge by generalizing selection diagrams (Pearl and Bareinboim, 2011; Subbaswamy et al., 2019b), a representation that assumes a known causal graph.
Definition 3 (Selection Diagram).
A selection diagram is an ADMG augmented with auxiliary selection variables such that for an edge denotes the mechanism that generates X can vary across environments. Selection variables may have at most one child.
A selection diagram describes a set of environments whose DGPs share the same underlying graph structure (i.e., ADMG). Only the causal mechanisms associated with the children of selection variables may differ across environments, usually expressed as distributional shifts in the terms of the factorization of the joint via Equation (1).
Selection diagrams assume that both the full graph (i.e., ADMG) and the shifts (i.e., placement of selection variables) are known, prohibiting their use in complex domains. A natural idea to relax this would be to define a selection PAG, thus allowing for uncertainty in the graphical structure. However, a PAG augmented with selection variables would not technically be a PAG—for example, selection variables could not be used to determine visible edges in the PAG. For this reason, we introduce the notion of a graphical invariance specification (or simply invariance spec) which generalizes selection diagrams.
Definition 4 (Invariance spec).
An invariance spec is a 2tuple consisting of a graphical representation, , of the DGP and a set of mutable variables, , whose causal mechanisms are vulnerable to shifts.
When is a PAG, an invariance spec defines a set of environments which share the same underlying graph structure (i.e., ADMG) that is only known up to an equivalence class (the PAG). Now we say the mechanism shifts are associated with the mutable variables (Subbaswamy et al., 2019b). Note that if is an ADMG, then by augmenting with selection variables as parents of we recover a selection diagram. We will only consider when is a PAG, which means the invariance spec can be learned from data.
Like selection diagrams, invariance specs provide graphical criteria for determining if a distribution is stable. The next result states that a distribution is stable in an invariance spec when the distribution is stable in every selection diagram corresponding to the equivalence class .
Proposition 1.
Given an invariance spec , a distribution is stable if in every ADMG in the equivalence class augmented with selection variables as parents of .
Armed with this graphical stability criterion, we now extend a prior result which showed that intervening on yields a stable distribution in ADMGs (Subbaswamy et al., 2019b, Proposition 1). The extension to PAGs is what permits the use of interventional distributions to make stable predictions across the environments represented by an invariance spec, and is thus key to the correctness of ISpec.
Proposition 2.
For invariance spec , , is stable to shifts in the mechanisms of .
3.2 ISpec Step by Step
We have just defined invariance specs and established that intervening on mutable variables yields a stable distribution. We are now ready to discuss each step of ISpec (Alg 1).
1) Learning Invariance Spec Structure : The first step (Line 1) in creating an invariance spec is to learn the graphical representation of the DGP. We will assume faithfulness: the independences implied by the graph are the only independences in the data distribution. Consider the DGP represented by the ADMG in Fig 1a, in which the observed variables are and the goal is to predict . While every environment (e.g., location) shares this graph (including unseen environments), certain mechanisms corresponding to may vary across environments (e.g., ). If we knew this full graph , then we could use existing graphical methods (e.g., Subbaswamy et al. (2019b)) to find a stable distribution. However, in practice we usually do not know the full graph.
Instead, we will learn the structure of the DGP from datasets , containing observations of collected in training environments . While this problem has itself been extensively studied (as we discuss in Section 4; Related Work), we will use a simple extension of FCI (Spirtes et al., 2000), a constraintbased structure learning algorithm which learns a PAG over the observed variables. FCI uses conditional independence tests to determine adjacencies and create a graph skeleton, and then uses a set of orientation rules to determine where to place edge marks (Zhang, 2008b). We apply FCI by pooling the datasets , adding the environment indicator as a variable, and adding the logical constraint that causally precedes all variables in (i.e., there can be no edge for ).
Suppose we had datasets generated from multiple environments according to the DGP in Fig 1a. Using the pooled FCI variant we described, we would learn the PAG in Fig 1b. The PAG represents an equivalence class of ADMGs (which includes the one in Fig 1a). The edge marks denote structural uncertainty: there is an ADMG in the equivalence class in which this mark is an arrowhead, and an ADMG in the equivalence class in which this mark is a tail. Despite being a partial graph, the PAG still helps inform decisions about which shifts to protect against as discussed next.
2) Declaring Mutable Variables : Given the graph , to complete the invariance spec we must declare the mutable variables (Line 2). The graph suggests possible mechanism shifts that occurred across the datasets: the possible children of the environment indicator (; nodes adjacent to with the edge not into ). In Fig 1b, . When there are many possible children of , an advantage of having an explicit graph is that we can reason about and protect against only the shifts that are most likely to be problematic (vs defaulting to ). We demonstrate this process on a mortality prediction problem in our experiments.
3) Determining a Stable Distribution:
Once we have the invariance spec , we need to find an identifiable
Unfortunately, searching for the optimal conditioning set that yields a stable identifiable interventional distribution (Alg 1, Lines 48), like feature subset selection, is NPHard. Following related methods (e.g., Magliacane et al. (2018); Subbaswamy et al. (2019b)) we consider an exhaustive search over the feature powerset , but note that many strategies for improving scalability exist, including greedy searches (RojasCarulla et al., 2018) or space pruning (using, e.g., regularization).
Correctness: We can now establish that Algorithm 1 does, in fact, return distributions which are guaranteed to be stable to the specified shifts.
Corollary 3 (Soundness).
If Algorithm 1 returns a distribution , then this is stable to shifts in .
3.3 Connection to Datasetdriven Methods
We now show that ISpec subsumes the ability of existing datasetdriven approaches to find stable (conditional) distributions.
A prior result provides a sound and complete criterion for cases in which interventional distributions in PAGs reduce to conditional distributions Zhang (2008a, Theorem 30). We adapt the criterion (Line 5) to find stable conditional distributions: cases in which . Distributions satisfying Line 5 are exactly the stable distributions that can found by existing datadriven methods.
However, not all identifiable interventional distributions reduce to conditionals, and are instead functionals of the observational distribution. These can be found using the CIDP algorithm (Jaber et al., 2019a).
Lemma 4.
Lemma 5.
Algorithm 1 finds stable distributions that cannot be expressed as conditional observational distributions.
The following is now immediate:
Corollary 6.
Algorithm 1 subsumes methods that find stable conditional (observational) distributions.
3.4 Special Case: Only One Source Dataset
ISpec was constructed to take datasets from multiple environments as input to match the input of existing datasetdriven methods that, by default, require this. We briefly want to note that ISpec is easily extensible to the case in which only data from a single environment is available. In this case, there is no environment indicator and one can simply learn a PAG over . Now specification of the mutable variables must come from prior knowledge alone, but we note that this is how selection variables are typically placed (Pearl and Bareinboim, 2011). This yields an invariance spec and stable interventional distributions can be found as before (i.e., Lines 48 of Alg 1). While it may be possible to modify other methods to only require one dataset, we believe the extension to this setting is most natural using ISpec because it uses an explicit graph.
4 Related Work
Proactively Addressing Dataset Shift: The problem of differing train and test distributions is known as dataset shift (QuiñoneroCandela et al., 2009). Typical solutions assume access to unlabeled samples from the test distribution which are used to reweight training data during learning (e.g., Shimodaira (2000); Gretton et al. (2009)). However, in many practical applications it is infeasible to acquire test distribution samples. In this paper we consider shifts of arbitrary strengths when test samples are not available during learning, though there has been other work on bounded magnitude distributional robustness when shifts are of a known type and strength (Rothenhäusler et al., 2018; HeinzeDeml and Meinshausen, 2017).
Datasetdriven approaches use datasets from multiple training environments to determine a feature subset (RojasCarulla et al., 2018; Magliacane et al., 2018; Kuang et al., 2018) or feature representation (e.g., Muandet et al. (2013); Arjovsky et al. (2019)) that yields a conditional distribution that is invariant across the training datasets. Perhaps most related is Magliacane et al. (2018), whose method uses unlabeled target environment data, though it can be easily adapted to the setting of this paper. Notably, they allow for multiple environment (or “context”) variables, and additionally consider shifts in environment due to a variety of types of interventions. Datasetdriven methods do not require an explicit causal graph, and by default conservatively protect against all shifts they detect across datasets.
In contrast, some works assume explicit knowledge of the underlying graph (i.e., an ADMG) so that users can specify the shifts in mechanisms to protect against. Subbaswamy et al. (2019b) determine stable interventional distributions in selection diagrams (Pearl and Bareinboim, 2011) that can be used for prediction. Under the assumption of linear mechanisms, Subbaswamy and Saria (2018) find a stable feature set that includes counterfactual features. When there are no unobserved confounders, Schulam and Saria (2017) protect against shifts in action policies and consider continuoustime longitudinal settings. ISpec allows for unobserved variables and inherits the benefits of using interventional distributions, but relaxes the need for a fully specified graph, instead using a partial graph learned from data.
Causal Discovery Across Multiple Environments: One line of research has focused exclusively on the problem of learning causal graphs using data from multiple environments. These methods could help extend ISpec to other settings: For example, methods have been developed to learn a causal graph using data collected from multiple experimental contexts (Mooij et al., 2016; He and Geng, 2016) or nonstationary environments (Zhang et al., 2017). The FCI variant described in Section 3.2 might be viewed as a special case of FCIJCI (Mooij et al., 2016), which allows for multiple environment/context variables. Triantafillou et al. (2010) consider the problem of learning a joint graph using datasets with different, but overlapping, variable sets. Others have considered local problems, e.g., using invariant prediction to infer a variable’s causal parents (Peters et al., 2016; HeinzeDeml et al., 2018) or Markov blanket (Yu et al., 2019) when there are no unobserved confounders.
5 Experiments
We perform two experiments to demonstrate the efficacy of ISpec. In our first experiment, we show that we can apply the framework to large, complicated datasets from the healthcare domain. Specifically, we are able to learn a partial graph that provides meaningful insights into both how the variables are related and what shifts occurred across datasets. We show how these insights can inform the choice of invariance spec, further showcasing the flexibility of the procedure since we can consider different choices of the mutable variables. We empirically show that ISpec finds distributions that generalize well to new environments and produce consistent predictions irrespective of the choice of training environment. In our second experiment, we measure the degree to which the magnitude of shifts in environments affects the difference in performance between stable and unstable models. We used simulated data to create a large number of datasets in order to compare performance under varying shifts. These results confirm that stable models have more consistent performance across shifted environments and that interventionals can capture more stable information than conditionals.
5.1 Real Data: Mortality Prediction
Motivation and Dataset: Machine learning has been used to predict intensive care unit (ICU) mortality to perform patient triage and identify most atrisk patients (e.g., Pirracchio et al. (2015)). However, in addition to physiologic features, studies have shown that features related to clinical practice patterns (e.g., ordering frequency of lab tests) are highly predictive of patient outcomes (Agniel et al., 2018). Since these patterns vary greatly by hospital, accurate models trained at one hospital will have highly variant performance at others, which can lead to unreliable and potentially dangerous decisions when deployed (Schulam and Saria, 2017). Therefore, we apply the proposed method to learn an ICU mortality prediction model that is stable to shifts in the mechanisms of such practicebased features and will generalize well to new hospitals. We demonstrate this using data from ICU patients at a large hospital and test its ability to generalize to smaller hospitals.
We extract the first 24 hours of ICU patient data from three hospitals in our institution’s network over a two year period.
Determining the Invariance Spec: To determine the invariance spec , we first learned a PAG from the full pooled dataset, using the hospital ID as the environment indicator . Specifically, using Tetrad we applied the FCI variant described in Section 3.2 and used the the Degenerate Gaussian likelihood ratio test (Andrews et al., 2019). The learned PAG is given in Appendix D; we describe some aspects of it here to demonstrate its value.
12 variables are possible children of , including physiologic variables such as ‘Age’ and ‘Bicarbonate’, and features associated with clinical practice such as ‘Admit Type’ and Lab Time’. Of the 10 variables adjacent to ‘Mortality’, ‘Age’ is the only parent—the other 9 variables are connected via bidirected edges. The explicit graph makes it easy to reason about the DGP: it tells us that ‘Age’ is a causal factor for mortality (e.g., older patients are more likely to die), while ‘Bicarbonate’ is related to mortality through unobserved common causes (such as an acute underlying kidney condition). The bidirected edge connecting ‘Lab Time’ to ‘Mortality’ indicates a practicebased noncausal relationship, and the bidirected edge between ‘Admit Type’ and ‘Mortality’ is due to the latent condition that caused the admission and will contribute to risk of mortality.
In this example, if a model is not stable to shifts in practice patternbased features, then the predictions it makes will be arbitrarily sensitive to changes in policies between datasets, such as shifts in the times when lab measurements are taken. This sensitivity would render the model unreliable, so we reason that shifts in administrative policies should not affect our mortality risk predictions. In contrast, shifts in physiologic mechanisms may encode clinically relevant changes: if there are differences in the treatments patients receive at different hospitals, this would affect the bicarbonate mechanism, for example. Because these shifts would be clinically meaningful, they should affect the decisions we make and model predictions should not be invariant to them. Thus, one reasonable invariance spec is to take the mutable variables to be . Note the flexibility of this procedure: we are able to consider alternative invariance specs (i.e., different choices of ) and compare the sensitivity of resulting solutions.
Baselines/Models: We consider three models that correspond to the three ways a model developer can respond to shifts in environment: ignoring shifts, protecting against all shifts in the datasets, or protecting against some shifts. Our first baseline, an unstable model, ignores shifts and uses all features. Our second baseline conservatively protects against all shifts in the data by using ISpec with the invariance spec , emulating the conservative default datadriven behavior. Finally, to protect against only some shifts we use ISPEC and the invariance spec defined before. Using the procedure in Alg 1, ISpec uses 13 of the 18 features, while the conservative method uses 7—neither use ‘Lab Time’ or ‘Admit Type’. For demonstration we train logistic regression models, though we emphasize that more complex models could be used instead.
Experimental Setup: We evaluate as follows: We randomly performed 80/20 train/test splits on data from each hospital (and repeated this 100 times). To measure predictive performance, we use the H1 dataset to train the unstable, conservative, and ISpec models, and evaluated their area under the ROC curve (AUROC) on the test patients from each hospital. This allows us to see the robustness of a model’s performance as it is applied to new environments. Beyond performance, we also evaluated the effect of shifts on model decisions. For each approach, we consider pairs of models (one trained at H1, and one trained at H2 or H3) and made predictions on the test set patients. We then computed the rank correlation of the predictions via Spearman’s . A value of indicates that two models produce the same ordering of patients by predicted risk despite being trained at different hospitals (i.e., patient orderings are stable).
Results: Fig 2 shows boxplots of the AUROC of the models at each test hospital. As expected, the unstable model fails to generalize to new hospitals, with a significant drop in performance from H1 to H3 because the unstable lab timemortality association flipped. On the other hand, the ISpec model generalizes well, and outperforms the unstable model at the new hospitals H2 and H3. Comparing ISpec to the conservative model, we see that the conservative model performs worse at all hospitals precisely because it protects against all shifts (leaving less predictive signal to learn), though its performance also does not deteriorate at new hospitals because the model is also stable.
Fig 3 shows the boxplots of rank correlations of each model’s predictions. The unstable model has significantly less stable patient orderings than the two stable models: its rank correlations are highly varying and reach as low as . Both the ISpec and conservative models have similar rank correlations, though the conservative model’s ’s tend to be slightly higher due to protecting against all shifts. Overall, we see that stable models produce significantly more consistent predictions (and, thus, more stable patient orderings) than the unstable model. The difference between the stable models, however, is that the ISpec model has significantly and strictly better discriminative performance at all hospitals. This demonstrates that careful choice of the mutable variables (as opposed to defaulting to ), can yield stable and accurate models.
5.2 Simulated Data
To analyze the effect of the magnitude of shifts on the performance of stable and unstable models, we simulated data from a zeromean linear Gaussian system according to the ADMG in Fig 1a. We shift the mechanism of by changing the coefficient of the unobserved confounder between and in the structural equation for .
As expected, under small shifts the unstable model outerperforms both stable models; but the unstable model’s error grows rapidly with the magnitude of the shift quickly performing much worse than the stable models. On the other hand, the stable models have performance that is consistent across environments as desired. The interventional model achieves lower MSE than the conditional because it uses stable information in the term that the conditional model does not.
6 Conclusion
In this paper we addressed one of the primary challenges facing the deployment of machine learning in safetycritical applications: shifts in environment between training and deployment. To this end, we proposed ISpec, an endtoend framework that lets us go from data to models that are guaranteed to be stable to shifts. Like existing graphical methods, ISpec does not require data from the target environment and is able to capture more stable information in the data than methods which use stable conditional distributions. An important difference, however, is that ISpec does not require prior knowledge of the full causal graph. As demonstrated in our healthcare experiments, this means ISpec can be applied to problems in which existing graphical methods would have been too difficult to use. The experiments further demonstrated how the framework can be used to discover shifts, determine which ones to protect against, and train accurate, stable models. To improve ISpec’s interoperability, a valuable direction for future work would be to handle differing variables sets across datasets.
Acknowledgements
The authors thank Dan Malinsky for helpful discussions about structure learning, the Tetrad developers for promptly providing an implementation of the Degenerate Gaussian score, and Sieu Tran for help in implementation of an earlier version of this work.
Appendix A Invariant Conditionals in PAGs and the CIDP Algorithm
a.1 Additional PAG Preliminaries
We first provide some additional definitions and facts about PAGs. These are relevant for understanding Theorem 7.
The separation criterion in DAGs is naturally generalized to encode conditional independences in mixed graphs through separation (Richardson and Spirtes, 2002). A path is connecting given a set if every collider (e.g., vstructure like ) on the path is in and all noncolliders are not in .
In PAGs we must also account for uncertainty in whether or not a node is a collider along a path. Letting denote a wildcard edge mark (head, tail, or circle), a node is a definite noncollider if there is at least one edge out of on the path, or if is a subpath and and are not adjacent. A definite status path is one in which every node is either a collider or definite noncollider (Maathuis and Colombo, 2015). These definitions let us extend connection (and separation) to PAGs: a definite status path is mconnecting given if every definite noncollider is in and every collider on the path is in .
a.2 Invariance Criterion for Conditionals
Theorem 7 (Zhang (2008a), Theorem 30).
Suppose is the PAG over the observed variables . For any such that , is invariant under interventions on in if and only if
 1)

for every , every definite status mconnecting path, if any, between and any member of given is out of with a visible edge;
 2)

for every , there is no definite status mconnecting path between and any member of given ;
 3)

for every , every definite status mconnecting path, if any, between and any member of given is into .
As originally written, verifying Theorem 7 involves checking individual definite status paths in the PAG. We will reduce the conditions to equivalent ones that can be verified in MAGs derived from the PAG that will, in general, have fewer paths, and for which efficient mseparation routines have been implemented (e.g., in the R package dagitty (Textor et al., 2016)). First, we require the following definitions from Maathuis and Colombo (2015), with the addition of .
Definition 5 (, , and ).
Let be a vertex in PAG . Define to be the set of MAGs in the equivalence class described by that have the same number of edges into as in . For any , let be the graph obtained from by removing all directed edges out of that are visible in . For any , let be the graph obtained from by removing all edges (directed or bidirected) into .
Lemma 8.
Proof of Lemma 8.
Consider each condition in turn.

This equivalence is a restatement of Lemma 7.4 in Maathuis and Colombo (2015) (which states the condition as there is no mconnecting path between and given in , i.e., they are mseparated).

This equivalence follows from the definition of a PAG. All MAGs in the equivalence class represented by share the same conditional independences. Thus, if in one MAG in then in . Similarly, if in then in all MAGs in .

To prove this equivalence we must prove the following: Let . Then there is a definite status mconnecting path from to given in that is not into if and only if there is an mconnecting path between X and given in . The style of the proof follows that of the proof of Lemma 7.4 in Maathuis and Colombo (2015).
First, the only if direction. Suppose there is definite status mconnecting path, , between and given in that is not into . Let be this path in and be this path in . As noted in Zhang (2008a), if a path is definite status mconnecting, then the corresponding path in every MAG in is mconnecting. Thus, we know that is mconnecting. Further, is out of because was not into , and by construction has no additional edges into when compared to . Since only deletes edges into when compared to , the path is no different from and is also out of . is also mconnecting because the only way for to not be mconnecting while is, would be for to contain a collider that became inactive after deleting edges into . However, we know that (and, thus, ) are not collider paths since , and thus the paths and are directed and out of . Thus, is mconnecting and out of in .
Now the if direction. Suppose there is an mconnecting path between and given in . Because this path is out of , the corresponding path in is unaffected and is also mconnecting. By Lemma 5.1.9 in Zhang (2006), since , this means there is a definite status mconnecting path, , between and given in that is not into .
∎
a.3 CIDP Algorithm
We briefly restate key aspects of the CIDP algorithm here. For full details see Jaber et al. (2019a).
Jaber et al. (2019a) introduce additional constructs that are used in the CIDP algorithm. In what follows, we will use () to denote the union of and the set of possible parents (children). Similarly, we define . We will let denote excluding the possible parents of due to circle edges. We similarly define . Let a circle path be a path on which all edge marks are . Define a bucket to be a closure of nodes connected with circle paths as a bucket.
Definition 6 (PCComponent).
In a PAG or any induced subgraph thereof, two nodes are in the same possible ccomponent (pccomponent) if there is a path between them such that (1) all nonendpoint nodes along the path are colliders, and (2) none of the edges are visible.
Note that two nodes are in the same definite ccomponent if they are connected by a bidirected path.
The following proposition gives an identification criterion for interventional distributions corresonding to interventions on a bucket.
Proposition 9 (Jaber et al. (2019a)[Proposition 2).
] Let denote a PAG over , be a union of a subset of buckets in , and be a bucket. Given (i.e., ), and a partial topological order of buckets with respect to (induced subgraph), is identifiable if and only if, in , there does not exist such that Z has a possible child that is in the pccomponent of Z. If identifiable, then the expression is given by
where is the union of the definite ccomponents of the members of in , and denotes the set of nodes preceding bucket in the partial order.
Definition 7 (Region ).
Given a PAG over , and . Let the region of w.r.t. , denoted , be the union of the buckets that contain nodes in the pccomponent of in the induced subgraph .
We are now ready to state the algorithm.
Appendix B Proofs of Main Results
Proof of Proposition 1.
Follows from admissibility (Pearl and Bareinboim, 2011, Theorem 2) and the definition of a PAG (independences that hold in every member of the PAG’s equivalence class must also hold in the PAG). ∎
Proof of Proposition 2.
In each ADMG in the equivalence class , we have that in , the mutilated graph in which all edges into have been deleted due to the operator in . Now, by Rule 2 of calculus we have that = (again in each ADMG in ), so is a stable distribution by Proposition 1. ∎
Proof of Corollary 3.
First, note that given an invariance spec , Algorithm 1 searches over distributions of the form . All of these are stable by Proposition 2. Now, for conditioning sets that satisfy Theorem 7, these are in fact stable due to the fact that the theorem is a sufficient graphical conditional for invariance (see the “if” direction of the proof in Zhang (2008a)). For conditional interventional distributions found to be identifiable by CIDP, correctness follows from its soundness Jaber et al. (2019a, Theorem 1). ∎
Proof of Lemma 4.
Appendix C Clarifying Relation to DatasetDriven Approaches
We now discuss how existing datasetdriven methods, RojasCarulla et al. (2018) and Magliacane et al. (2018) in particular, can be adapted to address a problem defined by an invariance spec. Then, by virtue of the fact that these methods search for invariant conditionals, this means that these methods are subsumed by ISPEC.
First, RojasCarulla et al. (2018) is related to work on invariant prediction that finds stable distributions by hypothesis testing the stability of a distribution across source environments (Peters et al., 2016). While these works do not assume faithfulness, under the faithfulness assumption (which is made by ISpec), it has been shown that an invariant distribution corresponds to a feature set such that : the target variable is separated from the environment indicator given the features (Peters et al., 2016, Appendix C). Thus, RojasCarulla et al. (2018) can naturally be applied to the input of ISpec and it searches for stable conditional distributions as defined within the main paper.
Magliacane et al. (2018), by contrast, builds on the Joint Causal Inference (JCI) framework proposed in Mooij et al. (2016). The JCI framework considers a related setting to the environment indicator setup (that is used by ISpec and invariant prediction works like Peters et al. (2016); RojasCarulla et al. (2018)) in which there are instead (possibly multiple) context variables that describe how environments differ as opposed to system variables which are the observed variables that form the feature set and target variable. The environment indicator described in the main paper can reasonably be viewed as a single context variable. Thus, invariance specs can be translated into the JCI framework. The specific method proposed in Magliacane et al. (2018) considers a problem setup in which unlabeled target domain data is available. However, within ISpec the assumption is that the unknown target environment will be drawn from the set of environments defined by an invariance spec . This stronger assumption is what allows ISpec to be applied in settings in which no target environment data is available. Under this assumption it is straightforward to adapt the method of Magliacane et al. (2018) to handle the input of ISpec. However, the method of Magliacane et al. (2018) searches only over stable conditional distributions.
Thus, both of these relevant datasetdriven methods are applicable to the same problems as ISpec. However, they search over stable conditional distributions, which (under the assumptions of the ISpec framework) consist of all the distributions (and only the distributions) that satisfy Zhang (2008a, Theorem 30) (which is sound and complete in PAGs). Then, by Lemma 5 we get Corollary 6, and we have that ISpec subsumes existing datasetdriven methods in their ability to find stable distributions due to the additional search over stable interventional distributions.
Appendix D Learned PAG
Appendix E Experimental Details
e.1 Simulated Experiment Details
We generated data according to the following linear Gaussian system:
Different environments correspond to different values of the coefficient in the structural equation for . We generated 50,000 samples each from two source environments associated with and . We pooled the data from these two environments to train all three models. We evaluated the three models in 100 test environments created by varying on an evenly spaced grid from to , sampling 10,000 data points from each test environment.
We briefly note that , where (e.g., with effect of removed). See Subbaswamy et al. (2019a) for the equivalence of using the auxiliary variable (a counterfactual variable) to the original interventional distribution. To compute we first fit a linear regression for the structural equation of to learn the coefficient of (which is 1). Then, using the estimated coefficient, we computed an estimate of before fitting the model . Test environment values were computed using the coefficient learned from training data.
e.2 Real Data Experiment Details:
Data Cohort
We construct the pooled dataset using deidentified measurements from patients who are admitted or transferred to the intensive care unit (ICU) of three hospitals in our institution’s network within from early 2016 to early 2018. We only consider patients who stayed in the ICU for longer than 24 hours and use data collected during the first 24 hours of their visit. We focus on the nonpediatric case, requiring all patients to be over 15 years old. For patients with multiple ICU encounters, we only consider data from their first encounter. These criteria result in a cohort of 24,787 patients. Mortality rates varied as follows: 7% in H1, 10% in H2, and 12% in H3.
Data Features
The target variable of our prediction model is Mortality, which is defined as an inhospital death. We capture 12 physiologic features: Heart Rate, Systolic Blood Pressure, Temperature, Glasgow Coma Scale/Score (GCS), PaO/FiO, Blood Urea Nitrogen, Urine Output, Sodium, Potassium, Bicarbonate, Bilirubin, and White Blood Cell Count. We computed the worst value using the SAPS II criteria found in Le Gall et al. (1993). Furthermore, we consider age and three comorbidities: Metastatic cancer, Hematologic malignancy, and AIDS. SAPS II also makes use of the admission type (i.e., scheduled surgical, unscheduled surgical, or medical). To create a known shift, we simulate another healthcare process variable: time of day when lab measurements occur (i.e., morning or night ), such that mortality is correlated with morning measurements in Hospital 1, uncorrelated with measurement timing in Hospital 2, and correlated with night measurements in Hospital 3.
Specifically, we generated Lab Time as follows:

,

,

,
Imputation of missing values
To account for the missing physiologic feature values, we impute our data via “Last Observation Carried Forward” (LOCF). If the feature value is missing from the patient’s first 24 hours, we impute it with the most recently recorded value prior to their ICU stay. Otherwise, we fill the missing value with the hospitalspecific population mean.
Training
We trained unregularized Logistic Regression models using “classif.logreg” in the R language’s mlr package (Bischl et al., 2016).
Footnotes
 We will use and interchangeably.
 We restate this algorithm in Appendix A
 This special case of shifts in mechanism are sometimes referred to as “soft interventions”.
 Proofs of all results are in Appendix B.
 Such prior knowledge can be specified in the Tetrad implementation of FCI http://www.phil.cmu.edu/tetrad/.
 Recall that identifiability means that an interventional distribution is a function of the observational training data distribution.
 CIDP and Zhang (2008a, Thm 30) are given in Appendix A.
 Namely, RojasCarulla et al. (2018); Magliacane et al. (2018) are easily adaptable to the setting of this paper; see Appendix C.
 CIDP has not been proven complete.
 Full inclusion criteria and details in Appendix E.
 Exact simulation details in Appendix E.
References
 Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. Bmj 361, pp. k1479. Cited by: §5.1.
 Learning highdimensional directed acyclic graphs with mixed datatypes. Proceedings of machine learning research 104, pp. 4. Cited by: §5.1.
 Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: §1, §4.
 Mlr: machine learning in r. Journal of Machine Learning Research 17 (170), pp. 1–5. External Links: Link Cited by: §E.2.
 Intelligible models for healthcare: predicting pneumonia risk and hospital 30day readmission. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1721–1730. Cited by: §1.
 Runaway feedback loops in predictive policing. In Conference on Fairness, Accountability and Transparency, pp. 160–171. Cited by: §1.
 Covariate shift by kernel mean matching. Dataset shift in machine learning 3 (4), pp. 5. Cited by: §4.
 Causal network learning from multiple interventions of unknown manipulated targets. arXiv preprint arXiv:1610.08611. Cited by: §4.
 Conditional variance penalties and domain shift robustness. arXiv preprint arXiv:1710.11469. Cited by: §4.
 Invariant causal prediction for nonlinear models. Journal of Causal Inference 6 (2). Cited by: §4.
 Identification of conditional causal effects under markov equivalence. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. Fox and R. Garnett (Eds.), Vancouver, Canada, pp. 11512–11520. Cited by: §A.3, §A.3, Appendix B, §2, §3.3, Proposition 9.
 Causal identification under markov equivalence: completeness results. In International Conference on Machine Learning, pp. 2981–2989. Cited by: §2.
 Stable prediction across unknown environments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1617–1626. Cited by: §4.
 A new simplified acute physiology score (saps ii) based on a european/north american multicenter study. Jama 270 (24), pp. 2957–2963. Cited by: §E.2, §5.1.
 To predict and serve?. Significance 13 (5), pp. 14–19. Cited by: §1.
 A generalized backdoor criterion. The Annals of Statistics 43 (3), pp. 1060–1088. Cited by: item 1, item 3, §A.1, §A.2, §2.
 Domain adaptation by using causal inference to predict invariant conditional distributions. In Advances in Neural Information Processing Systems, pp. 10869–10879. Cited by: Appendix C, Appendix C, §3.2, §4, footnote 8.
 Joint causal inference from multiple contexts. arXiv preprint arXiv:1611.10351. Cited by: Appendix C, §4.
 Domain generalization via invariant feature representation. In International Conference on Machine Learning, pp. 10–18. Cited by: §1, §4.
 Transportability of causal and statistical relations: a formal approach. In TwentyFifth AAAI Conference on Artificial Intelligence, Cited by: Appendix B, §3.1, §3.4, §4.
 Causality. Cambridge university press. Cited by: §2.
 Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (5), pp. 947–1012. Cited by: Appendix C, Appendix C, §4.
 Mortality prediction in intensive care units with the super icu learner algorithm (sicula): a populationbased study. The Lancet Respiratory Medicine 3 (1), pp. 42–52. Cited by: §5.1.
 Dataset shift in machine learning. The MIT Press. Cited by: §1, §4.
 Ancestral graph markov models. The Annals of Statistics 30 (4), pp. 962–1030. Cited by: §A.1, §2.
 Invariant models for causal transfer learning. The Journal of Machine Learning Research 19 (1), pp. 1309–1342. Cited by: Appendix C, Appendix C, Appendix C, §1, §3.2, §4, footnote 8.
 Anchor regression: heterogeneous data meets causality. arXiv preprint arXiv:1801.06229. Cited by: §4.
 Reliable decision support using counterfactual models. In Advances in Neural Information Processing Systems, pp. 1697–1708. Cited by: §4, §5.1.
 Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference 90 (2), pp. 227–244. Cited by: §4.
 Identification of conditional interventional distributions. In 22nd Conference on Uncertainty in Artificial Intelligence, UAI 2006, pp. 437–444. Cited by: §2.
 Identification of joint interventional distributions in recursive semimarkovian causal models. In Proceedings of the National Conference on Artificial Intelligence, Vol. 21, pp. 1219. Cited by: §2.
 Causation, prediction, and search. MIT press. Cited by: §2, §3.2.
 The hierarchy of stable distributions and operators to trade off stability and performance. arXiv preprint arXiv:1905.11374. Cited by: §E.1.
 Counterfactual normalization: proactively addressing dataset shift using causal mechanisms.. In UAI, pp. 947–957. Cited by: §1, §4.
 From development to deployment: dataset shift, causality, and shiftstable models in health ai. Biostatistics. Cited by: §1.
 Preventing failures due to dataset shift: learning predictive models that transport. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3118–3127. Cited by: §1, §3.1, §3.1, §3.1, §3.2, §3.2, §4.
 Robust causal inference using directed acyclic graphs: the r package ‘dagitty’. International journal of epidemiology 45 (6), pp. 1887–1894. Cited by: §A.2.
 Learning causal structure from overlapping variable sets. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 860–867. Cited by: §4.
 Learning markov blankets from multiple interventional data sets. IEEE transactions on neural networks and learning systems. Cited by: §4.
 Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a crosssectional study. PLoS medicine 15 (11), pp. e1002683. Cited by: §1, §1.
 Causal inference and reasoning in causally insufficient systems. Ph.D. Thesis, Carnegie Mellon University. Cited by: item 3.
 Causal reasoning with ancestral graphs. Journal of Machine Learning Research 9 (Jul), pp. 1437–1474. Cited by: item 3, Appendix B, Appendix C, §3.3, Theorem 7, 1, footnote 7.
 On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence 172 (1617), pp. 1873–1896. Cited by: §3.2.
 Causal discovery from nonstationary/heterogeneous data: skeleton estimation and orientation determination. In IJCAI: Proceedings of the Conference, Vol. 2017, pp. 1347. Cited by: §4.