Tutorial: Safe and Reliable Machine Learning

Tutorial: Safe and Reliable Machine Learning

Suchi Saria Department of Computer ScienceJohns Hopkins UniversityBaltimoreMD21218USA ssaria@cs.jhu.edu  and  Adarsh Subbaswamy Department of Computer ScienceJohns Hopkins UniversityBaltimoreMD21218USA asubbaswamy@jhu.edu

This document serves as a brief overview of the “Safe and Reliable Machine Learning” tutorial given at the 2019 ACM Conference on Fairness, Accountability, and Transparency (FAT* 2019). The talk slides can be found here: https://bit.ly/2Gfsukp, while a video of the talk is available here: https://youtu.be/FGLOCkC4KmE, and a complete list of references for the tutorial here: https://bit.ly/2GdLPme.

journalyear: 2019copyright: rightsretainedconference: ACM Conference on Fairness, Accountability, and Transparency; February 2019; Atlanta,GA, USAdoi: isbn:

1. Motivation and Outline

Machine Learning driven decision-making systems are starting to permeate modern society—for example, to decide bank loans, criminals’ incarceration, clinical decision-making, and the hiring of new employees. As we march towards a future where these systems underpin most of society’s decision-making infrastructure, it is critical for us to understand the principles that will help us engineer for reliability. In this tutorial, we (1) give an overview of issues to consider when designing for reliability, (2) draw connections to concepts of fairness, transparency, and interpretability, and (3) discuss novel technical approaches for measuring and ensuring reliability.

2. Principles of Reliability

The field of machine learning, despite its increasing use in high-stakes and safety-critical applications, fundamentally lacks a framework for reasoning about failures and their potentially catastrophic effects. This is in contrast to traditional engineering disciplines which have been forced to consider the safety implications across a broad set of applications, from building a bridge to managing a nuclear power plant. Bridging across these applications is the discipline of reliability engineering (see, e.g., (Kapur and Pecht, 2014)) which seeks to ensure that a product or system performs as intended (without failure and within specified performance limits). Drawing on this notion of reliability, we have pulled out three principles of reliability engineering that we use to group and guide technical solutions for addressing and ensuring reliability in machine learning systems:

  1. Failure Prevention: Prevent or reduce the likelihood of failures.

  2. Failure Identification & Reliability Monitoring: Identify failures and their causes when they occur.

  3. Maintenance: Fix or address the failures when they occur.

In what follows we will consider each of the principles of reliability in turn, summarizing key approaches when they exist and speculating about open problem areas. The focus of this tutorial is on supervised learning (i.e., classification and regression). For an overview of issues associated with reinforcement learnings see (Amodei et al., 2016).

3. Failure Prevention

To prevent failures, ideally we could proactively identify likely sources of error and develop methods that correct for these in advance. This requires us to explicitly reason about common sources of errors and issues. We broadly categorize four sources of failures and discuss them each: 1) bad or inadequate data, 2) differences or shifts in environment, 3) model associated errors, and 4) poor reporting.

3.1. Bad or Inadequate Data

Inadequate data can cause errors related to differential performance. For example, when a particular class or subpopulation is underrepresented in a dataset, the performance of a classifier on these subgroups can be very poor even though average or overall accuracy is high (e.g., (Buolamwini and Gebru, 2018)). These errors can be avoided by measuring performance on subpopulations of interest. If key subpopulations have not been identified, then one could consider clustering data to find regions of poor support. Inadequate data can be addressed by collecting more representative data and through better design of of the objective function.

On the other hand, bad data refers to cases in which the data simply do not contain the information necessary to answer the question or perform the task of interest. Understanding when machine learning can be applied is crucial to avoiding model misuse. Some examples are discussed in the tutorial, such as trying to predict behavioral traits from facial images.

3.2. Differences or Shifts in Environment

Differences between training and deployment environments can lead to degraded model performance and failures post-deployment. As an example, in (Zech et al., 2018) the authors train a model to diagnose pneumonia from chest X-rays at a particular hospital. When evaluated on that dataset, the model yielded good performance. But when evaluated at two other hospital networks the performance was significantly worse, calling into question the generalizability or external validity of the model.

The issue is that modelers typically assume that training data is representative of the target population or environment where the model will be deployed. Yet commonly there is bias specific to the training dataset which causes learned models to be unreliable: they do not generalize beyond the training population and, more subtly, are not robust to shifts in practice patterns or policy in the training environment. This bias can arise due to the method of data collection, frequently due to some form of selection bias. The bias may also be caused by differences between the policy or population in the training data and that of the deployment environment. In some instances, the very deployment of the decision support tool can change practice and lead to future shifts in policy in the training environment, which leads to dangerous feedback loops (consider the example of predictive policing algorithms (Lum and Isaac, 2016)).

The machine learning community has extensively studied this problem of dataset shift in which training and test distributions are different (Quiñonero-Candela et al., 2009). However, the solutions have primarily been reactive in that they use (unlabeled) samples from the target distribution in combination with training data during learning to optimize directly for the target environment. But, in general, it is not feasible to access data from all possible test environments at training time (in fact, the deployment environment may be unspecified or may exist in the future). Thus, we instead want a predictive model that generalizes to new, unseen environments. Achieving this requires a shift in paradigm to proactive approaches (Subbaswamy and Saria, 2018; Subbaswamy et al., 2019): in order to prevent failures we should learn a model that is explicitly protected against problematic shifts that are likely to occur.

In this subsection of the tutorial we go through the framework laid out in (Subbaswamy et al., 2019) for preventing failures due to shifts in environment. The framework is enticing because it allows model developers to proactively reason about the possible shifts that could occur in their application, and then dictates what they need to model in order to make optimal predictions that are unaffected by the shifts they want to ignore. An overview is as follows: The framework uses directed acyclic graphs (DAGs) as a language for representing shifts in the underlying data generating process (DGP) that could occur between environments. This language is sufficiently expressive to include common forms of dataset shift (such as covariate and label shift) as well as more complex ones including policy shift (which leads to feedback loops (Schulam and Saria, 2017)). The graphical representation of a DGP can be augmented (into a selection diagram (Pearl and Bareinboim, 2011)) to identify the shifts we want to guard against. Thus, the graph serves as an invariance specification that a reliable model needs to satisfy. The algorithm presented in (Subbaswamy et al., 2019) serves as a preprocessing step that dictates which pieces of the DGP to model, and subsequently these pieces can be fit using arbitrarily complex models (e.g., neural networks) and combined to make predictions that satisfy the invariance requirements. For a discussion of other properties of this procedure, see the tutorial video and the paper (Subbaswamy et al., 2019).

3.3. Model Associated Errors

Broadly, we can think of two main types errors related to modeling choices:

  1. Faulty (implicit) model assumptions

  2. Fragile models

Faulty model assumptions are often due to model misspecification. For example, a linear model may be inappropriate to use in a complex setting with many interactions (i.e., nonlinearities) between the features. To reduce or prevent model misspecification bias, we should make meaningful use of inductive bias in our choice of learner or alternatively consider nonparametric methods. Another example of a faulty model assumption is the case of dependent data. It is common practice to assume that samples have independent errors, but this assumption does not hold when working with geographic data or social network data where outcomes are tied. The lesson here is to be explicit about any assumptions that are made (which is related to having good reporting practices (Section 3.4)).

The issue with model fragility occurs when models are applied to problems with high dimensional inputs. The model is “fragile” in the sense that its predictions are very sensitive to small perturbations of the input. An example of this which has received attention in recent years is the problem of adversarial examples (e.g., (Goodfellow et al., 2014)). The high dimensional inputs are images, and by perturbing the images with human-imperceptible noise the model prediction can change to be confidently wrong. This sort of failure is complementary to dataset shift because despite proactive correction, a model can be susceptible to small perturbations as a result of its parameterization (i.e., it is a fitting issue).

Research into adversarial training has produced methods that have useful properties for ensuring reliability. For example, these methods consider robust objectives in which the goal is to minimize the loss achieved by the worst-case adversarial attack. Further, some training methods produce certificates of robustness (e.g., (Sinha et al., 2018)) which give data-dependent bounds on the worst-case loss. Similarly, methods for model verification provide a means for making yes/no statements about individual input-output pairs which let us “stress test” a model (e.g., (Dvijotham et al., 2018)).

3.4. Poor Reporting

An important source of error is the mismatch between a model’s intended purpose and the way it is actually used, which is often a result of poor reporting practices. As opposed to other high-impact industries (such as transportation or pharmaceuticals), few standards exist for reporting and documentation in machine learning. A number of recent proposals seek to address this by creating templates for appropriate documentation, including “datasheets for datasets” (Gebru et al., 2018) and “model cards for model reporting” (Mitchell et al., 2019). These proposals advocate explicitly stating intended use, details of creation, ethical considerations, and many other relevant factors. We think a natural extension is to also include reliability criteria in model reporting, since guarantees are a promise regarding reliable behavior that should be documented. For example, statements about likely shifts in environment that were considered, certificates of robustness, and model verification would improve reporting practices. The primary open question regarding reporting is: what are the relevant aspects that should be required in documentation, and who should decide and enforce such requirements?

3.5. Relating the Sources of Failures

Figure 1. Diagram showing how the various sources of errors relate to reliability.

The various sources of errors and the corresponding proactive methods for preventing them can be integrated together into a checklist for ensuring reliability and failure-proofing the development process, as shown in Fig 1. We should first ask: can the question of interest be answered using the available data? Then, we determine reliability with respect to shifts, check for safety regarding adversarial examples, and we can check for model misspecification. Finally, we should make sure we have responsibly documented the model and training procedure. As discussed in the tutorial, there are open problems related to each source of failure, and addressing them will be critical for the success of reliable machine learning.

4. Failure Identification and Reliability Monitoring

Failure prevention takes place prior to and during model learning and development. Once the system has been deployed we require a means for test-time monitoring. A useful approach is to assess point-wise reliability: assess the model output for each new input, rejecting the output when it is deemed unreliable. This is closely related to the concept of detecting “out-of-distribution” examples: samples generated by a process that is different than the process generating the “nominal” data. These examples are, in some sense, “far away” from the known data distribution. Extensive literature for this exists under the names anomaly detection and open-category detection (see full reference link for more information).

In this section of the tutorial we consider how to compute point-wise “trust score” to audit a model’s prediction to determine whether or not to reject the prediction (Jiang et al., 2018; Schulam and Saria, 2019). We primarily consider two criteria for performing the audit: the density principle and the local fit principle. The density principle essentially asks if the test case is close to training samples, while the local fit principle asks if the model was accurate on training samples close to the test case. By synthesizing these two it is possible to audit model predictions subsequent to model training (see (Schulam and Saria, 2019) for details).

While this section is primarily focused on purely algorithmic approaches to test-time monitoring, it is worth mentioning that there has been work on human-driven monitoring. These approaches include crowdsourcing and human-in-the-loop debugging, and are particularly useful because they are often effective at identifying a model’s “unknown unknowns” (predictions for which a model is confident despite being wrong) (Attenberg et al., 2011). For some sample references consult the relevant section of the tutorial slides.

5. Maintenance

Model maintenance remains a largely open area for ensuring reliability. For example, beyond error monitoring is the question of how to detect when updates to the model are necessary. Further, how do we safely update the model? There may be issues of forgetting what was previously learned or feedback loops created during continuous learning. The challenges of maintenance are also compounded by the fact that AI systems are complex with multiple components. See (Sculley et al., 2014) for a lucid discussion of the maintenance costs associated with the “technical debt” incurred by machine learning development.

6. Conclusion

In this tutorial we have defined reliability as a vital property of a successful machine learning system: given a specification of desired behavior, we want to ensure that the machine learning system behaves consistently as intended and is not error prone. We note that there is some existing work in this direction under the name “robust machine learning.” However, we believe that robustness, while an important aspect of reliability (e.g., in guarding against adversarial examples), is too narrow in scope (in part due to its connotations with the well-established field of robust statistics) and fails to address important sources of failure such as model misuse due to poor reporting. We also note that reliability is separate from other desirable properties such as privacy, fairness, and transparency. We identified three principles for ensuring reliability: failure prevention, failure identification and reliability monitoring, and maintenance. It is our hope that these principles will guide future work and discussions about successfully deploying machine learning across a variety of domains.


  • (1)
  • Amodei et al. (2016) Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016).
  • Attenberg et al. (2011) Josh M Attenberg, Pagagiotis G Ipeirotis, and Foster Provost. 2011. Beat the machine: Challenging workers to find the unknown unknowns. In Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence.
  • Buolamwini and Gebru (2018) Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency. 77–91.
  • Dvijotham et al. (2018) Krishnamurthy Dvijotham, Robert Stanforth, Sven Gowal, Timothy Mann, and Pushmeet Kohli. 2018. A dual approach to scalable verification of deep networks. arXiv preprint arXiv:1803.06567 (2018).
  • Gebru et al. (2018) Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010 (2018).
  • Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
  • Jiang et al. (2018) Heinrich Jiang, Been Kim, Melody Guan, and Maya Gupta. 2018. To trust or not to trust a classifier. In Advances in Neural Information Processing Systems. 5546–5557.
  • Kapur and Pecht (2014) Kailash C Kapur and Michael Pecht. 2014. Reliability Engineering. John Wiley and Sons, Inc.
  • Lum and Isaac (2016) Kristian Lum and William Isaac. 2016. To predict and serve? Significance 13, 5 (2016), 14–19.
  • Mitchell et al. (2019) Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, 220–229.
  • Pearl and Bareinboim (2011) Judea Pearl and Elias Bareinboim. 2011. Transportability of causal and statistical relations: a formal approach. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence. AAAI Press, 247–254.
  • Quiñonero-Candela et al. (2009) J Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. 2009. Dataset shift in machine learning.
  • Schulam and Saria (2017) Peter Schulam and Suchi Saria. 2017. Reliable decision support using counterfactual models. In Advances in Neural Information Processing Systems. 1697–1708.
  • Schulam and Saria (2019) Peter Schulam and Suchi Saria. 2019. Can you trust this prediction? Auditing Pointwise Reliability After Learning. In Artificial Intelligence and Statistics (AISTATS).
  • Sculley et al. (2014) D Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. 2014. Machine learning: The high interest credit card of technical debt. (2014).
  • Sinha et al. (2018) Aman Sinha, Hongseok Namkoong, and John Duchi. 2018. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations.
  • Subbaswamy and Saria (2018) Adarsh Subbaswamy and Suchi Saria. 2018. Counterfactual Normalization: Proactively Addressing Dataset Shift Using Causal Mechanisms. In Uncertainty in Artificial Intelligence.
  • Subbaswamy et al. (2019) Adarsh Subbaswamy, Peter Schulam, and Suchi Saria. 2019. Preventing Failures Due to Dataset Shift: Learning Predictive Models That Transport. In Artificial Intelligence and Statistics (AISTATS).
  • Zech et al. (2018) John R Zech, Marcus A Badgeley, Manway Liu, Anthony B Costa, Joseph J Titano, and Eric Karl Oermann. 2018. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS medicine 15, 11 (2018), e1002683.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description