On the Accuracy of Influence Functions
for Measuring Group Effects
Abstract
Influence functions estimate the effect of removing particular training points on a model without needing to retrain it. They are based on a firstorder approximation that is accurate for small changes in the model, and so are commonly used for studying the effect of individual points in large datasets. However, we often want to study the effects of large groups of training points, e.g., to diagnose batch effect or apportion credit between different data sources. Removing such large groups can result in significant changes to the model. Are influence functions still accurate in this setting? In this paper, we find that across many different types of groups and in a range of realworld datasets, the influence of a group correlates surprisingly well with its actual effect, even if the absolute and relative error can be large. Our theoretical analysis shows that such correlation arises under certain settings but need not hold in general, indicating that realworld datasets have particular properties that keep the influence approximation wellbehaved.
On the Accuracy of Influence Functions
for Measuring Group Effects
Pang Wei Koh^{†}^{†}thanks: Equal contribution. KaiSiang Ang^{1}^{1}footnotemark: 1 Hubert H. K. Teo^{1}^{1}footnotemark: 1 Percy Liang Department of Computer Science Stanford University {pangwei@cs, kaiang@, hteo@, pliang@cs}.stanford.edu
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Influence functions can be used to estimate the effect that removing an individual training point has on a model’s predictions, without the computationallyprohibitive cost of repeatedly retraining the model. This ability to trace a model’s output back to its training data is valuable: in recent years, influence functions have been used to explain predictions [Koh and Liang, 2017], produce confidence intervals [Schulam and Saria, 2019], increase model fairness [Wang et al., 2019], improve human trust [Zhou et al., 2019], and even craft data poisoning attacks [Koh et al., 2019].
Influence functions are based on firstorder approximations that are accurate for asymptotically small perturbations to the training data, which makes them suitable for predicting the effects of removing individual training points on the model. However, we often want to study the effects of removing groups of points, which represent large perturbations to the data. For example, we might wish to analyze the effect of data collected from different experimental batches [Leek et al., 2010] or demographic groups [Chen et al., 2018]; apportion credit between crowdworkers, each of whom generated part of the data [ArrietaIbarra et al., 2018]; or, in a multiparty learning setting, ensure that no individual user has too much influence on the joint model [Hayes and Ohrimenko, 2018]. Are influence functions still accurate when predicting the effects of (removing) these groups?
In this paper, we show that on real datasets and across a broad variety of groups of data, the predicted and actual effects are strikingly correlated (Spearman of to ), such that the groups with the largest actual effect also tend to have the largest predicted effect. Moreover, the predicted effect tends to underestimate the actual effect, suggesting that it could be an approximate lower bound. Using influence functions to predict the effect of large, coherent groups of data can therefore still be a useful and computationally tractable approximation, even though the violation of the smallperturbation assumption can result in high absolute and relative error between the predicted and actual effects.
What explains these phenomena of correlation and underestimation? Prior theoretical work focused on establishing the conditions under which this firstorder influence approximation is accurate, i.e., the error between the actual and predicted effects is small [Giordano et al., 2019, Rad and Maleki, 2018]. However, measuring the influence of groups is in a qualitatively different regime: one in which this error can be quite large. We characterize the relationship between influence and actual effects in this regime via the onestep Newton approximation [Pregibon et al., 1981] and show that correlation and underestimation arise under certain settings. However, our theoretical analysis shows that neither phenomena need to hold in general, which opens up the intriguing question of why we observe those phenomena across a wide range of empirical settings.
Finally, we exploit the correlation of influence with actual effects to study the effects of different data sources in two case studies: a chemicaldisease relationship task, where the data comes from different labeling functions [Hancock et al., 2018], and a language inference task, where the data comes from different crowdworkers. In both cases, we show that the influences of the different groups of data points (from different sources) can reveal insights about the nature of the data and application.
2 Background and problem setup
Consider the task of learning a predictive model with parameters that maps from an input space to an output space . We are given training points and a loss function that is twicedifferentiable and convex in . To train the model, we select the model parameters
(1) 
that minimize the regularized empirical risk, where is a hyperparameter that controls regularization strength. The allones vector in denotes that the initial training points all have uniform sample weights of one.
Our goal is to measure the effects that different groups of training data have on the model: if we removed a given subset of training points , how much would the model change? Concretely, we define a vector of sample weights with and consider the modified parameters
(2) 
which correspond to retraining the model after excluding . We will refer to directly as the subset (corresponding to ); the number of removed points as ; and the fraction of removed points as .
The actual effect of the subset is
(3) 
where the evaluation function measures a quantity of interest. Specifically, we study:

The change in test prediction, with . Linear models (for regression or binary classification) make predictions that are functions of , so this measures the effect that removing a subset will have on the model’s prediction for some test point .

The change in test loss, with . This is similar to the test prediction.

The change in selfloss, with , measures the increase in loss on the removed points . Its average over all subsets of size is the estimated extra loss that leaveout crossvalidation (CV) measures over the training loss, so measuring this allows us to approximate CV with influence.
2.1 Influence functions
The issue with computing the actual effect is that retraining the model to compute for each different subset can be prohibitively expensive. Influence functions provide a relatively efficient firstorder approximation to that avoids retraining [Hampel et al., 1986].
Consider the function with , such that the actual effect can be written as . We define the predicted effect of the subset to be its influence . Intuitively, this is the effect of removing an infinitesimal weight from each point in and then linearly extrapolating to removing all of .^{1}^{1}1In the statistics literature, it is typically defined as the effect of adding weight, so the sign is flipped. It can be computed as
(4) 
where , , and . When measuring the change in test prediction or test loss, influence is additive: if , then , i.e., the influence of a subset is the sum of influences of its constituent points, so we can efficiently compute the influence of any subset without any model retraining by precomputing the influence of each individual point. Koh and Liang [2017] describe how to compute the influence of all individual points on a test point with a single inverse Hessianvector product.
2.2 Relation to prior work
Influence functions—introduced in the seminal work of Hampel [1974] and in Jaeckel [1972], where it was called the infinitesimal jackknife—have a rich history in robust statistics. The use of influence functions in the ML community is more recent, though growing; in Section 1, we provide references for several recent applications of influence functions in ML.
Removing a single training point, especially when the total number of points is large, represents a small perturbation to the training distribution, so we expect the firstorder influence approximation to be accurate. Indeed, prior work on the accuracy of influence has focused on this regime: e.g., Debruyne et al. [2008], Liu et al. [2014], Rad and Maleki [2018], Giordano et al. [2019] give evidence that the influence on selfloss can approximate LOOCV, and Koh and Liang [2017] similarly examined the accuracy of estimating the change in test loss after removing single training points.
However, removing a constant fraction of the training data represents a large perturbation to the training distribution. To the best of our knowledge, this setting has not been empirically studied; perhaps the closest work is Khanna et al. [2019]’s use of Bayesian quadrature to estimate a maximally influential subset. Instead, older references have alluded to the phenomena we observe: Pregibon et al. [1981] note that influence tends to be conservative, while Hampel et al. [1986] say that “bold extrapolations” (i.e., large perturbations) are often still useful. On the theoretical front, Giordano et al. [2019] established finitesample error bounds that apply to groups, e.g., showing that the leaveout approximation is consistent as . Our focus is instead on the relationship of the actual effect and influence in the regime where is constant and the error is large.
3 Empirical accuracy of influence functions on constructed groups
How well do influence functions estimate the effect of (removing) a group of training points? If is large and we remove a subset uniformly at random, the new parameters should remain close to even when if fraction of removed points is nonnegligible, so the influence error should be small. However, we are usually interested in removing coherent, nonrandom groups, e.g., all points from a data source, or that share some features. In such settings, the parameters and might differ substantially, and the error could be large. Put another way, there could be a cluster of points such that removing a single point would not change the model by much—so influence could be low—but removing all of them would.
Surprisingly (to us), we found that even when removing large and coherent groups of points, the influence behaved consistently relative to the actual effect on test predictions, test losses, and selfloss, with two broad phenomena emerging:

Correlation: and rank subsets in a similar order (e.g., high Spearman ).

Underestimation: and tend to share the same sign, with .^{2}^{2}2 Except when measures test loss and the actual effect is negative, in which case and still tend to share the same sign, but need not hold (Figure 1Mid).
Since the error can be large, this systematic relationship was unexpected. Here, we report results on 5 datasets chosen to span a range of applications, training set size , and features (Table 1).^{3}^{3}3 The first 4 datasets involve hospital readmission prediction, spam classification, and object recognition, and were used in Koh and Liang [2017] to study the accuracy of the influence of individual points. The fifth dataset is a chemicaldisease relationship (CDR) dataset from Hancock et al. [2018]. In Section 5, we will also study the MultiNLI language inference dataset [Williams et al., 2018]; we omitted it from the experiments here because its size makes repeated retraining to compute the actual effect too expensive. See Appendix B for dataset details. To stress test the accuracy of the influence approximation, we constructed a broad variety of subsets, from small () to large (), to be coherent and have considerable influence on the model, and therefore make the influence approximation as bad as possible. On each dataset, we trained a regularized logistic regression model (or softmax for the multiclass tasks) and compared the influences and actual effects of these subsets (Appendix A).
Dataset  Classes  Test acc.  Source  

Diabetes  Strack et al. [2014]  
Enron  Metsis et al. [2006]  
Dogfish  Koh and Liang [2017]  
MNIST  LeCun et al. [1998]  
CDR  Hancock et al. [2018]  
MultiNLI  Williams et al. [2018] 
Group construction.
For each dataset, we grouped points that shared a similar th feature value, for random ; points that clustered on their features (or, separately, on their gradients ); random points from the same class; and, for comparison, random points from any class. In addition, we picked 3 random test points and the 3 test points with the highest loss (the latter as they seemed likely candidates for model developers to want to investigate). For each of these 6 test points, we constructed subsets of training points with large positive (or, separately, negative) influence on the test loss . Intuitively, training points that all have high influence on a test point should act together to change the model substantially. Overall, for each dataset, we constructed subsets ranging in size from to of the training points; more details are in Appendix A.2.
Results.
Figure 1 shows that the influences and actual effects of all of these subsets on test prediction (Top), test loss (Mid), and selfloss (Bot) are highly correlated (Spearman of to across all plots), even though the absolute and relative error of the influence approximation can be quite large. Moreover, the influence of a group tends to underestimate its actual effect in all settings except for groups with negative influence on test loss (the left side of each plot in Figure 1Mid). These trends held across a wide range of regularizations , though correlation increased with (Appendix C.2).
In Section 5, we will use the CDR dataset and the MultiNLI [Williams et al., 2018] dataset to show that these phenomena of correlation and underestimation also apply to groups of data that arise naturally, and that influence functions can therefore be used to derive insights about real datasets and applications. Before that, we first attempt to develop some theoretical insight into the results above.
4 Analysis
The experimental results above show that there is high correlation and consistent underestimation (as defined in Section 3) between the influences and actual effects of groups of data across a broad variety of datasets, despite the influence approximation incurring large absolute and relative error. As we discussed in Section 2.2, this is far outside the regime of existing theory. Here, we present initial results characterizing the conditions under which the influence and actual effect correlate despite high error. We give counterexamples demonstrating that high correlation and underestimation need not always hold. In restricted settings, however, we show that the approximate cone constraint holds for some that decreases with increasing regularization , which implies both underestimation and correlation when is high.
Our analysis centers on the onestep Newton approximation, which estimates the change in parameters
(5) 
where is the empirical Hessian at but reweighted after removing the subset . This change in parameters gives the Newton approximation of the effect and the corresponding Newton error , which measures its gap from the actual effect. Specifically, we decompose the difference between the influence and actual effects, , into:
(6) 
In the remainder of this section, we analyze each term in this decomposition:
Notation and assumptions.
Throughout our analysis, we assume that the Hessian is Lipschitz and that the evaluation function is Lipschitz. Furthermore, we assume that the third derivative of exists and is bounded with constant . We list these assumptions formally in Appendix E.2. We also define to be the largest norm of a training point’s gradient at , and and to be the smallest and largest eigenvalues of . Due to space constraints, we provide all proofs in Appendix E.
4.1 Bounding the error of the onestep Newton approximation
The Newton approximation is computationally expensive because it computes for each (instead of the fixed in the influence calculation), but it is known to provide more accurate estimates. For example, Pregibon et al. [1981] use it to study the influence of single points in logistic regression, and more recently, Rad and Maleki [2018] show that it ensures consistency in the setting where . Indeed, under the assumptions above, we can bound its error as follows.
Proposition 1.
Let the Newton error be . Then
only involves thirdorder or higher derivatives of the loss, so it is 0 for quadratic losses.
Proposition 1 tells us that the Newton approximation is accurate when is large or the third derivative of (controlled by ) is small. Empirically, the Newton error is small in most of our settings (Figure 2), even though the overall error of the influence approximation is still large; this implies the difference must be in between the Newton and influence approximations.
4.2 Characterizing the difference between the Newton approximation and influence
We next characterize the difference between the Newton approximation and influence, , and apply it to the overall decomposition in (6).
Theorem 1.
The difference between the influence and the actual effect is
where we define the error matrix , with . measures the error caused by thirdorder derivatives of and is defined in Appendix E.4. Furthermore, has eigenvalues between 0 and , and
and 
We can interpret Theorem 1 as a formalization of Hampel et al. [1986]’s observation that influence approximations are accurate when the model is robust and the curvature of the loss is low. In general, the error decreases as increases and gets less curved; in Figure C.2, we show that increasing reduces error and increases correlation in our experiments. Note that if we hold the perexample regularization constant, then as the fraction of removed points . However, we are interested in the setting where is a constant.
4.3 The relationship between influence and actual effect on selfloss
Let us now apply Theorem 1 to analyze the behavior of influence under different choices of evaluation function . We start with the selfloss , as it is the cleanest to characterize the conditions under which the phenomena of correlation and underestimation arise:
Proposition 2.
Suppose that . Then
The cone constraint in Proposition 2 shows that influence underestimates the actual effect (up to terms, and by an amount that decreases with ). This explains the previouslyunexplained downward bias observed when using influence to approximate LOOCV [Debruyne et al., 2008, Giordano et al., 2019]. Moreover, on the graph of influences vs. actual effects, all points lie within the cone bounded by the and lines (up to terms). As grows, these lines will converge, and the error terms and will decay at a rate of , forcing the influences and actual effects to be tightly correlated.
However, is quite small in our experiments in Section 3, so the actual correlation of influence is better than predicted by this theory: in Figure 1Bot, the sizes of the theoreticallypermissible cones can be quite large, but the points in the graphs nevertheless trace a tight line through the cone.
4.4 The relationship between influence and actual effect on a test point
We turn to the change in test prediction . Here, counterexamples show we cannot obtain a similar cone constraint except in a restricted setting, and that correlation and underestimation do not always hold for this . Define and . Theorem 1 gives:
Corollary 1.
Suppose . Then , where is the error matrix from Theorem 1.
Unfortunately, Corollary 1 implies that no cone constraint applies: if the error matrix rotates relative to , then we can find such that the influence but the Newton approximation is large. Figure 5Left shows that on synthetic data, and can even have opposite signs on some subsets ; thus, underestimation does not always hold even if the Newton error is small.
We can recover a cone constraint similar to Proposition 2 if we restrict our attention to the special case where we use a marginbased model and remove (possibly multiple copies) of a single point:
Proposition 3.
Consider a binary classification setting with and a marginbased model with loss for some . Suppose and that the subset comprises identical training points. Let be a representative training point from the subset. Then the actual effect is related to the influence according to
where . This implies that and follow the cone constraint
Similar to Proposition 2, Proposition 3 shows that when removing copies of a single point, the influence underestimates the actual effect (up to error). Moreover, on the graph of influences vs. actual effects, all points lie within the cone bounded by the and lines, plus noise. As grows, the cone shrinks, and correlation increases.
However, if is small (as in our experiments in Section 3), the cone is wide, and the scaling factor in Proposition 3 can be quite large for some subsets but not for others. In particular, is large when there are few remaining points in the direction of the removed points. In Figure 5Right, we exploit this fact to show that the influence and Newton approximation can exhibit low correlation (e.g., low need not mean low ), even in the simplified setting of removing copies of single points. We comment on the analogue of in the general multiplepoint setting in Appendix D.2.
4.5 Linking test prediction and test loss
We wrap up our analysis with a brief note on measuring the change in test loss . In the marginbased setting, the loss is a monotone function of the linear prediction . Thus, measuring will display the same rank correlation as measuring above, so the same results about correlation (or lack thereof) carry over.
However, the secondorder curvature term from Theorem 1 is always nonnegative, even if the influence is negative. Under the assumption that and are both small because they decay as , this implies that underestimation is only preserved when the influence is positive, as we observed empirically in Figure 1Mid.
5 Applications of influence functions on natural groups of data
The analysis in Section 4 shows that the influences of groups need not always be correlated with their actual effects. Nonetheless, our experiments in Section 3 demonstrate that on real datasets, the correlation between influence and actual effects is much stronger than the theory predicts. We close by showing that we can exploit this correlation to glean insights in two case studies on the CDR and MultiNLI datasets, where groups of data arise naturally.
Cdr.
The CDR dataset tackles the following task: given text about the relationship between a chemical and a disease, predict if the chemical causes the disease. It was collected via data programming, where users provide labeling functions (LFs)—instead of labels—that take in an unlabeled point and either abstain or output a heuristic label [Ratner et al., 2016]. Specifically, Hancock et al. [2018] asked annotators for natural language explanations of provided classifications; parsed those explanations into LFs; and used those LFs to label a large pool of data (Appendix B.1). We studied the effect of each LF by computing the influence of the groups of points that each LF labeled; these correlated with their actual effects on overall test loss (Spearman ; Figure C.5).
We used influence functions to study two important properties of LFs: coverage, the fraction of unlabeled training points they output a label on; and precision, the proportion of labels they output that are correct. We found that high coverage was essential for LFs to help overall test performance (Figure 8Left), though precision was not predictive of influence (Figure 8Mid). This suggests that in data programming, annotators should craft LFs that are broadly applicable, even if imprecise.
MultiNLI.
The MultiNLI dataset deals with natural language inference: determining if a pair of sentences agree, contradict, or are neutral. Williams et al. [2018] presented crowdworkers with initial sentences from five genres and asked them to generate followon sentences that were neutral or in agreement/contradiction (Appendix B.2). We studied the effect that each crowdworker had on the model’s test set performance by computing the influence of the examples they created on overall test loss (Spearman of to with actual effects across different genres; see Figure C.8).
Studying the influence of each crowdworker revealed that the number of examples they created was not predictive of influence on test performance: e.g., the most prolific crowdworker contributed 35,000 examples but ended up hurting overall test performance (Figure 8Right). Curiously, this effect was genrespecific: crowdworkers who improved performance on some genres would lower performance on others (Figure C.10), even though the number of examples a crowdworker contributed to a genre did not track their influence on it (Figure C.11). Identifying precisely what makes a crowdworker’s contributions useful could help us improve dataset collection and credit attribution as well as better understand how models transfer across genres.
6 Discussion
In this paper, we showed empirically that the influences of groups of points are highly correlated with, and consistently underestimate, their actual effects across a range of datasets, types of groups, and sizes ranging from to of the training data. Our analysis of this surprising observation reveals that while these phenomena are provably true in some restricted settings, they need not always hold in a more general and realistic setting. This gap between theory and experiments opens up important directions for future work: why do we observe such striking, even linear, correlation between predicted and actual effects on real data? To what extent is this due to the specific model, datasets, or subsets used? Our work suggests that there could be distributional assumptions that hold for real data and give rise to the broad phenomena of correlation and underestimation.
The correlation of the influence of groups with their actual effects lets us use influence functions to better understand the “different stories that different parts of the data tell,” in the words of Hampel et al. [1986]. As applications, we showed that we can gain insight into the effects of a labeling function in data programming, or a crowdworker in a crowdsourced dataset, by computing the influence of their corresponding group effects. While these applications involved predefined groups, influence functions could potentially also discover coherent, semanticallyrelevant groups in the data. We also note that influence functions can be used to approximate Shapley values, which are a different but related way of characterizing the effect of data points; see Jia et al. [2019] and Ghorbani and Zou [2019] for discussion.
Finally, the framework of influence functions can also be applied to studying the effects of adding training points. In this context, the phenomenon of underestimation turns into overestimation, i.e., the influence of adding a group of training points tends to overestimate the actual effect of adding that group. This raises the possibility of using influence functions to evaluate the vulnerability of a given dataset and model to data poisoning attacks [Steinhardt et al., 2017].
Reproducibility
The code and data for replicating our experiments will be available on Github http://bit.ly/gtgroupinfluence and Codalab http://bit.ly/clgroupinfluence.
Acknowledgments
We are grateful to Zhenghao Chen, Brad Efron, Jean Feng, Tatsunori Hashimoto, Robin Jia, Stephen Mussmann, Aditi Raghunathan, Marco Túlio Ribeiro, Noah Simon, Jacob Steinhardt, and Jian Zhang for helpful discussions and comments. We are further indebted to Ryan Giordano, Ruoxi Jia, and Will Stephenson for discussion about prior work, and Samuel Bowman, Braden Hancock, Emma Pierson, and Pranav Rajpurkar for their assistance with applications and datasets. This work was funded by an Open Philanthropy Project Award. PWK was supported by the Facebook Fellowship Program.
References
 ArrietaIbarra et al. [2018] I. ArrietaIbarra, L. Goff, D. JiménezHernández, J. Lanier, and E. G. Weyl. Should we treat data as labor? Moving beyond “free”. In American Economic Association Papers and Proceedings, volume 108, pages 38–42, 2018.
 Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
 Chen et al. [2018] I. Chen, F. D. Johansson, and D. Sontag. Why is my classifier discriminatory? In Advances in Neural Information Processing Systems (NeurIPS), pages 3539–3550, 2018.
 Debruyne et al. [2008] M. Debruyne, M. Hubert, and J. A. Suykens. Model selection in kernel based regression using the influence function. Journal of Machine Learning Research (JMLR), 9(0):2377–2400, 2008.
 Ghorbani and Zou [2019] A. Ghorbani and J. Zou. Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868, 2019.
 Giordano et al. [2019] R. Giordano, W. Stephenson, R. Liu, M. Jordan, and T. Broderick. A Swiss Army infinitesimal jackknife. In Artificial Intelligence and Statistics (AISTATS), pages 1139–1147, 2019.
 Hampel [1974] F. R. Hampel. The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383–393, 1974.
 Hampel et al. [1986] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust Statistics: The Approach Based on Influence Functions. Wiley, 1986.
 Hancock et al. [2018] B. Hancock, P. Varma, S. Wang, M. Bringmann, P. Liang, and C. Ré. Training classifiers with natural language explanations. In Association for Computational Linguistics (ACL), 2018.
 Hayes and Ohrimenko [2018] J. Hayes and O. Ohrimenko. Contamination attacks and mitigation in multiparty machine learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 6604–6615, 2018.
 Jaeckel [1972] L. A. Jaeckel. The infinitesimal jackknife. Unpublished memorandum, Bell Telephone Laboratories, Murray Hill, NJ, 1972.
 Jia et al. [2019] R. Jia, D. Dao, B. Wang, F. A. Hubis, N. Hynes, N. M. Gurel, B. Li, C. Zhang, D. Song, and C. Spanos. Towards efficient data valuation based on the shapley value. arXiv preprint arXiv:1902.10275, 2019.
 Khanna et al. [2019] R. Khanna, B. Kim, J. Ghosh, and O. Koyejo. Interpreting black box predictions using Fisher kernels. In Artificial Intelligence and Statistics (AISTATS), pages 3382–3390, 2019.
 Koh and Liang [2017] P. W. Koh and P. Liang. Understanding blackbox predictions via influence functions. In International Conference on Machine Learning (ICML), 2017.
 Koh et al. [2019] P. W. Koh, J. Steinhardt, and P. Liang. Stronger data poisoning attacks break data sanitization defenses. arXiv preprint arXiv:1811.00741, 2019.
 LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Leek et al. [2010] J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. Johnson, D. Geman, K. Baggerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in highthroughput data. Nature Reviews Genetics, 11(10), 2010.
 Liu et al. [2014] Y. Liu, S. Jiang, and S. Liao. Efficient approximation of crossvalidation for kernel methods using Bouligand influence function. In International Conference on Machine Learning (ICML), pages 324–332, 2014.
 Metsis et al. [2006] V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with naive Bayes – which naive Bayes? In CEAS, volume 17, pages 28–69, 2006.
 Pregibon et al. [1981] D. Pregibon et al. Logistic regression diagnostics. Annals of Statistics, 9(4):705–724, 1981.
 Rad and Maleki [2018] K. R. Rad and A. Maleki. A scalable estimate of the extrasample prediction error via approximate leaveoneout. arXiv preprint arXiv:1801.10243, 2018.
 Ratner et al. [2016] A. J. Ratner, C. M. D. Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems (NeurIPS), pages 3567–3575, 2016.
 Schulam and Saria [2019] P. Schulam and S. Saria. Can you trust this prediction? Auditing pointwise reliability after learning. In Artificial Intelligence and Statistics (AISTATS), pages 1022–1031, 2019.
 Steinhardt et al. [2017] J. Steinhardt, P. W. Koh, and P. Liang. Certified defenses for data poisoning attacks. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
 Strack et al. [2014] B. Strack, J. P. DeShazo, C. Gennings, J. L. Olmo, S. Ventura, K. J. Cios, and J. N. Clore. Impact of HbA1c measurement on hospital readmission rates: Analysis of 70,000 clinical database patient records. BioMed Research International, 2014, 2014.
 Wang et al. [2019] H. Wang, B. Ustun, and F. P. Calmon. Repairing without retraining: Avoiding disparate impact with counterfactual distributions. arXiv preprint arXiv:1901.10501, 2019.
 Wei et al. [2015] C. Wei, Y. Peng, R. Leaman, A. P. Davis, C. J. Mattingly, J. Li, T. C. Wiegers, and Z. Lu. Overview of the BioCreative V chemical disease relation (cdr) task. In Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, pages 154–166, 2015.
 Williams et al. [2018] A. Williams, N. Nangia, and S. Bowman. A broadcoverage challenge corpus for sentence understanding through inference. In Association for Computational Linguistics (ACL), pages 1112–1122, 2018.
 Zhou et al. [2019] J. Zhou, Z. Li, H. Hu, K. Yu, F. Chen, Z. Li, and Y. Wang. Effects of influence on user trust in predictive decision making. In Conference on Human Factors in Computing Systems (CHI), 2019.
Appendix A Experimental details for comparing influence vs. actual effects on constructed groups
a.1 Model training
For all experiments in Section 3, we trained a logistic regression model (or softmax for multiclass) using sklearn.linear_model.LogisticRegression.fit, fitting the intercept but only applying regularization to the weights. To choose the regularization strength , we conducted 5fold crossvalidation across 10 possible values of logarithmically spaced between and , inclusive, selecting the regularization that yielded the highest crossvalidation accuracy (except on the CDR dataset, where we selected regularization based on crossvalidation F1 score to account for class imbalance as per Hancock et al. [2018]’s procedure).
a.2 Group construction
For each dataset, we constructed groups of various sizes relative to the entire dataset by considering 100 sizes linearly spaced between and of the dataset. For each of these 100 sizes, we constructed one group with each of the following methods:

Shared features: We selected a single feature uniformly at random and sorted the dataset along this selected feature. Next, we selected an training point uniformly at random. We then constructed a group of size that consisted of the unique training points that were closest to the chosen point, as measured by their values in the selected feature.

Feature clustering: We clustered the dataset with respect to raw features via scipy.cluster.hierarchy.fclusterdata with t set to , as well as with sklearn.cluster.KMeans.fit with n_clusters taking on values . Since hierarchical clustering determines cluster sizes automatically with a principled heuristic and we try a range of values for n_clusters in means, this recovers clusters with a large range of sizes. The clustering with also guarantees (via the pigeonhole principle) that there is at least one cluster which contains at least of the dataset. From all the clusters that are at least the size of the desired group, we chose one uniformly at random and chose the group uniformly at random and without replacement from the training points in this cluster.

Gradient clustering: We followed the same procedure as “Feature clustering,” except that we clustered the dataset with respect to , i.e. each training point was represented by the gradient of the loss on that point.

Random within class: We considered all classes with at least as many training points as the desired group. From these classes, we chose one uniformly at random. Then, we chose the group uniformly at random and without replacement from all training points in this class.

Random: We picked a group uniformly at random and without replacement from the entire dataset.
The above methods gave us a total of 500 groups (100 groups per method) for each dataset, with the exception of the “random within class” method for MNIST. Since MNIST has 10 classes, each with only of the data, we skipped over groups of size just for the “random within class” groups.
In addition, we selected 3 random test points and the 3 test points with highest loss; we intend these to represent the average case and the more extreme case that may be relevant to model developers who want to debug errors that their model outputs. For each of these 6 test points, we selected groups that had large positive influence on its test loss. More specifically, we proceeded in 3 stages:

We considered 33 group sizes linearly spaced between and of the dataset, and for any size out of these 33, we selected a group uniformly at random and without replacement from training points in the top of the dataset, ordered according to their influence on the test point of interest.

This was similar to the first stage, but with 33 sizes spaced between and and groups chosen from the top of the dataset.

Finally, we considered 34 sizes spaced between and , with groups chosen from the top of the dataset.
Larger groups tend to have lower average influence than smaller groups, since by necessity, we need to construct larger groups out of training points further from the top. This multistage approach ensured that we would select small groups with both a high average influence and also with a low average influence, so that we could compare them to larger groups and mitigate confounding the group size with its average influence.
Finally, we repeated this last method of group construction for groups with large negative influence on test point loss.
Using these 6 test points, we generated 1,200 groups (100 subsets per group, with 6 test points, and drawing from the positive and negative tails). In total, we therefore generated 1,700 groups per dataset (except MNIST).
a.3 Comparison of influence and actual effect
To produce Figure 1, we selected groups as described in Appendix A.2. We retrained the model once for each group, excluding the relevant group in order to calculate its actual effect. To compute all groups’ influences, we first calculated the influence of every individual training point using the procedure of Koh and Liang [2017]. Then, to compute the influence on test prediction or loss of some group, we simply added the relevant individual influences (in CDR, we weighted these individual influences according to that point’s weight; see Appendix B.1). To compute the influence on selfloss of some group, we summed up the gradients of the loss of each training point to compute , we calculated the inverse Hessian vector product and took its dot product with (again, we modified this with appropriate weighting for individual points in CDR).
Appendix B Dataset details
We used the same versions of the Diabetes, Enron, Dogfish, and MNIST datasets as Koh and Liang [2017], since the examination of the accuracy of influence functions for large perturbations is a natural extension of their studies of small perturbations. Additionally, we applied influence to more natural settings in CDR and MultiNLI; here, we discuss their preprocessing pipelines.
b.1 Cdr
Hancock et al. [2018] established the BabbleLabble framework for data programming, follwing the following pipeline: They took labeled examples with natural language explanations, parsed the explanations into programmatic LFs via a semantic parser, and filtered out obviously incorrect LFs. Then, they applied the remaining LFs to unlabeled data to create a sparse label matrix, from which they learned a label aggregator that outputs a noisily labeled training set. Finally, they ran regularized logistic regression on a set of basic linguistic features with the noisy labels.
They demonstrated their method on three datasets: Spouse, CDR, and Protein. The Protein dataset was not publicly available, and the vast majority of Spouse was labeled by a single LF, hence we chose to use CDR. This dataset’s associated task involved identifying whether, according to a given sentence, a given chemical causes a given disease. The sentences and ground truth labels were sourced from the 2015 BioCreative chemicaldisease relation dataset [Wei et al., 2015].
In our application, we began with their 28 LFs and the corresponding label matrix. For simplicity, we did not learn a label aggregator; instead, if an example was given labels by LFs , then we created copies of , each with weight . The subset of points corresponding to LF then included one instance of with weight . This weighting was taken into account in model training as well as in calculations of influence and actual effect. In addition, we used regularization to reduce the number of features to 330 while still achieving similar F1 score to [Hancock et al., 2018]; they reported an F1 of 42.3, while we achieved 42.2.
We note that in BabbleLabble, a given LF can never output positive on one example but negative on another. Hence, some LFs are positive (unable to output negative and only able to abstain or output positive), while the others are negative (unable to output positive and only able to abstain or output negative).
b.2 MultiNLI
Williams et al. [2018] created the MultiNLI dataset for the task of natural language inference: determining if a pair of sentences agree, contradict, or are neutral. To do so, they presented crowdworkers with initial sentences and asked them to generate followon sentences that were neutral or in agreement/contradiction. Thus, each of the 380 crowdworkers generated a subset of the dataset. We used these subsets in our application of influence.
The training set consisted of 392,702 examples from five genres. The development set consisted of 10,000 “matched” examples from the same five genres as the training set, as well as 10,000 “mismatched” examples from five new genres. The test set was put on Kaggle as an open competition, hence we do not have its labels and could not use it; instead, the development set was treated as the test set.
The continuous bagofwords baseline in Williams et al. [2018] first converted the raw text of each sentence in the pair into a vector by treating the sentence as a continuous bag of words and simply averaging the 300D GloVe vector embeddings. This converted a pair of sentences into vectors . They then concatenated into a 1200D vector, where denotes the elementwise product. Finally, they treated this as input to a neural net with three hidden layers and finetuned the entire model, including word embeddings (more details in [Williams et al., 2018]).
For our application, we truncated their baseline and just used the concatenation of and as the representation for every example. By running logistic regression on this, we achieved test accuracy of (vs. their baseline’s ). Future work could explore influence in the setting of more complex and higherperforming models.
Appendix C Additional experiments
c.1 Representative test points
c.2 Regularization
In Section 4, our bounds show that influence ought to be closer to actual effect as regularization increases. Here, we support this claim empirically on Diabetes, Enron, Dogfish, and MNIST (small).^{4}^{4}4This experiment required us to retrain the model for every value of and for every subset. Thus, for computational purposes, we omitted CDR and MultiNLI, and we selected a random subset of MNIST’s training set to use in place of all of MNIST. To do so, for each dataset, we selected a range of values for , and we selected subsets as described in Appendix A.2. We then computed the influence and actual effect of each of these subsets on a representative test point’s prediction, that point’s loss, and on selfloss (Figure C.2).
In Figure C.3, we observe the trend that correlation generally increases as does. Specifically, we computed the Spearman between the influence and actual effect for each dataset, each value of , and each evaluation function of interest (i.e., test prediction, test loss, or selfloss).
c.3 The effect of loss curvature on the accuracy of influence
One takeaway from the results on test loss in Figure 1Mid is that the curvature of can significantly increase approximation error; this is expected since the influence linearizes around . When possible, choosing a that has low curvature (e.g., the linear prediction) will result in higher accuracy. We can mitigate this by using influence to approximate the parameters and then plug that estimate into (Figure C.4), though this can be more computationally expensive.
Note that Figure C.4 shows that this technique does not help much for measuring selfloss. However, in the context of LOOCV, the computational complexity of the Newton approximation for selfloss (described in Section 4) is similar to that of the influence approximation, so we encourage the use of the Newton approximation for LOOCV (as in Rad and Maleki [2018]); Figure C.4 shows that this leads to more accurate approximations for selfloss.
c.4 Additional analysis of influence functions applied to natural groups of data
In Section 5, we considered the CDR and MultiNLI datasets, which contain the natural subsets of LFs and crowdworkers, respectively. To draw inferences about these subsets, we took the regularized logistic regression model described in Appendix A, calculated the influence of the LF/crowdworker subsets, and retrained the model once for each LF/crowdworker.
Cdr.
As discussed in Appendix B.1, an LF is either positive or negative, where a positive LF can only give positive labels or abstain, and similarly for negative LFs. Because of this stark class separation, we indicate whether an LF is positive or negative, and we consider LF influence on the positive test examples separately from their influence on the negative test examples. To measure an LF’s influence and actual effect on a set of test points, we simply add up its influence and actual effect on the set’s individual test points.
In Figure C.5, we note that influence is a good approximation of an LF’s actual effect, just as with other kinds of subsets as well as other datasets (Figure 1). Furthermore, we observe that positive LFs improve the overall performance of the positively labeled portion of the test set while hurting the negatively labeled portion of the test set, and vice versa for negative LFs. This dichotomous effect further motivates the analysis of influence on the positive test set separately from the negative test set, since the process of adding these two influences to study the influence on the entire test set would obscure the full story.
Next, we define an LF’s coverage to be the proportion of the examples that it does not abstain on, which can be measured through the number of examples in its corresponding subset. In Figure C.6, we observe that the magnitude of influence correlates strongly with coverage.
Finally, we define an LF’s precision to be the number of examples it labels correctly divided by the number of examples it does not abstain on. Because the dataset had many more negative than positive examples, positive LFs had lower precision than negative LFs. Surprisingly, even when this effect was taken into account and we considered positive LFs separately from negative ones, precision did not correlate with influence (Figure C.7).
We conclude that annotators should aim for heuristic LFs with high coverage, not high precision.
MultiNLI.
As discussed in Appendix B.2, the training set consisted of five genres, and the test set consisted of a matched portion with the same five genres, as well as a mismatched portion with five new genres. For succinctness, we refer to the influence/actual effect of the set of examples generated by a single crowdworker as that crowdworker’s influence/actual effect.
First, we note in Figure C.8 that influence is a good approximation of a crowdworker’s actual effect for both matched and mismatched test sets, consistent with our findings in Figure 1 for other subset types and datasets.
Unlike in CDR (Figure C.6), we do not find strong correlation between a crowdworker’s influence and the number of examples they contributed; it is possible to contribute many examples but have relatively little influence (Figure C.9).
The most prolific crowdworker contributed 35,000 examples and had large negative influence on the test set. A closer analysis revealed that they had positive influence on the fiction genre but lowered performance on many other genres, despite contributing roughly equally to each genre. This genrespecific trend tended to hold more broadly among the workers: there appear to be two categories of genres (fiction, facetoface, nineeleven vs. travel, government, verbatim, letters, oup) such that each worker tended to have positive influence on all genres in one category and negative influence on all genres in the other (Figure C.10). Moreover, the number of examples a worker contributed to a given genre was not a good indicator for their influence on that genre (Figure C.11).
Appendix D Additional analysis on influence vs. actual effect on a test point
d.1 Counterexamples
For Figure 5, we constructed two binary datasets in which the influence of a certain class of subsets on the test prediction of a single test point exhibits pathological behavior.
Rotation effect.
In Figure 5Left, our aim was to show that there can be a dataset with subsets such that the cone constraint discussed in Section 4.4 does not hold.
The rotation effect described in Corollary 1 is due to the angular difference between the change in parameters predicted by the influence approximation, , and the change in parameters predicted by the Newton approximation, . If and are linearly independent, then for any pair of target values , we can find some such that and .
To exploit this, we constructed the MoG dataset as an equal mixture of two standard (identity covariance) Gaussian distributions in , one for each class, and with means and , respectively. In particular:

We sampled examples from each class for a total of training points, and set the regularization strength .

We then computed and for each pair of training points and chose the pairs of training points with the largest angles between and .

Finally, we solved a leastsquares optimization problem to find for which and are approximately decorrelated.
Note that we adversarially chose which subsets to study in this counterexample, since our main goal was to show that there existed subsets for which the cone constraint did not hold. For the next counterexample, we instead study all possible subsets in the restricted setting of removing copies of single points.
Scaling effect.
In Figure 5Right, our aim was to construct a dataset such that even if we only removed subsets comprising copies of single distinct points, a low influence need not translate into a low actual effect.
To do so, we constructed the Ortho dataset that contains 2 repeated points of opposite classes on each of the 2 canonical axes of (for a total of 4 distinct points). By varying their relative distances from the origin, we can control the influence of removing one of these points as well as the rate that the scaling factor from Proposition 3 grows as we remove more copies of the same point. Furthermore, because the axes are orthogonal, we can control independently for each repeated point. We fix the test point . Maximizing for one axis and minimizing it for the other produces the two distinct lines in Figure 5Right.
d.2 Scaling effects when removing multiple points
In the general setting of removing subsets of different points, the analogous failure case to a varying scaling factor (Figure 5Right) is the varying scaling effect that the error matrix in Theorem 1 can have on different subsets . The range of this effect is bounded by the spectral norm of . This norm is precisely equal to in the singlepoint setting, and it is large when we remove a subset whose Hessian is almost as large as the full Hessian in some direction. As with , the spectral norm of decreases with (Theorem 1), so as regularization increases, we expect that the influence of a group will track its actual effect more accurately.
Appendix E Proofs
e.1 Notation
We first review the notation given in Section 2 and introduce new definitions that will be useful in the sequel. We define the empirical risk as
such that the optimal parameters are .
Given sample weight vectors , we define the derivatives
If the argument is omitted, it is assumed to be equal to . For example,
If has a subscript, then we add . For example,
For a given dataset, we define the following constants:
To avoid confusion with the vector 2norm, we will use the operator norm to denote the matrix 2norm.
In the sequel, we study the order3 tensor . We define its product with a vector (which returns a matrix) as a contraction along the last dimension:
e.2 Assumptions
We make the following assumptions on the derivatives of and .
Assumption 1 (Positivedefiniteness and Lipschitz continuity of ).
The loss is convex and twicedifferentiable, with positive regularization . Further, there exists such that
for all and . This is a bound on the third derivative of , if it exists.
Assumption 2 (Bounded derivatives of ).
is thricedifferentiable, with such that
These assumptions apply to all the results that follow below.
e.3 Bounding the error of the onestep Newton approximation
Proposition 1.
Let the Newton error be . Then
only involves thirdorder or higher derivatives of the loss, so it is 0 for quadratic losses.
Proof.
This proof is adapted to our setting from the standard analysis of the Newton method in convex optimization [Boyd and Vandenberghe, 2004].
First, note that . We will bound the norm of the difference of the parameters ; the desired bound on then follows from the assumption that has gradients bounded by and is therefore Lipschitz.
Since is strongly convex (with parameter ) and minimized by , we can bound the distance in terms of the norm of the gradient at :