# Learning Pairwise Interactions with Bayesian Neural Networks

###### Abstract

Estimating pairwise interaction effects, i.e., the difference between the joint effect and the sum of marginal effects of two input features, with uncertainty properly quantified, is centrally important in science applications. We propose a non-parametric probabilistic method for detecting interaction effects of unknown form. First, the relationship between the features and the output is modeled using a Bayesian neural network, leveraging on the representation capability of deep neural networks. Second, interaction effects and their uncertainty are estimated from the trained model. For the second step we propose a simple and intuitive global interaction measure: Expected Integrated Hessian (EIH), whose uncertainty can be estimated using the predictive uncertainty. Two important properties of the Bayesian EIH are: 1. interaction estimation error is upper bounded by the prediction error of the neural network, which ensures interaction detection can be improved by training a more accurate model; 2. uncertainty of the Bayesian EIH is well-calibrated provided the prediction uncertainty is calibrated, which is easier to test and guarantee. The method outperforms the available alternatives on simulated and real-world data, and we demonstrate its ability to detect interpretable interactions also between higher-level features (at deeper layers of the neural network).

Learning Pairwise Interactions with Bayesian Neural Networks

Tianyu Cui Pekka Marttinen Samuel Kaski Department of Computer Science Aalto University Espoo, Finland 02150 {tianyu.cui, pekka.marttinen, samuel.kaski}@aalto.fi

noticebox[b]Preprint. Under review.\end@float

## 1 Introduction

Estimating interactions between variables, and the uncertainty of the interactions, is a challenge common to many data science tasks, and particularly important in genomics for example. Having the right interactions in a model makes it more understandable and interpretable. At the simplest this could mean that the output depends on inputs and through

(1) |

where and represent the main effects, is the strength of the interaction, and is noise. In this example, the shape of the interaction is known (multiplicative), but this is not true in general. Estimating the uncertainty is equally important, to assess the statistical significance of the detected interactions. Traditional methods from the statistics community (e.g. ANOVA based methods [5, 28]) usually lack statistical power due to multiple testing. Alternatively, all possible interactions have to be pre-specified (e.g. in the Lasso based methods of [2, 14, 15]), which restricts the type of interactions that can be considered to, for example, multiplicative as in Eq.1.

In this paper we study a different approach, in which the representation learning power of a neural network is used to model the set of interactions directly from the data. An intuitive way to learn interaction effects is: first train a neural network on the data, then find the encoded interactions by interpreting the trained model. However, the currently available algorithms that both are interpretable and aim to recover interactions [9, 16, 23, 26] neglect uncertainty, which is not acceptable in critical fields such as healthcare or autonomous driving. Another limitation is that the error in modeling and the error in interaction detection are intertwined, and we generally do not know the true interaction effects in practice. It is consequently not clear how to assess the detected interactions and improve the procedure when the learned interactions do not follow human intuition.

In this work, we propose Bayesian Expected Integrated Hessian (EIH) to estimate interactions using a trained Bayesian neural network. The posterior distribution of EIH represents the uncertainty of the interaction measure, and it can be seen as a non-parametric analogy to the posterior distribution of in Eq.1 which could be learned using a Bayesian linear regression model. Two significant theoretical advantages of the Bayesian EIH are: 1. it linearly upper bounds the interaction estimation error with the prediction error of the learning model, which ensures that estimation can be improved by training a more accurate model; 2. the uncertainty of the Bayesian EIH can be made well-calibrated by improving the calibration of the distribution of predictions. These properties indicate that we can improve the detected interactions by simply improving the predictive model which is more flexible and easier, because the targets are available to us in supervised learning.

## 2 Related Work

Interaction detection algorithms have been proposed in different communities. Tsang et al. [26] proposed Neural Interaction Detection (NID) to learn interactions by inspecting the weights of a neural network: if some features are connected to the same node in the first hidden layer with large weight and if the path(s) from the node to the output also have large weights, then an interaction between the features is detected. An interpretable neural network architecture [27] that contains interaction information has also been proposed based on this idea. These methods can detect higher-order interactions of unrestricted forms, but they can only be applied to fully connected vanilla multi-layer perceptrons, not to convolutional neural networks for instance, which limits their usage in applications such as in computer vision. Deep Feature Interaction Map [9] also makes use of the Hessian, but it is designed for discrete features only, such as DNA sequences. Lundberg et al. [16] introduced SHAP interaction score which extends the SHAP value of game theory to detect interactions, and it is efficiently implemented in ensemble tree methods. Additive Groves [23] is another tree-based method, which compares the difference in performance of two regression trees, one with all interactions, and the other with the interaction of interest removed. However, none of these methods estimates the uncertainty of the learned interactions.

Bayesian Neural Networks (BNN) [17, 19] have been studied since the 90s, but they have not been generally applicable to large datasets until scalable variational inference algorithms [11, 13, 21] were proposed. Recent works [6, 12] show that the standard dropout training in NNs can be interpreted as a form of variational inference, and that the prediction uncertainty can be obtained by simply using dropout during testing. This approach can be applied to most neural networks, such as RNNs [7]. By optimizing dropout probabilities, Gal et al. [8] obtain better-calibrated uncertainties, and Molchanov et al. [18] learn a sparse neural network. In this work, we use the concrete dropout from [8], and extend the model by including a separate component for main effects to better capture interactions with uncertainty.

## 3 Modeling Interactions and their Uncertainty

To learn interactions and their uncertainty, we first need a probabilistic model that is able to automatically incorporate all kinds of interactions. Bayesian neural networks (BNNs) offer one option that can both model a rich family of functions and capture uncertainty.

### 3.1 Practical Bayesian Neural Network

Bayesian neural networks are defined by placing a prior distribution on the weights of a neural network. Then, instead of finding a point estimate of the weights by minimizing a cost function, a posterior distribution of the weights is calculated conditionally on the data. Let denote the output of a BNN and the likelihood. Then, given a dataset , , training a BNN is equivalent to learning the posterior distribution . Variational inference can be used to approximate the intractable with a simpler distribution, , by minimizing the . This is equivalent to minimizing the negative ELBO

(2) |

According to recent research on dropout variational inference [6], a practical Bayesian neural network for a wide variety of architectures can be obtained by simply training a neural network with dropout, and interpreting this as being equivalent to maximizing Eq.2. Then, samples from the posterior distribution of predictions can be obtained by performing dropout at test time (MC dropout). Gal et al. [8] proposed an extension, concrete dropout, to learn a different Bernoulli dropout probability for each layer to obtain better-calibrated uncertainties. In this case the variational distribution is defined as , where is the dropout probability for layer , and is a vector of outgoing weights from node in layer . Eq.2 can be optimized w.r.t. the variational parameters by using the re-parametrization trick and concrete relaxation [8]. In this paper, we adopt concrete dropout BNNs to model uncertainty. However, instead of learning a single dropout probability for each layer, we learn a separate dropout probability for each node.

### 3.2 Separating Interactions from Main Effects

In practice it is beneficial to model the main effects separately from the interactions. We will use linear regression, , for the main effects, and a BNN, , to capture the interactions. A prediction from this hybrid model is the sum of the two components , such that the objective function in Eq.2 can be rewritten as

(3) |

If the uncertainty of main effects is of interest, a prior may be placed also on . The motivation for this model structure stems from the fact that in real-world data the main effects usually dominate interactions. A BNN with a small capacity would likely capture the main effects only, unless they are modeled separately. By separating the main effects we can use a much smaller BNN to learn the interactions.

## 4 Detecting Interactions

After a BNN has been learned from data, interactions must be detected from the BNN. Previously, the gradient of has been used to detect main effects from a neural network [1, 10, 22, 24]. We extend these works and formulate Bayesian Expected Integrated Hessian (EIH), which now quantifies interactions with uncertainty, and can be easily calculated for any given BNN.

### 4.1 Bayesian Expected Integrated Hessian

An intuitive way to measure interactions is to use the Hessian of w.r.t. the input. For the simple multiplicative interaction in Eq.1 this recovers the coefficient . For non-multiplicative interactions the Hessian is not constant and represents interaction only at the point at which it was calculated, making it a local interaction measure. To estimate a regional interaction between features and , we first specify the region of interest , and then define the Expected Integrated Hessian, , as

(4) |

where is the empirical distribution of all the other features except and . Hence, is the Hessian averaged over the region of interest, and further an expectation is taken over plausible values of the other features. Integration in dimensions and is defined over a uniform distribution (save for the normalizing constant), rather than the empirical distribution used in other dimensions, which guarantees certain theoretically attractive properties (Section 4.2) and efficient inference with few MC samples (Section 4.3). For the rest of the paper, we assume the region of interest to be the full domain of , making Eq.4 a global interaction measure. However, it is possible to focus the analysis on a subregion or even aggregate information across multiple subregions, which may improve detection of certain types of interactions (see Appendix). In practice we normalize all inputs to have the same range. We note that EIH can be seen as an extension of integrated gradients, which have previously been used to study main effects [24].

Here is the weight in a BNN, i.e., a random variable. Therefore, is also a random variable whose distribution follows from the posterior distribution of . Thus we call it Bayesian Expected Integrated Hessian.

### 4.2 Theoretical Properties

Here we interpret as a difference between joint effect (as defined below) and the sum of marginal effects. Based on this interpretation, we derive two favourable properties of this interaction measure. We first introduce some notation. Let denote a modified vector such that feature is set to value . For simplicity, the index of the feature to reset is indicated by the subscript of the argument. Similarly, we define to be a vector with set to and set to . With this notation, we define the marginal effect of at as

(5) |

Eq.5 is equal to the change in the value of function when is changed from its smallest value to its largest value and other features are kept fixed at . The marginal effect of is defined similarly. Analogously, the joint effect of and at is defined as

(6) |

and is interpreted as the change in when both features are changed at the same time from their smallest to the largest values. Now, a straightforward integration shows that

(7) |

Hence, we see that has an intuitive interpretation as an expectation of the difference between a joint effect and the sum of marginal effects. Based on the above derivation, we can show two important properties (proofs are given in the Appendix):

Property 1 Suppose is the true data generating mechanism and is the learned model. Then the estimation error in interaction measures, , is linearly upper bounded by the prediction error of .

Property 2 When is a probabilistic model (e.g. BNN), we can make the uncertainty of the Bayesian EIH arbitrarily well-calibrated by improving the calibration of the distribution of predictions from .

Property 1 guarantees that by training a better model with a smaller , will have a tighter upper bound, which improves the learning of interactions. This property does not generally hold for other interaction measures, such as NID [26]. Testing whether the uncertainties of interaction effects are calibrated or not is almost impossible in practice, because of the lack of ground truth. Property 2 provides a way to assess the uncertainty calibration of interaction measures by checking the calibration of predictive uncertainty, which is easier because we have the target. Property 2 also indicates that the uncertainty of interactions can be estimated more accurately by improving the probabilistic model.

### 4.3 Monte Carlo Estimation of Bayesian EIH

We estimate the mean and variance of the Bayesian EIH by Monte Carlo (MC) integration. Unbiased estimators for the mean and variance follow from Eq.4 and Eq.7. To also account for posterior uncertainty in model parameters we approximate two nested expectations, one w.r.t. the posterior distribution, , and the other w.r.t. the empirical distribution of the other features, as in Eq.7. The resulting MC estimates are

(8) |

The sampling can be based either on Eq.7 or Eq.4, resulting correspondingly in

(9) |

where in both. If is not differentiable, the first line in Eq.9 should be used, requiring evaluations ( is the input dimension). If is a BNN (differentiable), the second line can be used and requires only backward-passes. Moreover, if we assume EIH to be approximately Gaussian, which follows from the CLT for large , a credible interval can be estimated as , which can be used in statistical tests. More analysis on accuracy of the approximation w.r.t. and and on computational complexity is given in the Appendix.

## 5 Experiments

We apply our approach to simulated data, 3 public real-world data sets with and without injected artificial interactions, and to the MNIST data. We show that our method can identify interpretable interactions with well-characterized uncertainty, and outperforms 3 state-of-the-art baselines.

### 5.1 Simulated Data

Experimental setup: We simulate datasets with features and interaction pairs, using model

(10) |

where is the weight of feature , is the functional form of the th interaction (specified above the panels in Figure 1) with weight , and are features involved in this interaction. Each is drawn uniformly from . Noise is Gaussian with zero mean and variance adjusted to a specified signal-to-noise ratio. Each simulated dataset includes 20000 samples for training, 5000 for validation, and 5000 for testing. To model interactions according to Section 3, we use a concrete dropout Bayesian neural network with 3 hidden layers of sizes , , and nodes for . During training, we set the length-scale of prior distribution to , temperature of the Sigmoid function in concrete distribution to , and learning rate of Adam to .

Comparison methods: We apply Bayesian EIH and NID on the same trained neural networks. In MC estimation, we use and . We implement SHAP interaction score with learning rate equal to , and a Lasso regression containing all pairwise multiplicative interactions with regularization coefficient set to . We include a linear regression model with the correct functional forms for the true interactions and the multiplicative form for other interactions as the ’Oracle’. We rank feature pairs according to the absolute values of interaction scores from each method from low to high. A good interaction measure should assign the true interactions as small a rank as possible.

Results:

In Figure 1, we increase the signal-to-noise ratio (S/N) gradually from to see which method can recover the ground-truth interactions with the smallest S/N. The figures are sorted according to the variance of the corresponding interaction (which is adjusted by weight ). The interactions with higher variances (e.g. , ) should be easier to detect, because they contribute more to the prediction. We find that EIH identifies on average the true interactions with a smaller S/N ratio than the other methods. While it is impossible to assign the lowest rank for all interactions, the average rank EIH assigns is by far the smallest, and the performance of EIH is close to that of the Oracle. SHAP interaction score works very well for most interactions, but it fails to detect and .

Figure 2(a) shows the estimated uncertainties of interaction effects. We observe that each true interaction effect (red cross) is covered by its corresponding credible interval (blue bar) centered on the point estimate (black dot). Figure 2(b) compares the calibration of credible intervals from three dropout methods, and shows that concrete dropout with per node dropout rate is better calibrated than the other methods. Importantly, the wider credible intervals, e.g. the 95% interval often used in practice, are conservative, i.e. contain more true values than expected. This means that with these CIs, EIH does not underestimate uncertainty, which is important in applications.

### 5.2 Public real-world regression datasets

Datasets: We analyze 3 publicly available regression datasets: California housing prices, Bike sharing, and Energy efficiency datasets. California housing prices dataset [20] aims to predict housing prices using 9 features, such as location, number of rooms, number of people, etc. Bike sharing dataset [4] predicts the hourly bike rental count from environmental and seasonal information. Energy efficiency dataset [25] aims to predict the load of heating and cooling from the shape of a building. We use of data for training, for validation, and for testing.

Experimental setup: We 1) analyse the original datasets and report and interpret the results, and 2) repeat the same analyses with artificial interactions injected to the data sets, which establish as a ground truth when the true interactions are unknown. We inject three types of interactions : a multiplicative, an exponential, and a division, whose detailed functional forms are given above the panels in Fig. 3. We only inject one interaction at a time. The output is the sum of the true output and injected interactions: .

The same settings are used as in Section 5.1, except that only one hidden layer with 30 units is used for the third dataset, which has only around data points.

Results: Results with injected interactions of varying strengths are shown in Figure 3. We see that EIH is consistently the best (or shared best) method in California housing and Bike sharing data sets, irrespective of the type of interaction. SHAP works well for the simple interaction form (the first row), and small datasets (the third column). We notice that SHAP and Lasso achieve better performance in the third energy efficiency dataset than our method, because it is hard to train a good neural network when the dataset size is small.

Datasets | Interacting Features | EIH | CI | ||
---|---|---|---|---|---|

longitude, latitude | 0.000 | 0.000 | |||

California Housing | total rooms, population | 0.000 | 0.000 | ||

total rooms, median income | 0.002 | 0.000 | |||

temperature, humidity | 0.000 | 0.000 | |||

Bike Sharing | workingday, humidity | 0.011 | 0.017 | ||

year, weathersit | 0.011 | 0.018 | |||

compactness, wall area | 0.000 | 0.000 | |||

Energy Efficiency | roof area, glazing area | 0.001 | 0.000 | ||

roof area, height | 0.021 | 0.017 |

Table 1 shows results for top interactions in the datasets (without injected interactions). EIH, CI, and are the estimated means, credible intervals, and P value from our method respectively. We also show a P value obtained by permuting the target multiple times (), to create an empirical null distribution of the maximum interaction score (details in Appendix). All top interactions are meaningful and statistically significant based on both our CIs and permutation. For example, the strongest interaction in the California housing data set is between longitude and latitude, which together specify the location that obviously affects price.

### 5.3 Interactions between higher-level features in MNIST

Motivation: We aim to demonstrate the ability of our method to detect interpretable interactions between higher-level features. For this, we design a classification task where the positive label represents a combination of interpretable characteristics of the input. The classification task here is to identify a given combination of two digits, e.g. (5,3), and the inputs are obtained by concatenating randomly chosen MNIST digits. Our expectation is that nodes in upper layers represent interpretable properties of the inputs (e.g. "5 on the left"), such that an interactions between two such nodes corresponds to the positive label (e.g. "5 on the left" and "3 on the right").

Datasets: We repeat the experiment twice: the first dataset consists of pairs (7,4), (4,7), (0,4), (4,0), (7,0), (0,7), and the positive label is (7,0); the second dataset consists of pairs (5,4), (4,5), (3,4), (4,3), (5,3), (3,5), and the positive label is (5,3).

Experimental setup: We train a LeNet ( convolutional layers, and fully connected layers) with concrete dropout, and use EIH to detect interactions between nodes in the second top fully connected layer, where nodes can be regarded as some high-level features learned by previous layers. We provide interpretations for these high-level features by finding one-digit image inputs with white on the other side, e.g. (1,-) or (-,6) that, from all possible one-digit images in the MNIST data, maximize the activation of the node. This is the activation maximization with experts technique for interpreting nodes in intermediate NN layers [3], with empirical distribution of digits in MNIST as the expert.

Results: Figure 4 shows the top two interactions in the second-highest layer, and presents interpretations for the interacting nodes. We see that all the interacting nodes represent digits related to the prediction tasks instead of unrelated digits, such as 3 in the first task or 7 in the second task. Interestingly, in both tasks the strongest interaction is between nodes whose interpretation matches the human intuition exactly. In the first task, (7,0) is obtained as an interaction between "7 on the left" and "0 on the right", and similarly for the task of classifying (5,3). The second strongest interaction in the (5,3) classification is between nodes with interpretations (4,-) and (-,3), which may be interpreted as excluding digit 4 on the left, when there is 3 on the right. The interaction between nodes which both have interpretation (7,-) may be related to learning different parts of digit 7.

## 6 Conclusion

We presented a novel method to learn pairwise interactions with uncertainty using Bayesian neural networks. We proposed an intuitive and flexible global interaction measure, Bayesian Expected Integrated Hessian, to detect interactions with uncertainty from a neural network, but the method is applicable to any black-box model that maps an input to an output. The method comes with appealing theoretical properties, which ensure that by improving the underlying BNN, interaction detection can be improved, which is not true for some alternatives. Our results also demonstrated ability of the method to learn interpretable interactions between higher-level features.

## References

- [1] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7):e0130140, 2015.
- [2] J. Bien, J. Taylor, and R. Tibshirani. A lasso for hierarchical interactions. Annals of Statistics, 41(3):1111, 2013.
- [3] D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
- [4] H. Fanaee-T and J. Gama. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2(2-3):113–127, 2014.
- [5] R. Fisher. Statistical Methods For Research Workers. Oliver And Boyd: Edinburgh, 1936.
- [6] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 34th International Conference on Machine Learning, pages 1050–1059, 2016.
- [7] Y. Gal and Z. Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1019–1027, 2016.
- [8] Y. Gal, J. Hron, and A. Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pages 3581–3590, 2017.
- [9] P. Greenside, T. Shimko, P. Fordyce, and A. Kundaje. Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics, 34(17):i629–i637, 2018.
- [10] Y. Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint arXiv:1611.07634, 2016.
- [11] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
- [12] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
- [13] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
- [14] Y. Kong, D. Li, Y. Fan, J. Lv, et al. Interaction pursuit in high-dimensional multi-response regression via distance correlation. The Annals of Statistics, 45(2):897–922, 2017.
- [15] M. Lim and T. Hastie. Learning interactions via hierarchical group-lasso regularization. Journal of Computational and Graphical Statistics, 24(3):627–654, 2015.
- [16] S. M. Lundberg, G. G. Erion, and S.-I. Lee. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888, 2018.
- [17] D. J. MacKay. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448–472, 1992.
- [18] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 2498–2507. JMLR. org, 2017.
- [19] R. M. Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
- [20] R. K. Pace and R. Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291–297, 1997.
- [21] R. Ranganath, S. Gerrish, and D. Blei. Black box variational inference. In Artificial Intelligence and Statistics, pages 814–822, 2014.
- [22] A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685, 2017.
- [23] D. Sorokina, R. Caruana, M. Riedewald, and D. Fink. Detecting statistical interactions with additive groves of trees. In Proceedings of the 25th International Conference on Machine learning, pages 1000–1007. ACM, 2008.
- [24] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365, 2017.
- [25] A. Tsanas and A. Xifara. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings, 49:560–567, 2012.
- [26] M. Tsang, D. Cheng, and Y. Liu. Detecting statistical interactions from neural network weights. arXiv preprint arXiv:1705.04977, 2017.
- [27] M. Tsang, H. Liu, S. Purushotham, P. Murali, and Y. Liu. Neural Interaction Transparency (NIT): Disentangling Learned Interactions for Improved Interpretability. In Advances in Neural Information Processing Systems, pages 5809–5818, 2018.
- [28] T. H. Wonnacott and R. J. Wonnacott. Introductory statistics, volume 5. Wiley New York, 1990.

## Appendix A Aggregating Information from Multiple Subregions into EIH

In Section 4, we described as a global interaction measure between features and , assuming the region of interest is the full domain of the features. Generally, EIH can be applied to any subregion of the full domain. Results for the subregions can be considered independently, or they can be aggregated into another global interaction measure, which we briefly outline below.

Formally, let denote the interaction effects in a rectangular region (i.e. compute of Eq.4 in Section 4 in region instead of ), and denote the full domain of and (i.e. in Eq.4 of Section 4). Suppose the full domain is divided into several subregions , such that . Then, another possible global interaction measure, denoted by can be defined as

(11) |

where represents the area of region . If the full domain is divided into one region only (i.e. and ), then will reduce to the absolute value of that was used in Section 4. Hence, can be regarded as a generalized EIH.

Next we give an example where the generalized EIH can be helpful to detect symmetric interactions by aggregating information from multiple subregions.

Consider interaction model , where the full domain . If we compute EIH in directly, i.e. applying Eq.4 of Section 4, will be as the Hessian of w.r.t. and is an odd function in , and hence the local interactions will cancel out when integrated across the whole domain. Let us divide into symmetric subregions , such that holds obviously. If we compute according to Eq.11, this will be equal to , since the interaction effects of each subregion is .

This problem occurs also with other interaction measures, such as SHAP, as long as the global interaction effect is computed by averaging local interaction effects. Other interaction measures, such as NID, which are based on information of neural network architecture, will avoid this problem in principle, but in general the empirical performance of this method is below that of the other methods (see Section 5).

## Appendix B Proof of Accuracy Improvement Property (Property 1)

For simplicity, Eq.8 together Eq.6-7 in Section 4 can be rewritten in a more general form as a linear combination of expected predictions:

(12) |

according to the linearity of expectation. And also for the true interaction effects:

(13) |

Thus the estimation error, , can be further derived through:

(14) |

where is the expected prediction error in data distribution, i.e. , and is determined by the interaction measure. In our case , and thus is , where is the dimension. This indicates that the estimation error of Averaged Hessian is linearly upper bounded by the maximum prediction error of the trained model with factor , which means we can improve the estimation accuracy and stability by improving the learning model.

## Appendix C Proof of Calibration Improvement Property (Property 2)

If we denote is the true model and is the posterior predictive distribution of model given , we call the uncertainty of perfectly calibrated when . Here we assume that can model both epistemic (model) uncertainty and aleatory (data) uncertainty, which means it can also capture the noise distribution. For example, can represent a Gaussian distribution , where . In our experiments, we only consider the epistemic (model) uncertainty (i.e. is a very small constant).

Another way to define calibration is: if the percentage Bayesian credible interval of the mean of is the same as the frequentist confidence interval of the mean of for all with an infinite number of data, predictive distribution is perfectly calibrated. So if the credible interval is closer to the corresponding confidence interval, is better calibrated. Here we assume that the mean of both and are Gaussian distributed, which is generally true according to CLT.

In the rest of this section, we first define the closeness between the credible interval and the confidence interval. Then we show that for two predictive model and , if is better calibrated than , then is better calibrated than , where is the set of data points sampled from data distribution. Thus we have proved the property 2 in Section 4.2, because Averaged Hessian can be written in the form .

### c.1 Closeness between Two Intervals

We denote to be the credible interval of , the mean of , and to be the confidence interval of , the mean of . Then we define the closeness of and to be:

When , two intervals perfectly match, and when , two intervals have no intersection.

### c.2 Calibration Improvement Preserved under Linear Combination

We first prove that if is better calibrated than , then is better calibrated than .

We denote that , where and are the mean of and respectively. And , , where and are the mean of and .

We only present the case^{1}^{1}1It is easy to prove when one interval contains another interval. when is different from or . We can calculate the intervals based on Gaussian: and , where is the value of percent point function for . Then for data , the closeness of interval is

(15) |

where we assume is greater than (another case can be shown in the same way). And also for data :

(16) |

The mean of is equal to , because and are independent. Thus , and also . Here we consider the intervals with percent point function equals to , then and . Thus the closeness of these two intervals is:

(17) |

So when is better calibrated than , we have and . Then the lower bound of will be greater than , and this applies for all , so is better calibrated than . When is perfectly calibrated, we have , thus is also perfectly calibrated.

It can be generalized to the distribution of all possible linear combinations of predictions trivially, so is better calibrated than if is better calibrated than .

## Appendix D Computational Complexity of EIH

There are two ways to compute EIH according to Eq.9 in Section 4.3. The first line can be used for all probabilistic models, but it requires evaluations of the probabilistic model for estimating all pairwise interaction effects, where is the number of MC samples and is the number of dimensions. If our model is differentiable, we can estimate Hessian directly from the second line of Eq.9, but with only backward-passes, where is the number of MC samples, because we can obtain the gradients of all input features for one single backward-pass. So if is not significantly larger than to obtain stable estimations, making use of differentiable property can certainty improve the efficiency.

We test whether is different from in order to obtain a stable estimation of the mean of Eq.8 (Here and represent , because the estimation of the mean is only related to the product of and ).

We can observe that although the estimation based on Hessian directly (second method) has larger variance at the beginning, but these two methods requires nearly same MC samples for a stable estimation of the mean. So we should use the second estimation method when our algorithm is differentiable.

## Appendix E Determine and in Monte Carlo Estimator

In this section, we test how many Monte Carlo samples ( and in Section 4.3) we need to obtain a stable estimator of mean and standard deviation for simulation dataset and public dataset. Here we estimate the Hessian directly (the second estimation method of Eq.9), because BNN is differentiable.

From the first equation of Eq.8 in Section 4, we can notice that is only determined by the product of and (i.e. ) when and are independent. And the estimator of variance (second equation in Eq.8 of Section 4) is determined by both and . Thus the safest way to find a stable estimator is that we determine in first, and set in and then determine .

Figure 6 shows how the mean and standard deviation of estimated interaction effects change when we increase and on simulated dataset, and Figure 7 is results on California Housing price dataset. We can notice that the mean becomes stable when the number of samples is larger than , and std is stable when is larger than . So and should be enough for practice.

## Appendix F Permutation Distribution of Maximum Interaction effects

In this section, we show permutation distribution of the maximum interaction effect for each dataset in Figure 8.

For each dataset, we permute the target and compute the maximum interaction effect on the permuted dataset, then we can obtain the permutation distribution of the maximum interaction effect. Then we calculate the p-value of top-3 estimated interaction effects for each dataset under its corresponding permutation distribution. KDE stands for the Gaussian kernel density estimation of the empirical permutation distribution.