Quantifying Point-Prediction Uncertainty in Neural Networks via Residual Estimation with an I/O Kernel
Neural Networks (NNs) have been extensively used for a wide spectrum of real-world regression tasks, where the goal is to predict a numerical outcome such as revenue, effectiveness, or a quantitative result. In many such tasks, the point prediction is not enough, but also the uncertainty (i.e. risk, or confidence) of that prediction must be estimated. Standard NNs, which are most often used in such tasks, do not provide any such information. Existing approaches try to solve this issue by combining Bayesian models with NNs, but these models are hard to implement, more expensive to train, and usually do not perform as well as standard NNs. In this paper, a new framework called RIO is developed that makes it possible to estimate uncertainty in any pretrained standard NN. RIO models prediction residuals using Gaussian Process with a composite input/output kernel. The residual prediction and I/O kernel are theoretically motivated and the framework is evaluated in twelve real-world datasets. It is found to provide reliable estimates of the uncertainty, reduce the error of the point predictions, and scale well to large datasets. Given that RIO can be applied to any standard NN without modifications to model architecture or training pipeline, it provides an important ingredient in building real-world applications of NNs.
Quantifying Point-Prediction Uncertainty in Neural Networks via Residual Estimation with an I/O Kernel
Xin Qiu Cognizant email@example.com Elliot Meyerson Cognizant firstname.lastname@example.org Risto Miikkulainen Cognizant email@example.com
noticebox[b]Preprint. Under review.\end@float
Nowadays, Neural Networks (NNs) are arguably the most popular machine learning tool among Artificial Intelligence (AI) community. Researchers and practitioners have applied NNs to a wide variety of fields, including manufacturing Bergmann2014 (), bioinformatics LeCun2015 (), physics Baldi2014 (), finance Niaki2013 (), chemistry Anjos2015 (), healthcare Shahid2019 (), etc. Although standard NNs are good at making a point prediction (a single outcome) for supervised learning tasks, they are unable to provide the uncertainty information about predictions. For real-world decision makers, representing the model uncertainty is of crucial importance Gal2016 (); Ghahramani2015 (); Krzywinski2013 (). For example, in the case of regression, providing a confidence interval of the prediction allows the decision maker to anticipate the possible outcomes with explicit probability. In contrast, simply returning a single point prediction imposes increased risks on the decision making, e.g., a predictively good but actually risky medical treatment may be overconfidently interpreted without uncertainty information.
Conventional Bayesian models such as Gaussian Process (GP) Rasmussen2006 () offer a mathematically grounded approach to reason about the predictive uncertainty, but usually come with a prohibitive computational cost and lessened performance compared to NNs Gal2016 (). As a potential solution, considerable research has been devoted to the combination of Bayesian models and NNs (see "Related Work" section for a detailed review of these approaches), aiming to overcome the downsides of both. However, from the classical Bayesian Neural Network Neal1996 (), in which a distribution of weights is learned, to the most recent Neural Processes Garnelo2018 (); Marta2018 (); Kim2019 (), in which a distribution over functions is defined, all these methods require significant modifications to the model infrastructure and training pipeline. Compared to standard (non-Bayesian) NNs, these new models are often computationally slower to train and harder to implement Balaji2017 (); Gal2016 (), creating tremendous difficulty for practical uses. In Gal2016 (), a theoretical tool is derived to extract uncertainty information from dropout training, however, the method can only be applied to dropout models, and it also requires to change the internal inference pipeline of dropout NNs. Quantifying point-prediction uncertainty in standard NNs, which are overwhelmingly popular in practical applications, still remains a challenging problem with significant potential impact.
To circumvent the above issues, this paper presents a new framework that can quantitatively estimate the point-prediction uncertainty of standard NNs without any modifications to the model structure or training pipeline. The proposed framework can be directly applied to any pretrained NNs without retraining them. The main idea is to estimate the prediction residuals of NNs using a modified GP, which introduces a new composite kernel that makes use of both inputs and outputs of the NNs. The framework is referred to as RIO (for Residual Input/Output), and the new kernel as an I/O kernel. In addition to uncertainty estimation, RIO has an interesting and unexpected side-effect: It also provides a way to reduce the error of the NN predictions. Moreover, with the help of recent sparse GP models, RIO is well scalable to large-scale datasets. Since classification problems can also be treated as regression on class labels lee2018 (), this paper will focus on regression tasks. A theoretical foundation is provided to prove the effectiveness of both residual estimation and I/O kernel. Twelve real-world datasets are tested in empirical studies, in which RIO exhibits reliable uncertainty estimations, more accurate point predictions, and better scalability compared to existing approaches.
In this section, the general problem statement will be given, the RIO framework will be developed, and justified theoretically, focusing on the two main contributions: estimating the residuals with GP and using an I/O kernel. For background introductions of NNs, GPs, and its more efficient approximation, Stochastic Variational Gaussian Processes (SVGPs) Hensman2013 (); Hensman2015 (), see section S1 in appendix.
2.1 Problem Statement
Consider a training dataset , and a pre-trained standard NN that outputs a point prediction given . The problem is two-fold: (1) Quantify the uncertainty in the predictions of the NN, and (2) calibrate the point predictions of the NN (i.e. make them more accurate).
2.2 Framework Overview
The main idea of the proposed framework, RIO (for Residual estimation with an I/O kernel), is to predict the residuals between observed outcomes and NN predictions using GP with a composite kernel. The framework can be divided into two phases: training phase and deployment phase.
In the training phase, the residuals between observed outcomes and NN predictions on the training dataset are calculated as
Let denote the vector of all residuals and denote the vector of all NN predictions. A GP with a composite kernel is trained assuming , where denotes an covariance matrix at all pairs of training points based on a composite kernel
Suppose a radial basis function (RBF) kernel is used for both and . Then,
The training process of GP learns the hyperparameters , , , , and by maximizing the log marginal likelihood given by
In the deployment phase, a test point is input to the NN to get an output . The trained GP predicts the distribution of the residual as , where
where denotes the vector of kernel-based covariances (i.e., ) between and the training points.
Interestingly, note that the predicted residuals can also be used to calibrate the point predictions of the NN. Finally, the calibrated prediction with uncertainty information is given by
In other words, RIO not only adds the uncertainty estimation to a standard NN—it also makes their output more accurate, without any modification of its architecture or training. Figure 1 shows the overall procedure when applying the proposed framework in real-world applications.
2.3 Underlying Rationale of Residual Prediction
This subsection gives a theoretical justification for why residual prediction helps both error correction and uncertainty estimation of an NN. Specifically, such prediction (1) leads to output that is more accurate than that of GP alone, (2) leads to output that is more accurate than that of the NN alone, and (3) leads to uncertainty estimates that are positively correlated with variance of NN residuals.
Consider a dataset , with drawn i.i.d. from a distribution , and
where , , and is a function with mean zero and variance . Without loss of generality, represents the component in the target function that can be modeled by a GP, represents the observation noise, and represents all the remaining components. With mean squared error as loss, the optimal predictor for is .
Suppose that, from the perspective of GP given , is indistinguishable from noise, but that there is a neural network that has successfully learned some of its structure. Consider the residuals . Here, is the remaining GP component, i.e., , for some non-negative . Similarly, is the remainder of indistinguishable from noise, i.e., . Aside from these indistinguishable functions, assume GP hyperparameters can be estimated optimally.
Let be the posterior mean of the GP trained directly on , and be that of a GP trained to fit the residuals, which yields the final predictor . Let , , and be the expected generalization errors of , , and .
First, following a standard approach Rasmussen2006 (), consider the eigenfunction expansion and . Let be the diagonal matrix of the eigenvalues , and be the design matrix, i.e., . The following series of results capture the improvement due to residual estimation (proofs in section S3.1 of appendix).
In particular, the improvement with respect to GP is greater than the reduction in apparent noise. Importantly, this improvement in error corresponds to a predictive variance that is closer to the optimal for this problem, so uncertainty estimates are improved as well. Experiments on real world data confirm that when predicting the residuals of an NN, the estimated noise level of the GP is indeed lower and correlates with the reduction in generalization error (see S2.3 in appendix). This reduction is possible because the NN is able to extract higher-level features not available to the GP.
These results also lead to a key practical property, which is illustrated in Figure 2.
The variance of NN residuals is positively correlated with the uncertainty of .
Increases in lead to increases in , and increases in lead to decreases in . So, increases in either or lead to increases of , which is the expected predictive variance of . ∎
This property matches the intuition that the GP’s variance should generally be higher for bad NNs than for good NNs. Such a property is crucial to accurately measuring the confidence of NN predictions.
2.4 Robustness of I/O Kernel
This section provides a justification for why a GP using the proposed I/O kernel is more robust than the standard GP, i.e., using the input kernel alone. The reasoning closely matches that in Section 2.3.
Consider the setup in Section 2.3, but now with where and , with non-trivial RBF kernels , (as in Equation 3). Again, suppose that, due to its high expressivity Csaji2001 (), is indistinguishable from noise from the perspective of GP.
Let , , and be the expected generalization errors of the standard GP, GP with output kernel only, and GP with I/O kernel. Then, the expected result follows (proof in S3.2 of appendix).
The optimizer associated with GP simultaneously optimizes the hyperparameters of both kernels, so the less useful kernel usually receives a smaller signal variance. As a result, the I/O kernel is resilient to failures of either kernel. In particular, the GP using I/O kernel improves performance even in the case where the problem is so complex that Euclidean distance in the input space provides no useful correlation information or when the input space contains some noisy features. Conversely, when the NN is a bad predictor, and is simply noise, the standard GP with input kernel alone is recovered. In other words, the I/O kernel is never worse than using the input kernel alone, and in practice it is often better. This conclusion is confirmed in experiments, as will be described next.
RIO is scalable to large datasets by applying sparse GP methods, e.g., SVGP Hensman2013 (); Hensman2015 (). All the conclusions in previous sections still remain valid since sparse GP is simply an approximation of the original GP. In the case of applying SVGP with a traditional optimizer, e.g., L-BFGS-B Byrd1995 (); Zhu1997 (), the computational complexity is , and space complexity is , where is the number of data points and is the number of inducing variables. Experiments show that the computational cost of this implementation is significantly cheaper than other state-of-the-art approaches.
3 Empirical Evaluation
Experiments in this section compare nine algorithms on 12 real-world datasets. The algorithms include standard NN, the proposed RIO framework, four ablated variants of RIO, and three state-of-the-art models that can provide predictive uncertainty, namely, SVGP Hensman2013 (), Neural Network Gaussian Process (NNGP) lee2018 (), and Attentive Neural Processes (ANP) Kim2019 (). In naming the RIO variants, "R" means estimating NN residuals then correcting NN outputs, "Y" means directly estimating outcomes, "I" means only using input kernel, "O" means only using output kernel, and "IO" means using I/O kernel. For all RIO variants (including full RIO), SVGP is used as the GP component, but using the appropriate kernel and prediction target. Therefore, "Y+I" amounts to original SVGP, and it is denoted as "SVGP" in all the experimental results. All 12 datasets are real-world regression problems from UCI Machine Learning Repository Dua2017 (), and cover a wide variety of dataset sizes and feature dimensionalities. Except for the "MSD" dataset, all other datasets are tested for 100 independent runs. During each run, the dataset is randomly split into training set, validation set, and test set, and all algorithms are trained on the same split. All RIO variants that involve an output kernel or residual estimation are based on the trained NN in the same run. For "MSD", since the dataset split is strictly predefined by the provider, only 10 independent runs are conducted. NNGP and ANP are only tested on the four smallest dataset (based on the product of dataset size and feature dimensionality) because they do not scale to larger datasets within the available computational time. It is notable that for all the RIO variants, no extensive hyperparameter tunings are conducted; the same default setup is used for all experiments, i.e., RBF kernel and 50 inducing points. See appendix for the details of experimental setups. Table 1 summarizes the numerical results from these experiments. The main conclusions in terms of point-prediction error, uncertainty estimation, computational requirements, and ablation studies are summarized below.
The errors between point predictions of models and true outcomes of test points are measured using Root Mean Square Error (RMSE), and the mean and standard deviation of RMSEs over multiple experimental runs are shown in the "Prediction RMSE" column of Table 1. For models that return a probabilistic distribution, the mean of the distribution is considered as the point prediction. RIO performs the best or equals the best (based on statistical tests) in 11 out of 12 datasets, and ranks second in the remaining dataset. Compared to standard NN, the probabilistic models that predict the outcomes directly generally perform worse. This result demonstrates the advantages of standard NNs in making point predictions, and makes it compelling to extract uncertainty information from NN predictions.
|n d||RMSE||RMSE||RMSE||RMSE||(sec)||n d||RMSE||RMSE||RMSE||RMSE||(sec)|
The symbols and indicate that the difference between the marked entry and RIO is statistically significant at the 5% significance level using paired -test and Wilcoxon test, respectively. The best entries that are significantly better than all the others under at least one statistical test are marked in boldface (ties are allowed).
Figure 3 shows a comparison among NN, RIO and SVGP in terms of prediction RMSE. RIO is able to improve the NN predictions consistently regardless of how the dataset is split and how well the NN is trained. Conversely, SVGP performs significantly worse than NN in all the datasets. For the CT dataset, which has 386 input features, SVGP fails severely since the input kernel cannot capture the implicit correlation information. ANP shows an extremely unstable performance in airfoil dataset due to its poor scalability to the increase of dataset size. To conclude, applying RIO to NNs not only provides additional uncertainty information, but also reduces the point-prediction error.
Confidence interval (CI) is a useful tool to estimate the distribution of outcomes with explicit probabilities. In order to measure the quality of uncertainty estimation quantitatively, the percentages of testing outcomes that are within the 95%/90%/68% CIs as estimated by each algorithm are calculated. These percentages should be as close to the estimated confidence levels as possible, e.g., a perfect uncertainty estimator would have exactly 95% of testing outcomes within its estimated 95% CIs. The differences between the estimated CIs and true CIs are quantified using their standardized errors with respect to Z-score (see appendix for more details). The RMSEs based on these errors are shown in Table 1 for estimated 95%, 90% and 68% CIs. RIO provides the best or equals the best uncertainty estimations in 7, 6 and 5 out of 12 datasets for 95%, 90% and 68% CIs, respectively. In the remaining datasets, the differences between RIO and the best entries are also small (0.145 with respect to Z-score). One interesting observation is that in the CT dataset SVGP shows very poor prediction RMSE and 95% CI RMSE, but achieves the best 90% and 68% CI RMSE. After investigation, this is actually happened by chance and is not due to accurate CI estimations of SVGP (See S2.4 in appendix for more details).
To provide a more straightforward comparison, Figure 4 shows the distribution of percentages of testing outcomes that are within the estimated 68% CIs over 100 independent runs. It is notable that an accurate coverage percentage of 68% CI is usually more difficult to achieve than those of 95% and 90% because the true distribution of observed outcomes are denser in 68% CI borderline than in the tails. RIO makes reliable CI estimations in most cases, albeit it occasionally returns erroneous CIs for the yacht dataset. ANP provides reasonable CIs in two datasets but performs unstably in airfoil dataset. NNGP always returns overconfident CIs for these real-world datasets due to the lack of noise estimation in its original implementation. Boxplots for coverage percentages of all the CIs over all the datasets are presented in Appendix. The conclusion is that compared to all the other tested models, RIO provides reasonable CI estimations in most cases.
The "time" column in Table 1 shows the computation time of each algorithm, averaged over all the independent runs. All algorithms are implemented using Tensorflow under the same running environment (see Appendix for details about the implementations). The RIO variants scale well to increasing dataset sizes and feature dimensionalities. In contrast, ANP’s computation time increases significantly with the scale of the dataset. NNGP always needs very expensive computational budgets due to its costly grid search of hyperparameters. Thus, RIO scales better than other approaches to large datasets.
The RIO variants with residual estimation generally perform better than its counterparts in both point-prediction error and uncertainty estimation. This result confirms the effectiveness of residual estimation, as suggested by the theoretical analysis in Section 2.3. Another important result is that Y+IO outperforms both Y+I and Y+O in most cases across all performance metrics, and RIO generally provides more robust performance than R+I and R+O in all aspects. This result, in turn, confirms that the I/O kernel is robust, as suggested by the analysis in Section 2.4. In sum, both residual estimation and the I/O kernel contribute substantially to the good performance of RIO.
4 Discussion and Future Directions
In addition to reliable uncertainty estimation, accurate point prediction, and good scalability, RIO also provides benefits in other aspects.
RIO can be directly applied to any standard NN without modification to the model architecture or training pipeline. Moreover, retraining of the NN or change of inference process are not required. The framework simply requires the outputs of an NN; it does not need to access any internal structure. This feature makes the framework more accessible to practitioners in real-world applications, e.g., data scientists can train NNs using traditional pipelines, then directly apply RIO to the trained NNs.
RIO also provides robustness to a type of adversarial attack. Consider a worst-case scenario, in which an adversary can arbitrarily alter the output of the NN with minuscule changes to the input. It is well-known that there are NNs for which this is possible Goodfellow2015 (). In this case, with the help of the I/O kernel, the model becomes highly uncertain with respect to the output kernel. A confident prediction then requires both input and output to be reasonable. In the real world, a high degree of uncertainty may meet a threshold for disqualifying the prediction as outside the scope of the model’s ability.
There are several promising future directions for extending RIO: First, applying RIO to reinforcement learning (RL) algorithms, which usually use standard NNs for reward predictions, allows uncertainty estimation of the future rewards. Agents can then employ more mathematically efficient exploration strategies, e.g., bandit algorithms Thompson1933 (), rather than traditional epsilon greedy methods. Second, RIO applied to Bayesian optimization (BO) Mockus1975 () makes it possible to use standard NNs in surrogate modeling. This approach can potentially improve the expressivity of the surrogate model and the scalability of BO. Third, since RIO only requires access to the inputs and outputs of NNs, it can be directly applied to any existing prediction models, including hybrid and ensemble models. This makes RIO a more general tool for real-world practitioners.
5 Related Work
There has been significant interest in combining NNs with probabilistic Bayesian models. An early approach was Bayesian Neural Networks, in which a prior distribution is defined on the weights and biases of a NN, and a posterior distribution is then inferred from the training data MacKay1992 (); Neal1996 (). Traditional variational inference techniques have been applied to the learning procedure of Bayesian NN, but with limited success Barber1998 (); Graves2011 (); Hinton1993 (). By using a more advanced variational inference method, new approximations for Bayesian NNs were achieved that provided similar performance as dropout NNs Blundell2015 (). However, the main drawbacks of Bayesian NNs remain: prohibitive computational cost and difficult implementation procedure compared to standard NNs.
Alternatives to Bayesian NNs have been developed recently. One such approach introduces a training pipeline that incorporates ensembles of NNs and adversarial training Balaji2017 (). Another approach, NNGP, considers a theoretical connection between NNs and GP to develop a model approximating the Bayesian inference process of wide deep neural networks lee2018 (). Deep kernel learning (DKL) combines NNs with GP by using a deep NN embedding as the input to the GP kernel Andrew2016 () . Conditional Neural Processes (CNPs) combine the benefits of NNs and GP, by defining conditional distributions over functions given data, and parameterizing this dependence with a NN Garnelo2018 (). Neural Processes (NPs) generalize deterministic CNPs by incorporating a latent variable, strengthening the connection to approximate Bayesian and latent variable approaches Marta2018 (). Attentive Neural Processes (ANPs) further extends NPs by incorporating attention to overcome underfitting issues Kim2019 (). The above models all require significant modifications to the original NN model and training pipeline. Compared to standard NNs, they are also less computationally efficient and more difficult for practitioners to implement. In the approach that shares the most motivation with RIO, Monte Carlo dropout was used to estimate the predictive uncertainty of dropout NNs Gal2016 (). However, this method is restricted to dropout NNs, and also requires modifications to the NN inference process.
The RIO framework both provides estimates of predictive uncertainty of neural networks, and reduces their point-prediction errors. The approach is based on residual estimation and a composite I/O kernel, which are theoretically well founded and perform well on several real-world problems. By utilizing a sparse approximation of GP, RIO scales well to large datasets. Remarkably, it can be applied directly to any pretrained NNs without modifications to model architecture or training pipeline. Thus, RIO can be used to make NN regression practical and powerful in many real-world applications.
-  O. Anjos, C. Iglesias, F. Peres, J. MartÃnez, A. Garcia, and J. Taboada. Neural networks applied to discriminate botanical origin of honeys. Food Chemistry, 175:128–136, 05 2015.
-  P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5:4308, 07 2014.
-  D. Barber and C. Bishop. Ensemble learning in bayesian neural networks. In Generalization in Neural Networks and Machine Learning, pages 215–237. Springer Verlag, January 1998.
-  S. Bergmann, S. Stelzer, and S. Strassburger. On the use of artificial neural networks in simulation-based manufacturing control. Journal of Simulation, 8(1):76–90, Feb 2014.
-  C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 1613–1622. JMLR.org, 2015.
-  R. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
-  B. Csaji. Approximation with artificial neural networks. M.S.’Thesis, Dept. Science, Eotvos Lorand Univ., Budapest, Hungary, 2001.
-  L. Csató and M. Opper. Sparse on-line gaussian processes. Neural computation, 14:641–68, 04 2002.
-  D. Dua and C. Graff. UCI machine learning repository, 2017.
-  Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 1050–1059. JMLR.org, 2016.
-  M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. Rezende, and S. M. A. Eslami. Conditional neural processes. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1704–1713. PMLR, 10–15 Jul 2018.
-  M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. M. A. Eslami, and Y. W. Teh. Neural processes. CoRR, abs/1807.01622, 2018.
-  Z. Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521:452 EP –, 05 2015.
-  I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
-  A. Graves. Practical variational inference for neural networks. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 2348–2356, USA, 2011. Curran Associates Inc.
-  J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI’13, pages 282–290, Arlington, Virginia, United States, 2013. AUAI Press.
-  J. Hensman, A. Matthews, and Z. Ghahramani. Scalable Variational Gaussian Process Classification. In G. Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 351–360, San Diego, California, USA, 09–12 May 2015. PMLR.
-  G. E. Hinton and D. van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT ’93, pages 5–13, New York, NY, USA, 1993. ACM.
-  H. Kim, A. Mnih, J. Schwarz, M. Garnelo, S. M. A. Eslami, D. Rosenbaum, O. Vinyals, and Y. W. Teh. Attentive neural processes. CoRR, abs/1901.05761, 2019.
-  M. Krzywinski and N. Altman. Importance of being uncertain. Nature Methods, 10:809 EP –, 08 2013.
-  B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6402–6413. Curran Associates, Inc., 2017.
-  Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. On optimization methods for deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pages 265–272, USA, 2011. Omnipress.
-  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521:436 EP –, 05 2015.
-  J. Lee, Y. Bahri, R. Novak, S. Schoenholz, J. Pennington, and J. Sohl-dickstein. Deep neural networks as gaussian processes. International Conference on Learning Representations, 2018.
-  D. J. C. MacKay. A practical bayesian framework for backpropagation networks. Neural Comput., 4(3):448–472, May 1992.
-  J. Močkus. On bayesian methods for seeking the extremum. In G. I. Marchuk, editor, Optimization Techniques IFIP Technical Conference Novosibirsk, July 1–7, 1974, pages 400–404, Berlin, Heidelberg, 1975. Springer Berlin Heidelberg.
-  R. M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, Berlin, Heidelberg, 1996.
-  S. T. A. Niaki and S. Hoseinzade. Forecasting s&p 500 index using artificial neural networks and design of experiments. Journal of Industrial Engineering International, 9(1):1, Feb 2013.
-  M. Opper and F. Vivarelli. General bounds on bayes errors for regression with gaussian processes. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pages 302–308, Cambridge, MA, USA, 1999. MIT Press.
-  J. Quiñonero Candela and C. E. Rasmussen. A unifying view of sparse approximate gaussian process regression. J. Mach. Learn. Res., 6:1939–1959, Dec. 2005.
-  C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Jan. 2006.
-  M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse gaussian process regression. In IN WORKSHOP ON AI AND STATISTICS 9, 2003.
-  N. Shahid, T. Rappon, and W. Berta. Applications of artificial neural networks in health care organizational decision-making: A scoping review. PLOS ONE, 14(2):1–22, 02 2019.
-  P. Sollich. Learning curves for gaussian processes. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pages 344–350, Cambridge, MA, USA, 1999. MIT Press.
-  W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
-  M. K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In In Artificial Intelligence and Statistics 12, pages 567–574, 2009.
-  A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In A. Gretton and C. C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 370–378, Cadiz, Spain, 09–11 May 2016. PMLR.
-  C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw., 23(4):550–560, Dec. 1997.
Appendix S1 Background
This section reviews notation for Neural Networks, Gaussian Processes, and its more efficient approximation, Stochastic Variational Gaussian Processes. The RIO method, introduced in Section 2 of the main paper, uses Gaussian Processes to estimate the uncertainty in neural network predictions and reduces their point-prediction errors.
s1.1 Neural Networks
Neural Networks (NNs) learn a nonlinear transformation from input to output space based on a number of training examples. Let denote the training dataset with size , and and denote the inputs and outputs (i.e., targets). A fully-connected feed-forward neural network with hidden layers of width (for layer ) performs the following computations: Let denote the output value of th node in th hidden layer given input , then and , where denotes the weight on the connection from th node in previous layer to th node in th hidden layer, denotes the bias of th node in th hidden layer, and is a nonlinear activation function. The output value of th node in output layer is then given by , where denotes the weight on the connection from th node in last hidden layer to th node in output layer, and denotes the bias of th node in output layer.
A gradient-based optimizer is usually used to learn the weights and bias given a pre-defined loss function, e.g., a squared loss function . For a standard NN, the learned parameters are fixed, so the NN output is also a fixed point. For a Bayesian NN, a distribution of the parameters is learned, so the NN output is a distribution of . However, a pre-trained standard NN needs to be augmented, e.g., with a Gaussian Process, to achieve the same result.
s1.2 Gaussian Processes
A Gaussian Process (GP) is a collection of random variables, such that any finite collection of these variables follows a joint multivariate Gaussian distribution . Given a training dataset and , where denotes additive independent identically distributed Gaussian noise, the first step for GP is to fit itself to these training data assuming , where denotes a multivariate Gaussian distribution with mean 0 and covariance matrix . denotes the kernel-based covariance matrix at all pairs of training points with each entry , and denotes the noise variance of observations. One commonly used kernel is the radial basis function (RBF) kernel, which is defined as . The signal variance , length scale and noise variance are trainable hyperparameters. The hyperparameters of the covariance function are optimized during the learning process to maximize the log marginal likelihood .
After fitting phase, the GP is utilized to predict the distribution of label given a test point . This prediction is given by with and , where denotes the vector of kernel-based covariances (i.e., ) between and all the training points, and denotes the vector of all training labels. Unlike with NN, the uncertainty of the prediction of a GP is therefore explicitly quantified.
s1.3 Stochastic Variational Gaussian Processes
The main limitation of the standard GP, as defined above, is that it is excessively expensive in both computational and storage cost. For a dataset with data points, the inference of standard GP has time complexity and space complexity . To circumvent this issue, sparse GP methods were developed to approximate the original GP by introducing inducing variables [8, 30, 32, 36]. These approximation approaches lead to a computational complexity of and space complexity of , where is the number of inducing variables. Following this line of work, the Stochastic Variational Gaussian Process (SVGP) [16, 17] further improves the scalability of the approach by applying Stochastic Variational Inference (SVI) technique, as follows:
Consider the same training dataset and GP as in Section S1.2, and assume a set of inducing variables as and ( and are unknown). SVGP learns a variational distribution by maximizing a lower bound of , where and denotes the probability density under original GP. Trainable hyperparameters during the learning process include values of and hyperparameters of the covariance function of original GP. Given a test point , the predictive distribution is then given by , which still follows a Gaussian distribution. One advantage of SVGP is that minibatch training methods  can be applied in case of very large dataset. Suppose the minibatch size is and , then for each training step/iteration, the computational complexity is , and the space complexity is . For full details about SVGP, see . Since NNs typically are based on training with relatively large datasets, SVGP makes it practical to implement uncertainty estimates on NNs.
Appendix S2 Empirical Study
s2.1 Experimental Setups
In total, 12 real-world regression datasets from UCI machine learning repository  are tested. Table 1 summarizes the basic information of these datasets. For all the datasets except MSD, 20% of the whole dataset is used as test dataset and 80% is used as training dataset, and this split is randomly generated in each independent run. For MSD, the first 463715 samples are used as training dataset and the last 51630 samples are used as testing dataset according to the provider’s guideline. During the experiments, all the datasets except for MSD are tested for 100 independent runs, and MSD datasets are tested for 10 independent runs. For each independent run, the same random dataset split are used by all the tested algorithms to ensure fair comparisons.
|abbreviation||full name in UCI ML repository||dataset size||dimension||note|
|yacht||Yacht Hydrodynamics Data Set||252||6||-|
|ENB/h||Energy efficiency||768||8||Heating Load as target|
|ENB/c||Energy efficiency||768||8||Cooling Load as target|
|CCS||Concrete Compressive Strength||1030||8||-|
|wine/r||Wine Quality||1599||11||only use winequality-red data|
|wine/w||Wine Quality||4898||11||only use winequality-white data|
|CCPP||Combined Cycle Power Plant||9568||4||-|
|CASP||Physicochemical Properties of Protein Tertiary Structure||54730||9||-|
|CT||Relative location of CT slices on axial axis||53500||384||-|
|MSD||YearPredictionMSD||515345||90||train: first 463715, test: last 51630|
Parametric Setup for Algorithms
NN: For SC dataset, a fully connected feed-forward NN with 2 hidden layers, each with 128 hidden neurons, is used. For CT dataset, a fully connected feed-forward NN with 2 hidden layers, each with 256 hidden neurons, is used. For all the remaining datasets, a fully connected feed-forward NN with 2 hidden layers, each with 64 hidden neurons, is used. The inputs to the NN are normalized to have mean 0 and standard deviation 1. The activation function is ReLU for all the hidden layers. The maximum number of epochs for training is 1000. 20% of the training data is used as validation data, and the split is random at each independent run. An early stop is triggered if the loss on validation data has not be improved for 10 epochs. The optimizer is RMSprop with learning rate 0.001, and the loss function is mean squared error (MSE).
RIO, RIO variants and SVGP : SVGP is used as an approximator to original GP in RIO and all the RIO variants. For RIO, RIO variants and SVGP, the number of inducing points are 50 for all the experiments. RBF kernel is used for both input and output kernel. For RIO, RIO variants and SVGP, the signal variances and length scales of all the kernels plus the noise variance are the trainable hyperparameters. The optimizer is L-BFGS-B with default parameters as in Scipy.optimize documentation (https://docs.scipy.org/doc/scipy/reference/optimize.minimize-lbfgsb.html): ’maxcor’: 10, ’ftol’: 2.220446049250313e-09, ’gtol’: 1e-05, ’eps’: 1e-08, ’maxfun’: 15000, ’maxiter’: 15000, ’iprint’: -1, ’maxls’: 20. The training process runs until the L-BFGS-B optimizer decides to stop.
NNGP : For NNGP kernel, the depth is 2, and the activation function is ReLU. , , and . Following the learning process in original paper, a grid search is performed to search for the best values of and . Same as in the original paper, a grid of 30 points evenly spaced from 0.1 to 5.0 (for ) and 30 points evenly spaced from 0 to 2.0 (for ) was evaluated. The noise variance is fixed as 0.01. The grid search process stops when Cholesky decomposition fails or all the 900 points are evaluated. The best values found during the grid search will be used in the experiments. No pre-computed lookup tables are utilized.
ANP : The parametric setups of ANP are following the recommendations in the original paper. The attention type is multihead, the hidden size is 64, the max number of context points is 50, the context ratio is 0.8, the random kernel hyperparameters option is on. The size of latent encoder is , the number of latents is 64, the size of deterministic encoder is , the size of decoder is , and the deterministic path option is on. Adam optimizer with learning rate is used, and the maximum number of training iterations is 2000.
To measure the point-prediction error, the Root Mean Square Error (RMSE) between the method predictions and true outcomes on test datasets are calculated for each independent experimental run. After that, the mean and standard deviations of these RMSEs are used to measure the performance of the algorithms.
To quantitatively measure the quality of uncertainty estimation, the estimated 95%/90%/68% confidence intervals (CIs) are calculated for all the algorithms except for standard NNs. A theoretically optimal way to measure the quality of these CIs is to calculate the difference between these estimated CIs and true CIs. This requires that the true distributions of the outcomes are known for each test point. However, for real-world datasets, only one observed outcome is available for each test point. To overcome this limitation, we develop a performance metric to quantitatively compare the quality of estimated CIs among algorithms. First, for each independent experimental run, the percentages of test outcomes within the estimated 95%/90%/68% CIs of each algorithm are calculated. The residuals between these coverage percentages and 95%/90%/68% is still biased, e.g., a meaninglessly wide estimated 95% CI that covers 100% of the test outcomes only has a 5% residual. To reduce this bias, the Z-score (under the assumption that the true test outcome follows a Gaussian distribution) is calculated for coverage percentages and targeted 95%/90%/68%. The differences between the two Z-scores are used as a standardized error to approximate the true Z-score error between estimated CIs and target CIs. For coverage percentages of 100%, the Z-score of 99.999% is used. Each independent experimental run has a standardized error based on the test set, and the RMSE over all the independent runs are used as the final metric to compare algorithms.
To compare the computation time of the algorithms, the training time (wall clock time) of NN, RIO, all the RIO variants, SVGP and ANP are averaged over all the independent runs as the computation time. For NNGP, the wall clock time for the grid search is used. In case that the grid search stops due to Cholesky decomposition failures, the computation time of NNGP will be estimated as the average running time of all the successful evaluations 900, which is the supposed number of evaluations. All the algorithms are implemented using Tensorflow, and tested in the exactly same python environment. All the experiments are running on a MacBook Pro machine with 2.5 GHz Intel Core i7 processor and AMD Radeon R9 M370X 2048 MB graphic card.
s2.2 Supplementary Figures
Figure S1 and S2 show the distribution of the percentages that testing outcomes are within the estimated 95%/90%/68% CIs over all the independent runs for all the datasets and algorithms. In Figure S1 and S2, the box extends from the 25 to 75 quartile values of the data (each data point represents an independent experimental run), with a line at the median. The whiskers extend from the box to show the range of the data. Flier points are those past the end of the whiskers, indicating the outliers.
s2.3 Correlation between Noise Variance and Prediction RMSE
|Dataset||Method||Prediction RMSE||Noise Variance||Dataset||Method||Prediction RMSE||Noise Variance|
Table S2 shows the mean of prediction RMSE and noise variance of RIO, Y+IO and SVGP over all the independent runs for all the datasets. A clear positive correlation between prediction RMSE and noise variance can be observed. In most cases, RIO has a lower prediction RMSE and a lower noise variance than Y+IO, which has lower values in both metrics than original SVGP. This results are in accordance with the theoretical analysis in section 2.3 of the main paper, and demonstrates the effectiveness of both residual estimation and I/O kernel.
s2.4 Additional Discussions
One interesting observation is that in CT dataset SVGP shows very poor prediction RMSE and 95% CI RMSE, but achieves the best 90% and 68% CI RMSE. After investigation, this is only happened by chance, and is not due to the accurate CI estimations of SVGP. Since SVGP is not able to extract any useful information from the high-dimensional input space, it treated all the outcomes as simply noise. As a result, SVGP shows a very large RMSE compared to other algorithms, and the mean of its predicted outcome distribution is always around 0. Since SVGP treats everything as noise, the estimated noise variance is very high, and the estimated 95% CI based on this noise variance is overly high and covers all the test outcomes in most cases. This leads to a high Z-score RMSE as we expected. However, when the estimated 90% CI is evaluated, the big error in mean estimation and big error in variance estimation cancel most part of each other by chance, i.e., the estimated 90% CI is mistakenly shifted by erroneous mean then the overly wide variance fortunately covers slightly more than 90% test outcomes. Similar thing happen to the estimated 68% CI, but because of the overly high estimated variance, the coverage percentages are now blow 68%, but still provide decent Z-score errors. This phenomenon shows that the performance metric we used may introduce outliers if the prediction errors are too high. For comparison among algorithms that give reasonable prediction errors, which happens in most cases in the experiments, the current performance metric still works.
Appendix S3 Proofs
s3.1 Proofs for Section 2.3
Theorem 2.4 follows from a series of three lemmas. First, following a standard approach , consider the eigenfunction expansion and . Let be the diagonal matrix of the eigenvalues , and be the design matrix, i.e., .
Since, from the perspective of GP, is indistinguishable from noise, the optimal hyperparameter setting for the GP predictor is mean zero, kernel and noise variance . The expected generalization error of the GP predictor with posterior mean is
where the last step makes use of the fact that the expectation of the product of independent zero-mean random variables is zero. Plugging in a well-known result for the generalization error of Gaussian processes [29, 31, 34] yields the intended result
Making use of the fact that ,
Now, , and . So, ∎
Here, the goal of the GP is to fit the residuals . Since and is indistinguishable from noise, the optimal hyperparameter setting for the GP predictor is mean zero, kernel , and noise variance . Denote the posterior mean of the GP residual predictor by . Then, the final predictor for is The expected generalization error of is then
Theorem 2.4. and
The first condition follows from Lemma 2.1, Lemma 2.3, and the fact that
The second condition follows from Lemma 2.2, Lemma 2.3, and the fact that
s3.2 Proof of Theorem 2.5.
Theorem 2.5. and
The fact that and are non-trivial implies and Following the definitions at the beginning of Section S3.1, let , , and , be the diagonal matrices of eigenvalues and the design matrices for the eigenfunction expansions of and , respectively.
With only the input kernel, that fact that is indistinguishable from noise from the perspective of GP implies is also indistinguishable from noise, since it is a function of Thus, the GP optimizer estimates the noise variance as , so the generalization error is