Predictive Uncertainty through Quantization
Abstract
Highrisk domains require reliable confidence estimates from predictive models. Deep latent variable models provide these, but suffer from the rigid variational distributions used for tractable inference, which err on the side of overconfidence. We propose Stochastic Quantized Activation Distributions (SQUAD), which imposes a flexible yet tractable distribution over discretized latent variables. The proposed method is scalable, selfnormalizing and sample efficient. We demonstrate that the model fully utilizes the flexible distribution, learns interesting nonlinearities, and provides predictive uncertainty of competitive quality.
Predictive Uncertainty through Quantization
Bastiaan S. Veeling 

University of Amsterdam 
basveeling@gmail.com 
Rianne van den Berg 
University of Amsterdam 
Max Welling 
University of Amsterdam 
1 Introduction
In highrisk domains, prediction errors come at high costs. Luckily such domains often provide a failsafe: selfdriving cars perform an emergency stop, doctors run another diagnostic test, and industrial processes are temporarily halted. For deep learning models, this can be achieved by rejecting datapoints with a confidence score below a predetermined threshold. This way, a low error rate can be guaranteed at the cost of rejecting some predictions. However, estimating high quality confidence scores from neural networks, which create wellordered rankings of correct and incorrect predictions, remains an active area of research.
Deep Latent Variable Models (DLVMs, fig. 1) approach this by postulating latent variables for which the uncertainty in influences the confidence in the target prediction. Recently, efficient inference algorithms have been proposed in the form of variational inference, where an inference neural network is optimized to predict parameters of a variational distribution that approximates an otherwise intractable distribution (Kingma & Welling (2013); Rezende et al. (2014); Alemi et al. (2016); Achille & Soatto (2016)).
Variational inference relies on a tractable class of distributions that can be optimized to closely resemble the true distribution (fig. 2), and it’s hypothesized that more flexible classes lead to more faithful approximations and thus better performance (Jordan et al. (1999)). To explore this hypothesis, we propose a novel tractable class of highly flexible variational distributions. Considering that neural networks with lowprecision activations exhibit good performance (Holi & Hwang (1993); Hubara et al. (2016)), we make the modeling assumption that latent variables can be expressed under a strong quantization scheme, without loss of predictive fidelity. If this assumption holds, it becomes tractable to model a scalar latent variable with a flexible multinomial distribution over the quantization bins (fig. 3).
By repositioning the variational distribution from a potentially limited description of moments, as found in commonly applied conjugate distributions, to a direct expression of probabilities per value, a variety of benefits arise. As the output domain is constrained, the method becomes selfnormalizing, relieving the model from hardtoparallelize batch normalization techniques (Ioffe & Szegedy (2015)). More interesting priors can be explored and the model is able to learn unique activation functions per neuron.
More concretely, the contributions of this work are as follows:

We propose a novel variational inference method by leveraging multinomial distributions on quantized latent variables.

We show that the emerging predicted distributions are multimodal, motivating the need for flexible distributions in variational inference.

We demonstrate that the proposed method applied to the information bottleneck objective computes competitive uncertainty over the predictions and that this manifests in better performance under strong risk guarantees.
2 Background
In this work, we explore deep neural networks for regression and classification. We have datapoints consisting of intputs and targets in a dataset and postulate latent variables that represent the data. We focus on the Information Bottleneck (IB) perspective: first proposed by Tishby et al. (2000), the information bottleneck objective is optimized to maximize the mutual information between and , whilst minimizing the mutual information between and . The objective can be efficiently optimized using a variational inference scheme as shown concurrently by both Alemi et al. (2016) and Achille & Soatto (2016). Under the Markov assumption , they derive the following lower bound:
(1) 
where is commonly estimated using a single Monte Carlo sample and is a variational approximation to the marginal distribution of . In practice is fixed to a simple distribution such as a spherical Gaussian. Alemi et al. (2016) and Achille & Soatto (2016) continue to show that the Variational Auto Encoder (VAE) Evidence Lower Bound (ELBO) proposed in Kingma & Ba (2014); Rezende et al. (2014) is a special case of the IB bound when and :
(2) 
where represents the identity of datapoint . Interestingly, the VAE perspective considers the bound to optimize a variational distribution , whilst the IB perspective prescribes that in the ELBO is not a variational posterior but the true encoder , and instead and are the distributions approximated with variational counterparts.
From yet another perspective, equation 1 can be interpreted as a domaintranslating betaVAE (Higgins et al. (2016)), where an input image is encoded into a latent space and decoded into the target domain. The Lagrange multiplier then controls the tradeoff between rate and distortion, as argued by Alemi et al. (2017).
In this work, we follow the IB interpretation of the bound in equation 1 and leave the evaluation of our proposed variational inference scheme in other models such as the VAE for further work.
3 Method
At the heart of our proposal lies the assumption that neuron networks can be effective under strong activation quantization schemes. We start with presenting the derivation if the model in the context of a single latentlayer information bottleneck, following the single datapoint loss in equation 1 and dropping the subscript for clarity, with figure 4 for visual reference:
(3) 
To impose a flexible, multimodal distribution over , we first make a meanfield assumption . We then quantize the domain of each of the scalar latent variables such that only a small set of potential values remain: with e.g. , see fig. 3.
To optimize the parameters with Stochastic Gradient Descent (SGD), we need to derive a fully differentiable sampling scheme that allows us to sample values of . To formulate this, we reparametrize the expectation over in equation 3 using a set of variables which index the value vector , allowing us to use a softmax function to represent the distribution over :
(4) 
These indexing values are then used in conjunction with values as in input for , which is modelled with a small network : (abusing notation to indicate elementwise indexing with ):
To enable sampling from the discrete variables , we use the GumbelMax trick (Gumbel (1954)), denoted , reparameterizing the expectation with uniform noise :
(5) 
As the argmax is not differentiable, we approximate this expectation using the GumbelSoftmax trick (Maddison et al. (2016); Jang et al. (2016)), which generates samples that smoothly deform into onehot samples as the softmax temperature approaches . Using the inner product (denoted ) of the approximate onehot samples and , we create samples from :
(6) 
In practice, we anneal from to during the training process, as proposed by Yang et al. (2017) to reduce gradient variance initially, at the risk of introducing bias.
To conclude our derivation, we use a fixed SQUAD distribution to model the variational marginal as shown in figure 5. We can then derive the KL term analytically following the definition for discrete distributions. Using the fact that the KL divergence is additive for independent variables, we get our final loss:
(7) 
For the remainder of this work, we will refer to the latent variables as in lieu of , for clarity.
At test time, we can approximate the predictive function for a new datapoint by taking samples from the latent variables i.e. , and averaging the predictions for :
(8) 
We improve the flexibility of the proposed model by creating a hierarchical set of latent variables. The joint distribution of L layers of latents is then:
(9) 
With . This is straightforwardly implemented with a simple ancestral sampling scheme.
Interestingly, the strong quantization proposed in our method can itself be considered an additional information bottleneck, as it exactly upperbounds the number of bits per latent variable. Such bottlenecks are theorized to have a beneficial effect on generalization (Tishby et al. (2000); Achille & Soatto (2016); Alemi et al. (2017; 2016)), and we can directly control this bottleneck by varying the number of quantization bins.
The computational complexity of the method, as well as the number of model parameters , scale linearly in , i.e. (with the number of quantization bins). It is thus suitable for largescale inference. We would like to stress that the proposed method differs from work that leverages the GumbelSoftmax trick to model categorical latent variables: our proposal models continuous scalar latent variables by quantizing their domain and modeling belief with a multinomial distribution. Categorical latent variable models would incur a much larger polynomial complexity penalty of .
Matrixfactorization variant
To improving the tractability of using a large number of quantization bins, we propose a variant of SQUAD that uses a matrix factorization scheme to improve the parameter efficiency. Formally, equation 4 becomes:
with full layer weights and respectively of shape and , where denotes the number of neurons, the number of factorization inputs, number of quantization bins and the input dimensionality. To improve the parameter efficiency, we can learn per layer as well, resulting in shape , which is found to be beneficial for large C by the hyperparameter search presented in section 5. We depict this alternative model on the right side of figure 4 and will refer to it as SQUADfactorized. We leave further extensions such as NetworkinNetwork (Lin et al. (2013)) for future work.
4 Related Work
Outside the realm of DLVMs, other methods have been explored for predictive uncertainty. Lakshminarayanan et al. (2017) propose deep ensembles: straightforward averaging of predictions from a small set of separately adversarially trained DNNs. Although highly scalable, this method requires retraining a model up to 10 times, which can be inhibitively expensive for large datasets.
Gal & Ghahramani (2015b) propose the use of dropout (Srivastava et al. (2014)) at test time and present a Bayesian neural network interpretation of this method. A followup work by Gal et al. (2017) explores the use of GumbelSoftmax to smoothly deform the dropout noise to allow optimization of the dropout rate during training. A downside of MCdropout is the limited flexibility of the fixed bimodal deltapeak distribution imposed on the weights, which requires a large number of samples for good estimates of uncertainty. van den Oord et al. (2017) propose the use of vector quantization in variational inference, quantizing a multidimensional embedding, rather than individual latent variables, and explore this in the context of autoencoders.
In the space of learning nonlinearity’s, Su et al. (2017) explore a flexible nonlinearity that can assume the form of most canonical activations. More flexible distributions have been explored for distributional reinforcement learning by Dabney et al. (2017) using quantile regression, of which can be seen as a special case of SQUAD where the bin values are learned but have fixed uniform probability. Categorical distributions on scalar variables have been used to model more flexible Bayesian neural network posteriors as by Shayer et al. (2017). The use of a mixture of diracs distribution to approximate a variety of distributions was proposed by Schrempf et al. (2006).
5 Results
Quantifying the quality of uncertainty estimates of models remains an open problem. Various methods have been explored in previous works, such as relative entropy Louizos & Welling (2017); Gal & Ghahramani (2015a), probability calibration, and proper scoring rules Lakshminarayanan et al. (2017). Although interesting in their own right, these metrics do not directly measure a good ranking of predictions, nor indicate applicability in highrisk domains. Proper scoring rules are the exception, but a model with good ranking ability does not necessarily exhibit good performance on proper scoring rules: any score that provides relative ordering suffices and does not have to reflect true calibrated probabilities. In fact, wellranked confidence scores can be recalibrated (NiculescuMizil & Caruana (2005)) after training to improve performance on proper scoring rules and calibration metrics.
In order to evaluate the applicability of the model in highrisk fields such as medicine, we want to quantify how models perform under a desired risk requirement. We propose to use the selection with guaranteed risk (SGR) method^{1}^{1}1We deviate slightly from Geifman & ElYaniv (2017) in that we use Softmax Response (SR) – the probability taken from the softmax output for the most likely class – as the confidence score for all methods. Geifman & ElYaniv (2017) instead proposed to use the variance of the probabilities for MCdropout, but our experiments showed that SR paints MCdropout in a more favorable light. introduced by Geifman & ElYaniv (2017) to measure this. In summary, the SGR method defines a selective classifier using a trained neural network and can guarantee a userselected desired risk with high probability (e.g. 99%), by selectively rejecting data points with a predicted confidence score below an optimal threshold.
To limit the influence of hyperparameters on the comparison, we use the automated optimization method TPE (Bergstra et al. (2011)) for both baselines and our models. The hyperparameters are optimized for coverage at 2% risk () on fashionMNIST, and subsequentially evaluated on notMNIST. Larger models are evaluated on SVHN.
We compare^{2}^{2}2All models are optimized with ADAM (Kingma & Ba (2014)), weight initialization as proposed by He et al. (2015), a weight decay of and adaptive learning rate decay scheme — 10x reduction after 10 epochs of no validation accuracy improvement— and use early stopping after 20 epochs of no improvement. our model against plain MLPs, MCDropout using Maxout activations^{3}^{3}3We found this baseline to perform stronger in comparison to conventional ReLU MCdropout models, under equal number of latent variables. (Goodfellow et al. (2013); Chang & Chen (2015)) and an information bottleneck model using meanfield Gaussian distributions. We evaluate the complementary deep ensembles technique (Lakshminarayanan et al. (2017)) for all methods.
5.1 Main results
We start our analysis by comparing the predictive uncertainty of 2layer models with 32 latent variables per layer. In figure 6 we visualize the risk/coverage tradeoff achieved using the predicted uncertainty as a selective threshold, and present coverage results in table 1. Overall, we find that SQUAD performs significantly better than plain MLPs and deep Gaussian IB models, and we tentatively attribute this to the increased flexibility of the multinomial distribution. Compared to a Maxout MCdropout model with a similar number of weights, SQUAD appears to have a slight —though not significant— advantage, despite the strong quantization scheme, especially at low risk. Deep ensembles improve results for all methods, which fits the hypothesis that ensembles integrate over a form of weight uncertainty. When evaluated on a new dataset without retuning hyperparameters, SQUAD shows strong performance, as shown in table 3.
Fashion MNIST  cov@risk .5%  cov@risk 1%  cov@risk 2%  NLL  Acc. 
Plain MLP  29.1 ()  45.9 ()  60.4 ()  0.408 ()  87.7 () 
Maxout MCDropout  41.9 ()  56.5 ()  69.9 ()  0.299 ()  89.5 () 
DLGM  0.0 ()  33.5 ()  47.0 ()  0.446 ()  84.3 () 
SQUAD  42.9 ()  58.3 ()  69.5 ()  0.293 ()  89.5 () 
Deep Ensemble  cov@risk .5%  cov@risk 1%  cov@risk 2%  NLL  Acc. 
Plain MLP Ensemble  40.6  58.3  70.2  0.296  89.3 
Max. MCD. Ensemble  48.2  59.1  72.2  0.271  90.2 
DLGM Ensemble  0.0  34.3  47.8  0.435  84.7 
SQUAD Ensemble  47.5  61.6  73.1  0.273  90.1 
notMNIST  cov@risk .5%  cov@risk 1%  cov@risk 2%  NLL  Acc. 

Plain MLP  77.4 ()  85.5 ()  90.3 ()  0.228 ()  93.3 () 
Maxout MCDropout  85.7 ()  90.6 ()  94.2 ()  0.165 ()  95.3 () 
SQUAD  87.1 ()  91.1 ()  94.5 ()  0.161 ()  95.4 () 
Plain MLP Ensemble  85.9  90.6  93.5  0.175  94.9 
Max. MCD. Ensemble  88.5  92.8  95.7  0.148  96.0 
SQUAD Ensemble  90.7  93.5  96.1  0.137  96.2 
MLP K=256 (SVHN)  cov@risk .5%  cov@risk 1%  cov@risk 2%  NLL  Acc. 

Plain MLP  0.0 ()  0.0 ()  36.3 ()  0.758 ()  83.1 () 
Maxout MCDropout  0.0 ()  50.7 ()  65.0 ()  0.480 ()  86.4 () 
SQUADfactorized  18.4 ()  53.9 ()  66.7 ()  0.454 ()  86.7 () 
SQUAD  1.7 ()  42.8 ()  59.3 ()  0.534 ()  84.6 () 
Max. MCDropout T=4  0.0 ()  38.5 ()  57.6 ()  0.562 ()  84.9 () 
SQUADfactorized T=4  10.7 ()  49.9 ()  64.5 ()  0.480 ()  86.2 () 
SQUAD T=4  0.0 ()  38.0 ()  55.6 ()  0.569 ()  83.7 () 
5.2 Natural Images
To explore larger models trained on natural image datasets, we lightly tune hyperparameters on 256latent 2layer models over 100 TPE evaluations. As SVHN contains natural images in color, we anticipate a need for a higher amount of information per variable. We thus explore the effect of the matrixfactorized variant.
As shown in table 3, SQUADfactorized outperforms the nonfactorized variant. Considering the computational cost at the optimum of a 4neuron factorization () with =37 quantization bins, the model clocks 3.4 million weights. In comparison, the optimum for the presented MCdropout results has =11, using 9.0 million weights. On an NVIDIA Titan XP, the dropout baseline takes 13s per epoch on average, while SQUADfactorized spans just 9s.
To evaluate the sample efficiency of the methods, we compare results at samples. We find that SQUAD’s results suffer less from undersampling than MCdropout. We tentatively attribute the sample efficiency to the flexible approximating posterior on the activations, which is in stark contrast to the rigid approximating distribution that MCdropout imposes on the weights. In conclusion, SQUAD comes out favorably in a resourceconstrained environment.
5.3 Analysis of latent variable distributions
In order to evaluate if the proposed variational distribution does not simply collapse into single mode predictions, we want to find out what type of distributions the model predicts over the latent variables. We visualize the forms of predicted distributions in figure 7a. Although this showcases only a small subset of potential multimodal behavior that emerges, this demonstrates that the model indeed utilizes the distribution to its full potential. To provide an intuition on how these predicted distributions emerge, we present figure 9 in the appendix.
In figure 8 we visualize one of the activation functions that the method learns for a 1dimensional input SQUADfactorized model. The learned activation functions resemble “peaked” sigmoid activations, which can be interpreted as a combination of an RBF kernel and sigmoid. This provides food for thought on how nonlinearity’s for conventional neural networks can be designed, and the effect of using such a nonlinearity can be studied in further work.
6 Discussion
In this work, we have proposed a new flexible class of variational distributions. To measure the effectiveness for real world classification, we applied the class to a deep variational information bottleneck model. By placing a quantizationbased distribution on the activations, we can compute uncertainty estimates over the outputs. We proposed an evaluation scheme motivated by the need in realworld domains to guarantee a minimal risk. The results presented indicate that SQUAD provides an improvement over plain neural networks and Gaussian information bottleneck models. In comparison to a MCDropout model, which approximates a Bayesian neural network, we get competitive performance. Moreover, qualitatively we find that the flexible distribution is used to its full advantage is sample efficient. The method learns interesting nonlinearity’s, is tractable and scaleable, and as the output domain is constrained, no batch normalization techniques are required.
Various directions for future work arise. The improvement of ensemble methods over individual models indicates that there remains room for improvement for capturing the full uncertainty of the output, and thus a fully Bayesian approach to SQUAD which would include weight uncertainty, shows promise. The flexible class allows us to define a wide variety of interesting priors, which provides opportunity to study interesting priors that are hard to define as a continuous density. Likewise, more effective initialization of parameters for the proposed method requires further attention. Orthogonally, the proposed class can be applied to other variational objectives as well, such as the variational autoencoder. Finally, the discretized nature of the variables allows for the analytical computation of other divergences such as mutual information and the JensenShannon divergence, the effectiveness of which remains to be studied.
Acknowledgements
We thank Bart Bakker, Maximilian Ilse, Dimitrios Mavroeidis, Jakub Tomczak, Daniel Worrall and anonymous reviewers for their insightful comments and discussions. This research was supported by Philips Research, the SURFSara Lisa cluster and the NVIDIA GPU Grant. We thank contributors to TensorFlow (Abadi et al. (2016)), Keras (Chollet et al. (2015)) and Sacred (Gref et al. (2016)).
References
 Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: LargeScale machine learning on heterogeneous distributed systems. March 2016.
 Achille & Soatto (2016) Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. November 2016.
 Alemi et al. (2016) Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. December 2016.
 Alemi et al. (2017) Alexander A Alemi, Ben Poole, Ian Fischer, Joshua V Dillon, Rif A Saurous, and Kevin Murphy. An InformationTheoretic analysis of deep LatentVariable models. November 2017.
 Bergstra et al. (2011) James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for HyperParameter optimization. In J ShaweTaylor, R S Zemel, P L Bartlett, F Pereira, and K Q Weinberger (eds.), Advances in Neural Information Processing Systems 24, pp. 2546–2554. Curran Associates, Inc., 2011.
 Chang & Chen (2015) JiaRen Chang and YongSheng Chen. Batchnormalized maxout network in network. November 2015.
 Chollet et al. (2015) François Chollet et al. Keras. https://keras.io, 2015.
 Dabney et al. (2017) Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. October 2017.
 Gal & Ghahramani (2015a) Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with bernoulli approximate variational inference. June 2015a.
 Gal & Ghahramani (2015b) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. June 2015b.
 Gal et al. (2017) Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. May 2017.
 Geifman & ElYaniv (2017) Yonatan Geifman and Ran ElYaniv. Selective classification for deep neural networks. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 4885–4894. Curran Associates, Inc., 2017.
 Goodfellow et al. (2013) Ian J Goodfellow, David WardeFarley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. February 2013.
 Gref et al. (2016) Klaus Gref et al. Sacred. https://github.com/IDSIA/sacred, 2016.
 Gumbel (1954) E J Gumbel. Statistical theory of extreme values and some practical, applications, national bureau of standards, washington (1954). MR0061342 (15: 811b), 1954.
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. cvfoundation.org, 2015.
 Higgins et al. (2016) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betaVAE: Learning basic visual concepts with a constrained variational framework. November 2016.
 Holi & Hwang (1993) J L Holi and J N Hwang. Finite precision error analysis of neural network hardware implementations. IEEE Trans. Comput., 42(3):281–290, March 1993.
 Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. September 2016.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. jmlr.org, June 2015.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with GumbelSoftmax. November 2016.
 Jordan et al. (1999) Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Mach. Learn., 37(2):183–233, November 1999.
 Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. December 2014.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. AutoEncoding variational bayes. December 2013.
 Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 6405–6416. Curran Associates, Inc., 2017.
 Lin et al. (2013) Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. December 2013.
 Louizos & Welling (2017) Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. March 2017.
 Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. November 2016.
 NiculescuMizil & Caruana (2005) Alexandru NiculescuMizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pp. 625–632, New York, NY, USA, 2005. ACM.
 Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. January 2014.
 Schrempf et al. (2006) O C Schrempf, D Brunn, and U D Hanebeck. Dirac mixture density approximation based on minimization of the weighted cramervon mises distance. In 2006 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, pp. 512–517, 2006.
 Shayer et al. (2017) Oran Shayer, Dan Levi, and Ethan Fetaya. Learning discrete weights using the local reparameterization trick. October 2017.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15:1929–1958, 2014.
 Su et al. (2017) Qinliang Su, Xuejun Liao, and Lawrence Carin. A probabilistic framework for nonlinearities in stochastic neural networks. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 4486–4495. Curran Associates, Inc., 2017.
 Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. April 2000.
 van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. November 2017.
 Yang et al. (2017) Dong Yang, Daguang Xu, S Kevin Zhou, Bogdan Georgescu, Mingqing Chen, Sasa Grbic, Dimitris Metaxas, and Dorin Comaniciu. Automatic liver segmentation using an adversarial ImagetoImage network. July 2017.
Appendix A Appendix
a.1 Effect of hyperparameters on coverage:
The optimal configuration of hyperparameters and bin priors have been determined using 700 evaluations selected using TPE. The space of parameters explored is as follows, presented in the hyperopt API for transparency:
# Shared C: quniform(2, 10, 1) * 2 + 1, dropout rate: uniform(0.01, .95), lr: loguniform(log(0.0001), log(0.01)), batch_size: qloguniform(log(32), log(512), 1) # SQUAD & Gaussian kl_multiplier: loguniform(log(1e6), log(0.01)), init_scale: loguniform(log(1e3), log(20)), # SQUAD use_bin_probs: choice([’uni’, ’gaus’]), use_bins: choice([’equal_prob_gaus’, ’linearly_spaced’]), learn_bin_values: choice([ ’per_neuron’, ’per_layer’, ’fixed’]),
In figure 10 we visualize the pairwise effect of these hyperparameters on the coverage. The optimal configuration found in for the main SQUAD model are: batch size: 244, KL multiplier: 0.0027, learn bin values: per layer, : uniform, : linearly spread over (3.5,3.5), lr: 0.0008, : 15, initialization scale: 3.214.