Predictive Uncertainty through Quantization
Abstract
High-risk domains require reliable confidence estimates from predictive models. Deep latent variable models provide these, but suffer from the rigid variational distributions used for tractable inference, which err on the side of overconfidence. We propose Stochastic Quantized Activation Distributions (SQUAD), which imposes a flexible yet tractable distribution over discretized latent variables. The proposed method is scalable, self-normalizing and sample efficient. We demonstrate that the model fully utilizes the flexible distribution, learns interesting non-linearities, and provides predictive uncertainty of competitive quality.
Predictive Uncertainty through Quantization
Bastiaan S. Veeling |
---|
University of Amsterdam |
basveeling@gmail.com |
Rianne van den Berg |
University of Amsterdam |
Max Welling |
University of Amsterdam |
1 Introduction

In high-risk domains, prediction errors come at high costs. Luckily such domains often provide a failsafe: self-driving cars perform an emergency stop, doctors run another diagnostic test, and industrial processes are temporarily halted. For deep learning models, this can be achieved by rejecting datapoints with a confidence score below a predetermined threshold. This way, a low error rate can be guaranteed at the cost of rejecting some predictions. However, estimating high quality confidence scores from neural networks, which create well-ordered rankings of correct and incorrect predictions, remains an active area of research.
Deep Latent Variable Models (DLVMs, fig. 1) approach this by postulating latent variables for which the uncertainty in influences the confidence in the target prediction. Recently, efficient inference algorithms have been proposed in the form of variational inference, where an inference neural network is optimized to predict parameters of a variational distribution that approximates an otherwise intractable distribution (Kingma & Welling (2013); Rezende et al. (2014); Alemi et al. (2016); Achille & Soatto (2016)).

Variational inference relies on a tractable class of distributions that can be optimized to closely resemble the true distribution (fig. 2), and it’s hypothesized that more flexible classes lead to more faithful approximations and thus better performance (Jordan et al. (1999)). To explore this hypothesis, we propose a novel tractable class of highly flexible variational distributions. Considering that neural networks with low-precision activations exhibit good performance (Holi & Hwang (1993); Hubara et al. (2016)), we make the modeling assumption that latent variables can be expressed under a strong quantization scheme, without loss of predictive fidelity. If this assumption holds, it becomes tractable to model a scalar latent variable with a flexible multinomial distribution over the quantization bins (fig. 3).

By re-positioning the variational distribution from a potentially limited description of moments, as found in commonly applied conjugate distributions, to a direct expression of probabilities per value, a variety of benefits arise. As the output domain is constrained, the method becomes self-normalizing, relieving the model from hard-to-parallelize batch normalization techniques (Ioffe & Szegedy (2015)). More interesting priors can be explored and the model is able to learn unique activation functions per neuron.
More concretely, the contributions of this work are as follows:
-
We propose a novel variational inference method by leveraging multinomial distributions on quantized latent variables.
-
We show that the emerging predicted distributions are multimodal, motivating the need for flexible distributions in variational inference.
-
We demonstrate that the proposed method applied to the information bottleneck objective computes competitive uncertainty over the predictions and that this manifests in better performance under strong risk guarantees.
2 Background
In this work, we explore deep neural networks for regression and classification. We have data-points consisting of intputs and targets in a dataset and postulate latent variables that represent the data. We focus on the Information Bottleneck (IB) perspective: first proposed by Tishby et al. (2000), the information bottleneck objective is optimized to maximize the mutual information between and , whilst minimizing the mutual information between and . The objective can be efficiently optimized using a variational inference scheme as shown concurrently by both Alemi et al. (2016) and Achille & Soatto (2016). Under the Markov assumption , they derive the following lower bound:
(1) |
where is commonly estimated using a single Monte Carlo sample and is a variational approximation to the marginal distribution of . In practice is fixed to a simple distribution such as a spherical Gaussian. Alemi et al. (2016) and Achille & Soatto (2016) continue to show that the Variational Auto Encoder (VAE) Evidence Lower Bound (ELBO) proposed in Kingma & Ba (2014); Rezende et al. (2014) is a special case of the IB bound when and :
(2) |
where represents the identity of data-point . Interestingly, the VAE perspective considers the bound to optimize a variational distribution , whilst the IB perspective prescribes that in the ELBO is not a variational posterior but the true encoder , and instead and are the distributions approximated with variational counterparts.
From yet another perspective, equation 1 can be interpreted as a domain-translating beta-VAE (Higgins et al. (2016)), where an input image is encoded into a latent space and decoded into the target domain. The Lagrange multiplier then controls the trade-off between rate and distortion, as argued by Alemi et al. (2017).
In this work, we follow the IB interpretation of the bound in equation 1 and leave the evaluation of our proposed variational inference scheme in other models such as the VAE for further work.
3 Method
At the heart of our proposal lies the assumption that neuron networks can be effective under strong activation quantization schemes. We start with presenting the derivation if the model in the context of a single latent-layer information bottleneck, following the single data-point loss in equation 1 and dropping the subscript for clarity, with figure 4 for visual reference:
(3) |
To impose a flexible, multi-modal distribution over , we first make a mean-field assumption . We then quantize the domain of each of the scalar latent variables such that only a small set of potential values remain: with e.g. , see fig. 3.
To optimize the parameters with Stochastic Gradient Descent (SGD), we need to derive a fully differentiable sampling scheme that allows us to sample values of . To formulate this, we re-parametrize the expectation over in equation 3 using a set of variables which index the value vector , allowing us to use a softmax function to represent the distribution over :
(4) |
These indexing values are then used in conjunction with values as in input for , which is modelled with a small network : (abusing notation to indicate element-wise indexing with ):
To enable sampling from the discrete variables , we use the Gumbel-Max trick (Gumbel (1954)), denoted , re-parameterizing the expectation with uniform noise :
(5) |
As the argmax is not differentiable, we approximate this expectation using the Gumbel-Softmax trick (Maddison et al. (2016); Jang et al. (2016)), which generates samples that smoothly deform into one-hot samples as the softmax temperature approaches . Using the inner product (denoted ) of the approximate one-hot samples and , we create samples from :
(6) |
In practice, we anneal from to during the training process, as proposed by Yang et al. (2017) to reduce gradient variance initially, at the risk of introducing bias.
To conclude our derivation, we use a fixed SQUAD distribution to model the variational marginal as shown in figure 5. We can then derive the KL term analytically following the definition for discrete distributions. Using the fact that the KL divergence is additive for independent variables, we get our final loss:
(7) |
For the remainder of this work, we will refer to the latent variables as in lieu of , for clarity.
At test time, we can approximate the predictive function for a new data-point by taking samples from the latent variables i.e. , and averaging the predictions for :
(8) |
We improve the flexibility of the proposed model by creating a hierarchical set of latent variables. The joint distribution of L layers of latents is then:
(9) |
With . This is straightforwardly implemented with a simple ancestral sampling scheme.
Interestingly, the strong quantization proposed in our method can itself be considered an additional information bottleneck, as it exactly upper-bounds the number of bits per latent variable. Such bottlenecks are theorized to have a beneficial effect on generalization (Tishby et al. (2000); Achille & Soatto (2016); Alemi et al. (2017; 2016)), and we can directly control this bottleneck by varying the number of quantization bins.
The computational complexity of the method, as well as the number of model parameters , scale linearly in , i.e. (with the number of quantization bins). It is thus suitable for large-scale inference. We would like to stress that the proposed method differs from work that leverages the Gumbel-Softmax trick to model categorical latent variables: our proposal models continuous scalar latent variables by quantizing their domain and modeling belief with a multinomial distribution. Categorical latent variable models would incur a much larger polynomial complexity penalty of .
Matrix-factorization variant
To improving the tractability of using a large number of quantization bins, we propose a variant of SQUAD that uses a matrix factorization scheme to improve the parameter efficiency. Formally, equation 4 becomes:
with full layer weights and respectively of shape and , where denotes the number of neurons, the number of factorization inputs, number of quantization bins and the input dimensionality. To improve the parameter efficiency, we can learn per layer as well, resulting in shape , which is found to be beneficial for large C by the hyper-parameter search presented in section 5. We depict this alternative model on the right side of figure 4 and will refer to it as SQUAD-factorized. We leave further extensions such as Network-in-Network (Lin et al. (2013)) for future work.
4 Related Work
Outside the realm of DLVMs, other methods have been explored for predictive uncertainty. Lakshminarayanan et al. (2017) propose deep ensembles: straightforward averaging of predictions from a small set of separately adversarially trained DNNs. Although highly scalable, this method requires retraining a model up to 10 times, which can be inhibitively expensive for large datasets.
Gal & Ghahramani (2015b) propose the use of dropout (Srivastava et al. (2014)) at test time and present a Bayesian neural network interpretation of this method. A follow-up work by Gal et al. (2017) explores the use of Gumbel-Softmax to smoothly deform the dropout noise to allow optimization of the dropout rate during training. A downside of MC-dropout is the limited flexibility of the fixed bi-modal delta-peak distribution imposed on the weights, which requires a large number of samples for good estimates of uncertainty. van den Oord et al. (2017) propose the use of vector quantization in variational inference, quantizing a multi-dimensional embedding, rather than individual latent variables, and explore this in the context of auto-encoders.
In the space of learning non-linearity’s, Su et al. (2017) explore a flexible non-linearity that can assume the form of most canonical activations. More flexible distributions have been explored for distributional reinforcement learning by Dabney et al. (2017) using quantile regression, of which can be seen as a special case of SQUAD where the bin values are learned but have fixed uniform probability. Categorical distributions on scalar variables have been used to model more flexible Bayesian neural network posteriors as by Shayer et al. (2017). The use of a mixture of diracs distribution to approximate a variety of distributions was proposed by Schrempf et al. (2006).
5 Results
Quantifying the quality of uncertainty estimates of models remains an open problem. Various methods have been explored in previous works, such as relative entropy Louizos & Welling (2017); Gal & Ghahramani (2015a), probability calibration, and proper scoring rules Lakshminarayanan et al. (2017). Although interesting in their own right, these metrics do not directly measure a good ranking of predictions, nor indicate applicability in high-risk domains. Proper scoring rules are the exception, but a model with good ranking ability does not necessarily exhibit good performance on proper scoring rules: any score that provides relative ordering suffices and does not have to reflect true calibrated probabilities. In fact, well-ranked confidence scores can be re-calibrated (Niculescu-Mizil & Caruana (2005)) after training to improve performance on proper scoring rules and calibration metrics.
In order to evaluate the applicability of the model in high-risk fields such as medicine, we want to quantify how models perform under a desired risk requirement. We propose to use the selection with guaranteed risk (SGR) method111We deviate slightly from Geifman & El-Yaniv (2017) in that we use Softmax Response (SR) – the probability taken from the softmax output for the most likely class – as the confidence score for all methods. Geifman & El-Yaniv (2017) instead proposed to use the variance of the probabilities for MCdropout, but our experiments showed that SR paints MCdropout in a more favorable light. introduced by Geifman & El-Yaniv (2017) to measure this. In summary, the SGR method defines a selective classifier using a trained neural network and can guarantee a user-selected desired risk with high probability (e.g. 99%), by selectively rejecting data points with a predicted confidence score below an optimal threshold.
To limit the influence of hyper-parameters on the comparison, we use the automated optimization method TPE (Bergstra et al. (2011)) for both baselines and our models. The hyper-parameters are optimized for coverage at 2% risk () on fashionMNIST, and sub-sequentially evaluated on notMNIST. Larger models are evaluated on SVHN.
We compare222All models are optimized with ADAM (Kingma & Ba (2014)), weight initialization as proposed by He et al. (2015), a weight decay of and adaptive learning rate decay scheme — 10x reduction after 10 epochs of no validation accuracy improvement— and use early stopping after 20 epochs of no improvement. our model against plain MLPs, MCDropout using Maxout activations333We found this baseline to perform stronger in comparison to conventional ReLU MCdropout models, under equal number of latent variables. (Goodfellow et al. (2013); Chang & Chen (2015)) and an information bottleneck model using mean-field Gaussian distributions. We evaluate the complementary deep ensembles technique (Lakshminarayanan et al. (2017)) for all methods.
5.1 Main results
We start our analysis by comparing the predictive uncertainty of 2-layer models with 32 latent variables per layer. In figure 6 we visualize the risk/coverage trade-off achieved using the predicted uncertainty as a selective threshold, and present coverage results in table 1. Overall, we find that SQUAD performs significantly better than plain MLPs and deep Gaussian IB models, and we tentatively attribute this to the increased flexibility of the multinomial distribution. Compared to a Maxout MCdropout model with a similar number of weights, SQUAD appears to have a slight —though not significant— advantage, despite the strong quantization scheme, especially at low risk. Deep ensembles improve results for all methods, which fits the hypothesis that ensembles integrate over a form of weight uncertainty. When evaluated on a new dataset without retuning hyperparameters, SQUAD shows strong performance, as shown in table 3.
Fashion MNIST | cov@risk .5% | cov@risk 1% | cov@risk 2% | NLL | Acc. |
Plain MLP | 29.1 () | 45.9 () | 60.4 () | 0.408 () | 87.7 () |
Maxout MCDropout | 41.9 () | 56.5 () | 69.9 () | 0.299 () | 89.5 () |
DLGM | 0.0 () | 33.5 () | 47.0 () | 0.446 () | 84.3 () |
SQUAD | 42.9 () | 58.3 () | 69.5 () | 0.293 () | 89.5 () |
Deep Ensemble | cov@risk .5% | cov@risk 1% | cov@risk 2% | NLL | Acc. |
Plain MLP Ensemble | 40.6 | 58.3 | 70.2 | 0.296 | 89.3 |
Max. MCD. Ensemble | 48.2 | 59.1 | 72.2 | 0.271 | 90.2 |
DLGM Ensemble | 0.0 | 34.3 | 47.8 | 0.435 | 84.7 |
SQUAD Ensemble | 47.5 | 61.6 | 73.1 | 0.273 | 90.1 |
notMNIST | cov@risk .5% | cov@risk 1% | cov@risk 2% | NLL | Acc. |
---|---|---|---|---|---|
Plain MLP | 77.4 () | 85.5 () | 90.3 () | 0.228 () | 93.3 () |
Maxout MCDropout | 85.7 () | 90.6 () | 94.2 () | 0.165 () | 95.3 () |
SQUAD | 87.1 () | 91.1 () | 94.5 () | 0.161 () | 95.4 () |
Plain MLP Ensemble | 85.9 | 90.6 | 93.5 | 0.175 | 94.9 |
Max. MCD. Ensemble | 88.5 | 92.8 | 95.7 | 0.148 | 96.0 |
SQUAD Ensemble | 90.7 | 93.5 | 96.1 | 0.137 | 96.2 |
MLP K=256 (SVHN) | cov@risk .5% | cov@risk 1% | cov@risk 2% | NLL | Acc. |
---|---|---|---|---|---|
Plain MLP | 0.0 () | 0.0 () | 36.3 () | 0.758 () | 83.1 () |
Maxout MCDropout | 0.0 () | 50.7 () | 65.0 () | 0.480 () | 86.4 () |
SQUAD-factorized | 18.4 () | 53.9 () | 66.7 () | 0.454 () | 86.7 () |
SQUAD | 1.7 () | 42.8 () | 59.3 () | 0.534 () | 84.6 () |
Max. MCDropout T=4 | 0.0 () | 38.5 () | 57.6 () | 0.562 () | 84.9 () |
SQUAD-factorized T=4 | 10.7 () | 49.9 () | 64.5 () | 0.480 () | 86.2 () |
SQUAD T=4 | 0.0 () | 38.0 () | 55.6 () | 0.569 () | 83.7 () |
5.2 Natural Images
To explore larger models trained on natural image datasets, we lightly tune hyper-parameters on 256-latent 2-layer models over 100 TPE evaluations. As SVHN contains natural images in color, we anticipate a need for a higher amount of information per variable. We thus explore the effect of the matrix-factorized variant.
As shown in table 3, SQUAD-factorized outperforms the non-factorized variant. Considering the computational cost at the optimum of a 4-neuron factorization () with =37 quantization bins, the model clocks 3.4 million weights. In comparison, the optimum for the presented MCdropout results has =11, using 9.0 million weights. On an NVIDIA Titan XP, the dropout baseline takes 13s per epoch on average, while SQUAD-factorized spans just 9s.
To evaluate the sample efficiency of the methods, we compare results at samples. We find that SQUAD’s results suffer less from under-sampling than MCdropout. We tentatively attribute the sample efficiency to the flexible approximating posterior on the activations, which is in stark contrast to the rigid approximating distribution that MCdropout imposes on the weights. In conclusion, SQUAD comes out favorably in a resource-constrained environment.
5.3 Analysis of latent variable distributions
In order to evaluate if the proposed variational distribution does not simply collapse into single mode predictions, we want to find out what type of distributions the model predicts over the latent variables. We visualize the forms of predicted distributions in figure 7a. Although this showcases only a small subset of potential multi-modal behavior that emerges, this demonstrates that the model indeed utilizes the distribution to its full potential. To provide an intuition on how these predicted distributions emerge, we present figure 9 in the appendix.
In figure 8 we visualize one of the activation functions that the method learns for a 1-dimensional input SQUAD-factorized model. The learned activation functions resemble “peaked” sigmoid activations, which can be interpreted as a combination of an RBF kernel and sigmoid. This provides food for thought on how non-linearity’s for conventional neural networks can be designed, and the effect of using such a non-linearity can be studied in further work.
6 Discussion
In this work, we have proposed a new flexible class of variational distributions. To measure the effectiveness for real world classification, we applied the class to a deep variational information bottleneck model. By placing a quantization-based distribution on the activations, we can compute uncertainty estimates over the outputs. We proposed an evaluation scheme motivated by the need in real-world domains to guarantee a minimal risk. The results presented indicate that SQUAD provides an improvement over plain neural networks and Gaussian information bottleneck models. In comparison to a MCDropout model, which approximates a Bayesian neural network, we get competitive performance. Moreover, qualitatively we find that the flexible distribution is used to its full advantage is sample efficient. The method learns interesting non-linearity’s, is tractable and scaleable, and as the output domain is constrained, no batch normalization techniques are required.
Various directions for future work arise. The improvement of ensemble methods over individual models indicates that there remains room for improvement for capturing the full uncertainty of the output, and thus a fully Bayesian approach to SQUAD which would include weight uncertainty, shows promise. The flexible class allows us to define a wide variety of interesting priors, which provides opportunity to study interesting priors that are hard to define as a continuous density. Likewise, more effective initialization of parameters for the proposed method requires further attention. Orthogonally, the proposed class can be applied to other variational objectives as well, such as the variational auto-encoder. Finally, the discretized nature of the variables allows for the analytical computation of other divergences such as mutual information and the Jensen-Shannon divergence, the effectiveness of which remains to be studied.
Acknowledgements
We thank Bart Bakker, Maximilian Ilse, Dimitrios Mavroeidis, Jakub Tomczak, Daniel Worrall and anonymous reviewers for their insightful comments and discussions. This research was supported by Philips Research, the SURFSara Lisa cluster and the NVIDIA GPU Grant. We thank contributors to TensorFlow (Abadi et al. (2016)), Keras (Chollet et al. (2015)) and Sacred (Gref et al. (2016)).
References
- Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale machine learning on heterogeneous distributed systems. March 2016.
- Achille & Soatto (2016) Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. November 2016.
- Alemi et al. (2016) Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. December 2016.
- Alemi et al. (2017) Alexander A Alemi, Ben Poole, Ian Fischer, Joshua V Dillon, Rif A Saurous, and Kevin Murphy. An Information-Theoretic analysis of deep Latent-Variable models. November 2017.
- Bergstra et al. (2011) James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for Hyper-Parameter optimization. In J Shawe-Taylor, R S Zemel, P L Bartlett, F Pereira, and K Q Weinberger (eds.), Advances in Neural Information Processing Systems 24, pp. 2546–2554. Curran Associates, Inc., 2011.
- Chang & Chen (2015) Jia-Ren Chang and Yong-Sheng Chen. Batch-normalized maxout network in network. November 2015.
- Chollet et al. (2015) François Chollet et al. Keras. https://keras.io, 2015.
- Dabney et al. (2017) Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. October 2017.
- Gal & Ghahramani (2015a) Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with bernoulli approximate variational inference. June 2015a.
- Gal & Ghahramani (2015b) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. June 2015b.
- Gal et al. (2017) Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. May 2017.
- Geifman & El-Yaniv (2017) Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 4885–4894. Curran Associates, Inc., 2017.
- Goodfellow et al. (2013) Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. February 2013.
- Gref et al. (2016) Klaus Gref et al. Sacred. https://github.com/IDSIA/sacred, 2016.
- Gumbel (1954) E J Gumbel. Statistical theory of extreme values and some practical, applications, national bureau of standards, washington (1954). MR0061342 (15: 811b), 1954.
- He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. cv-foundation.org, 2015.
- Higgins et al. (2016) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning basic visual concepts with a constrained variational framework. November 2016.
- Holi & Hwang (1993) J L Holi and J N Hwang. Finite precision error analysis of neural network hardware implementations. IEEE Trans. Comput., 42(3):281–290, March 1993.
- Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. September 2016.
- Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. jmlr.org, June 2015.
- Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-Softmax. November 2016.
- Jordan et al. (1999) Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Mach. Learn., 37(2):183–233, November 1999.
- Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. December 2014.
- Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-Encoding variational bayes. December 2013.
- Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 6405–6416. Curran Associates, Inc., 2017.
- Lin et al. (2013) Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. December 2013.
- Louizos & Welling (2017) Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. March 2017.
- Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. November 2016.
- Niculescu-Mizil & Caruana (2005) Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pp. 625–632, New York, NY, USA, 2005. ACM.
- Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. January 2014.
- Schrempf et al. (2006) O C Schrempf, D Brunn, and U D Hanebeck. Dirac mixture density approximation based on minimization of the weighted cramer-von mises distance. In 2006 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, pp. 512–517, 2006.
- Shayer et al. (2017) Oran Shayer, Dan Levi, and Ethan Fetaya. Learning discrete weights using the local reparameterization trick. October 2017.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15:1929–1958, 2014.
- Su et al. (2017) Qinliang Su, Xuejun Liao, and Lawrence Carin. A probabilistic framework for nonlinearities in stochastic neural networks. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 4486–4495. Curran Associates, Inc., 2017.
- Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. April 2000.
- van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. November 2017.
- Yang et al. (2017) Dong Yang, Daguang Xu, S Kevin Zhou, Bogdan Georgescu, Mingqing Chen, Sasa Grbic, Dimitris Metaxas, and Dorin Comaniciu. Automatic liver segmentation using an adversarial Image-to-Image network. July 2017.
Appendix A Appendix
a.1 Effect of hyper-parameters on coverage:
The optimal configuration of hyper-parameters and bin priors have been determined using 700 evaluations selected using TPE. The space of parameters explored is as follows, presented in the hyperopt API for transparency:
# Shared C: quniform(2, 10, 1) * 2 + 1, dropout rate: uniform(0.01, .95), lr: loguniform(log(0.0001), log(0.01)), batch_size: qloguniform(log(32), log(512), 1) # SQUAD & Gaussian kl_multiplier: loguniform(log(1e-6), log(0.01)), init_scale: loguniform(log(1e-3), log(20)), # SQUAD use_bin_probs: choice([’uni’, ’gaus’]), use_bins: choice([’equal_prob_gaus’, ’linearly_spaced’]), learn_bin_values: choice([ ’per_neuron’, ’per_layer’, ’fixed’]),
In figure 10 we visualize the pairwise effect of these hyper-parameters on the coverage. The optimal configuration found in for the main SQUAD model are: batch size: 244, KL multiplier: 0.0027, learn bin values: per layer, : uniform, : linearly spread over (-3.5,3.5), lr: 0.0008, : 15, initialization scale: 3.214.