Predictive Uncertainty in Large Scale Classification using Dropout  Stochastic Gradient Hamiltonian Monte Carlo.
Abstract
Predictive uncertainty is crucial for many computer vision tasks, from image classification to autonomous driving systems. Hamiltonian Monte Carlo (HMC) is an inference method for sampling complex posterior distributions. On the other hand, Dropout regularization has been proposed as an approximate model averaging technique that tends to improve generalization in large scale models such as deep neural networks. Although, HMC provides convergence guarantees for most standard Bayesian models, it does not handle discrete parameters arising from Dropout regularization. In this paper, we present a robust methodology for predictive uncertainty in large scale classification problems, based on Dropout and Stochastic Gradient Hamiltonian Monte Carlo. Even though Dropout induces a nonsmooth energy function with no such convergence guarantees, the resulting discretization of the Hamiltonian proves empirical success. The proposed method allows to effectively estimate predictive accuracy and to provide better generalization for difficult test examples.
I Introduction
Artificial Intelligence systems have a wide variety of applications today, in some cases, are part of complex systems whose operation involves making delicate and dangerous decisions. Uncertainty in knowledge representation and reasoning has been studied since the fall of symbolic expert systems [1]. Therefore, several research efforts focused on efficient methods for estimating model uncertainty and capturing the variability inherent to real world situations [2].
Surprisingly, most modern artificial intelligence based computer vision techniques are generally unable to estimate their uncertainty. Predictive uncertainty is crucial for many of tasks, such as image classification, detecting noisy examples (adversarial examples), to analyze failure cases in decisionmaking systems [3]. These problems depend on uncertainty estimates for achieving good performance, requiring well calibrated probabilities for their image prediction and classification task. In such cases, Bayesian inference provides posterior predictive distributions that can be used to reduce overconfidence of the model outputs [4].
Hamiltonian Monte Carlo (HMC) is a Markov Chain Monte Carlo (MCMC) method for obtaining a sequence of random samples while maintaining asymptotic consistency with respect to a target distribution [5, 6]. HMC provides a mechanism for defining proposals with high acceptance rate, enabling more efficient exploration of the state space than standard randomwalk proposals. Also, another property of HMC is the feasibility to support high dimensionality models by using the conservation of energy principle of dynamical systems [7]. This is largely due to its capability to adapt to the problem geometry using the gradient of the energy function. Moreover, the method can be also extended to support largescale or streaming data sets [8].
MCMC techniques that account for model uncertainty may use sparsitypromoting priors such as the spike and slab prior[9], Bayesian Lasso [10] and the Horseshoe prior [11]. However, because of the combinatorial nature of the proposal distributions, MCMC techniques scale poorly in high dimensional models and variational inference techniques must be introduced [12]. Dropout has been previously proposed as a regularization technique for deep neural networks [13]. The relation with model uncertainty arises from the approximate variational distribution of model averaging [14]. In the other hand, variational estimates can be inaccurate or too concentrated in the posterior modes. Therefore, in this work, we propose a methodology called Dropout  Stochastic Gradient Hamiltonian Monte Carlo (DSGHMC) to estimate predictive uncertainty in large scale classification problems. The proposed approach is tested with the MNIST digit recognition [15] and the Adience age recognition [16] data sets. Predictive uncertainty estimates are compared to the estimates obtained by Stochastic Gradient HMC (SGHMC)[8] and Stochastic Gradient Langevin Dynamics (SGLD) [17].
Ii Related Work
Many variants based on HMC are found today in the literature [18]. HMC use discrete approximations of continuous dynamic systems. One such approach is the leapfrog integrator, which provides a numerical scheme to simulate Hamiltonian dynamics. Nevertheless, the leapfrog algorithm is highly sensitive to critical userspecified hyperparameters, the number of steps and the constant step size. Hoffman et. al. presents the NoUTurn sampler (NUTS) [19], which is an extension to HMC that automatically determines the appropriate number of leapfrogs steps that the algorithm needs in order to converge to the target distribution. This method uses a recursive algorithm to build a set of likely candidate points, stopping automatically when it starts to double back and retrace its steps. On the other hand, simple heuristics can also be used to establish the number of leapfrogs steps and the stepsize, based on assumption that the posterior distribution is Gaussian with diagonal covariance [20].
Other highly successful implementations also focus on attacking large data sets using previous ideas from stochastic optimization such as Stochastic Gradient Descent (SGD). The SGHMC and SGLD techniques support data batches and introduces a novel integrator by using Langevin dynamics which take into account the extranoise induced in the gradient [8, 17]. Another technique, the Metropolis Adjusted Langevin Algorithm (MALA) [21, 22], uses a MetropolisHastings (Metropolis) correction scheme to accept or reject proposals. Also, the Riemann Manifold Hamiltonian Monte Carlo method provides better adaption to the problem geometry for strongly correlated densities (RMHMC) [23, 24]. Finally, proximal algorithms and convex optimization techniques have also been studied in the context of MCMC [25], and HMC with logconcave or nondifferentiable (nonsmooth) energy functions [26].
The computational cost has been also discussed, especially with the apogee of large volumes of data and the emergence of Deep Neural Networks (DNNs). Split HMC [27] is a technique that accelerates the computation assuming that it is possible to divide the Hamiltonian function. On the other hand, solutions based on massively parallel computing and Graphics Processing Units (GPUs) have been also proposed [28].
In the context of computer vision, Stochastic Gradient  MCMC (SGMCMC) has been proposed to model weight uncertainty for shape classification [29]. Li et. al. propose an scalable learning approach for DNNs, reinforcing the connection from stochastic optimization and MCMC techniques. Moreover, it is also shown that SGD with Gaussian DropConnect as an alternative to Dropout, shares the same form as SGLD. Conversely, integrating Dropout with SGMCMC can be seen as a model average of mixtures of neural networks.
Iii Methods
In this section some important concepts for understanding the proposed method are detailed.
Iiia Hamiltonian Monte Carlo
Given a random variable and a data set , we want to sample from the posterior distribution, such that:
(1) 
HMC [5] performs a physical simulation of a conservative system, where a Hamiltonian function composed as a potential energy and an kinetic energy . These terms are constructed as follows:
(2) 
where is called the momentum and is considered as an auxiliary variable. A positivedefinite mass matrix is also introduced as follow:
(3) 
Now, in order to sample from , the method simulate Hamiltonian dynamics while leaving the joint distribution invariant, such that:
(4) 
A first step proposes a new value for the momentum variable from a Gaussian distribution. Then, a Metropolis update using Hamiltonian dynamics is used to propose a new state. The state evolves in a fictitious continuous time , and the partial derivatives of the Hamiltonian can be seen in Equation 5.
(5) 
In order to simulated continuous dynamics, the leapfrog integrator can be used to discretize time. Using a small step size , Equation 5 becomes:
(6) 
After iterations of the leapfrog integrator with finite , the joint proposal becomes . Subsequently, a Metropolis update is used to accept the proposal with probability , such that:
(7) 
In order to make a prediction at a new input , we need calculate the predictive distribution:
(8) 
IiiA1 Properties of Hamiltonian dynamics
An integrator is an algorithm that numerically approximates an evolution of the exact solution of a differential equation. Some important properties of Hamiltonian dynamics are important for building MCMC updates [6]:

Reversibility: The dynamics are timereversible.

Volume Preservation: Hamiltonian dynamics are volume preserving.

Conservation of the Hamiltonian: is constant as and vary.
The leapfrog method in HMC satisfies the criteria of volume conservation and reversibility over time. However, the total energy is only conserved approximately, in this way a bias is introduced in the joint density . Conversely, the Metropolis update is used to satisfy the detailed balance condition.
IiiA2 Limitations of HMC
One of the main limitations of HMC is the lack of support for discrete parameters. The difficulty in extending HMC to a discrete parameter space stems from the fact that the construction of proposals relies on the numerical solution of a differential equation. In other hand, approximating the likelihood of a discrete parameter by a continuous density is not always possible [30]. Moreover, when any discontinuity is introduced into the energy function (), the firstorder discretization does not ensure that a Metropolis correction maintains the stationary distribution invariant. Therefore, the standard implementation of HMC does not guarantee an acceptance rate when the parameter of interest has a discontinuous density. This is because integrators are designed for differential equations with smooth derivatives over time.
IiiB Stochastic Gradient Hamiltonian Monte Carlo
The standard HMC implementation needs to load all data in memory, which is not always possible when the volume of data increases considerably. Therefore, it is necessary to have an incremental integrator that satisfies the properties of Hamiltonian dynamics.
The SGHMC algorithm enables largescale and online sampling enabling rapidly exploration of the posterior distribution. A naive way to achieve this is by applying a SGD modification to the HMC integrator and assess the impact of the noisy gradient (Eq. 9). The noise injected in the system by the stochastic gradient no longer leads to Hamiltonian dynamics that leaves the target distribution invariant. Although, it is possible to correct this problem using a Metropolis step, this procedures requires costly computation on the entire dataset and might lead to low acceptance rates.
(9) 
Chen et. al. [8] propose an additional friction term to the momentum update and shows that using secondorder Langevin dynamics counteract the effects of the injected Gaussian noise (see Alg. 1). In the other hand, the Metropolis step is no longer needed due to the stability of the secondorder discretization [22]. This approach automatically corrects the error from the firstorder approximation, making it unnecessary to correct this discrepancy and ensuring convergence to the stationary distribution [23].
A friction term (where is the mass matrix and the friction constant) helps to decrease the energy , thus reducing the influence of the noise. The covariance matrix depends on the current state and the sample size. This procedure can be seen as a secondorder Langevin dynamics:
(10) 
, which delivers a new integrator and update step. Algorithm 1 explains the resulting method.
IiiC Dropout / DropConnect
Dropout and DropConnect arises in the context regularization of DNNs and provide a way to combine exponentially many different architectures [13, 31] . In the field of DNN, Dropout/DropConnect can be described as follows:

Regular (binary) DropConnect adds noise to global network weights, by setting a randomly selected subset of weights to zero within each layer:
(11) 
Binary Dropout instead randomly selects local hidden units:
(12)
, where and are the input and output layers, and is the probability that the weight/layer is dropped.
The Dropout learning interpretation has been the subject of research in recent years, due to its great effectiveness in various types of Neural Networks. Thus, has been shown that Dropout is similar to bagging and related ensemble methods [32]. Since all models are averaged in a efficient and fast approximation with weights scaling, Dropout can be seen as an approximation to the geometric mean in the space of all possible models, this approximation is less expensive than the arithmetic mean in bagging methods [33].
Iv Dropout  Stochastic Gradient Hamiltonian Monte Carlo
Gradient descent optimization has been extensively used for training Neural Networks architectures. In the context of Bayesian models, SGLD combines noisy gradient updates with Langevin dynamics in order to generate proposals from data subsets. Chen et.al. proposes SGD with momentum as an alternative discretization [8]. Conversely, SGHMC use a decaying learning rate that in the limit, reduces the discretization error to zero. The learning rate schedule ensures efficient sampling and high acceptance rates. Equation 13 shows the SGHMC update.
(13) 
where denotes the alternative to the momentum variable in SGHMC, is composed of a friction constant that can be chosen as a finite small number, a step size representing the initial value of the learning rate and a noise model . The momentum variable in SGD can be expressed as .
Incorporating Dropout can be seen as a regularization term that is directly related with the input data. On each iteration, a multivariate Bernoulli mask with probability is applied. Each one of the input components is randomly deleted and the remaining elements are scaled up by .
The noisy gradient updates generate proposals from a perturbed target distribution [34]. The proposed method is described in algorithm 2.
For SGHMC introducing a secondorder term reduces the discrepancy between the extra noise and the stationary distribution, so it is no longer necessary to make an Metropolis correction.
When Dropout is introduced, the energy function is no longer differentiable. In this context, the energy function of HMC requires specially tailored discretization. Nishimura [30] finds a solution that preserves the critical properties of the Hamiltonian dynamics, through soft approximations, where the dynamics can be analytically integrated near the discontinuity in a way that preserves the total energy. These methods [35] have shown that it is possible to build a discretization where the discontinuous parameters there no notion of the gradient in the traditional sense, the integrator maintains the irreversibility of the markov chain and conserves the energy, but volume preservation no longer guaranteed [36].
V Experiments
We implemented SGHMC, SGLD and DSGHMC in Edward^{1}^{1}1http://edwardlib.org [39], a Python library for posterior probabilistic modeling, inference, and criticism build on TensorFlow^{2}^{2}2https://www.tensorflow.org. We perform inference in two highly cited computer vision problems.
In order to make an objective comparison between the study methods and the proposed one, the optimal hyperparameters settings has been established. These settings are shown in Table I.
Method  SGHMC  DSGHMC  SGLD 

Step Size ()  
Friction Term ()  –  
MiniBatch Size  
Epochs  
Warmup  
Iterations  Epochs Num. Batches Warmup  
Prediction Samples  
Dropout probability ()  –  to  – 
Training data is splitted in minibatches of size and whitening is performed on each one of these batches. On the other hand, predictive distribution is estimates using Monte Carlo samples over the Test dataset
Va Digit Recognition
In the first experiment, we decided to work with the MNIST [15] dataset. This classic set of handwritten digits has served as the basis for the comparative evaluation of various classification algorithms. This database contains training images and test images, normalized to pixels ( features) and stored in gray scale. Figure 1 shows some examples of the database.
We perform independent chains for each method on MNIST dataset. The Dropout probability () is varied in , and , where the predictive results are compared with SGHMC and SGLD using the hyperparameter setting established in the Table I. The comparison results can be observed in Table II.
Method  Accuracy () 

SGHMC  
SGLD  
DSGHMC ()  
DSGHMC ()  
DSGHMC () 
Compared whit the other methods, the model inference results show that DSGHMC obtains a lower error when the Dropout probability is , obtaining stateoftheart results in terms of linear classification methods [15].
The role of the Dropout probability () for DSGHMC is also studied. Figure 2 show accuracy results with the execution of independent chains. It is possible to establish that for the studied data set a Dropout probability between on and generates lower error rates. On the other hand, also it can be seen that probabilities lower than rapidly decreases the variables influence on the classification results.
Predictive accuracy is now evalued by comparing the different methods, where the expected predictive accuracy is computed using a Monte Carlo approximation. DSGHMC (Fig. 3.a and 3.b) archieves higher predictive accuracy when compared to SGHMC (Fig. 3.c) and SGLD (Fig. 3.d).
VA1 Confusing classes
One of the biggest challenges of the MNIST data set is to separate highly confusing digits such as ’’ and ’’. This problem has been addressed earlier in feature selection challenges [40].
If we analyze the behavior between total uncertainty and predictive accuracy for one of these digits in each method, we can observe that there is less overconfidence and at the same time the classification results increase. Figure 4 shows the median of the expected probabilities for digit 9 in all test set.
As the Dropout probability decreases, the trend of probabilities towards similar classes becomes more visible. However, a very low Dropout probability does not guarantee better classification results or better uncertainty estimates, this due to the models generated with low reasons reduce too much the number of variables that allow explaining the data, as shown in the Figure 4.f.
VA2 Confusing Example (digit 9)
VA3 Confusing Example (digit 8)
VB Age Recognition
As part of the challenges of facial recognition systems in Computer Vision, methods are tested on the age recognition problem in Adience [16] face data set (Fig. 7). The database is passed through a cleaning process, eliminating null and undefined data, where each sample is redimensioned to pixels. Complete data set is partitioned into Training set and Test set. Samples distribution can be seen in Table III.
Age  ID Class  Train  Test  Total 

[]  
[]  
[]  
[]  
[]  
[]  
[]  
[]  
Total 
We use transfer learning from VGGFace CNN[41] whit AVG pooling, a CNN descriptor computed using the VGGVeryDeep16 architecture. For each example of Adience database, VGGFace computes a descriptor of features.
We perform again independent chains for each method in the Adience Test set. Using the hyperparameter configuration set in table I, where we obtain comparable stateofart results [42]. The comparison with the different methods can be seen in Table IV.
Method  Total Accuracy () 

SGHMC  
SGLD  
DSGHMC(  
DSGHMC()  
DSGHMC() 
When studying the role of on Adience, a different trend is observed to the one studied in the previous example. For Adience, the best results are centered on values between and , as shown in Figure 8.
VB1 Accuracy Analysis
The class imbalance plays an important role in the final classification results for age recognition problem in Adience. Although the comparison methods show the results of accuracy more balanced (Fig. 9.b) respect to the proposal, DSGHMC (Fig. 9.d) improves the results in most of classes, sacrificing to a lesser extent the results of the classes with less number of examples.
As the Dropout probability decreases, the imbalance in the accuracy results by class is more evident, which is why it has been observed that it is not advisable to select a Dropout probability too small. However, this choice will depend on the problem that is attacked and can be corroborated by visualizing its behavior in the studies previously carried out (See Fig. 2 and 8 ).
VC Misclassification example, class 4 (age [25  32])
In the case of the analysis of confusing examples in Adience, it is difficult to establish which classes are highly confusing without considering the neighboring ages. However, overconfidence in the probabilities of misclassified cases can be studied.
The first example (Fig. 10.a), we observe a high confidence in class in both SGHMC (Fig. 10.b) and SGLD (Fig. 10.c). On the other hand, the proposal allows to generate greater probabilities both in the correct class (Fig. 10.e and 10.f) and in its neighboring classes (Fig. 10.d) , which significantly improves the classification results adding uncertainty respect to the classes with greater similarity.
VD Misclassification example, class 1 (age [4  6])
In the second example (Fig. 11.a), we observe a high confidence in class both in SGHMC (Fig. 11.b) and in SGLD (Fig. 11.c), and that it is too far from the real classification (class ). As in the previous case, the proposal allows higher probabilities to be generated both in the correct class (Fig. 11.d) and in its neighboring classes (Fig. 11.e and 11.f), improving the classification results and making a better estimation of the uncertainty than the comparison methods.
Vi Conclusion
In many decisionmaking systems, there is uncertainty in the probabilities evaluated and in the expected consequences of the different actions that must be taken. The fundamental objective of better uncertainty estimates is that it is sufficiently detailed to encourage decisionmakers to think about the implications of uncertainty for the decision in question. In this way, it has been proven that our methodology allows a better estimation of uncertainty compared to two other methods of similar purpose, obtaining stateofart results in classification for high dimensional and large scale problems.
The second order correction of Langevin dynamics aims to eliminate the final correction of Metropolis, because the second order discretization of this stochastic differential equation reduces the first order approximation error. However, in discontinuous energy functions, it has not yet been proven that Langevin dynamics preserve the volume in its entirety. On the other hand, it has been demonstrated empirically through automatic differentiation that the method is capable of generating an approximation of the stable posterior distribution without the need for the adaptation of some correction method, adding better uncertainty results than the base method.
References
 [1] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1988.
 [2] M. B. Christopher, Pattern Recognition and Machine Learning. SpringerVerlag New York, 2016.
 [3] S. J. Prince, Computer vision: models, learning, and inference. Cambridge University Press, 2012.
 [4] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin, Bayesian data analysis. CRC press Boca Raton, FL, 2014, vol. 2.
 [5] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid monte carlo,” Physics letters B, vol. 195, no. 2, pp. 216–222, 1987.
 [6] R. M. Neal et al., “Mcmc using hamiltonian dynamics,” Handbook of Markov Chain Monte Carlo, vol. 2, no. 11, 2011.
 [7] A. Beskos, N. Pillai, G. Roberts, J.M. SanzSerna, A. Stuart et al., “Optimal tuning of the hybrid monte carlo algorithm,” Bernoulli, vol. 19, no. 5A, pp. 1501–1534, 2013.
 [8] T. Chen, E. Fox, and C. Guestrin, “Stochastic gradient hamiltonian monte carlo,” in International Conference on Machine Learning, 2014, pp. 1683–1691.
 [9] T. J. Mitchell and J. J. Beauchamp, “Bayesian variable selection in linear regression,” Journal of the American Statistical Association, vol. 83, no. 404, pp. 1023–1032, 1988.
 [10] T. Park and G. Casella, “The bayesian lasso,” Journal of the American Statistical Association, vol. 103, no. 482, pp. 681–686, 2008.
 [11] C. M. Carvalho, N. G. Polson, and J. G. Scott, “The horseshoe estimator for sparse signals,” Biometrika, vol. 97, no. 2, pp. 465–480, 2010.
 [12] P. Carbonetto, M. Stephens et al., “Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies,” Bayesian analysis, vol. 7, no. 1, pp. 73–108, 2012.
 [13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [14] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning, 2016, pp. 1050–1059.
 [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [16] E. Eidinger, R. Enbar, and T. Hassner, “Age and gender estimation of unfiltered faces,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 12, pp. 2170–2179, 2014.
 [17] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient langevin dynamics,” in Proceedings of the 28th International Conference on Machine Learning (ICML11), 2011, pp. 681–688.
 [18] M. Betancourt, “A conceptual introduction to hamiltonian monte carlo,” arXiv preprint arXiv:1701.02434, 2017.
 [19] M. D. Hoffman and A. Gelman, “The nouturn sampler: adaptively setting path lengths in hamiltonian monte carlo.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1593–1623, 2014.
 [20] M. D. Hoffman, “Learning deep latent gaussian models with markov chain monte carlo,” in International Conference on Machine Learning, 2017, pp. 1510–1519.
 [21] G. O. Roberts and J. S. Rosenthal, “Optimal scaling of discrete approximations to langevin diffusions,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 60, no. 1, pp. 255–268, 1998.
 [22] G. O. Roberts and O. Stramer, “Langevin diffusions and metropolishastings algorithms,” Methodology and computing in applied probability, vol. 4, no. 4, pp. 337–357, 2002.
 [23] M. Girolami and B. Calderhead, “Riemann manifold langevin and hamiltonian monte carlo methods,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73, no. 2, pp. 123–214, 2011.
 [24] Z. Wang, S. Mohamed, and N. Freitas, “Adaptive hamiltonian and riemann manifold monte carlo,” in International Conference on Machine Learning, 2013, pp. 1462–1470.
 [25] M. Pereyra, “Proximal markov chain monte carlo algorithms,” Statistics and Computing, vol. 26, no. 4, pp. 745–760, 2016.
 [26] L. Chaari, J.Y. Tourneret, C. Chaux, and H. Batatia, “A hamiltonian monte carlo method for nonsmooth energy sampling,” IEEE Transactions on Signal Processing, vol. 64, no. 21, pp. 5585–5594, 2016.
 [27] B. Shahbaba, S. Lan, W. O. Johnson, and R. M. Neal, “Split hamiltonian monte carlo,” Statistics and Computing, vol. 24, no. 3, pp. 339–349, 2014.
 [28] A. L. Beam, S. K. Ghosh, and J. Doyle, “Fast hamiltonian monte carlo using gpu computing,” Journal of Computational and Graphical Statistics, vol. 25, no. 2, pp. 536–548, 2016.
 [29] C. Li, A. Stevens, C. Chen, Y. Pu, Z. Gan, and L. Carin, “Learning weight uncertainty with stochastic gradient mcmc for shape classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5666–5675.
 [30] A. Nishimura, D. Dunson, and J. Lu, “Discontinuous hamiltonian monte carlo for models with discrete parameters and discontinuous likelihoods,” arXiv preprint arXiv:1705.08510, 2017.
 [31] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in Proceedings of the 30th international conference on machine learning (ICML13), 2013, pp. 1058–1066.
 [32] D. WardeFarley, I. J. Goodfellow, A. Courville, and Y. Bengio, “An empirical analysis of dropout in piecewise linear networks,” arXiv preprint arXiv:1312.6197, 2013.
 [33] P. Baldi and P. J. Sadowski, “Understanding dropout,” in Advances in neural information processing systems, 2013, pp. 2814–2822.
 [34] R. Bardenet, A. Doucet, and C. Holmes, “Towards scaling up markov chain monte carlo: an adaptive subsampling approach,” in International Conference on Machine Learning, 2014, pp. 405–413.
 [35] A. Pakman and L. Paninski, “Auxiliaryvariable exact hamiltonian monte carlo samplers for binary distributions,” in Advances in neural information processing systems, 2013, pp. 2490–2498.
 [36] H. M. Afshar and J. Domke, “Reflection, refraction, and hamiltonian monte carlo,” in Advances in Neural Information Processing Systems, 2015, pp. 3007–3015.
 [37] P. Baldi and P. Sadowski, “The dropout learning algorithm,” Artificial intelligence, vol. 210, pp. 78–122, 2014.
 [38] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell, “Stan: A probabilistic programming language,” Journal of Statistical Software, vol. 76, no. 1, 2017.
 [39] D. Tran, M. D. Hoffman, R. A. Saurous, E. Brevdo, K. Murphy, and D. M. Blei, “Deep probabilistic programming,” in International Conference on Learning Representations, 2017.
 [40] I. Guyon, S. Gunn, A. BenHur, and G. Dror, “Result analysis of the nips 2003 feature selection challenge,” in Advances in neural information processing systems, 2005, pp. 545–552.
 [41] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British Machine Vision Conference, 2015.
 [42] G. Levi and T. Hassner, “Age and gender classification using convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 34–42.