Challenges in Bayesian inference via Markov chain Monte Carlo for neural networks
Markov chain Monte Carlo (MCMC) methods and neural networks are instrumental in tackling inferential and prediction problems. However, Bayesian inference based on joint use of MCMC methods and of neural networks is limited. This paper reviews the main challenges posed by neural networks to MCMC developments, including lack of parameter identifiability due to weight symmetries, prior specification effects, and consequently high computational cost and convergence failure. Population and manifold MCMC algorithms are combined to demonstrate these challenges via multilayer perceptron (MLP) examples and to develop case studies for assessing the capacity of approximate inference methods to uncover the posterior covariance of neural network parameters. Some of these challenges, such as high computational cost arising from the application of neural networks to big data and parameter identifiability arising from weight symmetries, stimulate research towards more scalable approximate MCMC methods or towards MCMC methods in reduced parameter spaces.
Challenges in MCMC for neural networks
, , and
Bayesian inference \kwdBayesian neural networks \kwdconvergence diagnostics \kwdmanifold Langevin Monte Carlo \kwdMarkov chain Monte Carlo \kwdpower posteriors \kwdprior specification \kwdweight symmetries
The universal approximation theorem (Cybenko, 1989) and its subsequent extensions (Hornik, 1991; Lu et al., 2017) state that feedforward neural networks with exponentially large width and width-bounded deep neural networks can approximate any continuous function arbitrarily well. This universal approximation capacity of neural networks along with available computing power explain the widespread use of deep learning nowadays.
Bayesian inference for neural networks is mainly performed via stochastic Bayesian optimization or via stochastic variational inference (Polson and Sokolov, 2017). MCMC methods have been explored in the context of neural networks, but have not evolved yet to the point of being broadly included in the Bayesian deep learning toolbox.
The slower evolution of MCMC methods for neural networks is partly attributed to the lack of scalability of existing MCMC algorithms for big data and for high-dimensional parameter spaces. Furthermore, additional contributing factors hinder the adaptation of existing MCMC algorithms in deep learning, including the hierarchical structure of neural networks and the associated covariance between parameters, lack of identifiability arising from weight symmetries, lack of a priori knowledge about the parameter space and the associated impact of non-objective priors on the parameter posterior, and ultimately lack of convergence.
The purpose of this paper is twofold. Initially, a literature review is conducted to identify inferential challenges arising from attempts to develop MCMC methods for neural networks over the last three decades. Secondly, contemporary manifold and population MCMC algorithms are applied to small MLPs for the first time.
Manifold and population MCMC provide incremental improvements, are impeded by the same limitations other MCMC algorithms face, and therefore do not offer a solution for scalable Bayesian deep learning. Nevertheless, manifold and population MCMC help showcase many of the challenges reported in the MCMC literature for neural networks and offer a testbed for benchmarking scalable approximate inference methods on small neural networks. More specifically, manifold and population MCMC seem to capture the posterior covariance structure between parameters of small MLPs. Such ground truth is unknown, yet simulations with different manifold and population MCMC algorithms provide unanimous empirical evidence on what the posterior covariance structure might be.
An outline of the paper layout follows. Section 2 reviews the inferential challenges arising from the application of MCMC to neural networks. Section 3 provides an overview of the employed manifold and population MCMC algorithms, including simplified manifold Langevin Monte Carlo and power posteriors, and of the employed MCMC diagnostics, namely of the potential scale reduction factor (PSRF) and of the effective sample size (ESS). Section 4 introduces the truncated Metropolis-adjusted Langevin algorithm and the categorical distribution for swapping states between power posteriors that are used in the examples. An overview of the MLP model and of its likelihood for binary and multiclass classification are provided in section 5. MCMC simulations for two MLPs, one fitted to exclusive-or (XOR) and one to the Iris data, are summarized and interpreted in section 6. Directions for future research in section 7 conclude the paper.
2 Inferential challenges
A literature review of inferential challenges in the application of MCMC methods to neural networks is conducted in this section thematically, with each subsection being focused on a different challenge.
2.1 Computational cost
Existing MCMC algorithms do not scale with increasing number of parameters or of samples. For this reason, approximate inference methods, including variational inference (VI), are preferred in high-dimensional parameter spaces or in big data problems from a time complexity standpoint (Blei, Kucukelbir and McAuliffe, 2017). On the other hand, MCMC methods are known to be better than VI in terms of approximating the log-likelihood (Dupuy and Bach, 2017).
Literature on MCMC algorithms for neural networks is limited due to associated computational complexity implications. Sequential Monte Carlo and reversible jump MCMC have been applied on two types of neural network architectures, namely MLPs and radial basis function networks (RBFs), see for instance Andrieu, de Freitas and Doucet (1999); de Freitas (1999); Andrieu, de Freitas and Doucet (2000); de Freitas et al. (2001). For a review of MCMC approaches to neural networks, see Titterington (2004).
A resurgence of interest in scaling MCMC methods to big data has been reflected in literature over the last five years. The main focus has been on designing Metropolis-Hastings or Gibbs sampling variants that evaluate a costly log-likelihood on a subset (minibatch) of the data rather than on the entired data set (Welling and Teh, 2011; Chen, Fox and Guestrin, 2014; Ma, Foti and Fox, 2017; De Sa, Chen and Wong, 2018; Nemeth and Sherlock, 2018; Robert et al., 2018; Seita et al., 2018; Quiroz et al., 2019).
Among recent attempts to scale MCMC algorithms to big data applications, there exists a smaller subset of studies applying such algorithms to neural networks (Chen, Fox and Guestrin, 2014; Gu, Ghahramani and Turner, 2015; Gong, Li and Hernández-Lobato, 2019). Minibatch MCMC approaches to neural networks pave the way towards data-parallel deep learning. On the other hand, to the best of the authors’ knowledge, there is no published research on MCMC algorithms that evaluate the log-likelihood on a subset of neural network parameters rather than on the whole set of parameters, and therefore no reported research on model-parallel deep learning via MCMC.
2.2 Model structure
A neural network with layers can be viewed as a hierarchical model with levels, each network layer representing a level (Williams, 2000). Due to its non-linear activations, a neural network is specifically a non-linear hierarchical model.
MCMC methods for non-linear hierarchical models have been developed, see for example Bennett, Racine-Poon and Wakefield (1996); Gilks and Roberts (1996); Daniels and Kass (1998); Sargent, Hodges and Carlin (2000). However, existing MCMC algorithms for non-linear hierarchical models have not been harnessed by neural networks due to time complexity and convergence implications.
Although not designed to mirror the hierarchical structure of a neural network, recent hierarchical VI (Ranganath, Tran and Blei, 2016; Esmaeili et al., 2019; Huang et al., 2019; Titsias and Ruiz, 2019) provides more expressive variational approximations of the parameter posterior of the neural network than mean-field VI. Introducing a hierarchical structure in the variational distribution induces correlation among parameters, in contrast to the mean-field variational distribution that assumes independent parameters. So, one of the goals of Bayesian inference for neural networks is to approximate the covariance structure among network parameters. In fact, there are published comparisons between MCMC and VI in terms of speed and accuracy of convergence to the posterior covariance, both in linear or mixture model examples (Giordano, Broderick and Jordan, 2015; Mandt, Hoffman and Blei, 2017; Ong, Nott and Smith, 2018) and in neural network examples (Zhang et al., 2018a).
Manifold Monte Carlo methods introduce a proposal mechanism that approximates the covariance of the target posterior locally (Girolami and Calderhead, 2011). Thereby, manifold Monte Carlo can be used in toy neural networks to benchmark the capacity of large scale approximate Bayesian inference methods to capture the posterior covariance among neural network parameters.
2.3 Weight symmetries
The output of feedforward neural networks given some fixed input remains unchanged under a set of transformations determined by the the choice of activations and by the network achitecture more generally. For instance, certain weight permutations and sign flips in MLPs with hyperbolic tangent activations leave the output unchanged (Chen, Lu and Hecht-Nielsen, 1993).
If a parameter transformation leaves the output of a neural network unchanged given some fixed input, then the parameter posterior is invariant under the transformation. In other words, transformations, such as weight permutations and sign-flips, render neural networks non-identifiable (Pourzanjani, Jiang and Petzold, 2017).
It is known that the set of linear invertible parameter transformations that leaves the output unchanged is a subgroup of the group of invertible linear mappings from the parameter space to itself (Hecht-Nielsen, 1990). is a transformation group acting on the parameter space . It can be shown that for each permutable feedforward neural network, there exists a cone dependent only on the network architecture such that for any parameter there exist and such that . This relation means that every network parameter is equivalent to a parameter in the proper subset of (Hecht-Nielsen, 1990). Neural networks with convolutions, max-pooling and batch-normalization contain more types of weight symmetries than MLPs (Badrinarayanan, Mishra and Cipolla, 2015).
In practice, the parameter space of a neural network is set to be the whole of rather than a cone of . Since a neural network posterior with support in the non-reduced parameter space of is invariant under weight permutations, sign-flips or other transformations, the posterior landscape includes multiple equally likely modes. This implies low acceptance rate, entrapment in local modes and convergence challenges for MCMC. Additionally, computational time is wasted during MCMC, since posterior modes represent equivalent solutions (Nalisnick, 2018). Such challenges manifest themselves in the MLP examples of section 6. For neural networks with higher number of parameters in , the topology of the likelihood function is characterized by local optima embedded in high-dimensional flat plateaus (Brea et al., 2019). Thereby, larger neural networks lead to a multimodal target density with sparse modes for MCMC.
Seeking parameter symmetries in neural networks can lead to a variety of NP-hard problems (Ensign et al., 2017). Moreover, symmetries in neural networks pose identifiability and associated inferential challenges in Bayesian inference, but they also provide opportunities to develop inferential methods with reduced computational cost (Hu, Zagoruyko and Komodakis, 2019) or with improved predictive performance (Moore, 2016). Empirical evidence from stochastic optimization simulations suggests that removing weight symmetries has a negative effect on prediction accuracy in smaller and shallower convolutional neural networks (CNNs), but has no effect in prediction accuracy in larger and deeper CNNs (Maddison et al., 2015). So, elimination or exploitation of weight symmetries provides scope for scalable Bayesian inference in deep learning. At the same time, finding weight symmetries is not a trivial problem.
2.4 Prior specification
Parameter priors have been used for generating Bayesian smoothing or regularization effects. For instance, de Freitas (1999) develops sequential Monte Carlo methods with smoothing priors for MLPs and Williams (1995) introduces Bayesian regularization and pruning to neural networks via a Laplace prior.
When parameter prior specification for a neural network is not driven by smoothing or regularization, the question becomes how to choose the prior. The choice of parameter prior for a neural network is crucial in that it affects the parameter posterior (Lee, 2004), and consequently the predictive posterior (Lee, 2005). The impact of prior choice on the parameter posterior is demonstrated in the MCMC simulations of section 6.1 by fitting an MLP model to XOR.
Neural networks are commonly applied to big data. For large amounts of data, practitioners usually do not have intuition about the relationship between input and output variables. Furthermore, it is an open research question for scientists to interpret neural network weights and biases. As a priori knowledge about big data sets and about neural network parameters is typically not available, prior elicitation from experts is not applicable to neural networks in practice.
It seems logical to choose a prior that reflects a priori ignorance about the parameters. A constant-valued prior is a possible candidate, with the caveat of being improper for unbounded parameter spaces, such as . However, for neural networks, an improper prior can result in an improper parameter posterior (Lee, 2005).
Typically, a truncated flat prior for neural networks is sufficient for ensuring a valid parameter posterior (Lee, 2005). At the same time, the choice of truncation bounds depends on weight symmetry and consequently on the allocation of equivalent points in the parameter space. Lee (2003) proposes a restricted flat prior for feedforward neural networks by bounding some of the parameters and by imposing constraints that guarantee layer-wise linear independence between activations, while Lee (2000) shows that this prior is asymptotically consistent for the posterior. Moreover, Lee (2003) demonstrates that such a restricted flat prior enables more effective MCMC sampling in comparison to alternative prior choices.
Objective prior specification is an area of statistics that has not infiltrated substantially Bayesian inference for neural networks. Alternative ideas for constructing objective priors with minimal effect on posterior inference exist in the statistics literature. For example, Jeffreys priors are invariant to differentiable one-to-one transformations of the parameters (Jeffreys, 1962), maximum entropy priors maximize the Shannon entropy and therefore provide the least possible information (Jaynes, 1968), reference priors maximize the expected Kullback-Leibler divergence from the associated posteriors and in that sense are the least informative priors (Bernardo, 1979), and penalised complexity priors penalise the complexity induced by deviating from a simpler base model (Simpson et al., 2017).
To the best of the authors’ knowledge, there are only two published lines of research on objective priors for neural networks; a theoretical derivation of Jeffreys and reference priors for feedforward neural networks by Lee (2007), and an approximation of reference priors via Monte Carlo sampling of a differentiable non-centered parameterization of MLPs and CNNs by Nalisnick (2018).
MCMC convergence depends on the shape of the target density, namely on multi-modality, model sparsity and lack of smoothness. A small MLP with as few as twenty seven parameters makes convergence in fixed sampling time challenging for contemporary manifold and population MCMC algorithms (see example of section 6.2).
Attaining MCMC convergence is not the only challenge. Assessing whether a finite sample from an MCMC algorithm represents an underlying target density can not be done with certainty (Cowles and Carlin, 1996). MCMC diagnostics can fail to detect the type of convergence failure they were designed to identify. Combinations of diagnostics are thus used in practice to evaluate MCMC convergence with reduced risk of false diagnosis. In this paper, the potential scale reduction factor (PSRF) and the effective sample size (ESS) are used jointly to assess MCMC convergence (see section 3.2).
MCMC diagnostics have been designed with asymptotically exact MCMC algorithms in mind. Recently, research activity on approximate MCMC methods has emerged (Welling and Teh, 2011; Chen, Fox and Guestrin, 2014; Mandt, Hoffman and Blei, 2017; Rudolf and Schweizer, 2018; Chen et al., 2019) along with new diagnostics that quantify convergence of approximate MCMC methods (Chwialkowski, Strathmann and Gretton, 2016).
Quantization and discrepancy are two notions pertinent to approximate MCMC algorithms. The quantization of a target density by an empirical measure provides an approximation to the target (Graf and Luschgy, 2007), while the notion of discrepancy quantifies how well the empirical measure approximates the target density (Chen et al., 2019). The kernel Stein discrepancy (KSD) and the maximum mean discrepancy (MMD) constitute two instances of discrepancy; for more details, see Chen et al. (2019) and Gretton et al. (2012), respectively. Rudolf and Schweizer (2018) provides an alternative way of assessing the quality of approximation of a target density by in the context of approximate MCMC using the notion of Wasserstein distance between and .
A remark follows, which is beyond the scope of the present paper and towards future approximate MCMC developments. For parametric models, such as neural networks, there is no prior knowledge about the true parameter posterior. For this reason, it seems useful to attempt constructing approximate MCMC algorithms by measuring the discrepancy or Wasserstein distance between predictions and output data (Rudolf and Schweizer, 2018) instead of measuring discrepancy in the parameter space.
3 Overview of MCMC algorithms and diagnostics
3.1 MCMC algorithms
Interest is in sampling from a possibly unnormalized target density on a parameter space . For a neural network, the parameter space consists of the weights and biases of the network.
This section provides a description of the MCMC algorithms used in the examples. There are many available MCMC samplers, and no claim is made that the ones chosen in the paper are the most efficient among all. Langevin Monte Carlo has been chosen over Hamiltonian Monte Carlo on the basis of more feasible computational complexity of the respective manifold versions of these two families of samplers, since simplified manifold Langevin Monte Carlo can operate using only the gradient and Hessian of the target density, whereas manifold Hamiltonian Monte Carlo requires computing also the derivatives of the Hessian (Girolami and Calderhead, 2011).
3.1.1 Metropolis-Hastings algorithm
Consider a proposal density associated with state at the -th iteration of Metropolis-Hastings (MH). Let be a state sampled from . The MH acceptance probability is
if , and otherwise.
A normal proposal density with a constant covariance matrix simplifies the acceptance probability to , defining the random walk Metropolis (RWM) algorithm.
3.1.2 Metropolis-adjusted Langevin algorithm
The class of MH algorithms with normal proposal density
is known as the class of Metropolis-adjusted Langevin algorithms (MALA). and are hyper-parameters known as the preconditioning matrix (Roberts and Stramer, 2002) and integration stepsize, respectively. MALA differs from RWM in that it uses the gradient of the log-target density to update the proposal density mean (3.3).
is a positive-definite matrix of size , assuming an -dimensional parameter space . It is commonly set to the identity matrix . In high-dimensional parameter spaces (as ), the optimal integration stepsize is selected to attain a a limiting acceptance rate of (Roberts and Rosenthal, 1998).
3.1.3 Simplified manifold Metropolis-adjusted Langevin algorithm
Similarly to MALA, the simplified manifold Metropolis-adjusted Langevin algorithm (SMMALA) is also an MH algorithm. SMMALA is defined by the normal proposal density
Proposal density (3.4) is a generalization over (3.2). The difference between SMMALA and MALA is that the former assumes a position-dependent preconditioning matrix at iteration . Consequently, the covariance matrix of the SMMALA proposal density at iteration depends on the current state .
There is not any optimal scaling theory for tuning the stepsize of SMMALA towards an optimal acceptance rate. Girolami and Calderhead (2011) suggest empirically to tune to obtain an acceptance rate of around for SMMALA.
3.1.4 Choice of metric tensor
The SMMALA proposal density provides a sampling framework subject to the choice of position-dependent precondition matrix . Differential geometry motivates the choice of .
Consider a space of densities parameterized by . According to Rao (1945), the KL divergence between and is given by
where is the expected Fisher information matrix of conditional on . More generally, is a metric tensor conveying a notion of distance in the space of densities on the basis of (3.6), despite the fact that KL divergence is assymetric and therefore not a metric.
In a Bayesian setting, a posterior plays the role of target density, where is the likelihood and is the prior. One option is to then set at the -iteration of SMMALA to be the observed Fisher information matrix of given , which is the negative Hessian of the log-likelihood , plus the negative Hessian of the log-prior :
For more information about possible choices of metric tensor and for a more extensive treatment of manifold Langevin Monte Carlo methods, see Girolami and Calderhead (2011).
Sampling and evaluating the SMMALA proposal density require the Cholesky decomposition of and the inverse , respectively. Due to the observed Fisher information matrix in (3.7) not being always positive-definite, such linear algebra calculations can result in singularities. The SoftAbs metric by Betancourt (2013) is used for approximating metric tensors while avoiding such singularities. In the examples of the present paper, the SoftAbs metric is employed for appoximating metric (3.7).
3.1.5 Power posteriors
Power posterior sampling by Friel and Pettitt (2008) is a population Monte Carlo algorithm. It involves chains drawn from tempered versions of a target posterior for a temperature schedule , where . At each iteration, the state of each chain is updated using an MCMC sampler associated with that chain and subsequently states between pairs of chains are swapped according to an MH algorithm. For the -th chain, a sample is drawn from a probability mass function with probability , in order to determine the pair for a possible swap. Algorithm 1 describes power posterior sampling in more detail; denotes the parameter state at the -th MCMC iteration of the chain associated with power posterior .
Bayesian model selection, and more specifically log-marginal likelihood and Bayes factor computations, drive the development of power posteriors in Friel and Pettitt (2008). More generally, power posteriors are a useful population MCMC method for sampling from multimodal target densities. Power posteriors are smooth approximations of the target density , facilitating exploration of the parameter space via state transitions between chains of and of .
3.2 Numerical MCMC diagnostics
This section outlines PSRF and ESS, which are the two numerical MCMC diagnostics used in the examples.
3.2.1 Potential scale reduction factor
Convergence of a Markov chain to its stationary distribution can fail in various ways. For example, an MCMC algorithm may explore only one area in the support of the target density or may require more iterations to converge or may be sensitive to the choice of initial parameter value.
PSRF, commonly denoted by , is an MCMC diagnostic of convergence conceived by Gelman and Rubin (1992) and extended to its multivariate version by Brooks and Gelman (1998). To compute PSRF, several independent Markov chains are simulated. The idea behind PSRF is that the variance of all the chains will be higher than the variance of individual chains, if convergence has not been reached. Gelman et al. (2013) propose split-, a modification of that aims at comparing the distribution of the first half of each chain after warm up to the distribution of the second half of the chain.
Empirical evidence suggests that both and split- succeed in detecting lack of convergence of the first moment as long as the Monte Carlo variance is not very high. Vehtari et al. (2019) suggest two transformations to make split- operational under high Monte Carlo variance; normalizing the chain around the median (referred to as folding the chain) provides a statistic sensitive to simulated chains with same location and different scales, while rank-normalizing the folded chain reduces the effect of heavy tails. The folded-split- is defined to be the value of split- computed on rank-normalized values of the folded chain, see Vehtari et al. (2019) for more details. In this paper, both split- and folded-split- are reported.
Gelman et al. (2004) recommend terminating MCMC sampling as soon as . More recently, Vats and Knudson (2018) make an argument based on ESS that a cut-off of for is too high to estimate a Monte Carlo mean with reasonable uncertainty. Stan Development Team (2019) recommend simulating at least four chains to compute and using a threshold of .
3.2.2 Effective sample size
and its variants can fail to diagnose poor mixing of a Markov chain, whereas low values of ESS are an indicator of poor mixing. It is thus recommended to check both and ESS (Vehtari et al., 2019). For a theoretical treatment of the relation between and ESS, see Vats and Knudson (2018).
The ESS of a Monte Carlo estimate obtained from a Markov chain is defined to be the number of independent samples that provide an estimate with variance equal to the variance of the Monte Carlo estimate. For a more extensive treatment entailing alternative definitions and estimators of ESS, see for example Vats and Flegal (2018); Gong and Flegal (2016); Kass et al. (1998).
Vehtari et al. (2019) proposes the commonly used effective sample size estimator across independent chains of length each, where is an estimator of the integrated autocorrelation time. An average of within-chain autocorrelation across the chains and an estimate of between-chain variance are combined to provide a between-chain autocorrelation estimate . The resulting is then truncated to a maximum lag to reduce noise from autocorrelation estimation, and is computed using on the basis of the initial monote sequence estimator by Geyer (1992).
The estimation of ESS in this paper bears a similarity to (Vehtari et al., 2019) in that both approaches involve computing the initial monotone sequence estimator of Geyer (1992) across independent chains. Despite this similarity, the ESS estimator in this paper differs from (Vehtari et al., 2019). For a Markov chain consisting of states , the ESS estimator of the -the coordinate is defined therafter to be
where , and is the -coordinate of . The term refers to the variance of coordinate under the assumption of independent identically distributed (IID) samples , while the term denotes an estimator of the Monte Carlo variance of coordinate . The initial monotone sequence estimator is used as (Geyer, 1992; Papamarkou, Mira and Girolami, 2014).
To understand the intuition behind (3.8), start by observing that sample variance increases with smaller sample size, so sample variance is inversely proportional to sample size. Thus, assume the relation between the number of IID samples and the sample variance for these IID samples. satisfies the relation , so , whence . It now becomes obvious that the number of IID samples with variance is given by (3.8).
The sample mean of ESS (3.8) across independent chains has been reported in the examples due to making more conservative claims in comparison to the ESS of Vehtari et al. (2019). A minimum value of per chain for the ESS of Vehtari et al. (2019) is empirically suggested by Stan Development Team (2019), and such a threshold is employed in the present paper on the basis of ESS (3.8).
4 MCMC modifications
Two sampling tweaks are made in the examples. Firstly, some simulations assume neural network weights and biases with support in a bounded and closed interval for some and , relying on a version of MALA with a truncated proposal density. Secondly, a categorical probability mass function is used in power posterior sampling for determining candidate pairs of chains for state swaps.
4.1 Truncated Metropolis-adjusted Langevin algorithm
Consider a target density with support in , , . To sample from via MALA, the truncated normal proposal density
can be employed, which corresponds to a normal density bounded in . The terms , and refer to the -th coordinates of , and , respectively.
4.2 Categorical distribution for power posterior state swaps
In Friel and Pettitt (2008), a discrete Laplacian probability mass function is suggested for proposing a neighbouring chain of . In the current paper, a categorical probability mass function is chosen, in an attempt to disseminate conceptual details in the implementation of power posterior sampling.
Assuming power posteriors, a neighbouring chain of is sampled from the categorical probability mass distribution
where and . For a derivation of the normalizing constant appearing in event probability , see Appendix A: categorical distribution in power posteriors. is a hyper-parameter typically set to , a value which makes a jump to roughly three times more likely than a jump to (Friel and Pettitt, 2008).
5 Overview of the multilayer perceptron model
Both examples in this paper rely on an MLP model, one consisting of nine and one consisting of twenty seven parameters. MCMC inference has been performed on MLPs before, see for example de Freitas (1999); Vehtari, Sarkka and Lampinen (2000). Manifold Langevin Monte Carlo methods and power posteriors have not been used in the context of MLPs. The use of such contemporary geometric and population MCMC methods in neural networks is not an end in itself, it is a means for acquiring an understanding of the challenges arising from the application of MCMC methods on neural networks and for developing benchmark tools for more scalable MCMC algorithms.
MLPs have been chosen as a more tractable class of neural networks. CNNs are the most widely used deep learning models. However, even small CNNs, such as AlexNet (Krizhevsky, Sutskever and Hinton, 2012), SqueezeNet (Iandola et al., 2016), Xception (Chollet, 2017), MobileNet (Howard et al., 2017), ShuffleNet (Zhang et al., 2018b), EffNet (Freeman, Roese-Koerner and Kummert, 2018) or DCTI (Truong, Nguyen and Tran, 2018), have at least two orders of magnitude higher number of parameters, thus amplifying issues of computational complexity, model structure, weight symmetry, prior specification, posterior shape, MCMC convergence and sampling effectiveness.
5.1 The multilayer perceptron
An MLP is a feedforward neural network consisting of an input layer, one ore more hidden layers and an output layer (Rosenblatt, 1958; Minsky and Papert, 1988; Hastie, Tibshirani and Friedman, 2016). Let be an index indicating the layer for some natural number , where refers to the input layer, to one of the hidden layers and to the output layer. Let be the number of neurons in layer and use as a shorthand for the sequence of neuron counts per layer. Under such notation, refers to an MLP with hidden layers and neurons at layer .
An with hidden layers and neurons at layer is defined recursively as
for . A data point is used as input to the input layer, yielding the sequence in the first hidden layer. and are the respective weights and biases at layer , which together constitute the parameters at layer . is a shorthand for all weights and biases up to layer . Functions , known as activations, are applied elementwise to their input .
The default recommendation of activation in neural networks is a rectified linear unit (ReLU), see for instance Jarrett et al. (2009); Nair and Hinton (2009); Goodfellow, Bengio and Courville (2016). If an activation is not present at layer , then in (5.2) corresponds to the identity function .
The weight matrix in (5.1) has rows and columns, while the vector of biases has length . Concatenating all across hidden and output layers gives a parameter vector of length . To define uniquely, the convention to traverse weight matrix elements row-wise is made. Apparently, each of in (5.1) and in (5.2) has length .
The notation is introduced to point to the -the element of weight matrix at layer . Analogously, points to the -th coordinate of bias vector at layer .
5.2 Likelihood for binary classification
Consider samples , consisting of some input and of a binary output . An with a single neuron in its output layer can be used for setting the likelihood function of labels given the input and MLP parameters .
Firstly, the sigmoid activation function is applied at the output layer of the MLP. So, the event probabilities are set to
Assuming that the labels are outcomes of independent draws from Bernoulli probability mass functions with event probabilities given by (5.3), the likelihood becomes
The log-likelihood follows as
The negative value of log-likelihood (5.5) is known as the binary cross entropy (BCE). To infer the parameters of , the binary cross entropy or a different loss function is minimized using stochastic optimization methods, such as stochastic gradient descent.
5.3 Likelihood for multiclass classification
Consider some output variable , which can take values. An with neurons in its output layers can be used for setting the likelihood function .
A softmax activation function is applied at the output layer of the MLP, where denotes the -th coordinate of the -length vector . Thus, the event probabilities are
It is assumed that the labels are outcomes of independent draws from categorical probability mass functions with event probabilities given by (5.6), so the likelihood is
denotes the indicator function, that is if , and otherwise. The log-likelihood follows as
The negative value of log-likelihood (5.8) is also known as cross entropy, and it is used as loss function for stochastic optimization in multiclass classification MLPs.
It is noted that for , an with two neurons at the output layer, event probabilities given by softmax activation (5.6) and log-likelihood (5.8) can be used for binary classification. Such a formulation provides an alternative to an with one neuron at the output layer, event probabilities given by sigmoid activation (5.3) and log-likelihood (5.5). The difference between the two MLP models is the parameterization of event probabilities, since a categorical distribution with levels otherwise coincides with a Bernoulli distribution.
Two examples of MLPs are used for showcasing challenges in MCMC inference for neural networks. An and an are used in the context of a binary and of a multiclass classification example, respectively. The focus of this paper is on inferring the parameter posterior of an MLP using MCMC, rather than inferring the predictive posterior.
The unnormalized parameter posterior of an MLP with binary or categorical output is , where the likelihood function corresponds to (5.4) or (5.7). is the prior of MLP parameters. In the examples, the unnormalized log-target density sampled via MCMC is , where the log-likelihood for binary or multiclass classification corresponds to (5.5) or (5.8).
Ten and four Markov chains are simulated per MCMC sampler for the respective and examples to compute PSRF and ESS. The increased computational cost incurred by power posterior simulations, which are run only for the example, is the reason for simulating fewer chains per sampler for in comparison to . iterations are run per chain, of which are discarded as burn-in. Samplers are tuned empirically to attain acceptance rates recommended in the literature (see section 3.1). The SofAbs metric (Betancourt, 2013) is used for approximating the Hessian of the target density in all simulations involving SMMALA and power posteriors with SMMALA chains.
Maximum PSRF values across all parameters are reported for each MLP. For ease of exposition, the supressed notation split- and folded-split- is henceforth preferred over and . In a similar fashion, the mean ESS of each parameter is computed across simulated chains, and subsequently the minimum, median and maximum ESS is taken across parameters. The shorter notation , and is preferred over , and .
The mean acceptance rate across simulated chains is reported for each sampler. In the case of power posteriors, if the state of a chain corresponding to the target density changes after taking the within and between-chain sub-steps of an MCMC step, then the state counts as accepted, otherwise it counts as rejected.
Gaussian and Beta kernels are adopted for evaluating kernel density estimators (KDEs) of densities with unbounded and bounded support, respectively. KDEs relying on Beta kernels are based on the work of (Chen, 1999).
6.1 MLP for exclusive-or data
The XOR function returns if exactly one of its binary input values is equal to , otherwise it returns . The four data points defining XOR are , , and .
A perceptron without a hidden layer can not learn the XOR function (Minsky and Papert, 1988). On the other hand, an with a single hidden layer of two neurons can learn the XOR function (Goodfellow, Bengio and Courville, 2016).
An has a parameter vector of length , since and have respective dimensions and . MCMC is run to learn the posterior of given the four XOR data points . The sigmoid function is used as activation on the hidden layer, since it achieves higher acceptance rate in the XOR example than a ReLU, according to MCMC pilot runs.
6.1.1 Passing convergence diagnostics
Firstly, ten chains are simulated via MH with prior . The values of split- and folded-split- are below the upper PSRF threshold of . Moreover, the ESS sumaries across the ten chains are , and . The minimum ESS value of per chain is above the lower ESS threshold of . Thus, the PSRF and ESS diagnostics do not detect lack of convergence. Moreover, a mean acceptance rate of across the ten chains and the trace plot of parameter shown in Figure 5(a) of Appendix C: plots for parameter of do not show signs of slow mixing.
Next, ten chains are simulated via MALA with the same prior . As seen in Table 1, MALA with prior does not fail the PSRF and ESS diagnostics either. Moreover, MALA has a minimum ESS value of , higher than the minimum ESS of for MH. This higher sampling effectiveness of MALA is possibly due to the use of gradient information in MALA.
6.1.2 Convergence with a weakly informative prior
MCMC diagnostics for the chains generated via MH and MALA with prior give no reason to doubt that the chains are representative of the parameter posterior of the under consideration.
Nevertheless, an attempt follows to evaluate prior effects. To this end, MALA simulations are run with more weakly informative priors, such as and . The term ‘weakly informative’ is not used rigorously here from an information-theoretic perspective of objective priors, but rather refers to more flat priors.
According to Table 1, split- and folded-split- for and priors are below the upper threshold of , although PSRF increases with increasing prior variance. with prior and with prior are higher than the lower ESS threshold of . Sampling effectiveness drops with increasing prior variance. MALA does not fail the PSRF and ESS diagnostics with any of , and priors.
Figure 5(h) shows that the running mean of a single chain realization via MALA for weight stabilizes with or prior, but not with prior. The loss of sampling effectiveness of MALA with increasing normal prior variance is visible in the autocorrelations of Figure 5(g) (the plotted autocorrelation lines correspond to the chains visualized in Figures 5(b)-5(f)).
The trace plots of Figures 5(c), 5(d) and 5(e) provide an explanation for the reduced sampling effectiveness of MALA with increasing normal prior variance (black straight horizontal lines represent Monte Carlo means for the respective traces). As the prior variance increases, MALA explores wider regions in the support of , and mixing becomes slower. These three trace plots demonstrate the waste of computational time during MCMC due to weight symmetries (Nalisnick, 2018). Equivalent marginal posterior modes of are scattered over the real line, so MCMC mixing worsens and additional computational effort is required to explore the marginal posterior landscape.
As a note of caution, consider the effects of using a prior with arbitrary mean and relatively small variance. For instance, MALA simulations with prior do not fail MCMC diagnostics (see relevant PSRF and ESS values in Table 1) and attain a minimum ESS of per chain. Such a high ESS is spurious and does not represent high sampling effectiveness. Figure 5(c) shows the trace plots of two chains for parameter realized via MALA with priors and . Both chains are entrapped in local modes and therefore neither of the corresponding priors help explore the space of parameter effectively. Priors and yield and , so it seems that prior leads to more effective sampling. However, Figure 3 indicates otherwise, since prior seems to prohibit MALA from capturing the posterior covariance among parameters, in contrast to prior .
Different priors for the parameters of the under consideration yield chains that get entrapped in different local modes or explore parameter regions of varying magnitude (Figures 5(c)-5(e)), and yet convergence diagnostics do not fail in any of the examined scenarios (Table 1). Thus, prior specification can impact the neural network parameter posterior inferred via MCMC (Lee, 2004) and MCMC diagnostics can fail to detect this issue.
6.1.3 Parameter symmetries explored via optimization and MCMC
To explore parameter symmetries and their role in MCMC, gradient descent (GD) is run until optimization solutions are obtained for the fitted to XOR. The used for optimization applies a sigmoid activation to the hidden layer and minimizes the BCE loss, that is the negative log-likelihood (5.5).
Since only four data points make up XOR, the data are not split into training and test set. For each optimization run, GD learns the nine parameters given the four XOR data points, and the learnt parameters are then used for predicting the XOR ouput given the XOR input for the same four XOR data points.
The event probability for each of the four XOR data points is computed by a sigmoid activation at the output layer of . The classification threshold is set to be . An optimization solution is accepted if it achieves prediction accuracy, that is if it predicts correctly the output of all four XOR data points.
Initial parameter values for optimization are sampled from a prior. Two priors are employed, namely and . The notation refers to a prior of IID parameter coordinates, with each coordinate admitting uniform probability density function. optimization solutions are acquired for each of the two priors. epochs are run for each of the optimizations, with learning rate set to one.
Figure 0(a) displays a parallel coordinates plot (Inselberg and Dimsdale, 1990) of out of the optimization solutions with prior. Every line plotted along the horizontal axis connects the nine parameter coordinates of a single optimization solution. The parallel coordinates plot exhibits symmetries along the horizontal axis . Folding Figure 0(a) along the horizontal axis , visualizes permutations that leave the output of the unchanged.
Figure 0(b) shows the KDEs of GD solutions attained by sampling the initial parameter values for GD from or from . Each KDE in Figure 0(b) can be interpreted as a posterior of ‘good’ optimization solutions. In other words, each KDE approximates a posterior of parameters that lead to prediction accuracy on the four XOR data points. Those GD and MCMC simulations that use the same prior do not provide estimates of the same posterior for the parameters, since the likelihood functions for GD and MCMC differ. The BCE loss minimized by GD coincides with the negative log-likelihood (5.5) used in MCMC. However, a GD solution obtained by minimizing the BCE loss is accepted only if it meets the requirement of prediction accuracy on the four XOR data points. Adding this extra requirement of predictive quality on top of the BCE loss leads to a different loss function, and therefore to a different underlying likelihood.
Both KDEs in Figure 0(b) have very low probability mass in the region in the vicinity of , are bimodal and have one modal peak in and one in . On the other hand, the KDE associated with prior distributes probability mass widely in and , whereas the more highly peaked KDE associated with prior concentrates most of its probability mass in and in .
Figure 0(c) shows a Markov chain generated by MALA with prior. The chain switches between two regions, located towards the bounds of the support , and explores much less the region in the vicinity of . Thus, the chain simulated via MALA with prior explores the two regions away from the center of the support, which correspond to the two modal peaks and of the posterior KDE of GD solutions with