# Deep Learning: A Bayesian Perspective

## Abstract

Deep learning is a form of machine learning for nonlinear high dimensional pattern matching and prediction. By taking a Bayesian probabilistic perspective, we provide a number of insights into more efficient algorithms for optimisation and hyper-parameter tuning. Traditional high-dimensional data reduction techniques, such as principal component analysis (PCA), partial least squares (PLS), reduced rank regression (RRR), projection pursuit regression (PPR) are all shown to be shallow learners. Their deep learning counterparts exploit multiple deep layers of data reduction which provide predictive performance gains. Stochastic gradient descent (SGD) training optimisation and Dropout (DO) regularization provide estimation and variable selection. Bayesian regularization is central to finding weights and connections in networks to optimize the predictive bias-variance trade-off. To illustrate our methodology, we provide an analysis of international bookings on Airbnb. Finally, we conclude with directions for future research.

## 1Introduction

Deep learning (DL) is a form of machine learning that uses hierarchical abstract layers of latent variables to perform pattern matching and prediction. Deep learners are probabilistic predictors where the conditional mean is a stacked generalized linear model (sGLM). The current interest in DL stems from its remarkable success in a wide range of applications, including Artificial Intelligence (AI) [18], image processing [78], learning in games [19], neuroscience [67], energy conservation [18], and skin cancer diagnostics [50]. [77] provides a comprehensive historical survey of deep learning and their applications.

Deep learning is designed for massive data sets with many high dimensional input variables. For example, Google’s translation algorithm [85] uses 1-2 billion parameters and very large dictionaries. Computational speed is essential, and automated differentiation and matrix manipulations are available on `TensorFlow`

[1]. Baidu successfully deployed speech recognition systems [6] with an extremely large deep learning model with over 100 million parameters, 11 layers and almost 12 thousand hours of speech for training. DL is an algorithmic approach rather than probabilistic in its nature, see [10] for the merits of both approaches.

Our approach is Bayesian and probabilistic. We view the theoretical roots of DL in Kolmogorov’s representation of a multivariate response surface as a superposition of univariate activation functions applied to an affine transformation of the input variable [49]. An affine transformation of a vector is a weighted sum of its elements (linear transformation) plus an offset constant (bias). Our Bayesian perspective on DL leads to new avenues of research including faster stochastic algorithms, hyper-parameter tuning, construction of good predictors, and model interpretation.

On the theoretical side, we show how DL exploits a Kolmogorov’s “universal basis”. By construction, deep learning models are very flexible and gradient information can be efficiently calculated for a variety of architectures. On the empirical side, we show that the advances in DL are due to:

New activation (a.k.a. link) functions, such as rectified linear unit (), instead of sigmoid function

Depth of the architecture and dropout as a variable selection technique

Computationally efficient routines to train and evaluate the models as well as accelerated computing via graphics processing unit (GPU) and tensor processing unit (TPU)

Deep learning has very well developed computational software where pure MCMC is too slow.

To illustrate DL, we provide an analysis of a dataset from Airbnb on first time international bookings. Different statistical methodologies can then be compared, see [45] and [72] who provides a comparison of traditional statistical methods with neural network based approaches for classification.

The rest of the paper is outlined as follows. Section 1.1 provides a review of deep learning. Section 2 provides a Bayesian probabilistic interpretation of many traditional statistical techniques (PCA, PCR, SIR, LDA) which are shown to be “shallow learners” with two layers. Much of the recent success in DL applications has been achieved by including deeper layers and these gains pass over to traditional statistical models. Section 3 provides heuristics on why Bayes procedures provide good predictors in high dimensional data reduction problems. Section 4 describes how to train, validate and test deep learning models. We provide computational details associated with stochastic gradient descent (SGD). Section 5 provides an application to bookings data from the Airbnb website. Finally, Section 6 concludes with directions for future research.

### 1.1Deep Learning

Machine learning finds a predictor of an output given a high dimensional input . A learning machine is an input-output mapping, , where the input space is high-dimensional,

The output can be continuous, discrete or mixed. For a classification problem, we need to learn , where indexes categories. A predictor is denoted by .

To construct a multivariate function, , we start with building blocks of hidden layers. Let be univariate activation functions. A semi-affine activation rule is given by

Here and are the weight matrix and inputs of the th layer.

Our deep predictor, given the number of layers , then becomes the composite map

Put simply, a high dimensional mapping, , is modeled via the superposition of univariate semi-affine functions. Similar to a classic basis decomposition, the deep approach uses univariate activation functions to decompose a high dimensional . To select the number of hidden units (a.k.a neurons), , at each layer we will use a stochastic search technique known as dropout.

The offset vector is essential. For example, using without bias term would not allow to recover an even function like . An offset element (e.g. ) immediately corrects this problem.

Let denote the -th layer, and so . The final output is the response , which can be numeric or categorical. A deep prediction rule is then

Here, are weight matrices, and are threshold or activation levels. Designing a good predictor depends crucially on the choice of univariate activation functions . Kolmogorov’s representation requires only two layers in principle. [87] prove the remarkable fact that a discontinuous link is required at the second layer even though the multivariate function is continuous. Neural networks (NN) simply approximate a univariate function as mixtures of sigmoids, typically with an exponential number of neurons, which does not generalize well. They can simply be viewed as projection pursuit regression with the only difference being that in a neural network the nonlinear link functions, are parameter dependent and learned from training data.

Figure ? illustrates a number of commonly used structures; for example, feed-forward architectures, auto-encoders, convolutional, and neural Turing machines. Once you have learned the dimensionality of the weight matrices which are non-zero, there’s an implied network structure.

Feed forward | Auto-encoder | Convolution |

Recurrent | Long / short term memory | Neural Turing machines |

Recently deep architectures (indicating non-zero weights) include convolutional neural networks (CNN), recurrent NN (RNN), long short-term memory (LSTM), and neural Turing machines (NTM). [66] and [58] provide results on the advantage of representing some functions compactly with deep layers. [67] extends theoretical results on when deep learning can be exponentially better than shallow learning. [11] implements [82] algorithm to estimate the non-smooth inner link function. In practice, deep layers allow for smooth activation functions to provide “learned” hyper-planes which find the underlying complex interactions and regions without having to see an exponentially large number of training samples.

## 2Deep Probabilistic Learning

Probabilistically, the output can be viewed as a random variable being generated by a probability model . Given , the negative log-likelihood defines as

The -norm, is traditional least squares, and negative cross-entropy loss is for multi-class logistic classification. The procedure to obtain estimates of the deep learning model parameters is described in Section 4.

To control the predictive bias-variance trade-off we add a regularization term and optimize

Probabilistically this is a negative log-prior distribution over parameters, namely

Deep predictors are regularized maximum a posteriori (MAP) estimators, where

Training requires the solution of a highly nonlinear optimization

and the log-posterior is optimised given the training data, with

Deep learning has the key property that is computationally inexpensive to evaluate using tensor methods for very complicated architectures and fast implementation on large datasets. `TensorFlow`

and `TPUs`

provide a state-of-the-art framework for a plethora of architectures. From a statistical perspective, one caveat is that the posterior is highly multi-modal and providing good hyper-parameter tuning can be expensive. This is clearly a fruitful area of research for state-of-the-art stochastic Bayesian MCMC algorithms to provide more efficient algorithms. For shallow architectures, the alternating direction method of multipliers (ADMM) is an efficient solution to the optimization problem.

### 2.1Dropout for Model and Variable Selection

Dropout is a model selection technique designed to avoid over-fitting in the training process. This is achieved by removing input dimensions in randomly with a given probability . It is instructive to see how this affects the underlying loss function and optimization problem. For example, suppose that we wish to minimise MSE, , then, when marginalizing over the randomness, we have a new objective

Where denotes the element-wise product. It is equivalent to, with

Dropout then is simply Bayes ridge regression with a -prior as an objective function. This reduces the likelihood of over-reliance on small sets of input data in training, see [41] and [83]. Dropout can also be viewed as the optimization version of the traditional spike-and-slab prior, which has proven so popular in Bayesian model averaging. For example, in a simple model with one hidden layer, we replace the network

with the dropout architecture

In effect, this replaces the input by , where is a matrix of independent distributed random variables.

Dropout also regularizes the choice of the number of hidden units in a layer. This can be achieved if we drop units of the hidden rather than the input layer and then establish which probability gives the best results. It is worth recalling though, as we have stated before, one of the dimension reduction properties of a network structure is that once a variable from a layer is dropped, all terms above it in the network also disappear.

### 2.2Shallow Learners

Almost all shallow data reduction techniques can be viewed as consisting of a low dimensional auxiliary variable and a prediction rule specified by a composition of functions

The problem of high dimensional data reduction is to find the -variable and to estimate the layer functions correctly. In the layers, we want to uncover the low-dimensional -structure, in a way that does not disregard information about predicting the output .

Principal component analysis (PCA), partial least squares (PLS), reduced rank regression (RRR), linear discriminant analysis (LDA), project pursuit regression (PPR), and logistic regression are all shallow learners. [55] provides an interesting perspective on how Bayesian shrinkage provides good predictors in regression settings. [29] provide excellent discussions of PLS and why Bayesian shrinkage methods provide good predictors. [90], [21], [72], [15] provide further discussion of dimension reduction techniques. Other connections exists for Fisher’s Linear Discriminant classification rule, which is simply fitting , where is a Heaviside function. [69] provide a Bayesian version of support vector machines (SVMs) and a comparison with logistic regression for classification.

PCA reduces to using a singular value decomposition of the form

where the columns of the weight matrix form an orthogonal basis for directions of greatest variance (which is in effect an eigenvector problem).

Similarly PPR reduces to by setting

Interaction terms, and , and max functions, can be expressed as nonlinear functions of semi-affine combinations. Specifically,

[23] provide further discussion for Projection Pursuit Regression, where the network uses a layered model of the form . [24] provide an ergodic view of composite iterated functions, a precursor to the use of multiple layers of single operators that can model complex multivariate systems. [79] provide the approximation theory for composite functions.

Deep ReLU architectures can be viewed as Max-Sum networks via the following simple identity. Define . Let where is an offset. Then . This is generalized in [27] (p.272) who shows by induction that

A composition or convolution of -layers is then a one layer max-sum network.

### 2.3Stacked Auto-Encoders

Auto-encoding is an important data reduction technique. An auto-encoder is a deep learning architecture designed to replicate itself, namely , via a *bottleneck* structure. This means we select a model which aims to concentrate the information required to recreate . See [39] for an application to smart indexing in finance. Suppose that we have input vectors and output (or target) vectors .

Setting biases to zero, for the purpose of illustration, and using only one hidden layer () with factors, gives for

In an auto-encoder we fit the model , and *train* the weights with regularization penalty of the form

Writing our DL objective as an augmented Lagrangian (as in ADMM) with a hidden factor , leads to a two step algorithm, an encoding step (a penalty for ), and a decoding step for reconstructing the output signal via

where the regularization on induces a penalty on . The last term is the encoder, the first two the decoder.

If is estimated from the structure of the training data matrix, then we have a traditional factor model, and the matrix provides the factor loadings. PCA, PLS, SIR fall into this category, see Cook (2007) for further discussion. If is trained based on the pair than we have a sliced inverse regression model. If and are simultaneously estimated based on the training data , then we have a two layer deep learning model.

Auto-encoding demonstrates that deep learning does not directly model variance-covariance matrix explicitly as the architecture is already in predictive form. Given a hierarchical non-linear combination of deep learners, an implicit variance-covariance matrix exists, but that is not the focus of the algorithm.

Another interesting area for future research are long short-term memory models (LSTMs). For example, a dynamic one layer auto-encoder for a financial time series is a coupled system

The state equation encodes and the matrix decodes the vector into its history and the current state .

### 2.4Bayesian Inference for Deep Learning

Bayesian neural networks have a long history. Early results on stochastic recurrent neural networks (a.k.a Boltzmann machines) were published in [2]. Accounting for uncertainty by integrating over parameters is discussed in [20]. [54] proposed a general Bayesian framework for tuning network architecture and training parameters for feed forward architectures. [62] proposed using Hamiltonian Monte Carlo (HMC) to sample from posterior distribution over the set of model parameters and then averaging outputs of multiple models. Markov Chain Monte Carlo algorithms was proposed by [59] to jointly identify parameters of a feed forward neural network as well as the architecture. A connection of neural networks with Bayesian nonparametric techniques was demonstrated in [52].

A Bayesian extension of feed forward network architectures has been considered by several authors [60]. Recent results show how dropout regularization can be used to represent uncertainty in deep learning models. In particular, [31] shows that dropout technique provides uncertainty estimates for the predicted values. The predictions generated by the deep learning models with dropout are nothing but samples from predictive posterior distribution.

Graphical models with deep learning encode a joint distribution via a product of conditional distributions and allow for computing (inference) many different probability distributions associated with the same set of variables. Inference requires the calculation of a posterior distribution over the variables of interest, given the relations between the variables encoded in a graph and the prior distributions. This approach is powerful when learning from samples with missing values or predicting with some missing inputs.

A classical example of using neural networks to model a vector of binary variables is the Boltzmann machine (BM), with two layers. The first layer encodes latent variables and the second layer encodes the observed variables. Both conditional distributions and are specified using logistic function parametrized by weights and offset vectors. The size of the joint distribution table grows exponentially with the number of variables and [42] proposed using Gibbs sampler to calculate update to model weights on each iteration. The multimodal nature of the posterior distribution leads to prohibitive computational times required to learn models of a practical size. [86] proposed a variational approach that replaces the posterior and approximates it with another easy to calculate distribution was considered in [74]. Several extensions to the BMs have been proposed. An exponential family extensions have been considered by [80]

There have also been multiple approaches to building inference algorithms for deep learning models [54]. Performing Bayesian inference on a neural network calculates the posterior distribution over the weights given the observations. In general, such a posterior cannot be calculated analytically, or even efficiently sampled from. However, several recently proposed approaches address the computational problem for some specific deep learning models [35].

The recent successful approaches to develop efficient Bayesian inference algorithms for deep learning networks are based on the reparameterization techniques for calculating Monte Carlo gradients while performing variational inference. Given the data , the variation inference relies on approximating the posterior with a variation distribution , where . Then is found by minimizing the based on the Kullback-Leibler divergence between the approximate distribution and the posterior, namely

Since is not necessarily tractable, we replace minimization of with maximization of evidence lower bound (ELBO)

The of the total probability (evidence) is then

The sum does not depend on , thus minimizing is the same that maximizing . Also, since , which follows from Jensen’s inequality, we have . Thus, the evidence lower bound name. The resulting maximization problem is solved using stochastic gradient descent.

To calculate the gradient, it is convenient to write the ELBO as

The gradient of the first term is not an expectation and thus cannot be calculated using Monte Carlo methods. The idea is to represent the gradient as an expectation of some random variable, so that Monte Carlo techniques can be used to calculate it. There are two standard methods to do it. First, the log-derivative trick, uses the following identity to obtain . Thus, if we select so that it is easy to compute its derivative and generate samples from it, the gradient can be efficiently calculated using Monte Carlo technique. Second, we can use reparametrization trick by representing as a value of a deterministic function, , where does not depend on . The derivative is given by

The reparametrization is trivial when , and . [47] propose using and representing and as outputs of a neural network (multi-layer perceptron), the resulting approach was called variational auto-encoder. A generalized reparametrization has been proposed by [73] and combines both log-derivative and reparametrization techniques by assuming that can depend on .

## 3Finding Good Bayes Predictors

The Bayesian paradigm provides novel insights into how to construct estimators with good predictive performance. The goal is simply to find a good predictive MSE, namely , where denotes a prediction value. Stein shrinkage (a.k.a regularization with an norm) in known to provide good mean squared error properties in estimation, namely . These gains translate into predictive performance (in an iid setting) for .

The main issue is how to tune the amount of regularisation (a.k.a prior hyper-parameters). Stein’s unbiased estimator of risk provides a simple empirical rule to address this problem as does cross-validation. From a Bayes perspective, the marginal likelihood (and full marginal posterior) provides a natural method for hyper-parameter tuning. The issue is computational tractability and scalability. In the context of DL, the posterior for is extremely high dimensional and multimodal and posterior MAP provides good predictors .

Bayes conditional averaging performs well in high dimensional regression and classification problems. High dimensionality, however, brings with it the curse of dimensionality and it is instructive to understand why certain kernel can perform badly. Adaptive Kernel predictors (a.k.a. smart conditional averager) are of the form

Here is a deep predictor with its own trained parameters. For tree models, the kernel is a *cylindrical* region (open box set). Figure ? illustrates the implied kernels for trees (cylindrical sets) and random forests. Not too many points will be neighbors in a high dimensional input space.

(a) Tree Kernel | (b) Random Forest Kernel |

Constructing the regions to preform conditional averaging is fundamental to reduce the curse of dimensionality. Imagine a large dataset, e.g. 100k images and think about how a new image’s input coordinates, , are “neighbors” to data points in the training set. Our predictor is a smart conditional average of the observed outputs, , from our neighbors. When is large, spheres ( balls or Gaussian kernels) are terrible, degenerate cases occur when either no points or all of the points are “neighbors” of the new input variable will appear. Tree-based models address this issue by limiting the number of “neighbors.

Figure 1 further illustrates the challenge with the 2D image of 1000 uniform samples from a 50-dimensional ball . The image is calculated as , where and . Samples are centered around the equators and none of the samples fall anywhere close to the boundary of the set.

As dimensionality of the space grows, the variance of the marginal distribution goes to zero. Figure ? shows the histogram of 1D image of uniform sample from balls of different dimensionality, that is , where .

(a) = 100 | (b) = 200 | (c) = 300 | (d) = 400 |

Similar central limit results were known to Maxwell who has shown that the random variable is close to standard normal, when , is large, and is a unit vector (lies on the boundary of the ball), see [22]. More general results in this direction were obtained in [48] and [56] who presents many analytical and geometrical results for finite dimensional normed spaces, as the dimension grows to infinity.

Deep learning can improve on traditional methods by performing a sequence of GLM-like transformations. Effectively DL learns a distributed partition of the input space. For example, suppose that we have partitions and a DL predictor that takes the form of a weighted average or soft-max of the weighted average for classification. Given a new high dimensional input , many deep learners are then an average of learners obtained by our hyper-plane decomposition. Our predictor takes the form

where are the weights learned in region , and is an indicator of the region with appropriate weighting given the training data.

The partitioning of the input space by a deep learner is similar to the one performed by decision trees and partition-based models such as CART, MARS, RandomForests, BART, and Gaussian Processes. Each neuron in a deep learning model corresponds to a manifold that divides the input space. In the case of ReLU activation function the manifold is simply a hyperplane and the neuron gets activated when the new observation is on the “right” side of this hyperplane, the activation amount is equal to how far from the boundary the given point is. For example in two dimensions, three neurons with ReLU activation functions will divide the space into seven regions, as shown on Figure .

The key difference between tree-based architecture and neural network based models is the way hyper-planes are combined. Figure shows the comparison of space decomposition by hyperplanes, as performed by a tree-based and neural network architectures. We compare a neural network with two layers (bottom row) with tree mode trained with CART algorithm (top row). The network architecture is given by

The weight matrices for simple data , for circle data and , for spiral data we have and . In our notations, we assume that the activation function is applied point-vise at each layer. An advantage of deep architectures is that the number of hyper-planes grow exponentially with the number of layers. The key property of an activation function (link) is and it has zero value in certain regions. For example, hinge or rectified learner box car (differences in Heaviside) functions are very common. As compared to a logistic regression, rather than using in deep learning is typically used for training, as .

(a) simple data | (b) circle data | (c) spiral data |

[4] provide an interesting discussion of efficiency. Formally, a Bayesian probabilistic approach (if computationally feasible) optimally weights predictors via model averaging with

Such rules can achieve optimal out-of-sample performance. [5] discusses the striking success of multiple randomized classifiers. Using a simple set of binary local features, one classification tree can achieve 5% error on the NIST data base with 100,000 training data points. On the other hand, 100 trees, trained under one hour, when aggregated, yield an error rate under 7%. This stems from the fact that a sample from a very rich and diverse set of classifiers produces, on average, weakly dependent classifiers conditional on class.

To further exploit this, consider the Bayesian model of weak dependence, namely exchangeability. Suppose that we have exchangeable, , and stacked predictors

Suppose that we wish to find weights, , to attain where convex in the second argument;

where . Hence, the randomised multiple predictor with weights provides the optimal Bayes predictive performance.

We now turn to algorithmic issues.

## 4Algorithmic Issues

In this section we discuss two types of algorithms for training learning models. First, stochastic gradient descent, which is a very general algorithm that efficiently works for large scale datasets and has been used for many deep learning applications. Second, we discuss specialized statistical learning algorithms, which are tailored for certain types of traditional statistical models.

### 4.1Stochastic Gradient Descent

Stochastic gradient descent (SGD) is a default gold standard for minimizing the a function (maximizing the likelihood) to find the deep learning weights and offsets. SGD simply minimizes the function by taking a negative step along an estimate of the gradient at iteration . The gradients are available via the chain rule applied to the superposition of semi-affine functions.

The approximate gradient is estimated by calculating

where and is the number of elements in .

When the algorithm is called batch SGD and simply SGD otherwise. Typically, the subset is chosen by going cyclically and picking consecutive elements of , . The direction is calculated using a chain rule (a.k.a. back-propagation) providing an unbiased estimator of . Specifically, this leads to

At each iteration, SGD updates the solution

Deep learning algorithms use a step size (a.k.a learning rate) that is either kept constant or a simple step size reduction strategy, such as is used. The hyper parameters of reduction schedule are usually found empirically from numerical experiments and observations of the loss function progression.

One caveat of SGD is that the descent in is not guaranteed, or it can be very slow at every iteration. Stochastic Bayesian approaches ought to alleviate these issues. The variance of the gradient estimate can also be near zero, as the iterates converge to a solution. To tackle those problems a coordinate descent (CD) and momentum-based modifications can be applied. Alternative directions method of multipliers (ADMM) can also provide a natural alternative, and leads to non-linear alternating updates, see [12].

The CD evaluates a single component of the gradient at the current point and then updates the th component of the variable vector in the negative gradient direction. The momentum-based versions of SGD, or so-called accelerated algorithms were originally proposed by [64]. For more recent discussion, see [65]. The momentum term adds memory to the search process by combining new gradient information with the previous search directions. Empirically momentum-based methods have been shown a better convergence for deep learning networks [84]. The gradient only influences changes in the velocity of the update, which then updates the variable

The hyper-parameter controls the dumping effect on the rate of update of the variables. The physical analogy is the reduction in kinetic energy that allows to “slow down” the movements at the minima. This parameter can also be chosen empirically using cross-validation.

Nesterov’s momentum method (a.k.a. Nesterov acceleration) calculates the gradient at the point predicted by the momentum. One can view this as a look-ahead strategy with updating scheme

Another popular modification are the AdaGrad methods [92], which adaptively scales each of the learning parameter at each iteration

where is usually a small number, e.g. that prevents dividing by zero. `PRMSprop`

takes the `AdaGrad`

idea further and places more weight on recent values of gradient squared to scale the update direction, i.e. we have

The `Adam`

method [46] combines both `PRMSprop`

and momentum methods, and leads to the following update equations

Second order methods solve the optimization problem by solving a system of nonlinear equations by applying the Newton’s method

Here SGD simply approximates by . The advantages of a second order method include much faster convergence rates and insensitivity to the conditioning of the problem. In practice, second order methods are rarely used for deep learning applications [17]. The major disadvantage is its inability to train models using batches of data as SGD does. Since a typical deep learning model relies on large scale data sets, second order methods become memory and computationally prohibitive at even modest-sized training data sets.

### 4.2Learning Shallow Predictors

Traditional factor models use linear combination of latent factors, ,

Here factors and weights can be found by solving the following problem

Then, we minimize the reconstruction error (a.k.a. accuracy), plus the regularization penalty, to control the variance-bias trade-off for out-of-sample prediction. Algorithms exist to solve this problem very efficiently. Such a model can be represented as a neural network model with with identity activation function.

The basic sliced inverse regression (SIR) model takes the form , where is a nonlinear function and , with , in other words, is a function of linear combinations of . To find , we first slice the feature matrix, then we analyze the data’s covariance matrices and slice means of , weighted by the size of slice. The function is found empirically by visually exploring relations. The key advantage of deep learning approach is that functional relation is found automatically. To extend the original SIR fitting algorithm, [44] proposed a variable selection under the SIR modeling framework. A partial least squares regression (PLS) [91] finds , a lower dimensional representation of and then regresses it onto via .

A deep learning least squares network arrives at a criterion function given by a negative log-posterior, which needs to be minimized. The penalized log-posterior, with denoting a generic regularization penalty is given by

[12] propose a method of auxiliary coordinates which replaces the original unconstrained optimization problem, associated with model training, with an alternative function in a constrained space, that can be optimized using alternating directions method and thus is highly parallelizable. An extension of these methods are ADMM and Divide and Concur (DC) algorithms, for further discussion see [69]. The gains for applying these to deep layered models, in an iterative fashion, appear to be large but have yet to be quantified empirically.

## 5Application: Predicting Airbnb Bookings

To illustrate our methodology, we use the dataset provided by the Airbnb Kaggle competition. This dataset whilst not designed to optimize the performance of DL provides a useful benchmark to compare and contrast traditional statistical models. The goal is to build a model that can predict which country a new user will make his or her first booking. Though Airbnb offers bookings in more than 190 countries, there are 10 countries where users make frequent bookings. We treat the problem as classification into one of the 12 classes (10 major countries + other + NDF); where *other* corresponds to any other country which is not in the list of top 10 and *NDF* corresponds to situations where no booking was made.

The data consists of two tables, one contains the attributes of each of the users and the other contains data about sessions of each user at the Airbnb website. The user data contains demographic characteristics, type of device and browser used to sign up, and the destination country of the first booking, which is our dependent variable . The data involves 213,451 users and 1,056,7737 individual sessions. The sessions data contains information about actions taken during each session, duration and devices used. Both datasets has a large number of missing values. For example age information is missing for 42% of the users. Figure ?(a) shows that nearly half of the gender data is missing and there is slight imbalance between the genders.

(a) Number of observations | (b) Percent of reservations | (c) Relationship between |

for each gender | per destination | age and gender |

Figure ?(b) shows the country of origin for the first booking by gender. Most of the entries in the destination columns are NDF, meaning no booking was made by the user. Further, Figure ?(c) shows relationship between gender and age, the gender value is missing for most of the users who did not identify their age.

We find that there is little difference in booking behavior between the genders. However, as we will see later, the fact that gender was specified, is an important predictor. Intuitively, users who filled the gender field are more likely to book.

On the other hand, as Figure ? shows, the age variable does play a role.

(a) Empirical distribution of | (b) Destination by age | (c) Destination by age |

user’s age | category | group |

Figure ?(a) shows that most of the users are of age between 25 and 40. Furthermore, looking at booking behavior between two different age groups, younger than 45 cohort and older than 45 cohort, (see Figure ?(b)) have very different booking behavior. Further, as we can see from Figure ?(c) half of the users who did not book did not identify their age either.

Another effect of interest is the non-linearity between the time the account was created and booking behavior. Figure 3 shows that “old timers” are more likely to book when compared to recent users. Since the number of records in sessions data is different for each users, we developed features from those records so that sessions data can be used for prediction. The general idea is to convert multiple session records to a single set of features per user. The list of the features we calculate is

Number of sessions records

For each action type, we calculate the count and standard deviation

For each device type, we calculate the count and standard deviation

For session duration we calculate mean, standard deviation and median

Furthermore, we use one-hot encoding for categorical variables from the user table, e.g. gender, language, affiliate provider, etc. One-hot encoding replaces categorical variable with categories by binary dummy variable.

We build a deep learning model with two hidden dense layers and ReLU activation function . We use ADAGRAD optimization to train the model. We predict probabilities of future destination booking for each of the new users. The evaluation metric for this competition is NDCG (Normalized discounted cumulative gain). We use top five predicted destinations and is calculated as:

where and is the position of the true destination in the list of five predicted destinations. For example, if for a particular user the destination is FR, and FR was at the top of the list of five predicted countries, then

When FR is second, e.g. model prediction (US, FR, DE, NDF, IT) gives a

We trained our deep learning network with 20 epochs and mini-batch size of 256. For a hold-out sample we used 10% of the data, namely 21346 observations. The fitting algorithm evaluates the function at every epoch to monitor the improvements of quality of predictions from epoch to epoch. It takes approximately 10 minutes to train, whereas the variational inference approach is computationally prohibitive at this scale.

Our model uses a two-hidden layer architecture with ReLU activation functions

The weight matrices for simple data , . In our notations, we assume that the activation function is applied point-vise at each layer.

The resulting model has out-of-sample of . The classes are imbalanced in this problem. Table 1 shows percent of each class in out-of-sample data set.

Dest | AU | CA | DE | ES | FR | GB | IT | NDF | NL | PT | US | other |
---|---|---|---|---|---|---|---|---|---|---|---|---|

% obs | 0.3 | 0.6 | 0.5 | 1 | 2.2 | 1.2 | 1.2 | 59 | 0.31 | 0.11 | 29 | 4.8 |

Figure 4 shows out-of-sample NDCG for each of the destinations.

Figure ? shows accuracy of prediction for each of the destination countries.

(a) first | (b) second | (c) third |

The model accurately predicts bookings in the US and FR and other when top three predictions are considered.

Furthermore, we compared the performance of our deep learning model with the XGBoost algorithms [13] for fitting gradient boosted tree model. The performance of the model is comparable and yields NGD of . One of the advantages of the tree-based model is its ability to calculate the importance of each of the features [36]. Figure 5 shows the variable performance calculated from our XGBoost model.

The importance scores calculated by the XGBoost model confirm our exploratory data analysis findings. In particular, we see the fact that a user specified gender is a strong predictor. Number of sessions on Airbnb site recorded for a given user before booking is a strong predictor as well. Intuitively, users who visited the site multiple times are more likely to book. Further, web-users who signed up via devices with large screens are also likely to book as well.

## 6Discussion

Our view of deep learning is a high dimensional nonlinear data reduction scheme, generated probabilistically as a stacked generalized linear model (GLM). This sheds light on how to train a deep architecture using SGD. This is a first order gradient method for finding a posterior mode in a very high dimensional space. By taking a predictive approach, where regularization learns the architecture, deep learning has been very successful in many fields.

There are many areas of future research for Bayesian deep learning which include

By viewing deep learning probabilistically as stacked GLMs allows many statistical models such as exponential family models and heteroscedastic errors.

Bayesian hierarchical models have similar advantages to deep learners. Hierarchical models include extra stochastic layers and provide extra interpretability and flexibility.

By viewing deep learning as a Gaussian Process allows for exact Bayesian inference [63]. The Gaussian Process connection opens opportunities to develop more flexible and interpretable models for engineering [34] and natural science applications [7].

With gradient information easily available via the chain rule (a.k.a. back propagation), a new avenue of stochastic methods to fit networks exists, such as MCMC, HMC, proximal methods, and ADMM, which could dramatically speed up the time to train deep learners.

Comparison with traditional Bayesian non-parametric approaches, such as treed Gaussian Models [33], and BART [14] or using hyperplanes in Bayesian non-parametric methods ought to yield good predictors [28].

Improved Bayesian algorithms for hyper-parameter training and optimization [81]. Langevin diffusion MCMC, proximal MCMC and Hamiltonian Monte Carlo (HMC) can exploit the derivatives as well as Hessian information [69].

Rather than searching a grid of values with a goal of minimising out-of-sample means squared error, one could place further regularisation penalties (priors) on these parameters and integrate them out.

MCMC methods also have lots to offer to DL and can be included seamlessly in `TensorFlow`

[1]. Given the availability of high performance computing, it is now possible to implement high dimensional posterior inference on large data sets is now a possibility, see [16]. The same advantages are now available for Bayesian inference. Further, we believe deep learning models have a bright future in many fields of applications, such as finance, where DL is a form of nonlinear factor models [37], with each layer capturing different time scale effects and spatio-temporal data is viewed as an image in space-time [25]. In summary, the Bayes perspective adds helpful interpretability, however, the full power of a Bayes approach has still not been explored. From a practical perspective, current regularization approaches have provided great gains in predictive model power for recovering nonlinear complex data relationships.

### Footnotes

- Polson is Professor of Econometrics and Statistics at the Chicago Booth School of Business. email: ngp@chicagobooth.edu. Sokolov is an assistant professor at George Mason University, email: vsokolov@gmu.edu

### References

**TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.**

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. URL`http://tensorflow.org/`

.**A learning algorithm for boltzmann machines.**

David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. Cognitive science**Learning the structure of deep sparse graphical models.**

Ryan Adams, Hanna Wallach, and Zoubin Ghahramani. In*Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics*, pages 1–8, 2010.**Shape Quantization and Recognition with Randomized Trees.**

Y. Amit and D. Geman. Neural Computation**Multiple randomized classifiers: Mrcl.**

Yali Amit, Gilles Blanchard, and Kenneth Wilder. 2000.**Deep speech 2: End-to-end speech recognition in english and mandarin.**

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. In*International Conference on Machine Learning*, pages 173–182, 2016.**Gaussian predictive process models for large spatial data sets.**

Sudipto Banerjee, Alan E Gelfand, Andrew O Finley, and Huiyan Sang. Journal of the Royal Statistical Society: Series B (Statistical Methodology)**Ensemble learning in Bayesian neural networks.**

David Barber and Christopher M Bishop. Neural Networks and Machine Learning**Weight uncertainty in neural networks.**

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. arXiv preprint arXiv:1505.05424**Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author).**

Leo Breiman. Statistical ScienceAnalysis of Kolmogorov’s superpostion theorem and its implementation in applications with low and high dimensional data

Donald W. Bryant. .**Distributed optimization of deeply nested systems.**

Miguel A Carreira-Perpinán and Weiran Wang. In*AISTATS*, pages 10–19, 2014.**Xgboost: A scalable tree boosting system.**

Tianqi Chen and Carlos Guestrin. CoRR**Bart: Bayesian additive regression trees.**

Hugh A Chipman, Edward I George, Robert E McCulloch, et al. The Annals of Applied Statistics**Fisher Lecture: Dimension Reduction in Regression.**

R. Dennis Cook. Statistical Science**Large Scale Distributed Deep Networks.**

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,*Advances in Neural Information Processing Systems 25*, pages 1223–1231. Curran Associates, Inc., 2012a.**Large scale distributed deep networks.**

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and others. In*Advances in Neural Information Processing Systems*, pages 1223–1231, 2012b.**DeepMind AI Reduces Google Data Centre Cooling Bill by 40%.**

DeepMind. https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/**The story of AlphaGo so far.**

DeepMind. https://deepmind.com/research/alphago/**Large automatic learning, rule extraction, and generalization.**

John Denker, Daniel Schwartz, Ben Wittner, Sara Solla, Richard Howard, Lawrence Jackel, and John Hopfield. Complex systems**On Nonlinear Functions of Linear Combinations.**

P. Diaconis and M. Shahshahani. SIAM Journal on Scientific and Statistical Computing**A dozen de finetti-style results in search of a theory.**

Persi Diaconis and David Freedman. In*Annales de l’IHP Probabilités et statistiques*, volume 23, pages 397–423, 1987.**Generating a random permutation with random transpositions.**

Persi Diaconis and Mehrdad Shahshahani. Probability Theory and Related Fields**Consistency of Bayes estimates for nonparametric regression: normal theory.**

Persi W Diaconis, David Freedman, et al. Bernoulli**Deep Learning for Spatio-Temporal Modeling: Dynamic Traffic Flows and High Frequency Trading.**

Matthew F. Dixon, Nicholas G. Polson, and Vadim O. Sokolov. arXiv:1705.09851 [stat]**Dermatologist-level classification of skin cancer with deep neural networks.**

Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. NatureAn introduction to probability theory and its applications

William Feller. .BASS: Bayesian Adaptive Spline Surfaces

Devin Francom. , 2017.**A statistical view of some chemometrics regression tools.**

Ildiko E Frank and Jerome H Friedman. Technometrics**Variational learning in nonlinear gaussian belief networks.**

Brendan J Frey and Geoffrey E Hinton. Neural Computation**A theoretically grounded application of dropout in recurrent neural networks.**

Yarin Gal. arXiv:1512.05287**Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.**

Yarin Gal and Zoubin Ghahramani. In*international conference on machine learning*, pages 1050–1059, 2016.Bayesian treed Gaussian process models

Robert B Gramacy. .**Particle learning of gaussian process models for sequential design and optimization.**

Robert B Gramacy and Nicholas G Polson. Journal of Computational and Graphical Statistics**Practical variational inference for neural networks.**

Alex Graves. In*Advances in Neural Information Processing Systems*, pages 2348–2356, 2011.The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. .**Deep learning in finance.**

JB Heaton, NG Polson, and JH Witte. arXiv preprint arXiv:1602.06561**Deep portfolio theory.**

JB Heaton, NG Polson, and JH Witte. arXiv preprint arXiv:1605.07230**Deep learning for finance: deep portfolios.**

JB Heaton, NG Polson, and Jan Hendrik Witte. Applied Stochastic Models in Business and Industry**Probabilistic backpropagation for scalable learning of Bayesian neural networks.**

José Miguel Hernández-Lobato and Ryan Adams. In*International Conference on Machine Learning*, pages 1861–1869, 2015.**Reducing the dimensionality of data with neural networks.**

G. E. Hinton and R. R. Salakhutdinov. Science (New York, N.Y.)**Optimal perceptual inference.**

Geoffrey E Hinton and Terrence J Sejnowski. In*Proceedings of the IEEE conference on Computer Vision and Pattern Recognition*, pages 448–453. IEEE New York, 1983.**Keeping the neural networks simple by minimizing the description length of the weights.**

Geoffrey E Hinton and Drew Van Camp. In*Proceedings of the sixth annual conference on Computational learning theory*, pages 5–13. ACM, 1993.**Sliced inverse regression with variable selection and interaction detection.**

Bo Jiang and Jun S Liu. arXiv preprint arXiv:1304.4056**Airbnb New User Bookings.**

Kaggle. https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings**Adam: A method for stochastic optimization.**

Diederik Kingma and Jimmy Ba. arXiv preprint arXiv:1412.6980**Auto-encoding variational Bayes.**

Diederik P Kingma and Max Welling. arXiv preprint arXiv:1312.6114**A central limit theorem for convex sets.**

Bo’az Klartag. Inventiones mathematicae**On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition.**

Andrei Nikolaevich Kolmogorov. American Mathematical Society Translation**Artificial intelligence used to identify skin cancer.**

Taylor Kubota. http://news.stanford.edu/2017/01/25/artificial-intelligence-used-identify-skin-cancer/**Probabilistic non-linear principal component analysis with gaussian process latent variable models.**

Neil Lawrence. Journal of machine learning researchBayesian Nonparametrics via Neural Networks

H. Lee. .**Deep neural networks as gaussian processes.**

Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. arXiv preprint arXiv:1711.00165**A practical Bayesian framework for backpropagation networks.**

David JC MacKay. Neural computation**Some comments on Cp.**

Colin L Mallows. TechnometricsAsymptotic theory of finite dimensional normed spaces: Isoperimetric inequalities in riemannian manifolds

Vitali D Milman and Gideon Schechtman. , volume 1200.**Neural variational inference and learning in belief networks.**

Andriy Mnih and Karol Gregor. arXiv preprint arXiv:1402.0030**When Does a Mixture of Products Contain a Product of Mixtures?**

Guido F. Montúfar and Jason Morton. SIAM Journal on Discrete Mathematics**Issues in Bayesian Analysis of Neural Network Models.**

Peter Müller and David Rios Insua. Neural Computation**Learning stochastic feedforward networks.**

Radford M Neal. Department of Computer Science, University of Toronto**Bayesian training of backpropagation networks by the hybrid monte carlo method.**

Radford M Neal. Technical report, Technical Report CRG-TR-92-1, Dept. of Computer Science, University of Toronto, 1992.**Bayesian learning via stochastic dynamics.**

Radford M Neal. Advances in neural information processing systems**Priors for infinite networks.**

Radford M Neal. In*Bayesian Learning for Neural Networks*, pages 29–53. Springer, 1996.**A method of solving a convex programming problem with convergence rate O (1/k2).**

Yurii Nesterov. In*Soviet Mathematics Doklady*, volume 27, pages 372–376, 1983.Introductory lectures on convex optimization: A basic course

Yurii Nesterov. , volume 87.**How to Construct Deep Recurrent Neural Networks.**

Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. arXiv:1312.6026 [cs, stat]**Deep Learning: Mathematics and Neuroscience.**

T. Poggio. A Sponsored Supplement to Science**Deep learning for short-term traffic flow prediction.**

Nicholas G. Polson and Vadim O. Sokolov. Transportation Research Part C: Emerging Technologies**Proximal algorithms in statistics and machine learning.**

Nicholas G. Polson, James G. Scott, Brandon T. Willard, and others. Statistical Science**A statistical theory of deep learning via proximal splitting.**

Nicholas G Polson, Brandon T Willard, and Massoud Heidari. arXiv preprint arXiv:1509.06061**Stochastic backpropagation and approximate inference in deep generative models.**

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. arXiv preprint arXiv:1401.4082**Neural networks and related methods for classification.**

Brian D Ripley. Journal of the Royal Statistical Society. Series B (Methodological)**The generalized reparameterization gradient.**

Francisco R Ruiz, Michalis Titsias RC AUEB, and David Blei. In*Advances in Neural Information Processing Systems*, pages 460–468, 2016.**Learning and evaluating boltzmann machines.**

Ruslan Salakhutdinov. Tech. Rep., Technical Report UTML TR 2008-002, Department of Computer Science, University of Toronto**Deep boltzmann machines.**

Ruslan Salakhutdinov and Geoffrey Hinton. In*Artificial Intelligence and Statistics*, pages 448–455, 2009.**Mean field theory for sigmoid belief networks.**

Lawrence K Saul, Tommi Jaakkola, and Michael I Jordan. Journal of artificial intelligence research**Deep learning in neural networks: An overview.**

Jürgen Schmidhuber. Neural networks**Very deep convolutional networks for large-scale image recognition.**

Karen Simonyan and Andrew Zisserman. 2014.**Nonlinear black-box modeling in system identification: a unified overview.**

Jonas Sjöberg, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard Delyon, Pierre-Yves Glorennec, Håkan Hjalmarsson, and Anatoli Juditsky. Automatica**Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1.**

P. Smolensky. pages 194–281. MIT Press, Cambridge, MA, USA, 1986.**Practical bayesian optimization of machine learning algorithms.**

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. In*Advances in neural information processing systems*, pages 2951–2959, 2012.**A survey of solved and unsolved problems on superpositions of functions.**

David A. Sprecher. Journal of Approximation Theory**Dropout: a simple way to prevent neural networks from overfitting.**

Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Journal of Machine Learning Research**On the importance of initialization and momentum in deep learning.**

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. In*International conference on machine learning*, pages 1139–1147, 2013.**Sequence to sequence learning with neural networks.**

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. In*Advances in neural information processing systems*, pages 3104–3112, 2014.**Training restricted boltzmann machines using approximations to the likelihood gradient.**

Tijmen Tieleman. In*Proceedings of the 25th international conference on Machine learning*, pages 1064–1071. ACM, 2008.**Linear superpositions of functions.**

A. G. Vitushkin and G. M. Khenkin. Russian Mathematical Surveys**Exponential family harmoniums with an application to information retrieval.**

Max Welling, Michal Rosen-Zvi, and Geoffrey E Hinton. In*Advances in neural information processing systems*, pages 1481–1488, 2005.**Computing with infinite networks.**

Christopher KI Williams. In*Advances in neural information processing systems*, pages 295–301, 1997.**Causal inference from observational data: A review of end and means.**

Herman Wold. Journal of the Royal Statistical Society. Series A (General)**Pls-regression: a basic tool of chemometrics.**

Svante Wold, Michael Sjöström, and Lennart Eriksson. Chemometrics and intelligent laboratory systems**Adadelta: an adaptive learning rate method.**

Matthew D Zeiler. arXiv preprint arXiv:1212.5701