A Statistical Theory of Deep Learning via Proximal Splitting
This Draft: \monthyearJuly 3, 2019
In this paper we develop a statistical theory and an implementation of deep learning (DL) models. We show that an elegant variable splitting scheme for the alternating direction method of multipliers (ADMM) optimises a deep learning objective. We allow for non-smooth non-convex regularisation penalties to induce sparsity in parameter weights. We provide a link between traditional shallow layer statistical models such as principal component and sliced inverse regression and deep layer models. We also define the degrees of freedom of a deep learning predictor and a predictive MSE criteria to perform model selection for comparing architecture designs. We focus on deep multi-class logistic learning although our methods apply more generally. Our results suggest an interesting and previously under-exploited relationship between deep learning and proximal splitting techniques. To illustrate our methodology, we provide a multi-class logit classification analysis of Fisher’s Iris data where we illustrate the convergence of our algorithm. Finally, we conclude with directions for future research.
Keywords: Deep Learning, Sparsity, Dropout, Convolutional Neural Nets; Regularisation; Bayesian MAP; Image Segmentation; Classification; Multi-class Logistic regression.
Deep Learning (DL) provides a powerful tool for high dimensional data reduction. Many areas of applications in predictive modeling occur in artificial intelligence and machine learning; including pattern recognition Ripley (1996); computer vision Dean et al. (2012); image segmentation and scene parsing Farabet et al. (2013); predictive diagnostics; intelligent gaming Mnih et al. (2013). The salient feature of a deep learning model is a predictive rule comprised by a layered composition of link or activation functions. Deep architectures with at least three layers have been shown to provide improved predictive performance compared to traditional shallow architectures in a number of applications. The challenge for deep learning methodologies, however, are computational: the objective function which measures model fit is typically highly multi-modal and hard to optimise efficiently.
We build on the extensive deep learning literature by showing that proximal Bayesian optimisation techniques provide a turn-key solution to estimation and optimisation of such models and for calculating a regularisation path. We allow for the possibility of irregular non-convex regularisation penalties to induce sparsity in the deep layer weights. Proximal operators and the alternating direction method of multipliers (ADMM) are the key tools for implementation. This approach simply re-writes the unconstrained optimisation as a constrained one, with a carefully constructed sequence of auxiliary variables and envelopes, to deal with the associated augmented Lagrangian; see Parikh and Boyd (2014); Polson et al. (2015); Green et al. (2015) for recent surveys. Proximal algorithms have achieved great success and provide a methodology for incorporating irregular non-differentiable penalties; see (Masci et al., 2013) for applications in the areas of computer vision and signal processing.
From a statistical perspective, DL models can be viewed as generalised linear models (GLM Davison (2003); Dellaportas and Smith (1993)) with recursively defined link functions. Traditional statistical models commonly use shallow networks containing at most two layers or hierarchies. For example, reduced rank regression can be viewed as a deep learning model with only two layers and linear links. Support vector machines for classification Polson and Scott (2013) use a predictive rule based on a rectified linear (or hinge) link. Recent empirical evidence, however, suggests improved statistical predictive performance with deep architectures of at least three layers. Our focus here will be on developing fast learning methods for deep multi-class logistic models although our methods apply more generally to recurrent and convolutional neural nets. Although well known in the Neural Network and Statistics literature (Knowles and Minka, 2011), efficient estimation of the cross-entropy/multinomial loss with regularization–outside of the -ridge penalty–has not been a mainstream area of research. See Madigan et al. (2005) and Genkin et al. (2007) for applications to large scale multinomial logistic models. In general, the -ridge penalty is commonplace (Poggio and Girosi, 1990; Orr, 1995), mostly due to its differentiability. Our methods can therefore be seen as related to sparse Bayes MAP techniques; see Titterington (2004); Windle et al. (2013); Polson et al. (2015) for high dimensional data reduction.
Mainstream estimation within Deep Learning broadly revolves around gradient descent methods. The main variation in techniques arises from general considerations for computational complexity and introduce more tuning parameters (Ngiam et al., 2011). Such considerations are the basis for Stochastic Gradient Descent (SGD) and back-propagation. For instance, back-propagation uses the chain rule for the derivative of the composite of activation functions. This can reduce the order-of-operations dramatically from naive direct evaluation while maintaining high numerical accuracy; however, this says little about the general difficultly in estimation of a non-linear objective function. Direct gradient methods can be poorly scaled for the estimation of deep layer weights, in contrast to our proximal splitting approach which overcomes this by providing a simultaneous block update of parameters at all layers. The largest networks (e.g. Dean et al. (2012)) are currently trained using asynchronous SGD. Farabet et al. (2013) discusses hardware approaches to faster algorithms. Providing training methodologies is a very active field of research.
The splitting techniques common in the proximal framework do exist in the AI literature but we believe that their broad applicability and functional coverage has mostly been overlooked and under-exploited. This is possibly due to the aforementioned concerns of computational complexity that makes stochastic gradient descent (SGD) and back-propagation methods so popular, although it’s quite possible that the number of iterations to step-complexity can favor (parallel) proximal methods in some cases. We show that our augmented ADMM approach is embarrassingly parallel with block updates for parameters and auxiliary variables being directly available due to our carefully chosen splitting procedure.
Traditional approaches to deep learning use Back-propagation LeCun et al. (2012), Hinton et al. (2006), Hinton and Salakhutdinov (2006) which rely on the chain rule for the derivatives of the composite of the layers in the network. We propose a proximal splitting approach which also lends itself to the inclusion of a non-differentiable regularisation penalty term so as to induce sparsity. Combettes and Pesquet (2011) detail multiple splitting methods in the context of proximal operators as well as parallel proximal algorithms that handle many splitting variables. Wang and Carreira-Perpinán (2012) perform splitting in a similar context as ours, but with an loss and quadratic barrier approach to handle the equality constraints. In contrast, we generalize to non- and detail a general envelope approach that includes the well known augmented Lagrangian. Our general envelope approach allows one to utilize efficient bounds, such as those found in Bouchard (2007) and Knowles and Minka (2011).
The rest of the paper is outlined as follows. Section 2 introduces deep learning predictors. We show that standard statistical models such as principal components analysis, reduced rank and sliced inverse regression can be viewed as shallow architecture models. We also motivate the addition of further layers to aid in the regularised prediction problem. We illustrate how proximal splitting methods solve a simple shallow architecture model. Section 3 describes the general deep learning problem and its solution via proximal splitting methods. Supervised learning uses a training dataset to estimate the parameters of each layer of the network. We define the degrees of freedom of a deep learning predictive rule which quantifies the trade-off between model fit and predictive performance. Section 5 provides a comparison of DL and Neural Network (NN) models in a multi-class deep logistic classification model for Fisher’s Iris data. We also illustrate the convergence of our algorithm. Finally, Section 6 concludes with directions for future research.
2 Deep Learning Predictors
Let where for regression and for classification. Here denotes an observed output associated with a high dimensional input/covariate variable given by and . The generic problem is to find a non-linear predictor of the output where are parameters which will be estimated from a training dataset of outputs and inputs, denoted by .
When the dimensionality of is high, a common approach is to use a data reduction technique. This is achieved by introducing a low-dimensional auxiliary variable where and constructing a prediction rule specified by a composition of functions,
Now the problem of high dimensional data reduction is to find the -variables using training data and to estimate the layer functions . One reason for the success of deep learning is the regularisation achieved through the hidden layer low-dimensional variable. A hallmark of deep learning models is also the use of nonlinear layers. As such, one can view them as hierarchical nonlinear factor models or more specifically as generalised linear models (GLM) with recursively defined nonlinear link functions.
The key to a layering approach is to uncover the low-dimensional -structure in a way that doesn’t disregard information about predicting the output . We will show that our splitting scheme naturally uses the hidden layer -variables.
For example, Principal component analysis (PCA) reduces using a SVD decomposition Wold (1956). This type of dimension reduction is independent of and can easily discard information that is valuable for predicting the output. Sliced inverse regression (SIR) Cook and Lee (1999), Cook (2007) overcomes this by estimating the layer function , independently of , using data on both . Deep learning takes this one step further and jointly estimates and in tandem using training data on both pairs .
A common approach is to introduce parameters, , at each layer by convolution with linear maps which we denote by and assuming centered variables. Statistical models are traditionally shallow architectures with at most two layers. Reduced rank regression corresponds to linear link functions with where . The dimensionality reduction then becomes an eigen-problem for and . Radial basis functions/kernel sliced inverse regression uses where is a set of kernels/basis functions. In many cases, we will also add a penalty, , for parameter estimation. The term is a regularization term that imposes structure or effects a favorable bias-variance trade-off in prediction. We need to be able to account for non-smooth penalties such as lasso or bridge penalty to induce sparsity, where is a scaling parameter that traces out a full regularisation path.
A deep learning predictor takes the form of an -layered convolution rule of link functions where
We define the set of layer -variables are given by assuming that the variables are centered.
The original motivation for recursive deep learning predictors resides in the seminal work of Kolmogorov (1957), Lorenz (1963) and the extensions derived by Barron (1993), Poggio and Girosi (1990) and Bach (2014). See Paige and Butler (2001) for a Bayesian approach to neural networks. The motivation behind layered convolution maps lies in the following completeness result. Given a Gaussian nonparametric regression with any continuous function, on and Poggio and Girosi (1990), there exists one dimensional link functions and such that
This is a four layer network with a large number of units at the first layer. On the practical side, one may wish to build a deep network with more layers but less units at each stage. Section 4.2 provides a model selection approach to architecture design.
Neural network models can simply be viewed as projection pursuit regression with the only difference being that in a neural network the nonlinear link functions, , are parameter dependent and learned from training data. For example, a two-layer auto-encoder model with a sigmoid link function uses a prediction rule given by functions of the form . The ’s are learned via the sigmoid function of the linear combination.
The modeling intuition a deep learning architecture can be found within a two layer system. Suppose that we have a loss problem with a two layer rule
This objective can be highly nonlinear. Finding the optimal parameters is challenging as traditional convex optimisation methods deal only with sums rather than composites of objective functions. Our approach will be based on augmented ADMM methods which still apply by splitting on the variables and . Bertsekas (1976) provides a general discussion of ADMM methods.
Now we need to solve an augmented Lagrangian problem of the form
with and where
for augmentation parameters .
Re-expressing in scaled Lagrangian form with , , gives
If we add an -norm penalty for , we obtain ridge regressions steps in block updates. The scaled Lagrangian saddle-point is solved via the iterative ADMM scheme Polson et al. (2015). This solves the optimisation of a recursively defined set of link functions rather than the traditional sum of convex functions. See Lewis and Wright (2008) for convergence analysis of these layered optimisation problems.
An important feature of our ADMM update for the parameters is that it happens in tandem. Moreover, the regularisation steps are properly scaled by the current values of the functions of the augmented -variables. On the other hand, a direct gradient-based approach such as back-propagation uses first order information based on the composite derivative . This can easily lead to steps for the second layer parameters, namely , that are poorly scaled. Back-propagation can be slow to converge as it makes small zig-zag steps in the highly multi-modal objective. More efficient second order Hessian methods would require more information and computation. Martens and Sutskever (2011) develops a Hessian-free Lagrangian approach that is based on an envelope that uses the derivative of the composite map where denotes the model measure of fit and a layer function.
Whilst our augmented Lagrangian uses an -barrier, we can use general metrics and still achieve convergence Bertsekas (1976). A computationally attractive approach would be to match the type of ADMM barrier with the nonlinearity in the link function. We leave this for future research.
We now turn to our Deep Bayes learning framework. Bayes provides a way of unifying the stochastic and algorithmic approaches to data Breiman (2001).
3 Deep Bayes Learning
A deep learning predictor, , depends recursively on a set of link functions, , defined by
The link or activation functions are pre-specified and part of the architecture design. Specific choices includes sigmoidal; softmax; tanh; rectLU or log-sum-exp see Table 1. We use to denote the number of layers in the network. The parameters , with and , are off-sets or bias parameters. The parameters can be decomposed as
To construct a posterior distribution, , we use a prior distribution, , where is a regularisation penalty.
Bayes rule yields the posterior distribution over parameters given data, namely
The deep learning estimator, , is a Bayes MAP estimator and leads to a prediction rule
Maximising the posterior distribution corresponds to finding the of the deep learning objective function. Taking a Bayesian perspective allows the researcher to characterise the full posterior distribution, , typically, via simulation methods such as MCMC. This can aid in finding parameter uncertainty estimates and in determining efficient prediction rules.
The log-likelihood can also be interpreted as gauging the accuracy of a prediction, , of outcome, . The underlying probability model, denoted by , determined this fit via
The function is typically a smooth function such as an -norm or a cross-entropy metric when dealing with classification. The statistical interpretation is as a negative log-likelihood, in machine learning a cost functional. The major difficulty in implementing deep learning models is the high degree of nonlinearity induced by the predictor, , in the likelihood.
We will pay particular attention to the multi-class deep logistic learning model. The first layer is given by which defines the vector sigmoid function .
Example 1 (Deep Logistic Learning).
Suppose that our observations represent a multi-class -of- indicator vector, which we equate with class via for . Given a set of deep layer link functions with , we have a predictive rule
The negative log likelihood is given by
Minimising the cross-entropy cost functional is therefore equivalent to a multi-class logistic likelihood function. Genkin et al. (2007) and Madigan et al. (2005) provide analysis of large-scale multinomial logit models with shallow architectures.
The basic deep learning problem is supervised learning with a training dataset . We find the deep network structure via an optimisation of the following form
Again is a measure of fit depending implicitly on some observed data and a prediction rule . Here denotes an -vector of outcomes and a corresponding set of -many -vector characteristics; for example, pixels in an image or token counts in a topic model for document classification.
To solve the deep learning objective function with non-differentiable regularisation penalties we use an auxiliary latent variable scheme in the context of an augmented Lagrangian. This allows us to write the deep learning optimisation as a constrained problem
Through variable splitting we can introduce latent auxiliary variables, , where . Notice that we also allow for regularisation of the variables so as to include sparsity.
Our approach follows the alternating pattern of ADMM and Douglas-Rachford type algorithms, which for a split objective given by
performs the following iterations
for some .
Roughly speaking, variable splitting enables one to formulate an iterative solution to the original problem that consists of simple first-order steps, or independent second-order steps, since it “decouples” the functions, which in our case are the penalty, layers, and loss functions. Now, potentially simpler minimization steps on the splitting variables and observation are combined, often in a simple linear way, to maintain the “coupling” that exists in the original problem. In such a setting, one can craft sub-problems–through the choices of possible splittings–that have good conditioning or that suit the computational platform and restrictions, all without necessarily excluding the use of standard and advanced optimization techniques that may have applied to the problem in its original form.
For instance, the same Newton-steps and back-propagation techniques can be applied on a per-layer basis. For that matter, one can choose to split at the lowest layer (i.e. between the loss function and layers above) and allow to be a complicated composite function. This separates the problem of minimizing a potentially complicated loss from a composition of non-linear functions. In general, steps will still need to be taken in the composite , but again, currently established methods can be applied.
Since splitting can occur at any level, a natural question is at which layer should the splitting occur. If there was a penalty function across splitting variables, then standard regularization paths could be computed in fairly low dimension (minimum 2, for a weight on and ). This provides a potentially smooth scaling across the layer dimension and/or number of layers (see the regressor selection problems in (Boyd et al., 2011, Section 9.1.1)). By simply re-applying the strict equality constraints between splitting variables in some layers, e.g. through an indicator penalty , one could effectively emulate certain model designs; thus, through a structured approach to regularizing these terms one can perform a type of model selection. This approach may also explain–and emulate–the effect of drop-out in a non-stochastic way.
4 Proximal Splitting
Let’s start by considering the augmented Lagrangian, , which takes the following general form for observations
where and are Lagrange multipliers and . The form here does not easily highlight the role of each term across observations and layers, nor the relationship between terms. As a result, in what follows, we recast the Lagrangian into forms that clarify the relationship.
We make extensive use of vectorization, so many of the result are obtained by using the following identities
which, in our case, give
where , for the stacked by observation terms
We also extend and with and
which includes . This means that the terms are members of the primal variable set .
We introduce the scaled Lagrange multipliers
We then obtain
where is the identity matrix and
inner-product norm and with denoting the Kronecker product and the direct sum. is a block diagonal, so it can be factored in “square root” fashion.
Naturally, operations across layers can also be vectorized. First, let , and
The linear first-order difference maps are defined by
where the matrices match in dimension, and
Now, with , (4) becomes
In terms of , our problem is
Note the relationship between the two operators, i.e.
Equations (6) and (7) provides a simple form that shows how our problem differs from the problems commonly considered in the basic operator splitting literature, in which ADMM, Douglas-Rachford and the inexact-Uzawa techniques are developed. In these cases the general constraint is usually given as , for linear operators and . Our problem involves an operator, , that introduces a multiplicative relationship between the primal variable and the dual . This form could be interpreted as–or related to–a bi-convex problem (for convex , naturally), especially when is bi-affine Boyd et al. (2011).
4.1 Proximal Operators and ADMM
The proximal operator is defined by
for positive definite . Normally, and the operator reduces to
When is diagonal and positive, such as above, one could simply use its inverse to rescale the terms in the problem (effectively and ); however, we use the matrix inner-product norm for notational convenience and the implication of wider applicability.
The proximal operator enjoys many convenient properties, especially for lower semi-continuous, convex , and has strong connections with numerous optimization techniques and fixed-point methods (Boyd and Vandenberghe, 2009; Combettes and Pesquet, 2011).
The form of (4) in reflects the definition of the proximal operator, and, after some manipulation, the same is true for . This allows us to apply the iterative techniques surrounding proximal operators in what follows.
One advantage of our approach is its embarrassingly parallel nature. The augmented Lagrangian leads to a block update of . For example, if we add the traditional -ridge penalty we have a Bayes ridge regression update for the block . Proximal algorithms can also be used for non-smooth penalties. Our method is also directly scalable and the vectorization of our method allows implementation in large scale tensor libraries such as Theano or Torch.
The problem, in vectorized form, is
which, given the design of , is not a simple proximal operator. Even if resulted in a simple diagonal matrix, the proximal operator of may not be easy to evaluate. Regardless, if this minimum is found using proximal approaches, it would likely be through another phase of splitting, be it ADMM for the subproblem, or forward-backward/proximal gradient iterations.
To illustrate the forward-backward approach we let and note that the Jacobian matrix, , is given by
and use gradient step (note that )
Simple forward-backward can’t be expected to work well for all moderate to high-dimensional problems, and especially not for functions with more extreme non-linearities. Given the composite nature of , the sensitivity and condition of this problem could vary drastically, so one will have to tailor their approach to account for this.
From (7), the problem is
It is easier to see the relationships between terms when operating on a single layer, and since this sub-problem is separable by layer, we proceed conditionally on . Let then
and is a right pseudo-inverse. See Appendix A for details.
The resulting proximal problem involves a quadratic in that is no longer strictly diagonal in its squared term and, thus, isn’t necessarily a simple proximal operator to evaluate. The operator splitting that underlies ADMM, Douglas-Rachford and similar techniques is a common approach to this type of problem. Also, if full decompositions (e.g. SVD, eigen) of are reasonable to compute at each iteration, one could proceed by working in transformed coordinates; however, the transform will induce a dependency between components of in , which may require proximal solutions that are themselves difficult to compute. The composite methods in Argyriou et al. (2013); Chen et al. (2013) are designed for such transformed problems, at least when the transform is positive-definite, but given the dependency on , such conditions are not easy to guarantee in generality.
Other approaches for solving this sub-problem are forward-backward iterations and ADMM; each approach should be considered in light of . A forward-backward approach may be trivial to implement, especially when has a known bound, but convergence can be prohibitively slow for poorly conditioned . Some ADMM approaches can lead to faster convergence rates, but at the cost of repeated matrix inversions, which–for structured matrix problems–may be trivial.
For example, a forward-backward approach for lower semi-continuous, convex would consist of the following fixed-point iterations
If is Lipschitz continuous with constant , then ; otherwise, line-search can be used to find a sufficient at every iteration.
An ADMM approach could result in a single primal proximal step in , which involves inversions of a quantity like , and, given the form of , may be possible to compute via the well-known identity
given This involves the standard cumulative error update for the augmented Lagrangian, which is
Our approach requires a “consensus” across observations in Step 2, but the remaining steps are free to be processed asynchronously in observation dimension . The structure of both and essentially determine the separability, since they each act as “correlation” matrices between and . Observing (5), we see that sums “horizontally” across layers, but in independent blocks of observations.
4.2 Model Selection and Architecture Design
One advantage of viewing a deep learning predictor as a Bayes MAP estimator is that we can seamlessly apply Bayesian model selection tools to determine optimal architecture design. We will provide a model selection criterion for choosing between different link functions, size of the number of units together with the number of layers. Given the probabilistic data generating model, and a deep learning estimator based on layers, we propose the use of an information criteria (IC) to gauge model quality. Such measures like AIC, corrected AIC Hurvich and Tsai (1991), BIC and others George and Foster (2000) have a long history as model selection criteria. From a Bayesian perspective, we can define the Bayes factor as the ratio of marginal likelihoods and also provide an optimal model averaging approach to prediction.
An information measure for a deep learning predictor, , or equivalently a given architecture design, falls into a class of the form
where df is a measure of complexity or so-called degrees of freedom of a model and is its cost. The degrees of freedom term can be defined as where is the model estimation error.
The intuition is simple–the first term assesses in-sample predictive fit. However, over-fitting is the curse of any nonlinear high dimensional prediction or modeling strategy and the second term penalises for the complexity in architecture design–including the nonlinearities in the links, number of units and layer depth. The combination of terms provides a simple metric for the comparison of two architecture designs–the best model provides the highest IC value.
For suitably stable predictors, , Stein (1981) provides an unbiased estimator of risk using the identity . Given the scalability of our algorithm, the derivative is available using the chain rule for the composition of the layers and computable using standard tensor libraries such as Torch.
Efron (1983) provides an alternative motivation for model selection by directly addressing the trade-off between minimising out-of-sample predictive error and in-sample fit. Consider a nonparametric regression under -norm. The in-sample mean squared error is and the out-of-sample predictive MSE is for a future observation . In expectation we then have
The latter term can be written in terms of df as a covariance. Stein’s unbiased risk estimate then becomes
Models with the best predictive MSE are favoured. This approach also provides a predictive MSE criterion for optimal hyper-parameter selection in the prior regularisation penalty and allow the researcher to gauge the overall amount of regularisation necessary to provide architectures that provide good predictive rules. In contrast, the current state-of-the-art is to use heuristic rules such as dropout.
To illustrate our methodology, we provide an illustration of multi-class deep logistic learning. We use logit/softmax activation functions which allows us to employ some efficient quadratic envelopes to linearize the functions in the proximal sub-problems within our algorithm. We use Fisher’s Iris data to illustrate the comparison between DL models and traditional classification algorithms such as support vector machines.
5.1 Multi-Class Deep Logistic Regression
Suppose that we have a multinomial loss with -classes, where the observations are , and are all logistic functions. Now, (11) has an explicit form as