A Statistical Theory of Deep Learning via Proximal Splitting
This Draft: \monthyearJuly 3, 2019
Abstract
In this paper we develop a statistical theory and an implementation of deep learning (DL) models. We show that an elegant variable splitting scheme for the alternating direction method of multipliers (ADMM) optimises a deep learning objective. We allow for nonsmooth nonconvex regularisation penalties to induce sparsity in parameter weights. We provide a link between traditional shallow layer statistical models such as principal component and sliced inverse regression and deep layer models. We also define the degrees of freedom of a deep learning predictor and a predictive MSE criteria to perform model selection for comparing architecture designs. We focus on deep multiclass logistic learning although our methods apply more generally. Our results suggest an interesting and previously underexploited relationship between deep learning and proximal splitting techniques. To illustrate our methodology, we provide a multiclass logit classification analysis of Fisher’s Iris data where we illustrate the convergence of our algorithm. Finally, we conclude with directions for future research.
Keywords: Deep Learning, Sparsity, Dropout, Convolutional Neural Nets; Regularisation; Bayesian MAP; Image Segmentation; Classification; Multiclass Logistic regression.
monthyear\monthname[\THEMONTH] \THEYEAR
1 Introduction
Deep Learning (DL) provides a powerful tool for high dimensional data reduction. Many areas of applications in predictive modeling occur in artificial intelligence and machine learning; including pattern recognition Ripley (1996); computer vision Dean et al. (2012); image segmentation and scene parsing Farabet et al. (2013); predictive diagnostics; intelligent gaming Mnih et al. (2013). The salient feature of a deep learning model is a predictive rule comprised by a layered composition of link or activation functions. Deep architectures with at least three layers have been shown to provide improved predictive performance compared to traditional shallow architectures in a number of applications. The challenge for deep learning methodologies, however, are computational: the objective function which measures model fit is typically highly multimodal and hard to optimise efficiently.
We build on the extensive deep learning literature by showing that proximal Bayesian optimisation techniques provide a turnkey solution to estimation and optimisation of such models and for calculating a regularisation path. We allow for the possibility of irregular nonconvex regularisation penalties to induce sparsity in the deep layer weights. Proximal operators and the alternating direction method of multipliers (ADMM) are the key tools for implementation. This approach simply rewrites the unconstrained optimisation as a constrained one, with a carefully constructed sequence of auxiliary variables and envelopes, to deal with the associated augmented Lagrangian; see Parikh and Boyd (2014); Polson et al. (2015); Green et al. (2015) for recent surveys. Proximal algorithms have achieved great success and provide a methodology for incorporating irregular nondifferentiable penalties; see (Masci et al., 2013) for applications in the areas of computer vision and signal processing.
From a statistical perspective, DL models can be viewed as generalised linear models (GLM Davison (2003); Dellaportas and Smith (1993)) with recursively defined link functions. Traditional statistical models commonly use shallow networks containing at most two layers or hierarchies. For example, reduced rank regression can be viewed as a deep learning model with only two layers and linear links. Support vector machines for classification Polson and Scott (2013) use a predictive rule based on a rectified linear (or hinge) link. Recent empirical evidence, however, suggests improved statistical predictive performance with deep architectures of at least three layers. Our focus here will be on developing fast learning methods for deep multiclass logistic models although our methods apply more generally to recurrent and convolutional neural nets. Although well known in the Neural Network and Statistics literature (Knowles and Minka, 2011), efficient estimation of the crossentropy/multinomial loss with regularization–outside of the ridge penalty–has not been a mainstream area of research. See Madigan et al. (2005) and Genkin et al. (2007) for applications to large scale multinomial logistic models. In general, the ridge penalty is commonplace (Poggio and Girosi, 1990; Orr, 1995), mostly due to its differentiability. Our methods can therefore be seen as related to sparse Bayes MAP techniques; see Titterington (2004); Windle et al. (2013); Polson et al. (2015) for high dimensional data reduction.
Mainstream estimation within Deep Learning broadly revolves around gradient descent methods. The main variation in techniques arises from general considerations for computational complexity and introduce more tuning parameters (Ngiam et al., 2011). Such considerations are the basis for Stochastic Gradient Descent (SGD) and backpropagation. For instance, backpropagation uses the chain rule for the derivative of the composite of activation functions. This can reduce the orderofoperations dramatically from naive direct evaluation while maintaining high numerical accuracy; however, this says little about the general difficultly in estimation of a nonlinear objective function. Direct gradient methods can be poorly scaled for the estimation of deep layer weights, in contrast to our proximal splitting approach which overcomes this by providing a simultaneous block update of parameters at all layers. The largest networks (e.g. Dean et al. (2012)) are currently trained using asynchronous SGD. Farabet et al. (2013) discusses hardware approaches to faster algorithms. Providing training methodologies is a very active field of research.
The splitting techniques common in the proximal framework do exist in the AI literature but we believe that their broad applicability and functional coverage has mostly been overlooked and underexploited. This is possibly due to the aforementioned concerns of computational complexity that makes stochastic gradient descent (SGD) and backpropagation methods so popular, although it’s quite possible that the number of iterations to stepcomplexity can favor (parallel) proximal methods in some cases. We show that our augmented ADMM approach is embarrassingly parallel with block updates for parameters and auxiliary variables being directly available due to our carefully chosen splitting procedure.
Traditional approaches to deep learning use Backpropagation LeCun et al. (2012), Hinton et al. (2006), Hinton and Salakhutdinov (2006) which rely on the chain rule for the derivatives of the composite of the layers in the network. We propose a proximal splitting approach which also lends itself to the inclusion of a nondifferentiable regularisation penalty term so as to induce sparsity. Combettes and Pesquet (2011) detail multiple splitting methods in the context of proximal operators as well as parallel proximal algorithms that handle many splitting variables. Wang and CarreiraPerpinán (2012) perform splitting in a similar context as ours, but with an loss and quadratic barrier approach to handle the equality constraints. In contrast, we generalize to non and detail a general envelope approach that includes the well known augmented Lagrangian. Our general envelope approach allows one to utilize efficient bounds, such as those found in Bouchard (2007) and Knowles and Minka (2011).
The rest of the paper is outlined as follows. Section 2 introduces deep learning predictors. We show that standard statistical models such as principal components analysis, reduced rank and sliced inverse regression can be viewed as shallow architecture models. We also motivate the addition of further layers to aid in the regularised prediction problem. We illustrate how proximal splitting methods solve a simple shallow architecture model. Section 3 describes the general deep learning problem and its solution via proximal splitting methods. Supervised learning uses a training dataset to estimate the parameters of each layer of the network. We define the degrees of freedom of a deep learning predictive rule which quantifies the tradeoff between model fit and predictive performance. Section 5 provides a comparison of DL and Neural Network (NN) models in a multiclass deep logistic classification model for Fisher’s Iris data. We also illustrate the convergence of our algorithm. Finally, Section 6 concludes with directions for future research.
2 Deep Learning Predictors
Let where for regression and for classification. Here denotes an observed output associated with a high dimensional input/covariate variable given by and . The generic problem is to find a nonlinear predictor of the output where are parameters which will be estimated from a training dataset of outputs and inputs, denoted by .
When the dimensionality of is high, a common approach is to use a data reduction technique. This is achieved by introducing a lowdimensional auxiliary variable where and constructing a prediction rule specified by a composition of functions,
Now the problem of high dimensional data reduction is to find the variables using training data and to estimate the layer functions . One reason for the success of deep learning is the regularisation achieved through the hidden layer lowdimensional variable. A hallmark of deep learning models is also the use of nonlinear layers. As such, one can view them as hierarchical nonlinear factor models or more specifically as generalised linear models (GLM) with recursively defined nonlinear link functions.
The key to a layering approach is to uncover the lowdimensional structure in a way that doesn’t disregard information about predicting the output . We will show that our splitting scheme naturally uses the hidden layer variables.
For example, Principal component analysis (PCA) reduces using a SVD decomposition Wold (1956). This type of dimension reduction is independent of and can easily discard information that is valuable for predicting the output. Sliced inverse regression (SIR) Cook and Lee (1999), Cook (2007) overcomes this by estimating the layer function , independently of , using data on both . Deep learning takes this one step further and jointly estimates and in tandem using training data on both pairs .
A common approach is to introduce parameters, , at each layer by convolution with linear maps which we denote by and assuming centered variables. Statistical models are traditionally shallow architectures with at most two layers. Reduced rank regression corresponds to linear link functions with where . The dimensionality reduction then becomes an eigenproblem for and . Radial basis functions/kernel sliced inverse regression uses where is a set of kernels/basis functions. In many cases, we will also add a penalty, , for parameter estimation. The term is a regularization term that imposes structure or effects a favorable biasvariance tradeoff in prediction. We need to be able to account for nonsmooth penalties such as lasso or bridge penalty to induce sparsity, where is a scaling parameter that traces out a full regularisation path.
A deep learning predictor takes the form of an layered convolution rule of link functions where
We define the set of layer variables are given by assuming that the variables are centered.
The original motivation for recursive deep learning predictors resides in the seminal work of Kolmogorov (1957), Lorenz (1963) and the extensions derived by Barron (1993), Poggio and Girosi (1990) and Bach (2014). See Paige and Butler (2001) for a Bayesian approach to neural networks. The motivation behind layered convolution maps lies in the following completeness result. Given a Gaussian nonparametric regression with any continuous function, on and Poggio and Girosi (1990), there exists one dimensional link functions and such that
This is a four layer network with a large number of units at the first layer. On the practical side, one may wish to build a deep network with more layers but less units at each stage. Section 4.2 provides a model selection approach to architecture design.
Neural network models can simply be viewed as projection pursuit regression with the only difference being that in a neural network the nonlinear link functions, , are parameter dependent and learned from training data. For example, a twolayer autoencoder model with a sigmoid link function uses a prediction rule given by functions of the form . The ’s are learned via the sigmoid function of the linear combination.
The modeling intuition a deep learning architecture can be found within a two layer system. Suppose that we have a loss problem with a two layer rule
This objective can be highly nonlinear. Finding the optimal parameters is challenging as traditional convex optimisation methods deal only with sums rather than composites of objective functions. Our approach will be based on augmented ADMM methods which still apply by splitting on the variables and . Bertsekas (1976) provides a general discussion of ADMM methods.
Now we need to solve an augmented Lagrangian problem of the form
with and where
for augmentation parameters .
Reexpressing in scaled Lagrangian form with , , gives
If we add an norm penalty for , we obtain ridge regressions steps in block updates. The scaled Lagrangian saddlepoint is solved via the iterative ADMM scheme Polson et al. (2015). This solves the optimisation of a recursively defined set of link functions rather than the traditional sum of convex functions. See Lewis and Wright (2008) for convergence analysis of these layered optimisation problems.
An important feature of our ADMM update for the parameters is that it happens in tandem. Moreover, the regularisation steps are properly scaled by the current values of the functions of the augmented variables. On the other hand, a direct gradientbased approach such as backpropagation uses first order information based on the composite derivative . This can easily lead to steps for the second layer parameters, namely , that are poorly scaled. Backpropagation can be slow to converge as it makes small zigzag steps in the highly multimodal objective. More efficient second order Hessian methods would require more information and computation. Martens and Sutskever (2011) develops a Hessianfree Lagrangian approach that is based on an envelope that uses the derivative of the composite map where denotes the model measure of fit and a layer function.
linear  

sigmoid  
softmax  
tanh  
logsumexp  
norm  
rectLU/hinge  
maxpooling 
Whilst our augmented Lagrangian uses an barrier, we can use general metrics and still achieve convergence Bertsekas (1976). A computationally attractive approach would be to match the type of ADMM barrier with the nonlinearity in the link function. We leave this for future research.
We now turn to our Deep Bayes learning framework. Bayes provides a way of unifying the stochastic and algorithmic approaches to data Breiman (2001).
3 Deep Bayes Learning
A deep learning predictor, , depends recursively on a set of link functions, , defined by
The link or activation functions are prespecified and part of the architecture design. Specific choices includes sigmoidal; softmax; tanh; rectLU or logsumexp see Table 1. We use to denote the number of layers in the network. The parameters , with and , are offsets or bias parameters. The parameters can be decomposed as
To construct a posterior distribution, , we use a prior distribution, , where is a regularisation penalty.
Bayes rule yields the posterior distribution over parameters given data, namely
The deep learning estimator, , is a Bayes MAP estimator and leads to a prediction rule
Maximising the posterior distribution corresponds to finding the of the deep learning objective function. Taking a Bayesian perspective allows the researcher to characterise the full posterior distribution, , typically, via simulation methods such as MCMC. This can aid in finding parameter uncertainty estimates and in determining efficient prediction rules.
The loglikelihood can also be interpreted as gauging the accuracy of a prediction, , of outcome, . The underlying probability model, denoted by , determined this fit via
The function is typically a smooth function such as an norm or a crossentropy metric when dealing with classification. The statistical interpretation is as a negative loglikelihood, in machine learning a cost functional. The major difficulty in implementing deep learning models is the high degree of nonlinearity induced by the predictor, , in the likelihood.
We will pay particular attention to the multiclass deep logistic learning model. The first layer is given by which defines the vector sigmoid function .
Example 1 (Deep Logistic Learning).
Suppose that our observations represent a multiclass of indicator vector, which we equate with class via for . Given a set of deep layer link functions with , we have a predictive rule
The negative log likelihood is given by
Minimising the crossentropy cost functional is therefore equivalent to a multiclass logistic likelihood function. Genkin et al. (2007) and Madigan et al. (2005) provide analysis of largescale multinomial logit models with shallow architectures.
The basic deep learning problem is supervised learning with a training dataset . We find the deep network structure via an optimisation of the following form
(1) 
Again is a measure of fit depending implicitly on some observed data and a prediction rule . Here denotes an vector of outcomes and a corresponding set of many vector characteristics; for example, pixels in an image or token counts in a topic model for document classification.
To solve the deep learning objective function with nondifferentiable regularisation penalties we use an auxiliary latent variable scheme in the context of an augmented Lagrangian. This allows us to write the deep learning optimisation as a constrained problem
(2)  
where  
Through variable splitting we can introduce latent auxiliary variables, , where . Notice that we also allow for regularisation of the variables so as to include sparsity.
Our approach follows the alternating pattern of ADMM and DouglasRachford type algorithms, which for a split objective given by
performs the following iterations
for some .
Roughly speaking, variable splitting enables one to formulate an iterative solution to the original problem that consists of simple firstorder steps, or independent secondorder steps, since it “decouples” the functions, which in our case are the penalty, layers, and loss functions. Now, potentially simpler minimization steps on the splitting variables and observation are combined, often in a simple linear way, to maintain the “coupling” that exists in the original problem. In such a setting, one can craft subproblems–through the choices of possible splittings–that have good conditioning or that suit the computational platform and restrictions, all without necessarily excluding the use of standard and advanced optimization techniques that may have applied to the problem in its original form.
For instance, the same Newtonsteps and backpropagation techniques can be applied on a perlayer basis. For that matter, one can choose to split at the lowest layer (i.e. between the loss function and layers above) and allow to be a complicated composite function. This separates the problem of minimizing a potentially complicated loss from a composition of nonlinear functions. In general, steps will still need to be taken in the composite , but again, currently established methods can be applied.
Since splitting can occur at any level, a natural question is at which layer should the splitting occur. If there was a penalty function across splitting variables, then standard regularization paths could be computed in fairly low dimension (minimum 2, for a weight on and ). This provides a potentially smooth scaling across the layer dimension and/or number of layers (see the regressor selection problems in (Boyd et al., 2011, Section 9.1.1)). By simply reapplying the strict equality constraints between splitting variables in some layers, e.g. through an indicator penalty , one could effectively emulate certain model designs; thus, through a structured approach to regularizing these terms one can perform a type of model selection. This approach may also explain–and emulate–the effect of dropout in a nonstochastic way.
4 Proximal Splitting
Let’s start by considering the augmented Lagrangian, , which takes the following general form for observations
(3) 
where and are Lagrange multipliers and . The form here does not easily highlight the role of each term across observations and layers, nor the relationship between terms. As a result, in what follows, we recast the Lagrangian into forms that clarify the relationship.
We make extensive use of vectorization, so many of the result are obtained by using the following identities
which, in our case, give
where , for the stacked by observation terms
We also extend and with and
which includes . This means that the terms are members of the primal variable set .
We introduce the scaled Lagrange multipliers
We then obtain
(4) 
where is the identity matrix and
innerproduct norm and with denoting the Kronecker product and the direct sum. is a block diagonal, so it can be factored in “square root” fashion.
Naturally, operations across layers can also be vectorized. First, let , and
The linear firstorder difference maps are defined by
(5)  
where the matrices match in dimension, and
Now, with , (4) becomes
(6) 
In terms of , our problem is
(7) 
where
(8) 
Note the relationship between the two operators, i.e.
Equations (6) and (7) provides a simple form that shows how our problem differs from the problems commonly considered in the basic operator splitting literature, in which ADMM, DouglasRachford and the inexactUzawa techniques are developed. In these cases the general constraint is usually given as , for linear operators and . Our problem involves an operator, , that introduces a multiplicative relationship between the primal variable and the dual . This form could be interpreted as–or related to–a biconvex problem (for convex , naturally), especially when is biaffine Boyd et al. (2011).
4.1 Proximal Operators and ADMM
The proximal operator is defined by
for positive definite . Normally, and the operator reduces to
When is diagonal and positive, such as above, one could simply use its inverse to rescale the terms in the problem (effectively and ); however, we use the matrix innerproduct norm for notational convenience and the implication of wider applicability.
The proximal operator enjoys many convenient properties, especially for lower semicontinuous, convex , and has strong connections with numerous optimization techniques and fixedpoint methods (Boyd and Vandenberghe, 2009; Combettes and Pesquet, 2011).
The form of (4) in reflects the definition of the proximal operator, and, after some manipulation, the same is true for . This allows us to apply the iterative techniques surrounding proximal operators in what follows.
One advantage of our approach is its embarrassingly parallel nature. The augmented Lagrangian leads to a block update of . For example, if we add the traditional ridge penalty we have a Bayes ridge regression update for the block . Proximal algorithms can also be used for nonsmooth penalties. Our method is also directly scalable and the vectorization of our method allows implementation in large scale tensor libraries such as Theano or Torch.

given
The problem, in vectorized form, is
which, given the design of , is not a simple proximal operator. Even if resulted in a simple diagonal matrix, the proximal operator of may not be easy to evaluate. Regardless, if this minimum is found using proximal approaches, it would likely be through another phase of splitting, be it ADMM for the subproblem, or forwardbackward/proximal gradient iterations.
To illustrate the forwardbackward approach we let and note that the Jacobian matrix, , is given by
(9) and use gradient step (note that )
(10) (11) Simple forwardbackward can’t be expected to work well for all moderate to highdimensional problems, and especially not for functions with more extreme nonlinearities. Given the composite nature of , the sensitivity and condition of this problem could vary drastically, so one will have to tailor their approach to account for this.

given
From (7), the problem is
It is easier to see the relationships between terms when operating on a single layer, and since this subproblem is separable by layer, we proceed conditionally on . Let then
(12) where
and is a right pseudoinverse. See Appendix A for details.
The resulting proximal problem involves a quadratic in that is no longer strictly diagonal in its squared term and, thus, isn’t necessarily a simple proximal operator to evaluate. The operator splitting that underlies ADMM, DouglasRachford and similar techniques is a common approach to this type of problem. Also, if full decompositions (e.g. SVD, eigen) of are reasonable to compute at each iteration, one could proceed by working in transformed coordinates; however, the transform will induce a dependency between components of in , which may require proximal solutions that are themselves difficult to compute. The composite methods in Argyriou et al. (2013); Chen et al. (2013) are designed for such transformed problems, at least when the transform is positivedefinite, but given the dependency on , such conditions are not easy to guarantee in generality.
Other approaches for solving this subproblem are forwardbackward iterations and ADMM; each approach should be considered in light of . A forwardbackward approach may be trivial to implement, especially when has a known bound, but convergence can be prohibitively slow for poorly conditioned . Some ADMM approaches can lead to faster convergence rates, but at the cost of repeated matrix inversions, which–for structured matrix problems–may be trivial.
For example, a forwardbackward approach for lower semicontinuous, convex would consist of the following fixedpoint iterations
where and
with and
If is Lipschitz continuous with constant , then ; otherwise, linesearch can be used to find a sufficient at every iteration.
An ADMM approach could result in a single primal proximal step in , which involves inversions of a quantity like , and, given the form of , may be possible to compute via the wellknown identity

given This involves the standard cumulative error update for the augmented Lagrangian, which is
(13)
Our approach requires a “consensus” across observations in Step 2, but the remaining steps are free to be processed asynchronously in observation dimension . The structure of both and essentially determine the separability, since they each act as “correlation” matrices between and . Observing (5), we see that sums “horizontally” across layers, but in independent blocks of observations.
4.2 Model Selection and Architecture Design
One advantage of viewing a deep learning predictor as a Bayes MAP estimator is that we can seamlessly apply Bayesian model selection tools to determine optimal architecture design. We will provide a model selection criterion for choosing between different link functions, size of the number of units together with the number of layers. Given the probabilistic data generating model, and a deep learning estimator based on layers, we propose the use of an information criteria (IC) to gauge model quality. Such measures like AIC, corrected AIC Hurvich and Tsai (1991), BIC and others George and Foster (2000) have a long history as model selection criteria. From a Bayesian perspective, we can define the Bayes factor as the ratio of marginal likelihoods and also provide an optimal model averaging approach to prediction.
An information measure for a deep learning predictor, , or equivalently a given architecture design, falls into a class of the form
where df is a measure of complexity or socalled degrees of freedom of a model and is its cost. The degrees of freedom term can be defined as where is the model estimation error.
The intuition is simple–the first term assesses insample predictive fit. However, overfitting is the curse of any nonlinear high dimensional prediction or modeling strategy and the second term penalises for the complexity in architecture design–including the nonlinearities in the links, number of units and layer depth. The combination of terms provides a simple metric for the comparison of two architecture designs–the best model provides the highest IC value.
For suitably stable predictors, , Stein (1981) provides an unbiased estimator of risk using the identity . Given the scalability of our algorithm, the derivative is available using the chain rule for the composition of the layers and computable using standard tensor libraries such as Torch.
Efron (1983) provides an alternative motivation for model selection by directly addressing the tradeoff between minimising outofsample predictive error and insample fit. Consider a nonparametric regression under norm. The insample mean squared error is and the outofsample predictive MSE is for a future observation . In expectation we then have
The latter term can be written in terms of df as a covariance. Stein’s unbiased risk estimate then becomes
Models with the best predictive MSE are favoured. This approach also provides a predictive MSE criterion for optimal hyperparameter selection in the prior regularisation penalty and allow the researcher to gauge the overall amount of regularisation necessary to provide architectures that provide good predictive rules. In contrast, the current stateoftheart is to use heuristic rules such as dropout.
5 Applications
To illustrate our methodology, we provide an illustration of multiclass deep logistic learning. We use logit/softmax activation functions which allows us to employ some efficient quadratic envelopes to linearize the functions in the proximal subproblems within our algorithm. We use Fisher’s Iris data to illustrate the comparison between DL models and traditional classification algorithms such as support vector machines.
5.1 MultiClass Deep Logistic Regression
Suppose that we have a multinomial loss with classes, where the observations are , and are all logistic functions. Now, (11) has an explicit form as