An Elementary Approach to Convergence Guarantees of Optimization Algorithms for Deep Networks
Abstract
We present an approach to obtain convergence guarantees of optimization algorithms for deep networks based on elementary arguments and computations. The convergence analysis revolves around the analytical and computational structures of optimization oracles central to the implementation of deep networks in machine learning software. We provide a systematic way to compute estimates of the smoothness constants that govern the convergence behavior of firstorder optimization algorithms used to train deep networks. A diverse set of example components and architectures arising in modern deep networks intersperse the exposition to illustrate the approach.
1 Introduction
Deep networks have achieved remarkable performance in several application domains such as computer vision, natural language processing and genomics (Krizhevsky et al., 2012; Pennington et al., 2014; Duvenaud et al., 2015). A deep network can be framed as a chain of composition of modules, where each module is typically the composition of a nonlinear function and an affine transformation. The last module in the chain is usually taskspecific and can be expressed either in analytical form as in supervised classification or as the solution of an optimization problem in dimension reduction or clustering.
The optimization problem arising when training a deep network is often framed as a nonconvex optimization problem, dismissing the structure of the objective yet central to the software implementation. Indeed optimization algorithms used to train deep networks proceed by making calls to firstorder (or secondorder) oracles relying on dynamic programming such as gradient backpropagation (Werbos, 1994; Rumelhart et al., 1986; Lecun, 1988). See also (Duda et al., 2012; Anthony & Bartlett, 2009; ShalevShwartz & BenDavid, 2014; Goodfellow et al., 2016) for an exposition and (Abadi et al., 2015; Paszke et al., 2017) for an implementation of gradient backpropagation for deep networks. We highlight here the elementary yet important fact that the chaincompositional structure of the objective naturally emerges through the smoothness constants governing the convergence guarantee of a gradientbased optimization algorithm. This provides a reference frame to relate the network topology and the convergence rate through the smoothness constants. This also brings to light the benefit of specific modules popular among practitioners to improve the convergence.
In Sec. 2, we define the parameterized inputoutput map implemented by a deep network as a chaincomposition of modules and write the corresponding optimization objective consisting in learning the parameters of this map. In Sec. 3, we detail the implementation of firstorder and secondorder oracles by dynamic programming; the classical gradient backpropagation algorithm is recovered as a canonical example. GaussNewton steps can also be simply stated in terms of calls to an automaticdifferentiation oracle implemented in modern machine learning software libraries. In Sec. 4, we present the computation of the smoothness constants of a chain of computations given its components and the resulting convergence guarantees for gradient descent. Finally, in Sec. 5, we present the application of the approach to derive the smoothness constants for the VGG architecture and illustrate how our approach can be used to identify the benefits of batchnormalization (Simonyan & Zisserman, 2015; Ioffe & Szegedy, 2015). All proofs and notations are provided in the Appendix.
2 Problem formulation
2.1 Deep network structure
A feedforward deep network of depth can be described as a transformation of an input into an output through the composition of blocks, called layers, illustrated in Fig. 1. Each layer is defined by a set of parameters. In general, (see Sec. 2.3 for a detailed decomposition), these parameters act on the input of the layer through an affine operation followed by a nonlinear operation. Formally, the ^{th} layer can be described as a function of its parameters and a given input that outputs as
(1) 
where is generally linear in and affine in and is nonlinear.
Learning a deep network consists in minimizing w.r.t. its parameters an objective involving inputs . Formally, the problem is written
subject to  
(2) 
where is the set of parameters at layer whose dimension can vary among layers and is a regularization on the parameters of the network.
We are interested in the influence of the structure of the problem, i.e., the chain of computations defined below, on the optimization complexity of the problem.
Definition 2.1.
A function is a chain of computations, if it is defined by an input and functions for such that , , and for with , the output of is given by
(3)  
By considering the concatenation of the parameters and the concatenation of the transformations of each input as a single transformation, i.e., where is the chain of computations defined by the input , the objective in (2) can be written as
(4) 
where is a chain of computations
2.2 Objectives
Supervised learning
For supervised learning, the objective can be decomposed as
(5) 
where are losses on the labels predicted by the chain of computations, i.e., where is the label of the input of the chain of computations , , and is a given loss such as the squared loss and the logistic loss (see Appendix B.1).
Unsupervised learning
In unsupervised learning tasks the labels are unknown. The objective itself is defined through a minimization problem rather than through an explicit loss function. For example, a convex clustering objective is written
where are chains defined by inputs . See (Hocking et al., 2011; Tan & Witten, 2015) for the original formulations. We consider in Appendix B.2 different clustering objectives. Note that the classical ones (means, spectral clustering) are inherently nonsmooth, i.e., noncontinuously differentiable, as they are defined as the minimization of a linear objective under constraints.
2.3 Layers
The layer of a deep network can be described by the following components,

a biaffine operation such as a matrix multiplication or a convolution, denoted and decomposed as
(6) where is bilinear, and are linear and is a constant vector,

an activation function, such as the elementwise application of a nonlinear function, denoted ,

a reduction of dimension, such as a pooling operation, denoted ,

a normalization of the output, such as batchnormalization, denoted .
By concatenating the nonaffine operations, i.e., defining , a layer can be written as
(7) 
Note that some components may not be included, for example some layers do not include normalization. In the following, we consider the nonlinear operation to be an arbitrary composition of functions, i.e., . We present common examples of the components of a deep network, a list is detailed in Appendix B with the smoothness properties of each function.
Linear operations
In the following, we drop the dependency w.r.t. the layer and denote by the quantities characterizing the output. We denote by semicolumns the concatenations of matrices by rows, i.e., for , .
Fully connected layer A fully connected layer taking a batch of inputs of dimension is written
(8) 
where is the batch of inputs, are the weights of the layer and define the offsets. By vectorizing the parameters and the inputs, a fully connected layer can be written as
where  
Convolutional layer A convolutional layer convolves a batch of inputs (images or signals) of dimension stacked as with affine filters of size defined by weights and offsets through patches. The ^{th} output of the convolution of the ^{th} input by the ^{th} filter reads
(9) 
where extracts a patch of size at a given position of the input . The output is then given by the concatenation of each input, i.e., . By vectorizing the inputs and the outputs, the convolution operation is defined by a set of matrices such that
where  
where is defined by concatenations of the output.
Activation functions
We consider differentiable elementwise activation functions , i.e., for a given ,
(10) 
for a given scalar function such as for the sigmoid function.
Pooling functions
A pooling layer reduces the dimension of the output. For example, an average pooling convolves an input image with a mean filter. Formally, for a batch of inputs , the average pooling with a patch size for inputs with channels and coordinates such that convolves the inputs with a filter . The output dimension for each input is and the patches, represented by some acting in Eq. (9), are chosen such that it induces a reduction of dimension, i.e., .
Normalization functions
Given a batch of input the batchnormalization outputs defined by
(11)  
with , such that the vectorized formulation of the batchnormalization reads for .
3 Oracle arithmetic complexity
For each class of optimization algorithm considered (gradient descent, GaussNewton, Newton), we define the appropriate optimization oracle called at each step of the optimization algorithm which can be efficiently computed through a dynamic programming procedure. For a gradient step, we retrieve the gradient backpropagation algorithm. The gradient backpropagation algorithm forms then the basis of automaticdifferentiation procedures.
3.1 Oracle reformulations
All optimization oracles can be formally defined as the minimization of an approximation of the objective with an additional proximal term. For a function , we denote
the linear and quadratic approximations respectively of around provided that , are defined respectively. On a point , given a stepsize , for an objective of the form ,

a gradient step is defined as
(12) 
a (regularized) GaussNewton step is defined as
(13) 
a Newton step is defined as
(14) 
All those steps amount to solving quadratic problems on a linearized network as shown in the following proposition. For a multivariate function , composed of real functions with , we denote , that is the transpose of its Jacobian on , . We represent its 2nd order information by a tensor . For a real function, , whose value is denoted , we decompose its gradient on as
We decompose similarly its Hessian and combine notations for multivariate functions. See Appendix A for further details on derivatives and tensor notations. {restatable}propositionlinquad Let and be defined by the chain of computations in (3) applied to . Assume to be decomposable as . Gradient (12), GaussNewton (13) and Newton (14) steps are given as where is the solution of
(15)  
subject to  
where

for gradient steps (12),
Problems of the form
(16)  
subject to  
can be chunked into smaller problems defined as the costtogo from at time by
subject to  
such that they follow Bellman’s recursive equation
(17) 
This principle cannot be used directly on the original problem, since Eq. (17) cannot be solved analytically for generic problems of the form (16). However, for quadratic problems with linear compositions of the form (15), this principle can be used to solve problems (15) by dynamic programming. See (Bertsekas, 2005) for a review of the dynamic programming literature. Therefore as a corollary of Prop. 3.1, the complexity of all optimization steps given in (12), (13), (14) is linear w.r.t. to the length of the chain. Precisely, Prop. 3.1 shows that each optimization step amounts to reducing the complexity of Bellman’s recursive equation to an analytic problem.
In particular, while the Hessian of the objective scales as , a Newton step has a linear and not cubic complexity with respect to . We present in Appendix C the detailed computation of a Newton step. See (Dunn & Bertsekas, 1989) for an alternate derivation. This involves the inversion of intermediate quadratic costs at each layer. GaussNewton steps can also be solved by dynamic programming and can be more efficiently implemented using an automaticdifferentiation oracles as we explain below.
3.2 Automaticdifferentiation
Algorithm
As explained in last subsection and shown in Appendix C, a gradient step can naturally be derived as a dynamic programming procedure applied to the subproblem (15). However, the implementation of the gradient step provides itself a different kind of oracle on the chain of computations as defined below.
Definition 3.1.
Given a chain of computations as defined in Def. 2.1 and , an automatic differentiation oracle is a procedure that gives access to
The point is that we have access to not as a matrix but as a linear operator. The matrix can also be computed and stored to perform gradient vector products. Yet, this requires a surplus of storage and of computations that are generally not necessary for our purposes. The only quantities that need to be stored are given in a forward pass. Then, these quantities can be used to compute any gradient vector product directly.
The definition of an automatic differentiation oracle is composed of two steps:

a forward pass that computes and stores the information necessary to compute gradientvector products,

a backward pass that computes for any given the information stored in the forward pass.
Note that the two aforementioned passes are decorrelated in the sense that the forward pass does not require the knowledge of the slope for which is computed.
We present in Algo. 1 and Algo. 2 the classical forwardbackward passes used in modern automaticdifferentiation libraries. The implementation of the automatic differentiation oracle as a procedure that computes both the value of the chain and the linear operator is then presented in Algo. 3.
Computing the gradient on amounts then to

computing with Algo. 3, ,

computing then using the oracle computed by .
Complexity
Without additional information on the structure of the layers, the complexities of the forward and backward passes can readily be computed as shown in the following proposition. The units chosen are, for the space complexity, the cost of storing one cell of a matrix and, for the time complexity, the cost of performing an addition or a multiplication. {restatable}propositioncplxities The space and time complexities of the forward and backward passes, Algo. 1, Algo. 2, are of the order of
respectively, where is the time complexity of computing during the forward pass, denotes the timecomplexity of the forward pass and denotes the time complexity of the backward pass. For chain of computations of the form (7), the time complexity of the backward pass can be refined as shown in Appendix C.3. Specifically, we have the following corollary.
Corollary 3.2.
For a chain of fullyconnected layers (8) with elementwise activation function, no normalization or pooling, the time complexity of the backward pass is of the order of
elementary operations. For a chain of convolutional layers (9) with elementwise activation function, no normalization or pooling, the time complexity of the backward pass is of the order of
elementary operations.
3.3 GaussNewton by automaticdifferentiation
The GaussNewton step can also be solved by making calls to an automatic differentiation oracle as shown in (Roulet et al., 2019) and stated in the framework considered in this paper. {restatable}propositiongaussnewtonautodiff Consider the GaussNewtonstep (13) on for a convex objective , a convex decomposable regularization and a differentiable chain of computations . We have that

the GaussNewtonstep amounts to solving
(18) where , and for a function we denote by its convex conjugate,

the GaussNewtonstep reads where is the solution of (18),

the dual problem (18) can be solved by calls to an automatic differentiation procedure.
Proposition 3.3 shows that a GaussNewton step is only times more expansive than a gradientstep. Precisely, for a deep network with a supervised objective, we have where is the number of samples and is the number of classes. A gradient step makes then one call to an automatic differentiation procedure to get the gradient of the batch and the GaussNewton method will then make more calls. If minibatch GaussNewton steps are considered then the cost reduces to calls to an automatic differentiation oracle, where is the size of the minibatch.
4 Optimization complexity
The convergence guarantee of a firstorder method towards an stationary point is governed by the smoothness property of the objective, i.e., the Lipschitz continuity of the function itself or its gradient when it is defined. We study smoothness properties with respect to the Euclidean norm ,
whose operator norm is denoted
(19) 
a bound of on , the Lipschitzcontinuity parameter of on and the smoothness parameter of on (i.e. the Lipschitzcontinuity parameter of its gradient if it exists), all with respect to
For a given set , we denote by the class of functions such that and . Similarly we denote by the class of functions such that . We drop indexes to denote classes of functions for which only a subset of these parameters is defined. For example, we denote the set of functions that are Lipschitz continuous.
4.1 Convergence rate to a stationary point
We recall the convergence rate to a stationary point of a gradient descent and a stochastic gradient descent on constrained problems.
Theorem 4.1 (Ghadimi et al. (2016, Theorems 1 and 2)).
Consider problems of the form