An Elementary Approach to Convergence Guaranteesof Optimization Algorithms for Deep Networks

An Elementary Approach to Convergence Guarantees of Optimization Algorithms for Deep Networks

Abstract

We present an approach to obtain convergence guarantees of optimization algorithms for deep networks based on elementary arguments and computations. The convergence analysis revolves around the analytical and computational structures of optimization oracles central to the implementation of deep networks in machine learning software. We provide a systematic way to compute estimates of the smoothness constants that govern the convergence behavior of first-order optimization algorithms used to train deep networks. A diverse set of example components and architectures arising in modern deep networks intersperse the exposition to illustrate the approach.

1 Introduction

Deep networks have achieved remarkable performance in several application domains such as computer vision, natural language processing and genomics (Krizhevsky et al., 2012; Pennington et al., 2014; Duvenaud et al., 2015). A deep network can be framed as a chain of composition of modules, where each module is typically the composition of a non-linear function and an affine transformation. The last module in the chain is usually task-specific and can be expressed either in analytical form as in supervised classification or as the solution of an optimization problem in dimension reduction or clustering.

The optimization problem arising when training a deep network is often framed as a non-convex optimization problem, dismissing the structure of the objective yet central to the software implementation. Indeed optimization algorithms used to train deep networks proceed by making calls to first-order (or second-order) oracles relying on dynamic programming such as gradient back-propagation (Werbos, 1994; Rumelhart et al., 1986; Lecun, 1988). See also (Duda et al., 2012; Anthony & Bartlett, 2009; Shalev-Shwartz & Ben-David, 2014; Goodfellow et al., 2016) for an exposition and (Abadi et al., 2015; Paszke et al., 2017) for an implementation of gradient back-propagation for deep networks. We highlight here the elementary yet important fact that the chain-compositional structure of the objective naturally emerges through the smoothness constants governing the convergence guarantee of a gradient-based optimization algorithm. This provides a reference frame to relate the network topology and the convergence rate through the smoothness constants. This also brings to light the benefit of specific modules popular among practitioners to improve the convergence.

In Sec. 2, we define the parameterized input-output map implemented by a deep network as a chain-composition of modules and write the corresponding optimization objective consisting in learning the parameters of this map. In Sec. 3, we detail the implementation of first-order and second-order oracles by dynamic programming; the classical gradient back-propagation algorithm is recovered as a canonical example. Gauss-Newton steps can also be simply stated in terms of calls to an automatic-differentiation oracle implemented in modern machine learning software libraries. In Sec. 4, we present the computation of the smoothness constants of a chain of computations given its components and the resulting convergence guarantees for gradient descent. Finally, in Sec. 5, we present the application of the approach to derive the smoothness constants for the VGG architecture and illustrate how our approach can be used to identify the benefits of batch-normalization (Simonyan & Zisserman, 2015; Ioffe & Szegedy, 2015). All proofs and notations are provided in the Appendix.

2 Problem formulation

2.1 Deep network structure

A feed-forward deep network of depth can be described as a transformation of an input into an output through the composition of blocks, called layers, illustrated in Fig. 1. Each layer is defined by a set of parameters. In general, (see Sec. 2.3 for a detailed decomposition), these parameters act on the input of the layer through an affine operation followed by a non-linear operation. Formally, the th layer can be described as a function of its parameters and a given input that outputs as

 zl=ϕl(vl,zl−1)=al(bl(vl,zl−1)), (1)

where is generally linear in and affine in and is non-linear.

Learning a deep network consists in minimizing w.r.t. its parameters an objective involving inputs . Formally, the problem is written

 min(v1,…,vτ)∈Rρ1×…×Rρτ f(z(1)τ,…,z(n)τ)+r(v1,…,vτ) subject to z(i)l=ϕl(vl,z(i)l−1)forl=1,…,τ, i=1,…,n, z(i)0=x(i)fori=1,…,n, (2)

where is the set of parameters at layer whose dimension can vary among layers and is a regularization on the parameters of the network.

We are interested in the influence of the structure of the problem, i.e., the chain of computations defined below, on the optimization complexity of the problem.

Definition 2.1.

A function is a chain of computations, if it is defined by an input and functions for such that , , and for with , the output of is given by

 ψ(w)=zτ,withzl =ϕl(vl,zl−1)forl=1,…,τ, (3) z0 =x.

By considering the concatenation of the parameters and the concatenation of the transformations of each input as a single transformation, i.e., where is the chain of computations defined by the input , the objective in (2) can be written as

 minw∈Rp f(ψ(w))+r(w), (4)

where is a chain of computations1, is typically a decomposable differentiable function such as and we present examples of learning objectives below. Assumptions on differentiability and smoothness of the objective are detailed in Sec. 4.

2.2 Objectives

Supervised learning

For supervised learning, the objective can be decomposed as

 f(ψ(w))=1nn∑i=1f(i)(ψ(i)(w)), (5)

where are losses on the labels predicted by the chain of computations, i.e., where is the label of the input of the chain of computations , , and is a given loss such as the squared loss and the logistic loss (see Appendix B.1).

Unsupervised learning

In unsupervised learning tasks the labels are unknown. The objective itself is defined through a minimization problem rather than through an explicit loss function. For example, a convex clustering objective is written

 f(ψ(w))=miny1,…,yn∈Rq n∑i=112∥y(i)−ψ(i)(w)∥22+∑i

where are chains defined by inputs . See (Hocking et al., 2011; Tan & Witten, 2015) for the original formulations. We consider in Appendix B.2 different clustering objectives. Note that the classical ones (-means, spectral clustering) are inherently non-smooth, i.e., non-continuously differentiable, as they are defined as the minimization of a linear objective under constraints.

2.3 Layers

The layer of a deep network can be described by the following components,

1. a bi-affine operation such as a matrix multiplication or a convolution, denoted and decomposed as

 bl(vl,zl−1)=βl(vl,zl−1)+βvl(vl)+βzl(zl−1)+β0l, (6)

where is bilinear, and are linear and is a constant vector,

2. an activation function, such as the element-wise application of a non-linear function, denoted ,

3. a reduction of dimension, such as a pooling operation, denoted ,

4. a normalization of the output, such as batch-normalization, denoted .

By concatenating the non-affine operations, i.e., defining , a layer can be written as

 ϕl(vl,zl−1)=al(bl(vl,zl−1)). (7)

Note that some components may not be included, for example some layers do not include normalization. In the following, we consider the non-linear operation to be an arbitrary composition of functions, i.e., . We present common examples of the components of a deep network, a list is detailed in Appendix B with the smoothness properties of each function.

Linear operations

In the following, we drop the dependency w.r.t. the layer and denote by the quantities characterizing the output. We denote by semi-columns the concatenations of matrices by rows, i.e., for , .

Fully connected layer A fully connected layer taking a batch of inputs of dimension is written

 ~Z=W⊤Z+w01⊤m, (8)

where is the batch of inputs, are the weights of the layer and define the offsets. By vectorizing the parameters and the inputs, a fully connected layer can be written as

 ~z=β(v,z)+βv(v), where β(v,z)=Vec(W⊤Z)∈Rm~d, βv(v)=Vec(w01⊤m), z=Vec(Z)∈Rmd, v=Vec(W;w0)∈R~d(d+1).

Convolutional layer A convolutional layer convolves a batch of inputs (images or signals) of dimension stacked as with affine filters of size defined by weights and offsets through patches. The th output of the convolution of the th input by the th filter reads

 Ξi,j,k=w⊤jΠkzi+w0j, (9)

where extracts a patch of size at a given position of the input . The output is then given by the concatenation of each input, i.e., . By vectorizing the inputs and the outputs, the convolution operation is defined by a set of matrices such that

 ~z=β(v,z)+βv(v), where β(v,z)=(w⊤jΠkzi)i=1,…m;j=1,…,nf;k=1,…,np∈Rmnfnp, βv(v)=1m⊗w0⊗1np, z=Vec(Z)∈Rmd, v=Vec(W;w0)∈R(sf+1)nf, Z=(z1,…,zm), W=(w1,…,wnf),

where is defined by concatenations of the output.

Activation functions

We consider differentiable element-wise activation functions , i.e., for a given ,

 α(z)=(¯α(z1),…,¯α(zη)), (10)

for a given scalar function such as for the sigmoid function.

Pooling functions

A pooling layer reduces the dimension of the output. For example, an average pooling convolves an input image with a mean filter. Formally, for a batch of inputs , the average pooling with a patch size for inputs with channels and coordinates such that convolves the inputs with a filter . The output dimension for each input is and the patches, represented by some acting in Eq. (9), are chosen such that it induces a reduction of dimension, i.e., .

Normalization functions

Given a batch of input the batch-normalization outputs defined by

 (~Z)ij =Zij−μi√ϵ+σ2i, (11) whereμi =1mm∑j=1Zij,σ2i=1mm∑j=1(Zij−μi)2,

with , such that the vectorized formulation of the batch-normalization reads for .

3 Oracle arithmetic complexity

For each class of optimization algorithm considered (gradient descent, Gauss-Newton, Newton), we define the appropriate optimization oracle called at each step of the optimization algorithm which can be efficiently computed through a dynamic programming procedure. For a gradient step, we retrieve the gradient back-propagation algorithm. The gradient back-propagation algorithm forms then the basis of automatic-differentiation procedures.

3.1 Oracle reformulations

All optimization oracles can be formally defined as the minimization of an approximation of the objective with an additional proximal term. For a function , we denote

 ℓf(y;x) =f(x)+∇f(x)⊤(y−x) qf(y;x) =f(x)+∇f(x)⊤(y−x)+12(y−x)⊤∇2f(x)(y−x)

the linear and quadratic approximations respectively of around provided that , are defined respectively. On a point , given a step-size , for an objective of the form ,

1. a gradient step is defined as

 wt+1=argminw∈Rp ℓf∘ψ(w;wt)+ℓr(w;wt)+12γ∥w−wt∥22, (12)
2. a (regularized) Gauss-Newton step is defined as

 wt+1=argminw∈Rp qf(ℓψ(w;wt);ψ(wt))+qr(w;wt)+12γ∥w−wt∥22, (13)
3. a Newton step is defined as

 wt+1=argminw∈Rp qf∘ψ(w;wt)+qr(w;wt)+12γ∥w−wt∥22. (14)

All those steps amount to solving quadratic problems on a linearized network as shown in the following proposition. For a multivariate function , composed of real functions with , we denote , that is the transpose of its Jacobian on , . We represent its 2nd order information by a tensor . For a real function, , whose value is denoted , we decompose its gradient on as

 ∇f(x,y)=(∇xf(x,y)∇yf(x,y))with∇xf(x,y)∈Rd,∇yf(x,y)∈Rp.

We decompose similarly its Hessian and combine notations for multivariate functions. See Appendix A for further details on derivatives and tensor notations. {restatable}propositionlinquad Let and be defined by the chain of computations in (3) applied to . Assume to be decomposable as . Gradient (12), Gauss-Newton (13) and Newton (14) steps are given as where is the solution of

 min~v1,…,~vτ∈Rρ1×…×Rρτ~z0,…,~zτ∈Rδ0×…×Rδτ τ∑l=112~z⊤lPl~zl+p⊤l~zl+~z⊤l−1Rl~vl+12~v⊤lQl~vl+q⊤l~vl+12γ∥~vl∥22 (15) subject to ~zl=Al~zl−1+Bl~vl% forl∈{1,…,τ}, ~z0=0,

where

 Al=∇zl−1ϕl(vl,zl−1)⊤, Bl=∇vlϕl(vl,zl−1)⊤, pτ=∇f(ψ(wt)), pl=0for l≠τ, ql=∇rl(vl),

 Pl=0,Rl=0,Ql=0,
2. for Gauss-Newton steps (13),

 Pτ=∇2f(ψ(wt)),Pl=0for l≠τ, Rl=0,Ql=∇2rl(vl),

,

3. for Newton steps (14), defining

 λτ=∇f(ψ(wt)),λl−1=∇zl−1ϕl(vl,zl−1)λlfor l∈{1,…,τ},

we have

 Pτ=∇2f(ψ(wt)),Pl−1=∇2zl−1zl−1ϕl(vl,zl−1)[⋅,⋅,λl]for l∈{1,…,τ}, Rl=∇2zl−1vlϕl(vl,zl−1)[⋅,⋅,λl],Ql=∇2rl(vl)+∇2vlvlϕl(vl,zl−1)[⋅,⋅,λl].

Problems of the form

 minv1,…,vτ∈Rρ1×…×Rρτz0,…,zτ∈Rδ0×…×Rδτ τ∑l=1hl(zl)+τ∑l=1gl(vl) (16) subject to zl=ϕl(vl,zl−1)for l∈{1,…,τ}, z0=^z0

can be chunked into smaller problems defined as the cost-to-go from at time by

 cost(^zl)=minvl+1,…,vτ∈Rρl+1×…×Rρτzl,…,zτ∈Rδl×…×Rδτ τ∑l′=lhl′(zl′)+τ∑l′=l+1gl′(vl′) subject to zl′=ϕl′(vl′,zl′−1)for l′∈{l+1,…,τ}, zl=^zl,

such that they follow Bellman’s recursive equation

 costl(^zl)=minvl+1∈Rρl+1{hl(^zl)+gl+1(vl+1)+costl+1(ϕl+1(vl+1,^zl))}. (17)

This principle cannot be used directly on the original problem, since Eq. (17) cannot be solved analytically for generic problems of the form (16). However, for quadratic problems with linear compositions of the form (15), this principle can be used to solve problems (15) by dynamic programming. See (Bertsekas, 2005) for a review of the dynamic programming literature. Therefore as a corollary of Prop. 3.1, the complexity of all optimization steps given in (12), (13), (14) is linear w.r.t. to the length of the chain. Precisely, Prop. 3.1 shows that each optimization step amounts to reducing the complexity of Bellman’s recursive equation to an analytic problem.

In particular, while the Hessian of the objective scales as , a Newton step has a linear and not cubic complexity with respect to . We present in Appendix C the detailed computation of a Newton step. See (Dunn & Bertsekas, 1989) for an alternate derivation. This involves the inversion of intermediate quadratic costs at each layer. Gauss-Newton steps can also be solved by dynamic programming and can be more efficiently implemented using an automatic-differentiation oracles as we explain below.

3.2 Automatic-differentiation

Algorithm

As explained in last subsection and shown in Appendix C, a gradient step can naturally be derived as a dynamic programming procedure applied to the subproblem (15). However, the implementation of the gradient step provides itself a different kind of oracle on the chain of computations as defined below.

Definition 3.1.

Given a chain of computations as defined in Def. 2.1 and , an automatic differentiation oracle is a procedure that gives access to

 μ→∇ψ(w)μfor any μ∈Rq.

The point is that we have access to not as a matrix but as a linear operator. The matrix can also be computed and stored to perform gradient vector products. Yet, this requires a surplus of storage and of computations that are generally not necessary for our purposes. The only quantities that need to be stored are given in a forward pass. Then, these quantities can be used to compute any gradient vector product directly.

The definition of an automatic differentiation oracle is composed of two steps:

1. a forward pass that computes and stores the information necessary to compute gradient-vector products,

2. a backward pass that computes for any given the information stored in the forward pass.

Note that the two aforementioned passes are decorrelated in the sense that the forward pass does not require the knowledge of the slope for which is computed.

We present in Algo. 1 and Algo. 2 the classical forward-backward passes used in modern automatic-differentiation libraries. The implementation of the automatic differentiation oracle as a procedure that computes both the value of the chain and the linear operator is then presented in Algo. 3.

Computing the gradient on amounts then to

1. computing with Algo. 3, ,

2. computing then using the oracle computed by .

Complexity

Without additional information on the structure of the layers, the complexities of the forward and backward passes can readily be computed as shown in the following proposition. The units chosen are, for the space complexity, the cost of storing one cell of a matrix and, for the time complexity, the cost of performing an addition or a multiplication. {restatable}propositioncplxities The space and time complexities of the forward and backward passes, Algo. 1, Algo. 2, are of the order of

 S=τ∑l=1(ρl+δl−1)δl,T=τ∑l=1T(ϕl,∇ϕl)TF+2τ∑l=1(δl−1δl+ρlδl)TB,

respectively, where is the time complexity of computing during the forward pass, denotes the time-complexity of the forward pass and denotes the time complexity of the backward pass. For chain of computations of the form (7), the time complexity of the backward pass can be refined as shown in Appendix C.3. Specifically, we have the following corollary.

Corollary 3.2.

For a chain of fully-connected layers (8) with element-wise activation function, no normalization or pooling, the time complexity of the backward pass is of the order of

 TB=O(τ∑l=12mdl(dl−1+1))

elementary operations. For a chain of convolutional layers (9) with element-wise activation function, no normalization or pooling, the time complexity of the backward pass is of the order of

 TB=O(τ∑l=1(2nplnflsfl+nplnfl+dl)m)

elementary operations.

3.3 Gauss-Newton by automatic-differentiation

The Gauss-Newton step can also be solved by making calls to an automatic differentiation oracle as shown in (Roulet et al., 2019) and stated in the framework considered in this paper. {restatable}propositiongaussnewtonautodiff Consider the Gauss-Newton-step (13) on for a convex objective , a convex decomposable regularization and a differentiable chain of computations . We have that

1. the Gauss-Newton-step amounts to solving

 minμ∈Rq~q⋆f(μ)+~q⋆r(−∇ψ(wt)μ), (18)

where , and for a function we denote by its convex conjugate,

2. the Gauss-Newton-step reads where is the solution of  (18),

3. the dual problem (18) can be solved by calls to an automatic differentiation procedure.

Proposition 3.3 shows that a Gauss-Newton step is only times more expansive than a gradient-step. Precisely, for a deep network with a supervised objective, we have where is the number of samples and is the number of classes. A gradient step makes then one call to an automatic differentiation procedure to get the gradient of the batch and the Gauss-Newton method will then make more calls. If mini-batch Gauss-Newton steps are considered then the cost reduces to calls to an automatic differentiation oracle, where is the size of the mini-batch.

4 Optimization complexity

The convergence guarantee of a first-order method towards an -stationary point is governed by the smoothness property of the objective, i.e., the Lipschitz continuity of the function itself or its gradient when it is defined. We study smoothness properties with respect to the Euclidean norm , whose operator norm is denoted 2. In the following, for a function and a set , we denote by

 mCf=supx∈C∥f(x)∥2,ℓCf=supx,y∈Cx≠y∥f(x)−f(y)∥2∥x−y∥2,LCf=supx,y∈Cx≠y∥∇f(x)−∇f(y)∥2,2∥x−y∥2, (19)

a bound of on , the Lipschitz-continuity parameter of on and the smoothness parameter of on (i.e. the Lipschitz-continuity parameter of its gradient if it exists), all with respect to 3. We denote by the same quantities defined on the domain of , e.g., . If these quantities are not defined, we consider them to be infinite. For example if is not bounded, or if is not continuously differentiable .

For a given set , we denote by the class of functions such that and . Similarly we denote by the class of functions such that . We drop indexes to denote classes of functions for which only a subset of these parameters is defined. For example, we denote the set of functions that are -Lipschitz continuous.

4.1 Convergence rate to a stationary point

We recall the convergence rate to a stationary point of a gradient descent and a stochastic gradient descent on constrained problems.

Theorem 4.1 (Ghadimi et al. (2016, Theorems 1 and 2)).

Consider problems of the form

 (i)minw∈Rp {F(w):=f(ψ(w))+r(w)