Robust Implicit Backpropagation

# Robust Implicit Backpropagation

## Abstract

Arguably the biggest challenge in applying neural networks is tuning the hyperparameters, in particular the learning rate. The sensitivity to the learning rate is due to the reliance on backpropagation to train the network. In this paper we present the first application of Implicit Stochastic Gradient Descent (ISGD) to train neural networks, a method known in convex optimization to be unconditionally stable and robust to the learning rate. Our key contribution is a novel layer-wise approximation of ISGD which makes its updates tractable for neural networks. Experiments show that our method is more robust to high learning rates and generally outperforms standard backpropagation on a variety of tasks.

## 1 Introduction

Despite decades of research, most neural networks are still optimized using minor variations on the backpropagation method proposed by Rumelhart, Hinton and Williams in 1986 rumelhart1986learning (). Since backpropagation is a stochastic first order method, its run time per iteration is independent of the number of training datapoints. It is this key property that makes it able to ingest the vast quantities of data required to train neural networks on complex tasks like speech and image recognition.

A serious limitation of backpropagation being a first order method is its inability to use higher order information. This leads to multiple problems: the need to visit similar datapoints multiple times in order to converge to a good solution, instability due to “exploding” gradients (pascanu2013difficulty, , Sec. 3), and high sensitivity to the learning rate (Goodfellow-et-al-2016, , Sec. 11.4.1). A number of different approaches have been suggested to deal with these problems. Adaptive learning rate methods, like Adam kingma2014adam () and Adagrad duchi2011adaptive (), estimate appropriate per-parameter learning rates; and momentum accelerates backpropagation in a common direction of descent zhang2017yellowfin (). Gradient clipping is a heuristic which “clips” the gradient magnitude at a pre-specified threshold and has been shown to help deal with exploding gradients (pascanu2013difficulty, , Sec. 3). Although these approaches partially address the problems of backpropagation, neural network training remains unstable and highly sensitive to the learning rate (Goodfellow-et-al-2016, , Sec. 11).

The key research question here is how to add higher order information to stabilize backpropagation while keeping the per iteration run time independent of the number of datapoints. A technique that has recently emerged that addresses this same question in the context of convex optimization is Implicit Stochastic Gradient Descent (ISGD). ISGD is known in convex optimization to be robust to the learning rate and unconditionally stable for convex optimization problems (ryu2014stochastic, , Sec. 5)(toulis2015scalable, , Sec. 3.1). A natural question is whether ISGD can be used to improve the stability of neural network optimization.

In this paper, we show how ISGD can be applied to neural network training. To the best of our knowledge this is the first time ISGD has been applied to this problem.1 The main challenge in applying ISGD is solving its implicit update equations. This step is difficult even for most convex optimization problems. We leverage the special structure of neural networks by constructing a novel layer-wise approximation for the ISGD updates. The resulting algorithm, Implicit Backpropagation (IB), is a good trade-off: it has almost the same run time as the standard “Explicit” Backpropagation (EB), and yet enjoys many of the desirable features of exact ISGD. IB is compatible with many activation functions such as the relu, arctan, hardtanh and smoothstep; however, in its present form, it cannot be applied to convolutional layers. It is possible to use IB for some layers and EB for the other layers; thus, IB is partially applicable to virtually all neural network architectures.

Our numerical experiments demonstrate that IB is stable for much higher learning rates as compared to EB on classification, autoencoder and music prediction tasks. In all of these examples the learning rate at which IB begins to diverge is 20%-200% higher than for EB. We note that for small-scale classification tasks EB and IB have similar performance. IB performs particularly well for RNNs, where exploding gradients are most troublesome. We also investigate IB’s compatibility with clipping. We find that IB outperforms EB with clipping on RNNs, where clipping is most commonly used, and that clipping benefits both IB and EB for classification and autoencoding tasks. Overall, IB is clearly beneficial for RNN training and shows promise for classification and autoencoder feedforward neural networks. We believe that more refined implementations of ISGD to neural networks than IB are likely to lead to even better results — a topic for future research.

The rest of this paper is structured as follows. Section 2 reviews the literature on ISGD and related methods. Section 3 develops IB as approximate ISGD, with Section 4 deriving IB updates for multiple activation functions. The empirical performance of IB is investigated in Section 5 and we conclude with mentions of further work in Section 6.

## 2 ISGD and related methods

### 2.1 ISGD method

The standard objective in most machine learning models, including neural networks, is the ridge-regularized loss

 ℓ(θ)=1NN∑i=1ℓi(θ)+μ2∥θ∥22,

where is the loss associated with datapoint and comprises the weight and bias parameters in the neural network.

The method ISGD uses to minimize is similar to that of standard “Explicit” SGD (ESGD). In each iteration of ESGD, we first sample a random datapoint and then update the parameters as , where is the learning rate at time . ISGD also samples a random datapoint but employs the update , or equivalently,

 θ(t+1)=argminθ{2ηt(ℓi(θ)+μ2∥θ∥22)+∥θ−θ(t)∥22}. (1)

The main motivation of ISGD over ESGD is its robustness to learning rates, numerical stability and transient convergence behavior bertsekas2011incremental (); patrascu2017nonasymptotic (); ryu2014stochastic (). The increased robustness of ISGD over ESGD can be illustrated with a simple quadratic loss , as displayed in Figure 1. Here the ISGD step is stable for any learning rate whereas the ESGD step diverges when .

Since ISGD becomes equivalent to ESGD when the learning rate is small, there is no difference in their asymptotic convergence rate for decreasing learning rates. However, it is often the case that in the initial iterations, when the learning rate is still large, ISGD outperforms ESGD.

The main drawback of ISGD is that the implicit update (1) can be expensive to compute, whereas the update for ESGD is usually trivial. If this update is expensive, then ESGD may converge faster than ISGD in terms of wall clock time, even if ISGD converges faster per epoch. Thus, in order for ISGD to be effective, one needs to be able to solve the update (1) efficiently. Indeed, the focus of this paper is to develop a methodology for efficiently approximating the ISGD update for neural networks.

### 2.2 Related methods

ISGD has been successfully applied to several machine learning tasks. Cheng et al. cheng2007implicit () applied ISGD to learning online kernels and He he2014stochastic () to SVMs, while Kulis and Bartlett kulis2010implicit () consider a range of problems including online metric learning. ISGD has also been used to improve the stability of temporal difference algorithms in reinforcement learning iwaki2017implicit (); tamar2014implicit (). For more recent advances in ISGD see bertsekas2015incremental (); lin2017catalyst (); paquette2017catalyst (); toulis2016towards (); wang2017memory ().

Although ISGD has never been applied to neural networks, closely related methods have been investigated. ISGD may be viewed as a trust-region method, where is the optimal dual variable in the Langrangian,

 argminθ{ℓi(θ)+μ2∥θ∥22:\leavevmode\nobreak ∥θ−θ(t)∥22≤r}=argminθ{ℓi(θ)+μ2∥θ∥22+12η(r)∥θ−θ(t)∥22}.

Trust-region methods for optimizing neural networks have been effectively used to stabilize policy optimization in reinforcement learning schulman2015trust (); wu2017scalable (). Clipping, which truncates the gradient at a pre-specified threshold, may also be viewed as a computationally efficient approximate trust-region method pascanu2013difficulty (). It was explicitly designed to address the exploding gradient problem and achieves this by truncating the exploded step. In the experiments section we will investigate the difference in the effect of IB and clipping.

An example of a non-trust region method for optimizing neural networks using higher order information is Hessian-Free optimization martens2010deep (); martens2011learning (). These methods directly estimate second order information of a neural network. They have been shown to make training more stable and require fewer epochs for convergence. However they come at a much higher per iteration cost than first order methods, which offsets this benefit bengio2013advances ().

## 3 Implicit Backpropagation

In this section we develop Implicit Backpropagation (IB) as an approximate ISGD implementation for neural networks. It retains many of the desirable characteristics of ISGD, while being virtually as fast as the standard “Explicit” Backpropagation (EB).

Consider a -layered neural network where represents the layer with parameters ,   , is the input to the neural network, and denotes composition. Let the loss associated with a datapoint be for some . Later in this section, we will want to extract the effect of the layer on the loss. To this end, we can rewrite the loss as where and .

The complexity of computing the ISGD update depends on the functional form of the loss . Although it is possible in some cases to compute the ISGD update explicitly, this is not the case for neural networks. Even computing the solution numerically is very expensive. Hence, it is necessary to approximate the ISGD update in order for it to be computationally tractable. We introduce the following two approximations in IB:

1. We update parameters layer by layer. When updating parameter associated with layer , all the parameters corresponding to the other layers are kept fixed. Under this approximation the loss when updating the layer is

 ℓk(θk;x,y,θ(t)−k):=ℓ(d:k+1)θ(t)d:k+1,y∘f(k)θk∘f(k−1:1)θ(t)k−1:1(x). (2)
2. Building on (a), we linearize the higher layers , but keep the layer being updated as non-linear. The loss from (2) reduces to2

 ~ℓk(θk;x,y,θ(t)−k):=ℓθ(t)(x,y)+∇ℓ(d:k+1)⊤θ(t)d:k+1,y(f(k)θk∘f(k−1:1)θ(t)k−1:1(x)−f(k:1)θ(t)k:1(x)). (3)

This approximation can be validated via a Taylor series expansion where the error in (3) compared to (2) is .

The IB approximation to the ISDG update is, thus, given by

 θ(t+1)k=argminθk{2η(~ℓk(θk;x,y,θ(t)−k)+μ2∥θk∥22)+∥θk−θ(t)k∥22}. (4)

In Appendix C we present a simple theorem, which leverages the fact that IB converges to EB in the limit of small learning rates, to show that IB converges to a stationary point of for appropriately decaying learning rates.

In the next section we show that the IB update can be efficiently computed for a variety of activation functions. The IB approximations thus make ISGD practically feasible to implement. However, the approximation is not without drawbacks. The layer-by-layer update from (a) removes all higher order information along directions perpendicular to the parameter space of the layer being updated, and the linearization in (b) loses information about the non-linearity in the higher layers. The hope is that IB retains enough of the beneficial properties of ISGD to have noticeable benefits over EB. In our experiments we show that this is indeed the case.

Our IB formulation is, to our knowledge, novel. The most similar update in the literature is for composite convex optimization with just two layers, where the lower, not higher, layer is linearized duchi2017stochastic ().

## 4 Implicit Backpropagation updates for various activation functions

Since neural networks are applied to extremely large datasets, it is important that the IB updates can be computed efficiently. In this section we show that the IB update (4) can be greatly simplified, resulting in fast analytical updates for activation functions such as the relu and arctan. For those activation functions that do not have an analytical IB update, IB can easily be applied on a piecewise-cubic approximation of the activation function. This makes IB practically applicable to virtually any element-wise acting activation function.

Although it is possible to apply IB to layers with weight sharing or non-element-wise acting activation functions, the updates tend to be complex and expensive to compute. For example, the IB update for a convolutional layer with max-pooling and a relu activation function involves solving a quadratic program with binary variables (see Appendix B for the derivation). Thus, we will only focus on updates for activation functions that are applied element-wise and have no shared weights.

Here we derive the IB updates for a generic layer with element-wise acting activation function . Let the parameters in the layer be where is the weight matrix and is the bias. We’ll use the shorthand notation for the input to the layer and for the backpropagated gradient. The output of the layer is thus where is applied element-wise and is a matrix-vector product. Using this notation the IB update from (4) becomes:

 θ(t+1)k=argminθk{2ηtbki⊤σ(θkzki)+ηtμ∥θk∥22+∥θk−θ(t)k∥22}, (5)

where we have dropped the terms and from (3) as they are constant with respect to . Now that we have written the IB update in more convenient notation, we can begin to simplify it. Due to the fact that is applied element-wise, (5) breaks up into separate optimization problems, one for each output node :

 θ(t+1)kj=argminθkj{2ηtbkijσ(θ⊤kjzki)+ηtμ∥θkj∥22+∥θkj−θ(t)kj∥22}, (6)

where are the parameters corresponding to the output node. Using simple calculus we can write the solution to as

 θ(t+1)kj=θ(t)kj1+ηtμ−ηtαkijzki (7)

where is the solution to the one-dimensional optimization problem

 αkij=argminα⎧⎪⎨⎪⎩bkij⋅σ⎛⎜⎝θ(t)⊤kjzki1+ηtμ−α⋅ηt∥zki∥22⎞⎟⎠+ηt(1+ηtμ)∥zki∥22α22⎫⎪⎬⎪⎭. (8)

See Appendix A for the derivation.

To connect the IB update to EB, observe that if we do a first order Taylor expansion of (7) and (8) in we recover the EB update:

 θ(t+1)kj=θ(t)kj(1−ηtμ)−ηtσ′(θ(t)⊤kjzki)bkijzki+O(η2t),

where denotes the derivative of . Thus we can think of IB as a higher order update than EB.

In summary, the original dimensional IB update from (5) has been reduced to separate one-dimensional optimization problems in the form of (8). The difficulty of solving (8) depends on the activation function . Since (8) is a one-dimensional problem, an optimal can always be computed numerically using the bisection method, although this may be slow. Fortunately there are certain important activation functions for which can be computed analytically. In the subsections below we derive analytical updates for when is the relu and arctan functions as well as a general formula for piecewise-cubic functions.

Before proceeding to these updates, we can observe directly from (7) and (8) that IB will be robust to high learning rates. Unlike EB, in which the step size increases linearly with the learning rate, IB has a bounded step size even for infinite learning rates. As the learning rate increases (7) becomes

 θ(t+1)kjηt→∞−−−−→argminβ{bkij⋅σ(β⋅∥zki∥22)+μ∥zki∥22β22}zki

where . This update is finite as long as and is asymptotically linear.

### 4.2 Relu update

Here we give the solution to from (8) for the relu activation function, . We will drop the super- and sub-scripts from (8) for notational convenience. When there are three cases and when there are two cases for the solution to . The updates are given in Table 1.

The difference between the EB and IB updates is illustrated in Figure 2. When and is on the slope but close to the hinge, the EB step overshoots the hinge point, making a far larger step than is necessary to reduce the relu to 0. IB, on the other hand, stops at the hinge. The IB update is better from two perspectives. First, it is able to improve the loss on the given datapoint just as much as EB, but without taking as large a step. Assuming that the current value of is close to a minimizer of the average loss of the other datapoints, an unnecessarily large step will likely take away its (locally) optimum value. An example of where this property might be particularly important is for the “cliff” of “exploding gradient” problem in RNNs pascanu2013difficulty (). And second, the IB step size is a continuous function of , unlike in EB where the step size has a discontinuity at the origin. This should make the IB update more robust to perturbations in the data.

When and is on the flat, EB isn’t able to descend as the relu has “saturated” (i.e. is flat). IB, on the other hand, can look past the hinge and is still able to descend down the slope, thereby decreasing the loss. IB thus partially solves the saturating gradient problem pascanu2013difficulty ().

### 4.3 Arctan update

Although the IB update is not analytically available for all sigmoidal activation functions, it is available when is the arctan. For the arctan the value of becomes the root of a cubic equation which can be solved for analytically. Since the arctan function is non-convex there may be up to three real solutions for . Under the axiom that smaller step sizes that achieve the same decrease in the objective are better (as argued in Section 4.2), we always choose the value of closest to zero.

### 4.4 Piecewise-cubic function update

Many activation functions are piecewise-cubic, including the hardtanh and smoothstep. Furthermore, all common activation functions can be approximated arbitrarily well with a piecewise-cubic. Being able to do IB updates for piecewise-cubic functions thus extends its applicability to virtually all activation functions.

Let where is a cubic function and defines the bounds on each piece. The optimal value of for each can be found by evaluating at its boundaries and stationary points (which may be found by solving a quadratic equation). The value of can then be solved for by iterating over and taking the minimum over all the pieces. Since there are pieces, the time to calculate scales as .

### 4.5 Relative run time difference of IB vs EB measured in flops

A crucial aspect of any neural network algorithm is not only its convergence rate per epoch, but also its run time. Since IB does extra calculations, it is slower than EB. Here we show that the difference in floating point operations (flops) between EB and IB is typically small, on the order of 30% or less.

For any given layer, let the input dimension be denoted as and the output dimension as . The run time of both EB and IB are dominated by three operations costing flops: multiplying the weight matrix and input in the forward propagation, multiplying the backpropagated gradient and input for the weight matrix gradient, and multiplying the backpropagated gradient and weight matrix for the backpropagated gradient in the lower layer. IB has two extra calculations as compared to EB: calculating , costing flops, and calculating a total of times, once for each output node. Denoting the number of flops to calculate each with activation function as , the relative increase in run time of IB over EB is upper bounded by .

The relative run time increase of IB over EB depends on the values of and . When is the relu, then is small, no more than flops; whereas when is the arctan is larger, costing just less than flops.3 Taking these flop values as upper bounds, if then the relative run time increase is upper bounded by % for relu and % for arctan. These bounds diminish as and grow. If , then the bounds become just % for relu and % for arctan. Thus, IB’s run time is virtually the same as EB’s for large neural network tasks.

If and are small then the arctan IB update might be too slow relative to EB for the IB update to be worthwhile. In this case simpler sigmoidal activation functions, such as the hardtanh or smoothstep, may be preferable for IB. The hardtanh has been used before in neural networks, mainly in the context of binarized networks courbariaux2015binaryconnect (). It has the form

 σ(x)=⎧⎨⎩−1 if x<−1x if −1≤x≤11 if x>1

for which is no more than 15 flops (using the piecewise-cubic function update from Section 4.4). The smoothstep is like the hardtanh but is both continuous and has continuous first derivatives,

 σ(x)=⎧⎪⎨⎪⎩−1 if x<−132x−12x3 if −1≤x≤11 if x>1,

with being no more than 25 flops. The relative increase in run time of IB over EB for the hardtanh and smoothstep is about the same as for the relu. This makes the IB update with the hardtanh or smoothstep practical even if and are small.

## 5 Experiments

This section details the results of three sets of experiments where the robustness to the learning rate of IB is compared to that of EB.4 Since IB is equivalent to EB when the learning rate is small, we expect little difference between the methods in the limit of small learning rates. However, we expect that IB will be more stable and have lower loss than EB for larger learning rates.

Classification, autoencoding and music prediction tasks. For the first set of experiments, we applied IB and EB to three different but common machine learning tasks. The first task was image classification on the MNIST dataset lecun1998mnist () with an architecture consisting of two convolutional layers, an arctan layer and a relu layer. The second task also uses the MNIST dataset, but for an 8 layer relu autoencoding architecture. The third task involves music prediction on four music datasets, JSB Chorales, MuseData, Nottingham and Piano-midi.de boulanger2012modeling (), for which a simple RNN architecture is used with an arctan activation function.

For each dataset-architecture pair we investigated the performance of EB and IB over a range of learning rates where EB performs well (see Appendix D.5 for more details on how these learning rates were chosen). Since IB and EB have similar asymptotic convergence rates, the difference between the methods will be most evident in the initial epochs of training. In our experiments we only focus on the performance after the first epoch of training for the MNIST datasets, and after the fifth epoch for the music datasets.5 Five random seeds are used for each configuration in order to understand the variance of the performance of the methods.6

Figure 3 displays the results for the experiments. EB and IB have near-identical performance when the learning rate is small. However, as the learning rate increases, the performance of EB deteriorates far more quickly as compared to IB. Over the six datasets, the learning rate at which IB starts to diverge is at least 20% higher than that of EB.7 The benefit of IB over EB is most noticeable for the music prediction problems where for high learning rates IB has much better mean performance and lower variance than EB. A potential explanation for this behaviour is that IB is better able to deal with exploding gradients, which are more prevalent in RNN training.

Exact ISGD. For MNIST-classification we also investigate the potential performance of exact ISGD. Instead of using our IB approximation for the ISGD update, we directly optimize (1) using gradient descent. For each ISGD update we take a total of gradient descent steps of (1) at a learning rate times smaller than the “outer” learning rate . It is evident from Figure 3 that this method achieves the best performance and is remarkably robust to the learning rate. Since exact ISGD uses extra gradient descent steps per iteration, it is 100 times slower than the other methods, and is thus impractically slow. However, its impressive performance indicates that ISGD-based methods have great potential for neural networks.

Run times. According to the bounds derived in Section 4.5, the run time of IB should be no more than 12% longer per epoch than EB on any of our experiments. With our basic Pytorch implementation IB took between 16% and 216% longer in practice, depending on the architecture used (see Appendix E.1 for more details). With a more careful implementation of IB we expect these run times to decrease to at least the levels indicated by the bounds. Using activation functions with more efficient IB updates, like the smoothstep instead of arctan, would further reduce the run time.

UCI datasets. Our second set of experiments is on classification datasets from the UCI database fernandez2014we (). We consider a 4 layer feedforward neural network run for 10 epochs on each dataset. In contrast to the above experiments, we use the same coarse grid of 10 learning rates between 0.001 and 50 for all datasets. For each algorithm and dataset the best performing learning rate was found on the training set (measured by the performance on the training set). The neural network trained with this learning rate was then applied to the test set. Overall we found IB to have a higher average accuracy on the test set. The similarity in performance of IB and EB is likely due to the small size of the datasets (some datasets have as few as 4 features) and relatively shallow architecture making the network relatively easy to train, as well as the coarseness of the learning rate grid.

Clipping. In our final set of experiments we investigated the effect of clipping on IB and EB. Both IB and clipping can be interpreted as approximate trust-region methods. Consequently, we expect IB to be less influenced by clipping than EB. This was indeed observed in our experiments. A total of experiments were run with different clipping thresholds applied to RNNs on the music datasets (see Appendix D for details). Clipping improved EB’s performance for higher learning rates in out of the experiments, whereas IB’s performance was only improved in . IB without clipping had an equal or lower loss than EB with clipping for all learning rates in all experiments except for one (Piano-midi.de with a clipping threshold of 0.1). This suggests that IB is a more effective method for training RNNs than EB with clipping.

The effect of clipping on IB and EB applied to MNIST-classification and MNIST-autoencoder is more complicated. In both cases clipping enabled IB and EB to have lower losses for higher learning rates. For MNIST-classification it is still the case that IB has uniformly superior performance to EB, but for MNIST-autoencoder this is reversed. It is not unsurprising that EB with clipping may outperform IB with clipping. If the clipping threshold is small enough then the clipping induced trust region will be smaller than that induced by IB. This makes EB with clipping and IB with clipping act the same for large gradients; however, below the clipping threshold EB’s unclipped steps may be able to make more progress than IB’s dampened steps. See Figure 4 for plots of EB and IB’s performance with clipping.

Summary. We ran a total of 17 experiments on the MNIST and music datasets8. IB outperformed EB in 15 out of these. On the UCI datasets IB had slightly better performance than EB on average. The advantage of IB is most pronounced for RNNs, where for large learning rates IB has much lower losses and even consistently outperforms EB with clipping. Although IB takes slightly longer to train, this is offset by its ability to get good performance with higher learning rates, which should enable it to get away with less hyperparameter tuning.

## 6 Conclusion

In this paper we developed the first method for applying ISGD to neural networks. We showed that, through careful approximations, ISGD can be made to run nearly as quickly as standard backpropagation while still retaining the property of being more robust to high learning rates. The resulting method, which we call Implicit Backpropagation, consistently matches or outperforms standard backpropagation on image recognition, autoencoding and music prediction tasks; and is particularly effective for robust RNN training.

The success of IB demonstrates the potential of ISGD methods to improve neural network training. It may be the case that there are better ways to approximate ISGD than IB, which could produce even better results. For example, the techniques behind Hessian-Free methods could be used to make a quadratic approximation of the higher layers in IB (opposed to the linear approximation currently used); or a second order approximation could be made directly to the ISGD formulation in (1). Developing and testing such methods is a ripe area for future research.

## Appendix A Derivation of generic update equations

In this section we will derive the generic IB update equations, starting from equation (5) and ending at equation (8). For notational simplicity we will drop superscripts and subscripts where they are clear from the context. Let a tilde denote the current iterate, i.e. . With this notation, the IB update from (5) becomes

 argminθ{2ηb⊤σ(θz)+ημ∥θ∥22+∥θ−~θ∥22} (9) =argminθ⎧⎨⎩Dk+1∑j=12ηbjσ(θ⊤jz)+ημ∥θj∥22+∥θj−~θj∥22⎫⎬⎭

where is the row of corresponding to the node in the layer. The minimization splits into separate minimization problems, one for each :

 argminθj{2ηbjσ(θ⊤jz)+ημ∥θj∥22+∥θj−~θj∥22}. (10)

Since this is a -dimensional problem. However, we will be able to reduce it to just a one-dimensional problem. We begin by introducing an auxiliary variable and rewriting (10) as

 minqj{2ηbjσ(qj)+minθj{ημ∥θj∥22+∥θj−~θj∥22:\leavevmode\nobreak qj=θ⊤jz}}. (11)

We will first solve the inner minimization over as a function of , and then solve the outer minimization over .

#### Inner minimization

The inner minimization can be solved by taking the dual:

 minθj{ημ∥θj∥22+∥θj−~θj∥22:\leavevmode\nobreak qj=θ⊤jz} =maxλj∈Rminθj{ημ∥θj∥22+∥θj−~θj∥22+2λj(qj−θ⊤jz)} =maxλj∈R{2λjqj+minθj{ημ∥θj∥22+∥θj−~θj∥22−2λjθ⊤jz}}. (12)

The solution for is

 θj=~θj+λjz1+ημ. (13)

Substituting (13) into (12) and simplifying yields

 −11+ημminλj∈R{λ2j∥z∥22+2λj(~θ⊤jz−(1+ημ)qj)−ημ∥~θj∥22}.

This is a quadratic in , which is easily minimized. The value for at the minimum is,

 λj=(1+ημ)qj−~θ⊤jz∥z∥22. (14)

Substituting (14) into (12) yields the minimal value of the inner minimization problem as a function of ,

 1+ημ∥z∥22⎛⎜⎝qj−~θ⊤jz1+ημ⎞⎟⎠2+ημ1+ημ∥~θj∥22. (15)

#### Outer minimization

Replacing the inner minimization with (15), dropping the constant term and dividing everything by , (11) becomes

 argminqj⎧⎪⎨⎪⎩bjσ(qj)+1+ημ2η∥z∥22⎛⎜⎝qj−~θ⊤jz1+ημ⎞⎟⎠2⎫⎪⎬⎪⎭.

Reparameterizing as

 αj=1η∥z∥22⎛⎜⎝~θ⊤jz1+ημ−qj⎞⎟⎠, (16)

we arrive at our simplified update from (8):

 αj=argminα⎧⎪⎨⎪⎩bj⋅σ⎛⎜⎝~θ⊤jz1+ημ−α⋅η∥z∥22⎞⎟⎠+η(1+ημ)∥z∥22α22⎫⎪⎬⎪⎭.

Once we have solved for we can recover the optimal by using (16) to find , (14) to find and (13) to find . The resulting formula is

 θj=~θj1+ημ−ηαjz,

as was stated in (7).

## Appendix B IB for convolutional neural networks

Here we consider applying IB to a Convolutional Neural Network (CNNs). As each filter is applied independently in a CNN, the IB updates decouple into separate updates for each filter. Since a filter uses shared weights, we cannot use the generic update equations in Section 4.1. Instead we have to derive the updates starting from (9)

 argminθ2ηb⊤σ(Zθ)+ημ∥θ∥22+∥θ−~θ∥22. (17)

Here is a matrix where each row corresponds to one patch over which the convolution vector is multiplied.9 The activation function has components where is a pooling matrix with elements in . We will assume that has a row of all zeros so that the max-pooling effectively includes a relu non-linearity, since .

We can expand (17) into a quadratic program:

 argminθ2ηM∑m=1bmmax{BmZθ}+ημ∥θ∥22+∥θ−~θ∥22 =argmina,θ2ηM∑m=1|bm|am+ημ∥θ∥22+∥θ−~θ∥22 \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak s.t.\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak am≥{max{BmZθ}if bm≥0min{−BmZθ}if bm<0 =argmina,y,θ2ηM∑m=1|bm|am+ημ∥θ∥22+∥θ−~θ∥22 \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak s.t.\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak am≥{BmZjθif bm≥0−BmZjθ−M(1−ymj)if bm<0for all j \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak ∑jymj=1 \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak ymj∈{0,1},

where is a large constant. This is clearly an expensive problem to solve each iteration.

Note that if the convolution did not include max-pooling, but just used shared weights, then the problem would become a quadratic program with continuous variables and no constraints, which could be solved analytically. On the other hand if the convolution had max-pooling, but no shared weights, then the generic IB updates from (7) would apply. Thus the difficulty in solving the IB convolutional update comes from doing the max-pooling and weight sharing in the same layer.

## Appendix C Convergence theorem

In this section we present a simple theorem that leverages the fact that IB converges to EB in the limit of small learning rates, to show that IB converges to a stationary point of the loss function for appropriately decaying learning rates. First we will introduce some useful notation, after which we state the conditions under which the convergence theorem holds. After a few lemmas, we prove the desired result.

Notation. Let denote the gradient used to take a step in IB when datapoint is sampled with learning rate , i.e. . Let be the loss function from Sections 2 and 3. Define the level set

 C={θ:∥θ∥22≤2μℓ(0)}

and the restarting function

 R(θ)={θif θ∈C0otherwise.

The set depends on the value of . When the output of the neural network is independent of its input and so can be quickly calculated. Finally define the extended level-set

 ¯C(η)={θ:\leavevmode\nobreak ∥θ∥2≤√2μℓ(0)+η⋅maxi,θ∈C{∥gi(θ;η)∥2}}

to contain all points that can be reached from in one IB iteration (without restarting).

We will assume the following conditions.

###### Assumption 1.

The objective function , IB gradients and learning rate sequence satisfy the following:

1. The loss at each datapoint is non-negative, i.e. for all .

2. The gradient function is Lipschitz continuous with a Lipschitz constant that is monotonically decreasing in . That is

 ∥∇ℓ(θ)−∇ℓ(¯θ)∥2≤L(η)∥θ−¯θ∥2,

for all and if .

3. The gradients of the stochastic functions in (3) are Lipschitz continuous with a Lipschitz constant that is monotonically decreasing in . That is

 ∥∇~ℓk(θk;x,y,θ(t)−k)−∇~ℓk(¯θk;x,y,θ(t)−k)∥2≤~L(η)∥θk−¯θk∥2,

for all , , and ; and if .

4. The learning rate sequence is monotonically decreasing with , and .

A few comments on the assumptions. Assumption (a) is effectively equivalent to the loss being lower bounded by a deterministic constant , as one can always define an equivalent loss . Most standard loss functions, such as the square loss, quantile loss, logistic loss and multinomial loss, satisfy assumption (a). Assumptions (b) and (c) will be valid for any neural network whose activation functions have Lipschitz gradients. Assumption (d) is standard in SGD proofs (except for the assumption which is particular to us).

Let “restarting IB” refer to IB where the restarting operator is applied each iteration. We now state the IB convergence theorem:

###### Theorem 1.

Under Assumptions 1, restarting IB converges to a stationary point in the sense that

 limT→∞∑Tt=1ηtE[∥∇ℓ(θ(t))∥22]∑Tt=1ηt=0.

The proof of Theorem 1 is given below, after a few helpful lemmas.

###### Lemma 1.

The restarting operator does not increase the loss .

###### Proof.

If then and the loss stays the same, otherwise

 ℓ(R(θ))=ℓ(0)<μ2∥θ∥22≤ℓ(θ),

where the first inequality is by the definition of and the second is from the assumption that . ∎

###### Lemma 2.

The gradient of the loss function at any point in is bounded by .

###### Proof.

By the triangle inequality and Lipschitz assumption

 ∥∇ℓ(θ)∥2 =∥∇ℓ(θ)−∇ℓ(0)+∇ℓ(0)∥2 ≤∥∇ℓ(θ)−∇ℓ(0)∥2+∥∇ℓ(0)∥2 ≤L(η)∥θ−0∥2+∥∇ℓ(0)∥2 ≤L(η)√2μℓ(0)+∥∇ℓ(0)∥2 =Bℓ(η)

for all . ∎

###### Lemma 3.

If then for any we have .

###### Proof.
 y⊤v ≥mins{s⊤v:\leavevmode\nobreak ∥x−s∥22≤z2} =maxλ≥0mins{s⊤v+λ(∥x−s∥22−z2)} =x⊤v−z∥v∥2

where the final line follows from basic algebraic and calculus. ∎

###### Lemma 4.

The 2-norm difference between the EB and IB gradients at with learning rate is bounded by .

###### Proof.

Let denote the components of corresponding to the parameters of the layer , i.e. . By construction

 ∇ℓik(θ(t)) =∇θk~ℓk(θk;x,y,θ(t)−k)∣∣θk=θ(t)k gi(θ(t);ηt) =∇θk~ℓk(θk;x,y,θ(t)−k)∣∣θk=θ(t+1)k

where, in a slight abuse of notation, refers to the value of the next IB iterate before the application of the restarting operator. By the Lipschitz assumption on we have

 ∥∇ℓi(θ(t))−gi(θ(t);ηt)∥22 =d∑k=1∥∇ℓ