Proportionate gradient updates with PercentDelta
Deep Neural Networks are generally trained using iterative gradient updates. Magnitudes of gradients are affected by many factors, including choice of activation functions and initialization. More importantly, gradient magnitudes can greatly differ across layers, with some layers receiving much smaller gradients than others. causing some layers to train slower than others and therefore slowing down the overall convergence. We analytically explain this disproportionality. Then we propose to explicitly train all layers at the same speed, by scaling the gradient w.r.t. every trainable tensor to be proportional to its current value. In particular, at every batch, we want to update all trainable tensors, such that the relative change of the L1-norm of the tensors is the same, across all layers of the network, throughout training time. Experiments on MNIST show that our method appropriately scales gradients, such that the relative change in trainable tensors is approximately equal across layers. In addition, measuring the test accuracy with training time, shows that our method trains faster than other methods, giving higher test accuracy given same budget of training steps.
Proportionate gradient updates with PercentDelta
Sami Abu-El-Haija Google Research Mountain View, CA email@example.com
Weights of Deep Neural Networks (DNNs) are commonly randomly initialized. The -th layer’s weight matrix is initialized from a distribution that is generally conditioned on the shape of .
Feed-forward DNNs generally consist of a series of matrix multiplications and element-wise activation functions. For example, input vector is transformed to the output vector by an -layer neural network , depicted as:
Where is a matrix-multply operator, is an element-wise activation (e.g. logistic or ReLu). The DNN parameters are optimized by a training algorithm, such as Stochastic Gradient Descent (SGD), which iteratively applies gradient updates:
where is the value of the -th layer weight matrix at timestep , is the learning rate, the decay function , generally decreases with time , and is the partial gradient of training objective w.r.t. , evaluated at time for a batch of training examples. For notational convenience, we define the delta SGD as:
Under usual circumstances, the magnitudes of gradients can widely vary across layers of a DNN, causing some trainable tensors to change much slower than others, slowing down overall training. Several proposed methods metigate this problem, including utilizing: per-parameter adaptive learning rate such as AdaGrad (adagrad) or Adam (adam); normalization operators such as BatchNorm (batchnorm), WeightNorm (weightnorm), or LayerNorm (layernorm); and intelligent initialization schemes such as xavier’s (xavier). These methods heuristically attack the disproportionate training problem, which we justify in section 2. In this paper, we propose to directly enforce proportionate training of layers. Specifically, We propose a gradient update rule that moves every weight tensor in the direction of the gradient, but with a magnitude that changes the tensor value by a relative amount. We use the same relative amount across all layers, to train them all at the same speed.
The remainder of the paper is organized as follows. In Section 2 we illustrate the disproportionate training phenomena on a (toy) hypothetical 4-layer neural network, by expanding gradient terms using BackProp backprop. In Section 3, we summarize related work. Then, we introduce our algorithm in Section 4. We show experimental results on MNIST in Section 5. In Section 6, we discuss where PercentDelta could be useful and potential future direction. Finally, we conclude our findings in Section 7.
2 Disproprtionate Training
We use a toy example to illustrate disproportionate training across layers. Assume a 4-layer network with trainable weight matrices and output vector . The gradient of the objective w.r.t. output vector can be directly calculated from the data e.g. using cross-entropy loss. We write-down the gradient of the objective w.r.t. the last layer’s weight matrix as:
where is the derivative of w.r.t. its input, and is the Hadamard product.
We also write-down the gradient w.r.t. and , then expand the expressions using the Back Propagation Algorithm (backprop):
Note the following in the above equations: First, the derivatives look similar. In fact, they are almost sub-expressions of one another, with the exception of the right-most row-vector . Second, all quantities in brackets are column-vectors. The gradient matrix is determined by an outer product of a column-vector (in brackets) times a row-vector .
What can we conclude about the magnitudes (e.g. the L1 norm) of ? Each is calculated using three types of multiplicands: , , and . Therefore, the magnitudes of are affected by:
The value of the derivative of , which is evaluated element-wise. If is generally less than 1, then we can expect the gradient to be smaller for earlier layers than later ones, since they are multiplied by more times. If it is generally greater than 1, then we can expect the gradient to be larger for earlier layers. For very deep networks (recurrent or otherwise), the former situation can cause the gradients to vanish, while the latter can cause the gradients to explode. ReLu mitigates this problem, as its derivative , in locations where input is positive.
The magnitude of ’s. In practice, all ’s are initialized from the same distribution. If the L1-norms of rows in are less than 1, then we should expect , yielding smaller gradients for earlier layers. Otherwise, if the row L1 norms are greater than 1, then earleir layers should receive larger gradient magnitudes. Weight normalization (weightnorm) and intelligent initialization schemes (e.g. xavier) can mitigate this problem.
The norm of the row-vector . If in the forward pass, the activations consistently grow (e.g. unbounded activation) with subsequent layers, then later layers will receive larger gradients. BatchNorm (batchnorm) and LayerNorm (layernorm) mitigate this problem.
3 Related Work
3.1 Adaptive Gradient
adagrad proposed AdaGrad. A training algorithm that keeps a cumulative sum-of-squared gradients (i.e. second moment of gradients):
Then divides (element-wise) the gradient by the square-root of the sum:
or equivalently , where the power operator is applied element-wise. In essense, if some layer receives large gradients, then they will be normalized to smaller gradients through the division. However, a weakness in AdaGrad is that at some point, the will grow too large, effectively making and therefore slowing or halting training.
adam propose Adam, which keeps exponential decaying sums, of gradients and of square gradients, respectively known as first and second moments. Adam offers two benefits over AdaGrad. First, its decaying sums should not grow to infinity and therefore training should not halt. Second, the exponential-decay averaging was shown to speed up training (Momentum, momentum). For details, we point readers to the Adam paper (adam).
lars propose to normalize every layer’s gradients by the ratio of L2 norms of the parameter and the gradient. Namely, they propose the gradient update rule:
which effectively normalizes the gradient to be unit-norm. This setup is very similar to ours, with two differences: First, our norm operator is L1 rather than L2. Second, our norm operator is applied outside the division (i.e. our division is element-wise). Our proposed algorithm, PercentDelta, was used to train our work (asymproj), before we where aware of the work of (lars). In (asymproj), PercentDelta gave us improvement over Adam, over all datasets. Nonetheless, we only discuss MNIST experiments in this paper, and leave graph embedding experiments for follow-up work.
For every weight matrix , and similarily bias vectors, we propose the gradient update rule:
where scalar is the number of entries111 is the product of ’s dimensions. For a 2-D matrix, it is equal to # rows # columns, for a vector, it is equal to its length, etc. We also define . in , and the scalar:
normalizes the gradient of the -th layer, so that it is more proportional to its current parameter value , and is the L1-norm. We avoid division-by-zero errors by adding an epsilon to the denominator222In practice, rather than converting to , we use , as our goal is to push slightly away from zero while keeping its sign.. The divide operator within the L1 norm is applied element-wise and the outer divide operator is scalar. Note that the “gradient multiplier” fraction (Equation 7) is . Therefore, we only changes the gardient’s magnitude, but not its direction.
where if and if . Therefore, will change with a quantity proportional to its current value. In particular, the scalar determines the percentage at which changes at timestep , giving rise to the name: PercentDelta. For example, if we setup a decay schedule on , such that changes during training from to , then the network parameters will change at a rate of in early mini-batches, and gradually decrease their PercentDelta updates to towards end of training, consistently across all layers, regardless of the current parameter value or the gradient value, which are influenced by choice of activation function, initialization, and network architecture.
We run experiments on MNIST. We use the same model for all experiments and fix the batch size to 500. Our model code is copied from the TensorFlow tutorial333https://www.tensorflow.org/get_started/mnist/pros, and contains 4 trainable layers: 2D Convolution, Max-pooling, 2D Convolution, Max-pooling, Fully-connected, Fully-connected. The convolutional layers contain trainable tensors with dimensions: (5, 5, 1, 32), (5, 5, 32, 64), and bias vectors. The Fully-connected layers contain trainable tensors with dimensions: (3136, 1024), (1024, 10), and bias vectors. The model is trained with Softmax loss, uses ReLu for hidden activations, and does not use BatchNorm.
5.1 MNIST Network Gradient Magnitudes
We record the gradient magnitudes for our 4-layer MNIST network throughout training. We compare the relative magnitude that changes, under vanilla SGD versus under PercentDelta. We plot and for tranable tensors of convolutional and fully-connected layers. We notice that vanilla SGD proposes gradients that are not proportional to current weights. For example, using some learning rate, some layers can completely diverge (e.g. all entries switching signs) while other layers would change at a rate of <1%. In this case, to prevent divergence in any layer tensors, the learning rate would be lowered. However, using PercentDelta, relative magnitude of parameter updates is almost equal for all layers, showing that all layers are training at the same speed, consistently throughout the duration of training.
5.2 MNIST Test Accuracy Curves
We want to measure how fast can PercentDelta train MNIST. We compare training speed with other algorithms, including per-parameter adaptive learning rate algorithms, AdaGrad (adagrad) and Adam (adam), as well as a recent algorithm with similar spirit, LARS (lars), which also normalizes the gradient w.r.t. a weight tensor, by the current value of the weight tensor.
Figure 2 In our experiments, we fix the learning rate of PercentDelta to , use momentum, and set to Equation 10. For Adagrad and Adam, we sweep the learning rate but we fix , as they implement their own decay. We do not use Momentum for Adagrad or Adam, as the later applies its own momentum scheme. For LARS, we use momentum, vary the learning rate, and set to Equation 10 with . We note the following:
In early stages, MNIST test accuracy climbs up the fastest with our algorithm.
The final test accuracy produced by our algorithm is higher, given the training budget of 5000 steps.
6.1 Situations where PercentDelta is useful
While PercentDelta has outperforms other training algorithms on 4-layer MNIST, the space of models-datasets is enormous we leave it as future work to try PercentDelta under various models and datasets. Nonetheless, we speculate that PercentDelta (and similarily, LARS, lars) would be very useful in the following scenarios:
Learning Embeddings. Consider the common setup of feeding word embeddings (or graph embeddings, asymproj) into a shared Neural Network and jointly learning the embeddings and Neural Network for an upstream objective. In this setup, a certain embedding vector is only affected by a fraction of training examples, while the shared network parameters are affected by all training examples. The sum of gradients w.r.t. the shared network parameters over all training examples can be disproportionately larger than embedding gradients. PercentDelta ensures that the shared network is not being updated much faster than the emebddings.
Soft-Attention Models on Bag-of-Words. It is common to convert from variable-length bag-of-words into fixed-length representation by a convex combination: , which can then be used for an upstream objective (e.g. event detection in videos, ramanathan-cvpr16). Here, can be the -th position of the softmax over all Words. The parameters of the softmax model would receive gradients from all words. PercentDelta ensures that, the otherwise disproportionately large, gradient updates of the softmax model are proportional to the remainder of the network.
Matrix Factorization Models. For example, kore-bell propose to factorize a user-movie rating matrix into:
where and are the user and movie embedding matrix; is the size of the latent-space; and are the user and movie bias vectors, and is the global bias scalar. In this setup, would receive very large sum-of-gradients, and PercentDelta can ensure that all parameters are training at the same speed.
6.2 Hyperparamters and Decay Function
It seems that PercentDelta has many knobs to tune. However, we can fix to some value and only change as their product determines the effective rate of change across all layers’ trainable tensors. We can set to constant decay:
where determines the decay slope. In addition, we can ensure that to allow training continue indefinitely, by modifying Equation 9 to:
where can be set to a small positive value, such as 0.01. In this case, if we fix , then we are effectively changing each trainable tensor by for every training batch initially, then gradually annealing this change-rate to after steps.
More importantly, we feel that and are a function of the dataset, and not the model. Experimentally, we observe the algorithm is insensitive to the choices of and as long as they are “reasonable” (i.e. removing diverging setups that can be quickly detected). However, we do not yet have a formula to automatically set them. Nonetheless, with a wide range of and , we experimentally show on MNIST that PercentDelta beats all training algorithms, given the same budget of training steps.
We propose an algorithm that trains layers of a neural network, all at the same speed. Our algorithm, PercentDelta, is a simple modification over standard Gradient Descent. It divides the gradient w.r.t. a trainable tensor over the mean of . The division over mean L1-norm is scalar, and only changes the gradient’s magnitude but not its direction. Effectively, this updates the L1 norm of trainable layers, all at the same rate. We recommend a linear decaying change-rate schedule. Our modified gradients can be passed through a standard momentum accumulator (momentum). Overall, we show experimentally that our algorithm puts an upper envelop on all training algorithms, reaching higher test accuracy with fewer steps.