Farkas layers: don’t shift the data, fix the geometry
Successfully training deep neural networks often requires either batch normalization, appropriate weight initialization, both of which come with their own challenges. We propose an alternative, geometrically motivated method for training. Using elementary results from linear programming, we introduce Farkas layers: a method that ensures at least one neuron is active at a given layer. Focusing on residual networks with ReLU activation, we empirically demonstrate a significant improvement in training capacity in the absence of batch normalization or methods of initialization across a broad range of network sizes on benchmark datasets.
The training process of deep neural networks has gone through significant shifts in recent years, primarily due to revolutionary network architectures, such as Residual neural networks (resnet50) and normalization techniques, initially proposed as batch normalization (batchnorm). The former is the backbone for current “state-of-the-art” results for image-based tasks such as classification. Normalization can be found in a variety of flavors and applications; (layernorm; instancenorm; groupnorm; vaswani; zhu). For reasons that are not fully understood, normalization gives rise to many desirable traits in network training, e.g. a fast convergence rate, but also comes with a cost by increasing vulnerability to adversarial attacks (bn_adversarial). Apart from normalization, network weight initialization has been a driving force of improved performance. Throughout the years, various initialization schemes have been proposed, in particular Xavier initialization (xavier_init), FixUp (fixup), and a recent asymmetric initialization (lu2019dying). These approaches are often connected to probabilistic notions or balancing learning rates while training.
Modern network architectures catered for image-classification tasks incorporate various one-sided activation functions, the canonical example being the Rectified Linear Unit (ReLU) (relu). Other one-sided activations include the Exponential Linear Unit (ELU) (elu), Scaled ELU (SELU) (selu), and LeakyReLU (leaky). ELU and SELU were constructed with the purpose of eliminating batch normalization (BN); these activation functions minimize the internal covariate shift (ICS), which is what BN was intended to accomplish. Recent developments have shown that BN has little impact on ICS but affects the smoothness of the loss landscape, allowing for easier (and faster) training regimes (madry_bn).
A geometric interpretation of batch normalization can be readily seen in the context of the dying ReLU problem. This phenomenon occurs when the gradient of the neuron is zero (when the neuron outputs only negative values), in which case the neuron is “dead”. This effectively freezes the neuron during training. If too many neurons are dead, the network learns slowly. In fact, it is entirely possible for a network to be “born dead”, where it does not allow learning at all (lu2019dying). To motivate this, consider a simple binary classification problem: suppose two sets of points in are sufficiently separated and we wish to classify them using a simple 1-layer ReLU network (with standard Cross Entropy loss). If the network is initialized such that one cloud of points is not “observed”, as in Figure 0(a), the network will not learn to classify those points. Batch normalization will shift the points to behave like 0(b); the network is no longer dead. There is a far simpler geometric solution in : given one hyperplane, construct another to face the missing data, as shown in Figure 0(c) by the green hyperplane, which is now a non-dead component of the network.
We present a geometrically motivated method for network training in the absence of initialization and normalization. Our novel layer structure, called Farkas layers, ensures that at least one neuron is active in every layer. Using off-the-shelf networks, Farkas layers can recover a significant part of the explanatory power of DNNs when batch normalization is removed, sometimes up to 10%, as demonstrated by empirical results on benchmark image classification tasks, such as CIFAR10 and CIFAR100. When used in conjunction with batch normalization on ImageNet-1k networks, we empirically show an approximate 20% improvement on the first epoch over only using batch normalization. Finally, we claim that input data normalization is a critical component for using large learning rates in FixUp, which is not the case for Farkas layer-based networks. All in all, this work provides a new research direction for architecture design beyond initialization methods, normalization layers, or new activation functions.
We define neural networks as iterative compositions of activation functions with linear functions. More specifically, given an input to the -th layer, the output is typically written
where is a weight matrix, the rows are written as (), is the layer’s bias, and is the activation function of the layer. For the remainder of our analysis, we consider the ReLU activation function, written
which acts component-wise in the case of vector arguments.
2.2 Related work
Previous work on the dying ReLU problem, or vanishing gradient problem, in the case of using sigmoid-like activation functions, has heavily revolved around novel weight initializations. To address the vanishing gradient problem, (xavier_init) proposed an initialization that is still often used with ReLU activations (shang). (kaiming_init) proposed a scaled version of a normal distribution, which improved training for pure convolutional neural networks. More recently, (lu2019dying) studied the dying ReLU problem from a probabilistic perspective and addressed networks that are “born dead”, i.e. not capable of learning. Their analysis is based on the reasonable assumption that all weights and biases are initialized with a non-zero probability of being dead:
where is the number of neurons in a given layer, and is a fixed positive constant. They show that, as the number of layers approaches infinity, the probability of having a network be born dead approaches one. They also find that initializing weights with a symmetric distribution is more likely to cause a network to be born dead and thus propose initializing weights using asymmetric distributions to mitigate this problem. Their analysis is relevant in the case that the network is not very wide, which is distinct from the case we study in this paper.
Normalization has been one of the driving forces of recent state-of-the-art results; the general idea is to manipulate the neuron activation with various statistics, such as subtracting the mean and dividing by the variance. Popular normalization techniques include batch normalization (batchnorm), Layer normalization layernorm, Instance normalization (instancenorm), and Group normalization (groupnorm). However, there is a desire to eliminate normalization from training. A recent work that does this is FixUp (fixup), a novel initialization technique that works by scaling the weights as a function of the network structure, allowing the network to take large learning rate steps during training. Reportedly, FixUp provides the same level of accuracy as that of a BN-enhanced network and scales to networks that train on ImageNet-1k. As stated in FixUp, the inherent structure of residual networks already prevents some level of gradient vanishing, since the variance of the layer output grows with depth. In the case of positively-homogeneous blocks (e.g. no bias in a Linear or Convolutional layer), this type of variance increase can lead to gradient explosion (hanin2018start), which makes training more difficult.
3 Farkas layers
3.1 Augmenting ReLU activated layers
We present a novel approach to understanding how weight matrices and bias vectors contribute to neural network learning, called Farkas layers. As the name suggests, our method is influenced by Farkas’ lemma. In particular, we use the following representation of Farkas’ lemma; the proof is an exercise in linear programming and is left in the appendix.
Lemma 3.1 (Representation of Farkas’ lemma).
Let with . Then the set is empty if and only if there exists a such that (for , with and .
We present the construction of and and prove using Lagrangian duality (boyd) that such weights and bias vectors will satisfy Farkas’ lemma, resulting in at least one positive component. In essence, this solves the dying ReLU problem while training.
Let be a layer of a neural network with ReLU activation function, with weights (with rows ) and bias vector , and let , with (with at least one , without loss of generality ). Further, suppose has the property that
and that . Then this layer has a non-zero gradient.
Let (with rows ) and bias , with the aforementioned assumptions, and arbitrary input . The output of this layer is , and has a non-zero gradient if and only if there exists at least one component such that
The left-hand side of (3) (i.e. omitting the strict inequality constraint) can be written as the following minimization problem:
where is the vector of all ones. The Lagrangian for this problem is
where . We compute the dual problem,
The conditions on are met by assumption, and there are many ways to ensure that is on the simplex and thus has at least one strictly positive component. Both the primal and dual problems are linear optimization problems: the objective and constraints are linear. By weak duality, . Thus to ensure that (as in (3)), it suffices to show that . This is guaranteed by the assumptions on the bias vectors, and so we are done. ∎
It is highly unlikely that a given layer in a deep neural network has all dead neurons. However, we believe that guaranteeing the explicit activity of one neuron is beneficial in that it improves training. Figure 1 motivates the use of Farkas layers as a potential substitute for (BN) in the context of learning: the arrows of the corresponding hyperplane indicate the “non-zero” part of the ReLU. In the case of 0(a), we see that all data points end up on the “zero” side, which would typically lead to no learning. Heuristically, the effect of BN mimics Figure 0(b), where the data points are centered and scaled by a diagonal matrix (which geometrically acts like an ellipse) allowing for learning to occur. Farkas layers will instead append another hyperplane that will be guaranteed to see all the points (in a “third” dimension).
3.2 Extension to general one-sided activations
The use of Farkas layers can be extended to other one-sided activation functions, such as a smooth-ReLU (i.e. replacing the kink with a polynomial), ELU, and several others. In all these cases, one can impose a scalar cutoff value such that “beyond” this cutoff, we have very small (or zero) gradients. This is shown in the following corollary, which has the same proof as Theorem 3.2.
Let be a layer of a neural network with an arbitrary one-sided activation function with cutoff , i.e. for all , we have (or exactly zero). Denote the weights by (with rows ) and bias vector , and let , with , with at least one positive component. To ensure that has at least one component greater than , we require that
and that .
3.3 Algorithm and limitations
We denote a Farkas layer by , where the weights and biases satisfy Theorem 3.2. The construction of and imposes certain restrictions on the existence of Farkas layers: namely we require that . Finally, some network architectures claim to perform better without bias vectors present; we require bias vectors by construction. We focus on replacing Convolutional and Linear layers in (deep) neural networks, with efficient implementations in PyTorch. The general method of incorporating Farkas layers is provided in Algorithm 1. To minimize computational cost, we consider to be the evenly spaced simplex element, i.e. (for ), but for smaller problems (such as the motivating 1-layer example) can be learnable. Farkas layers can be made into Farkas blocks in conjunction with residual networks as well, as outlined in Algorithm 2, which we leave in the Appendix. In the case of multiple convolutional layers in a given residual block, the extension is natural.
In both Algorithm 1 and 2 we consider two choices for the aggregating function (AggFunc): either the sum or the mean of the previous outputs and biases. Indeed, let , and thus Then we have the following two aggregation methods:
and similarly for the bias vectors. Theorem 3.2 is constructed using “sum”, but the result holds in the case of using “mean” since ReLUs are 1-positive homogenous. Furthermore, by using “mean”, we induce stability with respect to the -induced matrix norm for a matrix , defined as
where are the rows of . Let and be the mean of the rows of . Define , then
Thus, in the context of network weights, the burden of stability lies on the trainable weights and not the aggregated component. A similar analysis can be done for the bias vectors. Thus, for networks that train on ImageNet-1k, or to potentially use larger learning rates, we believe that using “mean” as the aggregation function is necessary.
4 Experiments — Image classification
Image classification is a standard benchmark for measuring new advancements in deep learning. Our goal is to show how training with Farkas layers, henceforth referred to as FLs, impacts learning in these settings relative to other training techniques such as normalization and initialization (e.g. FixUp).
We consider CIFAR10, CIFAR100, and ImageNet-1k datasets for the purposes of testing FL-based residual networks on image classification tasks. For CIFAR10, we consider 18, 34, 50, and 101 layer residual networks. The latter two of these use a BottleNeck block structure. On CIFAR100, we only consider 34, 50, and 101 layers. All models are trained using the same SGD schedule for 200 epochs. We make the distinction between a “large” learning rate, meaning 0.1, and a “small” learning rate, which is 0.01. We use standard data augmentation (RandomCrop and RandomHorizontalFlip) but do not normalize the inputs with respect to the mean and standard deviation of the dataset in the case of CIFAR10/CIFAR100. ] The weights are left at the default PyTorch initialization. Additionally, we use cutout to improve generalization (cutout) and use weight-decay. For ImageNet-1k, we use modified code from the DAWNBench competition (DAWNBench; fastimagenet). In this set of experiments, we augment the data via RandomCrop, RandomHorizontalFlip, and normalize the input data. We only consider the case of networks using BN and see how FLs can improve performance, and we train with 30 epochs.
Henceforth, we call a residual network a FarkasNet if is it comprised of only FLs, and we always use the “sum” aggregation function unless otherwise specified. We make the distinction of networks that use or do not use BN in the tables. When omitting BN, we use the small learning rate. For CIFAR10 and CIFAR100, the results presented are averaged across three runs; if one of the runs failed111Failed implies that the network did not make any progress in the first five epochs. to train, we omit it from the average but place an asterisk. Full training curves are in the Appendix.
Interpreting dependence of batch normalization
Tables 1 and 2 both show a strong disparity in the learning capacity of a DNN when batch normalization is removed, which is unsurprising. On CIFAR10, we primarily observe similar, if not better, test errors when batch normalization is used. Without normalization, our test error is always better; the same can be said for the CIFAR100 dataset. Thus, while FarkasResNets also diminish in quality, we note that adding a guaranteed undead neuron to all layers heavily impacts learning in the case of a non-normalized network. This is not at all to say that training a deep neural network is only possible using FLs or that we achieve state-of-the-art performance; however, our implementation demonstrates the strong dependency of getting low test accuracy with normalization and is geometrically motivated.
Improvement at first epoch for ImageNet-1k
The final table shows similar behaviour on ImageNet-1k, where FLs222For ImageNet-1k, we use “mean” aggregation function. neither dramatically improve nor inhibit the learning capacity towards the end of training except in the case of 50-layers, where it performs significantly better. Figure 2 shows that FarkasNets have a dramatic advantage over standard ResNets at the start of training. Evidently, this advantage gradually diminishes; nonetheless, its presence alludes to a bigger problem in network initialization. We notice a roughly 20% improvement on the first epoch, simply due to the use of FLs.
Impact of data normalization and best practice comparison
We briefly compare against other best practice learning regimes on CIFAR10. Table 4 highlights the various differences between methods. At the core, we compare FarkasNets to FixUp, a recently proposed initialization scheme that allows the use of larger learning rates in the absence of batch normalization, and thus faster convergence. We consider a 34-layer ResNet with BasicBlock structure using the official implementation of FixUp333From one of the authors’ Github pages for the network architecture. We compare using a 34-layer FarkasNet, where the last layer in a block is initialized to zero (as in FixUp, which is motivated by several other works (zerolast_1; zerolast_2; zerolast_3)) but use standard initialization otherwise, as well as “mean” AggregationFunction. We maintain the setup of the previous experiments.
Data normalization is a standard “trick” for training deep neural networks, and acts like an initial batch normalization step, where the incoming data is scaled to have zero mean and unit variance. In the case of CIFAR10 and CIFAR100, we strove to study the problem of training without any attempts at normalization, which is why we have omitted this from our training methodology. In the absence of data normalization, we observed that FixUp requires a small learning rate for training. Thus, while (fixup) claims to present a theory for addressing training in the absence of batch normalization, we find that it still heavily hinges on normalization principles, namely on the first layer pass. This observation is not to diminish the benefits of FixUp, as faster convergence is observed (see Figure 3), but the learning rate does need adjustment. On the other hand, our 34-layer FarkasNet was able to use a medium learning rate of 0.05, allowing for improved performance relative to Table 1.
Finally, we also address the notion that training a very deep neural network is challenging using maximal learning rate without normalization, which has been experimentally observed in FixUp and Layer-Sequential Unit-Variance (LSUV) orthogonal initialization (mishkin). Despite the shortcomings of previous work, by incorporating the “mean” aggegration function and default initialization, we are able to successfully train a 101-layer FarkasNet with a large learning rate on CIFAR10. We present the averaged training curve in Figure 5, left in the Appendix, where we repeated the experiment three times. We remark that using default initialization alone results in complete failure to train.
To date, successful training of a deep neural networks requires either some form of normalization, weight initialization, and/or choice of activation function, all of which come with their own challenges. In this work, we provide a fourth option, Farkas layers, which can be used alone or in conjunction with the aforementioned techniques. The Farkas layer is based on the interaction of the geometry of the weights with the data: by simply adding one linearly dependent row to the weight matrix, we ensure that no neurons are dead. Using our method, we have shown that training with larger learning rates is possible even in the absence of batch normalization and with default initialization. In this work, we only touched on image classification, but Farkas layers could be used in many deep learning applications, which is left for future work.
Appendix A Appendix
a.1 Elementary linear programming
Proposition A.1 (Farkas’ Lemma).
Let and .
Let and . Then
Suppose by contradiction there exists an such that . Let and , and so . Then:
which follows from Farkas’ Lemma. ∎
a.2 Algorithm for residual Farkas layer
a.3 Training curves for CIFAR10
a.4 Training curves for CIFAR100
For 50-layers, one of the networks (standard ResNet without BN) failed to train, and we omit it from the average.
a.5 Training curves for ImageNet-1k
Here, dotted line means Farkas layer-based network.