Abstract
We investigate neural network training and generalization using the concept of stiffness. We measure how stiff a network is by looking at how a small gradient step on one example affects the loss on another example. In particular, we study how stiffness varies with 1) class membership, 2) distance between data points (in the input space as well as in latent spaces), 3) training iteration, and 4) learning rate. We empirically study the evolution of stiffness on MNIST, FASHION MNIST, CIFAR10 and CIFAR100 using fullyconnected and convolutional neural networks. Our results demonstrate that stiffness is a useful concept for diagnosing and characterizing generalization. We observe that small learning rates lead to initial learning of more specific features that do not translate well to improvements on inputs from all classes, whereas high learning rates initially benefit all classes at once. We measure stiffness as a function of distance between data points and observe that higher learning rates induce positive correlation between changes in loss further apart, pointing towards a regularization effect of learning rate. When training on CIFAR100, the stiffness matrix exhibits a coarsegrained behavior suggestive of the model’s awareness of superclass membership.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layoutchanging commands and try again.
Stiffness: A New Perspective on Generalization in Neural Networks
Stanislav Fort ^{0 }^{0 } Paweł Krzysztof Nowak ^{0 } Srini Narayanan ^{0 }
\@xsect
Neural networks are a class of highly expressive function approximators that proved to be successful in approximating solutions to complex tasks across many domains such as vision, natural language understanding, and gameplay. They have long been recognized as universal function approximators (Hornik et al., 1989; Cybenko, 1989; Leshno et al., 1993). The specific details that lead to their expressive power have recently been studied in Montúfar et al. (2014); Raghu et al. (2017); Poole et al. (2016). Empirically, neural networks have been extremely successful at generalizing to new data despite their overparametrization for the task at hand, as well as their proven ability to fit arbitrary random data perfectly Zhang et al. (2016); Arpit et al. (2017).
The fact that gradient descent is able to find good solutions given the highly overparametrized family of functions has been studied theoretically in Arora et al. (2018) and explored empirically in Li et al. (2018), where the effective lowdimensional nature of many common learning problems is shown. Fort & Scherlis (2018) extends the analysis in Li et al. (2018) to demonstrate the role of initialization on the effective dimensionality.
Du et al. (2018a) and Du et al. (2018b) use a Gram matrix to study convergence in neural network empirical loss. Pennington & Worah (2017) study the concentration properties of a similar covariance matrix formed from the output of the network. Both concepts are closely related to our definition of stiffness.
To explain the remarkable generalization properties of neural networks, it has been proposed (Rahaman et al., 2018) that the function family is biased towards lowfrequency functions. The role of similarity between the neural network outputs to similar inputs has been studied in Schoenholz et al. (2016) for random initializations and explored empirically in Novak et al. (2018).
In this paper, we study generalization through the lens of stiffness. We measure how stiff a neural network is by analyzing how a small gradient step based on one input affects the loss on another input. Mathematically, if the gradient of the loss at point with respect to the network weights is , and the gradient at point is , we define stiffness . We specifically focus on the sign of which captures the resistance of the functional approximation learned to deformation by gradient steps. We find the concept of stiffness useful in diagnosing and characterizing generalization. As a corollary, we use stiffness to characterize the regularization power of learning rate, and show that higher learning rates bias the functions learned towards higher stiffness.
We show that stiffness is directly related to generalization when evaluated on the heldout validation set. Stiff functions are less flexible and therefore less prone to overfitting to the particular details of a dataset. We explore the concept of stiffness for fullyconnected (FC) and convolutional neural networks (CNN) on 4 classification datasets (MNIST, FASHION MNIST, CIFAR10, CIFAR100) and on synthetic data comprising spherical harmonics. We focus on how stiffness between data points varies with their 1) class membership, 2) distance between each other (both in the space of inputs as well as in latent spaces), 3) training iteration, and 4) the choice of learning rate.
We observed the stiffness between validation set data points based on their class membership and noticed a clear evolution towards high stiffness within examples of the same class, as well as between different classes as the model trains. We diagnose and characterize the classdependent stiffness matrix for fullyconnected and convolutional neural networks on the datasets mentioned above in different stages of training. We observe the stiffness between inputs to regress to zero with the onset of overfitting, demonstrating the clear connection to generalization.
The choice of learning rate effects the stiffness properties of the learned function significantly. High learning rates induce functional approximations that are stiffer over larger distances (i.e. data points further apart respond similarly to gradient updates) and that the features learned generalize better to inputs from different classes (i.e. data points from different classes respond similarly to gradient updates). Lower learning rates, on the other hand, seem to learn more detailed, specific features that, even though leading to the same loss on the training set, do not generalize to other classes as well. This points towards high learning rates being not only advantageous due to the smaller number of steps needed to converge, but also due to the higher generalizability of the features they tend to learn, i.e. that high learning rates act as an effective regularizer.
This paper is structured as follows: we introduce the concept of stiffness and the relevant theory in Section id1. We describe our experimental setup in Section id1, and discuss their results in Section id1. We conclude with Section id1.
Let a functional approximation (e.g. a neural network) be parametrized by tunable parameters . Let us assume a classification task and let a data point have the ground truth label . A loss gives us the amount of mismatch between the functions output at input and the ground truth . The gradient of the loss with respect to the parameters
(1) 
is the direction in which, if we were to change the parameters , the loss would change the most rapidly (at least for infinitesimal step sizes). Gradient descent uses this step to update the weights and gradually tune the functional approximation to better correspond to the desired outputs on the training dataset inputs.
Let there be two data points with their ground truth labels and . We construct a gradient with respect to example 1 as and ask, how do the losses on data points 1 and 2 change as a result of a small change of in the direction , i.e. what is
(2) 
which is equivalent to
(3) 
The change in loss on input 2 due to the gradient step from input 1 becomes equivalently
(4) 
We are interested in the correlation in loss changes and . We know that since we constructed the gradient update accordingly. We define positive stiffness to mean as well, i.e. that losses at both inputs went down. There would be no stiffness if and the two inputs would be antistiff, i.e. negative stiffness, if . The equations above show that this can equivalently be thought of as the overlap between the two gradients being positive for positive stiffness, and negative for negative stiffness. We illustrate this in Figure 1.
The above indicate that what we initially thought of as a change in loss due to the application of a small gradient update from one input to another is in fact equivalent to analyzing gradient alignment between different datapoints.
We define stiffness to be the expected sign of (or equivalently the expected sign of ) as
(5) 
where stiffness depends on the dataset from which and are drawn (e.g. examples of the same class, examples a certain distance apart etc.) as well as on the particular architecture and weights specifying the neural net/function approximator . The sign of is positive, when points in the same halfspace is . That means that positive stiffness corresponds to the weight updated optimal for input 1 having at least a partial alignment with the optimal weight update for input 2. We illustrate this in Figure 1.
In the empirical part of this paper, we study the average stiffness between inputs and as a function of their different properties. We define the relevant details in the following subsections.
A natural question to ask is whether a gradient taken with respect to an input in class will also decrease the loss for example with true class . In particular, we define the class stiffness matrix
(6) 
The ondiagonal elements of this matrix correspond to the suitability of the current gradient update to the members of a class itself. In particular, they correspond to within class generalizability. The offdiagonal elements, on the other hand, express the amount of improvement transferred from one class to another. They therefore directly diagnose the amount of generality the currently improved features have. We work with the stiffness properties of the validation set, and therefore inverstigate generalization directly.
A consistent summary of generalization between classes is the offdiagonal sum of the class stiffness matrix
(7) 
In our experiments, we track this value as a function of learning rate once we reached a fixed loss. The quantity is related to how generally applicable the learned features are, i.e. how well they transfer from one class to another. For example, for CNNs learning good edge detectors in initial layers typically benefits all downstream tasks, regardless of the particular class in question.
We investigate how stiff two inputs are based on how far away from each other they are. We can think of neural networks as a form of kernel learning and here we are investigating the particular form of the learned kernel. This links our results to the work on spectral bias (towards slowlyvarying, low frequency functions) in Rahaman et al. (2018). We are able to directly measure the characteristic size of the stiff regions in neural networks trained on real tasks, i.e. what the characteristic size on which data points very together is for our trained networks.
Let us have two inputs and , that are preprocessed to zero mean and unit length. Those are then fed into a multilayer neural network, where each layer will produce a representation of the input and pass it on to the next layer. Schematically, the network forms a set of representations as
(8) 
We study how stiffness between two inputs and depends on their mutual distance. We investigate and distances, as well as the dot product distance between representations. We look at both the input (pixel) space distances and distances between representations formed by the network itself.
The distance metric that we use is the dot product distance
(9) 
which has the advantage of being bounded between 1 and 1 and therefore makes it easier to compare distances between different layers.
We identify a sharp decline in the amount of stiffness between inputs further than a threshold distance from each other in all representations including the input space. We track this threshold distance as a function of training and learning rate to estimate the characteristic size of the stiff regions of a neural net.
We ran a large number of experiments with fullyconnected (FC) and convolutional neural networks (CNN) on 4 classification datasets: MNIST (LeCun & Cortes, 2010), FASHION MNIST Xiao et al. (2017), CIFAR10, and CIFAR100 Krizhevsky (2009). Using those experiments, we investigated the behavior of stiffness as a function of 1) training iteration, 2) the choice of learning rate, 3) class membership, and 4) distance between images (in the input space as well as representation spaces within the networks themselves).
For experiments with fullyconnected neural networks, we used a 6 layer network of the form . For experiments with convolutional neural networks, we used a 5 layer network with filter size 3 and the numbers of channels being 32, 64, 128 and 256 after the respective convolutional layer, each followed by max pooling. The final layer was fullyconnected. No batch normalization was used.
We preprocessed the network inputs to have zero mean and unit variance. We used with different (constant) learning rates as our optimizer and a default batch size of 32.
We evaluated stiffness properties between data points in the validation set to study generalization. We used the training set to train our model. The procedure was as follows:

Train for a number of steps on the training set and update the network weights accordingly.

Fix the network weights.

Go through tuples of images in the validation set.

For each tuple calculate the loss gradients and , and check .

Log distances between the images as well as other relevant features.
In our experiments, we used a fixed subset (typically of images for experiments with 10 classes, and for experiments with 100 classes) of the validation set to evaluate the stiffness properties on. We convinced ourselves that such a subset is sufficiently large to provide measurements with small enough statistical uncertainties.
We investigated how stiffness properties depend on the learning rate used in training. To be able to compare training runs with different learning rates fairly, we looked at them at the time they reached the same training loss. Our results are presented in Figure 8.
We explored the stiffness properties of validation set data points based on their class membership as a function of training iteration. Our results are summarized in Figures 3, 5, and 6 for MNIST, FASHION MNIST and CIFAR10 with true labels respectively, and in Figure 4 for MNIST with randomly permuted training set labels.
Stiffness between two data points characterizes the amount of correlation between changes in loss on the two due to the application of a gradient update based on one of them. This, as we show in Section id1, can be thought of as the amount of alignment between gradients at the two input points.
We focused on stiffness between inputs in the validation set as we wanted to explore generalization. If a gradient step taken with respect to a validation set input would improve loss on another validation set input, the gradient step potentially represents a genuinely generalizable feature that is relevant to the underlying generator of the data.
Figures 3, 5, and 6 show the stiffness matrix at 4 stages of training: at initialization (before any gradient step was taken), early in the optimization, and at two latetime stages.
Initially, an improvement based on an input from a particular class benefits only members of the same class. Intuitively, this could be due to some crude features shared within a class (such as the typical overall intensity, or the average color) being learned. There is no consistent stiffness between different classes at initialization. As training progresses, withinclass stiffness stays high. In addition, stiffness between classes increases as well, given the model were is powerful enough for the dataset. Features, that are beneficial to almost all inputs are being learned at this stage. The pattern is visible in Figures 3, 5, and 6, where the offdiagonal elements of the stiffness matrix become consistenly positive with training. With the onset of overfitting, as shown in Figure 2, the model becomes increasingly less stiff until even stiffness for inputs within the same class is lost. This is due to the model overfitting to the specific details of the training set. The features learned are too specific to the training data, and therefore do not generalize to the validation set, which leads to the loss of stiffness.
We ran experiments with randomly permuted training set labels to explore the evolution of stiffness there. In Figure 4, the stiffness of a fullyconnected network trained on MNIST with permuted labels is shown. As there are no general features to be learned, the model converges to a stage with no positive between classes stiffness. The reason for the offdiagnal, i.e. between different classes, stiffness converges to is the fact that the optimal response is to give all classes equal probability. Any gradient update based on a particular input will necessarily lead to a preference to one of the classes (the one that was randomly assigned to this data point), which in turn increases loss on other inputs on average.
In our experiments with CIFAR100 we notice a blocklike structure in the stiffness matrix shown in Figure 7. The coarsegrained pattern is suggestive of the networks knowledge of the superclasses (groups of 5 classes), on which the network, however, was not trained. This is due to the similarity between images within the superclass and points strengthens the connection of stiffness and generalization.
We investigated the effect of learning rate on stiffness of the functions learned. In particular, we focused on the amount of betweenclasses stiffness that characterizes the generality of the features learned and the transfer of knowledge from one class to another. We used the mean of the offdiagonal terms of the class stiffness matrix as described in Section id1.
In order to be able to compare the learned functions fairly for different learning rates, we decided to train until a particular training loss was reached. We then investigated the stiffness properties of the learned function. Our results are presented in Figure 8. We observe that the higher the learning rate, the more stiff the learned function is between different classes, i.e. that higher learning rates bias the models found towards features that benefit several classes at once. We observed this behavior for both MNIST and FASHION MNIST and at all three stopping training losses we investigated.
Our hypothesis is that high learning rates force the model to learn very generalizable, robust features that are not easily broken by the stochasticity provided by the large step size. Those features tend to be useful for multiple classes at once. We speculate that this points towards the regularization role of high learning rate that goes beyond the benefit of having a smaller number of steps until convergence. The concept of stiffness therefore sheds some light on the regularization value of high learning rates.
We investigated stiffness between two inputs as a function of their distance in order to measure how large the patches of the learned function that move together under gradient updates are. This relates to the question of spectral bias of neural networks, however, the connection is not straightforward, as we will discuss later.
We studied distances in the input (pixel) space as well as distances between representations induced by each layer of our neural networks. We primarily focused on the dotproduct distance, which we defined to be the cosine of the angle between two input/representation vectors. This distance is bounded between 1 and 1 and is therefore easier to compare between layers.
To be able to compare training at different learning rates, we trained until a particular training loss was reached and then analyzed the stiffness properties of the learned function. An example of the distance dependence of stiffness is presented in Figure 9. Note that the dot product distance of 1 corresponds to points being at the same place. The general pattern visible in Figure 9 is that there exists a critical distance within which input data points tend to move together under gradient updates, i.e. have positive stiffness. This holds true for all layers in the network, with the tendency of deeper layers to have smaller stiff domain sizes.
We extracted the first zerostiffness crossing, such as in 9, and obtained the variation of stiff domain sizes with learning rate. We observed that the characteristic size of stiff regions in the learned function increases with higher learning rates. The stiff region size corresponds to the distance between inputs (in the input pixel space as well as in the representation spaces of the neural network itself) under which a gradient update tends to improve all of them. It characterizes the spatial frequency of the learned function’s response to gradient updates. Our results are presented in Figure 10. Our observations seem connected to recent work on regularization using interpolation between data points in Verma et al. (2018).
A natural question arises as to whether the characteristic distance between two input points at which stiffness reaches zero defines the typical scale of spatial variation of the learned function. Unfortunately, that is not necessarily the case, though it can be for some families of functions. The stiff domain sizes visible in Figure 9 represent the typical length scale over which neural networks react similarly to gradient inputs, rather than the typical length scale of variation of the function value itself.
To illustrate the difference, imagine a function that varies rapidly over input data, however, whose losses over the same data move in the same direction on application of a gradient step based on any of the data points. This function would have a small characteristic length scale of value variation, yet large stiff domain size. We believe that these two length scales are likely to be connected, however, we have not explored this direction in this paper. We believe that for fullyconnected networks the connection is likely.
We explored the concept of neural network stiffness and used it to diagnose and characterize generalization. We studied stiffness for models trained on real datasets, and measured its variation with training iteration, class membership, distance between data points, and the choice of learning rate. We focused on the stiffness of data points in the validation set in order to probe generalization and overfitting.
On real data, we explored models trained on MNIST, FASHION MNIST, CIFAR10 and CIFAR100 through the lens of stiffness. In essence, stiffness measures the alignment of gradients taken at different input data points, which we show is equivalent to asking whether a weight update based on one input will benefit the loss on another. We demonstrate the connection between stiffness and generalization and show that with onset of overfitting to the training data, stiffness on the validation set decreases and eventually reaches 0, where even gradient updates taken with respect images of a particular class stop benefiting other members of the same class.
Having established the usefulness of stiffness as a diagnostic tool for generalization, we explored its dependence on class membership. We find that in general gradient updates with respect to a member of a class help to improve loss on data points in the same class, i.e. that members of the same class have high stiffness with respect to each other. This holds at initialization as well as throughout most of the training. The pattern breaks when the model starts overfitting to the training set, after which withinclass stiffness eventually reaches 0. We observe this behavior with fullyconnected and convolutional neural networks on MNIST, FASHION MNIST, CIFAR1o, and CIFAR100. Stiffness between inputs from different classes relates to the generality of the features being learned and withintask transfer of improvement from class to class. We find that untrained models do not exhibit consistent stiffness between different classes, and that with training its amount increases. For a model with highenough capacity for the task at hand, we observe positive stiffness between the majority of classes during training. With the onset of overfitting, the stiffness between different classes regresses to 0, as does withinclass stiffness.
We experimented with training on data with randomly permuted labels, where no meaningful general patterns can be learned. There, the stiffness between examples disappears as the model trains. This is expected as for positive stiffness to appear, features that are beneficial for many inputs must develop, which is impossible with randomly permuted labels. This highlights the connection between stiffness and the generality of features learned. Since we measure stiffness on the validation set, explicitly probe generalization.
We observed that for a model trained on CIFAR1oo, a blocklike structure appears in the classdependent stiffness matrix. We believe this is related to the same response of the network to gradient updates by images in the superclass in the dataset. This is another pointer towards the usefulness of stiffness in diagonsing generalization. Since our model had no access to the superclass (coarse grained) labels, the structure in the stiffness matrix likely came from general features being learned.
We investigated the effect of learning rate on stiffness and identified a tendency of high learning rates to induce more stiffness into the learned function. We find that for models trained with different learning rates and stopped at an identical training loss, the amount of stiffness between different classes is higher for higher learning rates. This points towards the role of high learning rate in learning more general features that are beneficial for inputs from many classes. Lower learning rates, on the other hand, seem to learn more detail, classspecific features that do not transfer well to other classes.
We also investigated the characteristic size of stiff regions in our trained networks. By studying stiffness between two validation set inputs and measuring their distance in the input space as well as in the representation spaces induced by the neural network, we were able to show that size of stiff regions – regions of the data space that move together when a gradient update is applied – increases with increasing learning rate. We therefore find that higher learning rates tend to learn functions whose response to gradient updates varies over larger characteristic length scales. This is in line with our previous observation that the average stiffness is higher for higher learning rates. Both of these observations point towards the regularization effect of learning rate beyond the benefit of a smaller number of steps until convergence.
In future work, we are investigating four lines of inquiry which are suggested by this work.

In this paper, all the experiments were conducted with a fixed architecture. One obvious extension to the concept of stiffness would be to ascertain the role stiffness might play in architecture search. For instance, we expect locality (as in CNN) to reflect in higher stiffness properties. It is quite possible that stiffness could be a guiding parameter for metalearning and explorations in the space of architectures.

One idea we are pursuing is the use of stiffness to measure the efficacy of a particular ordering of data in the training set. It has been suggested that different permutations of standard NLP datasets behave differently in terms of performance (Schluter & Varab, 2018). We think this could be reflected in the stiffness of the data, which is something we are exploring.

As we noted in the results section, the superclass structure was related to the stiffness value for the CIFAR100 data. To what extent is such hierarchical or relational structure visible from the stiffness value change over time?

We would like to investigate the connection between the characteristic size of variation of the learned function value and how it relates to the typical size of stiff domains we observe in our experiments.
In summary, we defined the concept of stiffness, showed its utility in providing a perspective to better understand generalization characteristics in a neural network and observed its variation with learning rate.
References
 Arora et al. (2018) Arora, S., Cohen, N., and Hazan, E. On the optimization of deep networks: Implicit acceleration by overparameterization. CoRR, abs/1802.06509, 2018. URL http://arxiv.org/abs/1802.06509.
 Arpit et al. (2017) Arpit, D., Jastrzebski, S. K., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A. C., Bengio, Y., and LacosteJulien, S. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pp. 233–242, 2017. URL http://proceedings.mlr.press/v70/arpit17a.html.
 Cybenko (1989) Cybenko, G. Approximation by superpositions of a sigmoidal function. MCSS, 2:303–314, 1989.
 Du et al. (2018a) Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. Gradient Descent Finds Global Minima of Deep Neural Networks. arXiv:1811.03804 [cs, math, stat], November 2018a. URL http://arxiv.org/abs/1811.03804. arXiv: 1811.03804.
 Du et al. (2018b) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient Descent Provably Optimizes Overparameterized Neural Networks. arXiv:1810.02054 [cs, math, stat], October 2018b. URL http://arxiv.org/abs/1810.02054. arXiv: 1810.02054.
 Fort & Scherlis (2018) Fort, S. and Scherlis, A. The goldilocks zone: Towards better understanding of neural network loss landscapes. CoRR, abs/1807.02581, 2018.
 Hornik et al. (1989) Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural Netw., 2(5):359–366, July 1989. ISSN 08936080. doi: 10.1016/08936080(89)900208. URL http://dx.doi.org/10.1016/08936080(89)900208.
 Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. 2009.
 LeCun & Cortes (2010) LeCun, Y. and Cortes, C. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
 Leshno et al. (1993) Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6:861–867, 1993.
 Li et al. (2018) Li, C., Farkhoor, H., Liu, R., and Yosinski, J. Measuring the intrinsic dimension of objective landscapes. CoRR, abs/1804.08838, 2018. URL http://arxiv.org/abs/1804.08838.
 Montúfar et al. (2014) Montúfar, G., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. In NIPS, 2014.
 Novak et al. (2018) Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and SohlDickstein, J. Sensitivity and generalization in neural networks: an empirical study. CoRR, abs/1802.08760, 2018.
 Pennington & Worah (2017) Pennington, J. and Worah, P. Nonlinear random matrix theory for deep learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 2637–2646. Curran Associates, Inc., 2017.
 Poole et al. (2016) Poole, B., Lahiri, S., Raghu, M., SohlDickstein, J., and Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, pp. 3360–3368, 2016.
 Raghu et al. (2017) Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and SohlDickstein, J. On the expressive power of deep neural networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2847–2854, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/raghu17a.html.
 Rahaman et al. (2018) Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F. A., Bengio, Y., and Courville, A. On the Spectral Bias of Neural Networks. arXiv eprints, art. arXiv:1806.08734, June 2018.
 Schluter & Varab (2018) Schluter, N. and Varab, D. When data permutations are pathological: the case of neural natural language inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4935–4939. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/D181534.
 Schoenholz et al. (2016) Schoenholz, S. S., Gilmer, J., Ganguli, S., and SohlDickstein, J. Deep information propagation. CoRR, abs/1611.01232, 2016. URL http://arxiv.org/abs/1611.01232.
 Verma et al. (2018) Verma, V., Lamb, A., Beckham, C., Courville, A., Mitliagkis, I., and Bengio, Y. Manifold mixup: Encouraging meaningful onmanifold interpolation as a regularizer. arXiv eprints, 1806.05236, June 2018. URL https://arxiv.org/abs/1806.05236.
 Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017.
 Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. CoRR, abs/1611.03530, 2016.