Deep learning generalizes because the parameter-function map is biased towards simple functions

Deep learning generalizes because the parameter-function map is biased towards simple functions

Guillermo Valle Pérez
University of Oxford
guillermo.valle@dtc.ox.ac.uk
&Ard A. Louis
University of Oxford
ard.louis@physics.ox.ac.uk
&Chico Q. Camargo
University of Oxford
chico.camargo@gmail.com
Abstract

Deep neural networks generalize remarkably well without explicit regularization even in the strongly over-parametrized regime. This success suggests that some form of implicit regularization must be at work. By applying a modified version of the coding theorem from algorithmic information theory and by performing extensive empirical analysis of random neural networks, we argue that the parameter function map of deep neural networks is exponentially biased towards functions with lower descriptional complexity. We show explicitly for supervised learning of Boolean functions that the intrinsic simplicity bias of deep neural networks means that they generalize significantly better than an unbiased learning algorithm does. The superior generalization due to simplicity bias can be explained using PAC-Bayes theory, which yields useful generalization error bounds for learning Boolean functions with a wide range of complexities. Finally, we provide evidence that deeper neural networks trained on the CIFAR10 data set exhibit stronger simplicity bias than shallow networks do, which may help explain why deeper networks generalize better than shallow ones do.

\externaldocument

[supp-]suppinfo_flat \externaldocument[main-]main

1 Introduction

Deep learning is a machine learning paradigm based on very large, expressive and composable models, which most often require similarly large data sets to train. The name comes from the main component in the models: deep neural networks, or artificial neural networks with many layers of representation. These models have been remarkably successful in domains ranging from image recognition and synthesis, to natural language processing, and reinforcement learning mnih2015human (); lecun2015deep (); radford2015unsupervised (); schmidhuber2015deep (). There has been work on understanding the expressive power of certain classes of deep networks poggio2017and (), their learning dynamics advani2017high (); liao2017theory (), and generalization properties kawaguchi2017generalization (); poggio2018theory (). However, a full theoretical understanding of many of these properties is still lacking.

These deep neural networks are typically overparametrized, with many more parameters than training examples. The success of these highly-expressive models implies two things: 1) some form of inductive bias must be at work, to account for their successful generalization, and 2) classical learning theories based on worst-case111Worst-case over all functions in the hypothesis class analyses such as those based on VC dimension, are insufficient to explain generalization in deep learning.

Regarding 1), it was originally thought that regularization methods such as Tikhonov regularization tikhonov1943stability (), dropout srivastava2014dropout (), or early stopping morgan1990generalization () were key in providing this inductive bias. However, Zhang et al. zhang2016understanding () demonstrated that highly-expressive deep neural networks still generalize successfully with no explicit regularization, reopening the question of the origin of the inductive bias. There is now more evidence that unregularized deep networks are biased towards simple functions arpit2017closer (); wu2017towards (). Stochastic gradient descent has been conjectured as a possible cause of the bias soudry2017implicit (); zhangmusings (), but the true origin of the bias is still unknown arpit2017closer ().

The experiments by Zhang et al. zhang2016understanding () also clearly demonstrated point 2), which spurred a wave of new work in learning theories tailored to deep learning kawaguchi2017generalization (); arora2018stronger (); morcos2018importance (); neyshabur2017exploring (); dziugaite2017computing (); neyshabur2017pac (), none of which has yet successfully explained the observed generalization performance.

In this paper, we address the problem of generalization by deep neural networks in the overparametrized regime. We apply insights from Dingle et al. simpbias () who demonstrated empirically that a wide range of input-output maps from science and engineering exhibit simplicity bias, that is, upon random inputs, the maps are exponentially more likely to produce outputs with low descriptional complexity. The authors trace this behavior back to the coding theorem of Solomonoff and Levin ming2014kolmogorov (), a classic result from algorithmic information theory (AIT). By deriving a weaker form of the coding theorem valid for non-Turing universal maps (see also zenil2018coding ()) and by providing practical complexity measures, they overcome some key difficulties in the application of the original AIT coding theorem. The mapping between parameters of a deep neural network and the function that it encodes fulfills the key conditions for simplicity bias (See Appendix LABEL:supp-param-fun-map-complexity). Thus we expect that for a random set of input parameters, the probability that a particular function is encoded by the deep neural network will decrease exponentially with a linear increase in some appropriate measure of the complexity of the function. The main aim of this paper is to empirically test whether or not deep neural networks exhibit this predicted simplicity bias phenomenology, and to determine whether or not this bias explains their generalization performance.

The paper is organized as follows. In Section 2, we provide empirical evidence – for a deep network implementing Boolean functions – that the probability that a randomly chosen parameter set generates a particular function varies over many orders of magnitude. As predicted by simplicity bias, the probability decreases exponentially with the increased complexity of the Boolean function. In Section 3, we show empirically for a standard supervised learning framework that deep neural networks generalize much better when learning simpler target Boolean functions than unbiased learners do, even when both learners achieve zero training set error. In Section 4 we rationalize the link between simplicity bias and generalization by deriving explicit PAC-Bayes bounds which are shown to work well in several experiments. In Section 5 we present empirical results for networks trained on the CIFAR10 database showing that deep networks have a stronger simplicity bias than shallow networks do, which may explain why deep networks tend to generalize better than shallow ones. In the final section we provide a broader context for our results, arguing that simplicity bias provides a key theoretical ingredient for explaining the remarkable generalization properties of deep neural networks.

2 Bias in the parameter-function map

To empirically study bias in the parameter-function map, we consider feedforward neural networks with real-valued parameters, -dimensional Boolean inputs, and a single Boolean output. The advantage of using a system with discrete functions is that it makes sampling the probability that a function obtains upon random selection of parameters more feasible (In Section 5, we also explore networks with continuous inputs and many more parameters.). For more details of our implementation, see Appendix LABEL:supp-methods. With this setup, the parameter-function map is defined as:

where is the function produced by the network with parameters . We can investigate the structure of this mapping empirically, by randomly sampling according to a fixed distribution. We use uniform distributions, with variances fixed as in Xavier initialization glorot2010understanding () (other choices of distribution and variance were also explored, see Appendix LABEL:supp-prob-comp-plots). We define the probability of a function as the fraction of parameter samples which produced a given function. With enough samples, this empirical estimate should approximate the true probability that a random set of parameters will produce a given function.

In Figure 0(a), we show a typical probability versus rank plot for a network with an input layer of 7 nodes, two 40 node hidden layers, and a single output node. We empirically found that this network can encode almost all possible Boolean functions with high probability, by training it to perfectly fit Boolean functions chosen uniformly at random from all possible Boolean functions, and finding that it succeeded in perfectly recreating all of them. There are in principle up to possible functions. If the functions were all equally likely, then their probabilities would be , so with a sample of this size () it would be exceedingly unlikely to find the same function more than once. However, as can be clearly seen in Figure 0(a), some functions have orders of magnitude higher probability than a naive uniform estimate would suggest. We observe the same behaviour for all network architectures which we tried (see Appendix LABEL:supp-prob-comp-plots)

By measuring the complexity of the functions produced by the neural network, we can uncover a second pattern: there is a bias towards simple functions. This correlation is illustrated in Figure 0(b). The functions can be represented as binary strings of length , with each bit corresponding to the output for an input (see Appendix LABEL:supp-methods) to which we apply the Lempel-Ziv (LZ) complexity measure of reference simpbias (), which is based on the LZ-76 algorithm lempel1976complexity (). In Appendix LABEL:supp-prob-comp-plots, we show that the same correlation obtains for other complexity measures, including the entropy of the string, as well as non-string based measures such as the critical sample ratio of Ref. arpit2017closer (), the generalization complexity of Ref. franco2004generalization (), and a Boolean expression complexity measure based on the length of the shortest Boolean representation of a function. See Appendix LABEL:supp-complexity_measures for a description of these measures.

(a) Probability versus rank of each of the functions (ranked by probability)
(b) Probability versus Lempel-Ziv complexity
Figure 1: Probability versus versus (a) rank and (b) complexity, estimated from a sample of parameters, for a network of shape . Points with a frequency of are removed for clarity because these suffer from finite-size effects (see Appendix LABEL:supp-finite-size-effects). The parameter-function map is highly biased towards functions with low complexity. See Appendix LABEL:supp-prob-comp-plots for similar plots using other complexity measures.

The shape of the distributions in Figure 7 is similar to those found for a much wider set of input-output maps by Dingle et al. simpbias (). Very briefly, for maps satisfying a number of simple conditions, most notably that the complexity of the map grows slowly with increasing system size, they argue that the probability that a particular output obtains upon random sampling of inputs can be bound by:

(1)

where is an approximation to the uncomputable Kolmogorov complexity , and and are constants, typically within an order of magnitude of 1, that depend on the map and the approximation method used for , but not on . This bound is motivated by and similar in spirit to the full AIT coding theorem of Solomonoff and Levin ming2014kolmogorov (), but is easier to apply in practice. Dingle et al. simpbias () also show that is expected to be close to the upper bound, with high probability, when is the result of an input sampled uniformly at random. They find that this bound holds remarkably well for a large variety of input-output maps even with small outputs (strings of size to ). In Appendix LABEL:supp-param-fun-map-complexity, we justify why the parameter-function map of deep networks has low Kolmogorov complexity relative to the output size, so that these results are applicable.

While these probability bounds for finite strings and computable approximations to Kolmogorov complexity are not yet fully rigorous, the empirical evidence for their generality is strong. We therefore expect simplicity bias to hold for a wide range of deep neural networks, even if for many practical systems there are barriers to calculating Eq. (1) because the parameter spaces are too large to sample, and/or because it is hard to calculate a suitable complexity measure for the functions.

3 Simplicity bias leads to generalization

To empirically explore the effects of bias on generalization, we performed supervised learning experiments on a neural network of shape , which we trained on a test set of of all inputs, for target functions with a range of complexities. The results were compared to those of an unbiased algorithm, with the same hypothesis class, defined by picking a Boolean function uniformly at random that fitted the training data perfectly. The network was trained using a variant of the SGD algorithm which we call advSGD (see Appendix LABEL:supp-training_algos), as with SGD alone it was difficult to perfectly fit the data for the higher complexity cases.

As can be seen in Figure 2, for a simple target function the neural network finds solutions which are significantly simpler, and generalize better than the functions found by the unbiased algorithm. Every string of length encodes a different Boolean function. It is a standard result that the vast majority of strings have a complexity close to the maximum (for the LZ measure, see the SI of ref. simpbias ()). Roughly speaking, the probability that the unbiased algorithm finds a function with complexity bits or more below the maximum complexity scales as , so the unbiased algorithm almost always finds a high complexity Boolean function.

We also observe a clear correlation between the complexity of the learned function and the generalization error, for a fixed simple target function (see more examples in Appendix LABEL:supp-error-comp-hists). This correlation strongly suggests that the simplicity bias leads to better generalization. The origin of this correlation is discussed at length in Appendix LABEL:supp-error-comp-landscape .

Furthermore, as can be seen in Figure 2(a) (and in Appendix LABEL:supp-error-comp-hists), although the error increases with the complexity of the target function222The fact that generalization error and the complexity of the target function correlate for neural networks has been observed before, see e.g. franco2006generalization (), and is one of the conclusions of the experiments in zhang2016understanding (), the network still generalizes significantly better than the unbiased learner, as long as the target function does not have near maximum complexity.

(a) Target function LZ complexity:
(b) Target function LZ complexity:
Figure 2: Generalization error versus learned function LZ complexity, for random initialization and training sets of size , for a target function with (a) lower complexity and (b) higher complexity. Generalization error is defined with respect to off-training set samples. The blue circles and blue histograms correspond to the neural network, and the red dots and histograms to an unbiased learner which also fits the training data perfectly. The histograms on the sides of the plots show the frequency of generalization errors and complexities. Overlaid on the red and blue symbols there is a black histogram depicting the density of dots (darker is higher density).
(a) Generalization error of learned function
(b) Complexity of learned functions
(c) Number of iterations to perfectly fit training set
(d) Net Euclidean distance traveled in parameter space to fit training set
Figure 3: Different learning metrics versus the LZ complexity of the target function, when learning with a network of shape using advSGD or the unbiased learner. Dots represent the means, while the shaded envelope corresponds to piecewise linear interpolation of the standard deviation, over random initializations and training sets. See Appendix LABEL:supp-learning_appendix for results using other complexity measures.

Figs 2(c) and 2(d) shows that, in order to perfectly fit the training data ( training error), the number of advSGD iterations needed as well as the distance traveled in parameter space increase with increasing target function complexity. This behaviour is not surprising, as complex functions typically have much smaller regions of parameter space producing them, so more exploration is needed to find them. If instead of an optimization method like advSGD, the network was trained by simply randomly sampling parameters, then the number of iterations would grow exponentially with increasing target function complexity. The fact that the scaling is nearly linear testifies to the efficiency of the advSGD algorithm’s ability to exploit structure in the parameter-function map.

Figure 2(b) demonstrates how the LZ complexity of the learned function depends on the complexity of the target. Overall, the functions that the network learns grow in complexity with the complexity of the target function. One reason is simply that very simple functions are typically incompatible with a training set produced by a complex target function (see Appendix LABEL:supp-complexity_lower_bound for a theoretical bound supporting this claim).

In order to gain some intuition for the complexity-dependence of generalization we simply observe from Figure 2(b) that the network typically finds functions within a small range of the target complexity. Since there are many fewer simple functions than complex functions, we may expect the effective hypothesis class to be smaller for simpler target functions, which aids generalization. This argument is similar in spirit to that used for Occam algorithms blumer1987occam (); wolpert1994relationship (), where the simplest hypothesis consistent with the training set is always chosen. To make our intuitive argument more precise, first note that for a hypothesis class of all Boolean functions of inputs, and a training set of size , there are always functions consistent with the training set. Because the number of simple functions will typically be much less than , for a simple enough target function, the functions consistent with the training set will include simple and complex functions. Because of simplicity bias, the low-complexity functions are much more likely to be considered than the high complexity ones. On the other hand, for a complex target function, the functions consistent with the training set are all of high complexity. Among these, the simplicity bias does not have as large an effect because there is a smaller range of probabilities. Thus the network effectively considers a larger set of potential functions. This difference in effective hypothesis class causes the difference in generalization. This intuition is formalized in the next section, using PAC-Bayes Theory.

4 PAC-Bayes generalization error bounds

In order to obtain a more quantitative understanding of the generalization behaviour we observe, we turn to PAC-Bayes theory, an extension of the probably approximately correct (PAC) learning framework. In particular, we use Theorem 1 from the classic work by McAllester mcallester1998some (), which gives a bound on the expected generalization error, when sampling the posterior over concepts. It uses the standard learning theory terminology of concept space for a hypothesis class of Boolean functions (called concepts), and instance for any element of the input space.

Theorem 1.

(PAC-Bayes theorem mcallester1998some ()) For any measure on any concept space and any measure on a space of instances we have, for , that with probability at least over the choice of sample of instances all measurable subsets of the concepts such that every element of is consistent with the sample and with satisfies the following:

where , and where , i.e. the expected value of the generalization errors over concepts in with probability given by the posterior . Here, is the generalization error (probability of the concept disagreeing with the target concept, when sampling inputs).

The results from Section 2 tell us that when the distribution over parameters of a neural network is uniform, then there is a highly biased prior over concepts. On the other hand, stochastic gradient descent can be seen as a Markov Chain Monte Carlo algorithm which approaches an equilibrium distribution on parameter space, which approximates the posterior given by the prior and the likelihood of the data mandt2017stochastic (). Given that we have a flat prior on parameter space, the distribution on the regions of zero error should be flat after equilibration. Therefore, given that we succeed in empirical risk minimization, and that we have equilibrated, the function we get will be a sample from a distribution which approximates the posterior , and we can interpret as the expected value of the generalization error over many runs of SGD for the given training set. Exploring the possible non-equilibrium effects as well as the effect of the choice of training algorithm is left for future work, but our results appear to be robust against these effects.

We can therefore bound the expected generalization error of standard neural networks trained by versions of SGD (where the expectation is over training sets and runs of SGD), if we know the expected value of over training sets, given a target function. Qualitatively, we can use the observations (Figure 2(b)) that the complexity of the learned functions correlates with the complexity of the target function, together with the correlation between complexity and probability from Section 2. This suggests that simpler target functions will typically have higher , which would imply lower expected generalization error bound, in agreement with the results of experiments (see Figure 2(a)).

To get a more quantitative result, we can use a sample of learned functions obtained by training the network on a particular training set, to approximate a lower bound on the value , approximating for each of these functions using their complexity as described on Section 2. Then, because , , we can make the approximation:

Because the functions in are most likely going to be among the functions in with highest , this bound could be reasonably tight, if the bias is strong enough. We used and by fitting an upper bound to the probability versus Lempel-Ziv complexity plot (Fig. 0(b)). We can furthermore repeat the experiment for several training sets and average the resulting generalization error bounds to obtain an expected error bound over training sets. The resulting bounds can be seen in Fig. 4, for a range of target function complexity values. The upper bound (black circles) bounds the generalization error of the functions learned by the neural network, and is relatively tight. The network was trained using advSGD. Appendix Fig. LABEL:supp-fig:gen_bound shows qualitatively similar results for SGD.

One disadvantage of the Pac-Bayes theorem above is that it provides a bound on the expectation of the error, while generalization bounds in learning theory typically hold with high probability. To obtain better bounds holding w.h.p., we would need to bound the variance of the error, which we leave for future work. We also note that a number of the steps above could be improved, especially the estimates for , but overall, even with these approximations, the bound works well.

Figure 4: Comparison of expected generalization error bounds from PAC-Bayes (black dots) to measured error from Fig. 2(a), for a network of shape , For each target function, the network is trained on random training sets of size , and for each training we run advSGD with random initializations.

5 Deep versus shallow networks

In ref. ba2014deep (), it is demonstrated that shallow neural networks with sufficient number of parameters can learn to mimic a deeper network and reach comparable accuracy in an image classification task using the CIFAR10 dataset, demonstrating that the shallow network has enough expressivity to solve this task. However, when training directly on the training data, the deep network succeeds in finding a function with high generalization accuracy, while the shallow network fails. We explored whether the parameter-function map could be the cause of this difference by biasing the deeper network towards simpler functions. In Figure 5 we see that, measuring complexity using the critical sample ratio (CSR) from arpit2017closer (), when randomly sampling parameters the deeper network produces simple functions much more often than the shallow function, demonstrating a stronger simplicity bias in the parameter-function map of the deep network. This could explain why the deep network generalizes better when trained on the CIFAR10 data, even though both networks have enough expressive power.

(a) Histograms of CSRs using the SNN-CNN-MIMIC-30k shallow network with 64 channel convolutional filters used in ba2014deep ()
(b) Histograms of CSRs for the deep convolutional neural network with three convolutional layers with 64 channels each, used in ba2014deep (), and originally in hinton2012improving ()
Figure 5: Normalized histograms of critical sample ratio (CSR) estimated from a sample of images from the CIFAR10 dataset These are obtained by randomly sampling parameters (using a Gaussian distribution with variance as in Xavier initialization). The parameter sample size is also .

Note that although for the network in Figure 4(a) a complex function is more likely than a simple function, this is still compatible with the simplicity bias in the parameter-function map. As there are typically exponentially more complex functions than simple ones, the histogram of complexities for an unbiased parameter-function map would be much more skewed toward complex functions. Therefore, the results in Figure 5 just tell us that the shallow network has significantly less simplicity bias than the deep network, not that it doesn’t have simplicity bias. In Figure LABEL:supp-fig:prob_comp_layers in the Appendix E, we provide a similar analysis for the Boolean learning networks, which also shows that a shallow network is less biased than a deep one.

6 Conclusion and future work

In this work we claim that parameter-function maps for neural networks are strongly biased towards simple functions. We observe this bias empirically for a model Boolean function map, and use the AIT inspired arguments of simplicity bias simpbias () to argue that this bias will hold more generally. In future work we plan to give further insight into the origin of the bias, and make some of the results in this paper more rigorous.

Using both empirical calculations for supervised learning of Boolean functions, and PAC-Bayes theory, we establish that this inductive bias aids generalization. One could say that neural networks have an inbuilt Occam’s razor. But, it is important to remember that the inductive bias of deep neural networks only improves generalization if the bias reflects the problem being addressed in practice. A number of authors, including Schmidhuer schmidhuber1997discovering (), Bengio and LeCun bengio2007scaling () and Lin et al. lin2017does () have argued that deep architectures are likely fit to real-world problems. On the other hand, No Free Lunch theorems also imply that improvement in one domain will impair performance in other domains. For example, if one applies neural nets to problems that are highly complex, then simplicity bias may harm generalization (contra Occam), since these solutions will be very hard to find (see also Appendix LABEL:supp-error-comp-hists).

The simplicity bias in the parameter-function map may help rationalise other patterns observed in deep learning. For example, it could shed light on the success of derivative-free methods, which have shown success in reinforcement learning such2017deep (); mania2018simple (). In fact, because of their similarity to evolutionary processes, similar analysis to those for biased genotype-phenotype maps in evolution schaper2014arrival () could be applicable. Furthermore in this evolutionary context much is known about the geometry of neutral networks greenbury2016genetic () which are analogous to zero-loss surfaces in neural networks. Investigating such parallels may generate new insights about learning in the overparameterized regime.

Future work that leads to ways to affect the bias could guide the design of methods with the right bias for particular problems. Finally, an important future direction of work will be to capitalise on these insights for practical applications. For example, estimating target function complexity (perhaps from CSR on a small sample) could be used with the PAC-Bayes predictions to give reliable estimates on the amount of data needed to learn a problem to a desired accuracy.

Acknowledgments

We would like to thank EPSRC for financial support

References

  • [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • [2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
  • [3] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [4] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
  • [5] Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. International Journal of Automation and Computing, 14(5):503–519, 2017.
  • [6] Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.
  • [7] Qianli Liao and Tomaso Poggio. Theory of deep learning ii: Landscape of the empirical risk in deep learning. arXiv preprint arXiv:1703.09833, 2017.
  • [8] Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.
  • [9] T Poggio, K Kawaguchi, Q Liao, B Miranda, L Rosasco, X Boix, J Hidary, and HN Mhaskar. Theory of deep learning iii: the non-overfitting puzzle. Technical report, CBMM memo 073, 2018.
  • [10] Andrey Nikolayevich Tikhonov. On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, volume 39, pages 195–198, 1943.
  • [11] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • [12] Nelson Morgan and Hervé Bourlard. Generalization and parameter estimation in feedforward nets: Some experiments. In Advances in neural information processing systems, pages 630–637, 1990.
  • [13] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
  • [14] Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. arXiv preprint arXiv:1706.05394, 2017.
  • [15] Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.
  • [16] Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345, 2017.
  • [17] Chiyuan Zhang, Qianli Liao, Alexander Rakhlin, Brando Miranda, Noah Golowich, and Tomaso Poggio. Musings on deep learning: Properties of sgd. 2017.
  • [18] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.
  • [19] Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959, 2018.
  • [20] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5949–5958, 2017.
  • [21] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
  • [22] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017.
  • [23] Kamaludin Dingle, Chico Q Camargo, and Ard A Louis. Input–output maps are strongly biased towards simple outputs. Nature communications, 9(1):761, 2018.
  • [24] LI Ming and Paul MB Vitányi. Kolmogorov complexity and its applications. Algorithms and Complexity, 1:187, 2014.
  • [25] Hector Zenil, Liliana Badillo, Santiago Hernández-Orozco, and Francisco Hernández-Quiroz. Coding-theorem like behaviour and emergence of the universal distribution from resource-bounded algorithmic probability. International Journal of Parallel, Emergent and Distributed Systems, pages 1–21, 2018.
  • [26] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
  • [27] Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Transactions on information theory, 22(1):75–81, 1976.
  • [28] Leonardo Franco and Martin Anthony. On a generalization complexity measure for boolean functions. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, volume 2, pages 973–978. IEEE, 2004.
  • [29] Leonardo Franco. Generalization ability of boolean functions implemented in feedforward neural networks. Neurocomputing, 70(1):351–361, 2006.
  • [30] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Occam’s razor. Information processing letters, 24(6):377–380, 1987.
  • [31] David H Wolpert and R Waters. The relationship between pac, the statistical physics framework, the bayesian framework, and the vc framework. In In. Citeseer, 1994.
  • [32] David A McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh annual conference on Computational learning theory, pages 230–234. ACM, 1998.
  • [33] Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.
  • [34] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014.
  • [35] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  • [36] Jürgen Schmidhuber. Discovering neural nets with low kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857–873, 1997.
  • [37] Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards ai. Large-scale kernel machines, 34(5):1–41, 2007.
  • [38] Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223–1247, 2017.
  • [39] Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567, 2017.
  • [40] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
  • [41] Steffen Schaper and Ard A Louis. The arrival of the frequent: how bias in genotype-phenotype maps can steer populations to local optima. PloS one, 9(2):e86635, 2014.
  • [42] Sam F Greenbury, Steffen Schaper, Sebastian E Ahnert, and Ard A Louis. Genetic correlations greatly increase mutational robustness and can both reduce and enhance evolvability. PLoS computational biology, 12(3):e1004773, 2016.
  • [43] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
  • [44] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • [45] Peter L Bartlett, Nick Harvey, Chris Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. arxiv preprint. arXiv, 1703, 2017.
  • [46] Eric B Baum and David Haussler. What size net gives valid generalization? In Advances in neural information processing systems, pages 81–90, 1989.
  • [47] Ehud Friedgut. Boolean functions with low average sensitivity depend on few coordinates. Combinatorica, 18(1):27–35, 1998.
  • [48] E Estevez-Rams, R Lora Serrano, B Aragón Fernández, and I Brito Reyes. On the non-randomness of maximum lempel ziv complexity sequences of finite size. Chaos: An Interdisciplinary Journal of Nonlinear Science, 23(2):023118, 2013.
  • [49] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
  • [50] Shizhao Sun, Wei Chen, Liwei Wang, Xiaoguang Liu, and Tie-Yan Liu. On the depth of deep neural networks: A theoretical view. In AAAI, pages 2066–2072, 2016.
  • [51] Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391–423, 2012.
  • [52] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
  • [53] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
  • [54] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
  • [55] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • [56] Tor Lattimore and Marcus Hutter. No free lunch versus occam’s razor in supervised learning. In Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence, pages 223–235. Springer, 2013.
  • [57] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  • [58] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
  • [59] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.
  • [60] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.
  • [61] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in neural information processing systems, pages 3360–3368, 2016.
  • [62] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. arXiv preprint arXiv:1611.01232, 2016.
  • [63] Raja Giryes, Guillermo Sapiro, and Alexander M Bronstein. Deep neural networks with random gaussian weights: a universal classification strategy? IEEE Trans. Signal Processing, 64(13):3444–3457, 2016.
  • [64] Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. A correspondence between random neural networks and statistical field theory. arXiv preprint arXiv:1710.06570, 2017.

Appendix A Basic Experimental details

The neural networks used in the experiments of this paper are feedforward neural networks with real-valued weights, Boolean -dimensional inputs, and a single Boolean output. We use ReLU nonlinearity, unless stated otherwise, except for the final layer, which uses a step nonlinearity to produce the Boolean output. Neural network architectures are given a tuple of dimensions including input and output, so that means a neural network with -dimensional input and two hidden layers with neurons each, and a -dimensional output. The number of parameters is denoted (the dimensionality of the parameter vector), and the dimension of the input is denoted by .

Given the Boolean inputs and outputs, any function implemented by the neural network can be represented uniquely in a binary representation by enumerating the inputs in a fixed order (here we will use numerical order when the inputs are interpreted as binary numbers), and concatenating the corresponding outputs to produce a binary string of length . It’s easy to see there are Boolean functions for an -dimensional input.

This representation allows us to use complexity measures for strings. The definitions of the complexity measures used in this paper are found in section C.

In the learning experiments, unless stated otherwise, we used , and a training set of size , corresponding to half of the input space, which was sampled uniformly at random, without replacement. Unless stated otherwise, generalization error is defined as the fraction of errors outside of the training set (off-training error).

a.1 Training algorithms

We used several training algorithms in our experiments. However, we focused on two. One is plain stochastic gradient descent (SGD) [43], and the other is a variation of SGD similar to the method of adversarial training proposed by Ian Goodfellow [44]. We chose this second method because SGD often did not find a solution with training error for all the Boolean functions, even with many thousand iterations. By contrast, the adversarial method succeeded in almost all cases, at least for the relatively small neural networks which we focus on here.

We call this method adversarial SGD, or advSGD, for short. In SGD, the network is trained using the average loss of a random sample of the training set, called a mini-batch. In advSGD, after every training step, the classification error for each of the training examples in the mini-batch is computed, and a moving average of each of these classification errors is updated. This moving average gives a score for each training example, measuring how “bad” the network has recently been at predicting this example. Before getting the next mini-batch, the scores of all the examples are passed through a softmax to determine the probability that each example is put in the mini-batch. This way, we force the network to focus on the examples it does worst on.

In the experiments we used a batch size of , and step sizes between and .

Appendix B Simplicity bias and the parameter-function map

An important argument in the main text of this paper is that the parameter-function map of neural networks should exhibit the basic simplicity bias phenomenolgy recently described in Dingle et al. in [23]. In this section we briely describe some key results of reference [23] relevant to this argument.

A computable333Here computable simply means that all inputs lead to outputs, in other words there is no halting problem. input-output map , mapping inputs from the set to outputs from the set 444This language of finite input and outputs sets assumes discrete inputs and outputs, either because they are intrinsically discrete, or because they can be made discrete by a coarse-graining procedure. For the parameter-function maps studied in this paper the set of outputs (the full hypothesis class) is typically naturally discrete, but the inputs are continuous. However, the input parameters can always be discretised without any loss of generality. may exhibit simplicity bias if the following restrictions are satisfied [23]:

1) Map simplicity The map should have limited complexity, that is its Kolmogorov complexity should asymptotically satisfy , for typical where is a measure of the size of the input set (e.g. for binary input sequences, .).

2) Redundancy: There should be many more inputs than outputs () so that the probability that the map generates output upon random selection of inputs can in principle vary significantly.

3) Finite size to avoid potential finite size effects.

4) Nonlinearity: The map must be be a nonlinear function since linear functions don’t exhibit bias.

5) Well behaved: The map should not primarily produce pseudorandom outputs (such as the digits of ), because complexity approximators needed for practical applications will mistakenly label these as highly complex.

For the deep learning learning systems studied in this paper, the inputs of the map are the parameters that fix the weights for the particular neural network architecture chosen, and the outputs are the functions that the system produces. Consider, for example, the configuration for Boolean functions studied in the main text. While the output functions rapidly grow in complexity with increasing size of the input layer, the map itself can be described with a low-complexity procedure, since it consists of reading the list of parameters, populating a given neural network architecture and evaluating it for all inputs. For reasonable architectures, the information needed to describe the map grows logarithmically with the input dimension , so for large enough , the amount of information required to describe the map will be much less than the information needed to describe a typical function, which requires bits. Thus the Kolmogorov complexity of this map is asymptotically smaller than the the typical complexity of the output, as required by the map simplicity condition 1) above.

The redundancy condition 2) depends on the network architecture and discretization. For overparameterised networks, this condition is typically satisfied. In our specific case, where we use floating point numbers for the parameters (input set ), and Boolean functions (output set ), this condition is clearly satisfied. Neural nets can represent very large numbers of potential functions (see for example estimates of VC dimension [45, 46]), so that condition 3) is also generally satisfied. Neural network parameter-function maps are evidently non-linear, satisfying condition 4). Condition 5) is perhaps the least understood condition within simplicity bias. However, the lack of any function with high probability and high complexity (at least when using LZ complexity), provides some empirical validation. This condition also agrees with the expectation that neural networks won’t predict the outputs of a good pseudorandom number generator. One of the implicit assumptions in the simplicity bias framework is that, although true Kolmogorov complexity is always uncomputable, approximations based on well chosen complexity measures perform well for most relevant outputs . Nevertheless, where and when this assumptions holds is a deep problem for which further research is needed.

Appendix C Other complexity measures

One of the key steps to practical application of the simplicity bias framework of Dingle et al. in [23] is the identification of a suitable complexity measure which mimics aspects of the (uncomputable) Kolmogorov complexity for the problem being studied. It was shown for the maps in [23] that several different complexity measures all generated the same qualitative simplicity bias behaviour:

(2)

but with different values of and depending on the complexity measure and of course depending on the map, but independent of output . Showing that the same qualitative results obtain for different complexity measures is sign of robustness for simplicity bias.

Below we list a number of different complexity measures used in this work:

c.1 Complexty measures

Lempel-Ziv complexity (LZ complexity for short). The Boolean functions studied in the main text can be written as binary strings, which makes it possible to use measures of complexity based on finding regularities in binary strings. One of the best is Lempel-Ziv complexity, based on the Lempel-Ziv compression algorithm. It has many nice properties, like asymptotic optimality, and being asymptotically equal to the Kolmogorov complexity for an ergodic source. We use the variation of Lempel-Ziv complexity from [23] which is based on the 1976 Lempel Ziv algorithm [27]:

(3)

where is the length of the binary string, and is the number of words in the Lempel-Ziv "dictionary" when it compresses output . The symmetrization makes the measure more fine-grained, and the value for the simplest strings ensures that they scale as expeted for Kolmogorov complexity. This complexity measure is the primary one used in the main text.

We note that the binary string representation depends on the order in which inputs are listed to construct it, which is not a feature of the function itself. This may affect the LZ complexity, although for simple input orderings, it will typically have a negligible effect.

Entropy. A fundamental, though weak, measure of complexity is the entropy. For a given binary string this is defined as , where is the number of zeros in the string, and is the number of ones, and . This measure is close to when the number of ones and zeros is similar, and is close to when the string is mostly ones, or mostly zeros. Entropy and are compared in fig. 6, and in more detail in supplementary note 7 (and supplementary information figure 1) of reference [23]. They correlate, in the sense that low entropy means low , but it is also possible to have Large entropy but low , for example for a string such as .

Boolean expression complexity. Boolean functions can be compressed by finding simpler ways to represent them. We used the standard SciPy implementation of the Quine-McCluskey algorithm to minimize the Boolean function into a small sum of products form, and then defined the number of operations in the resulting Boolean expression as a Boolean complexity measure.

Generalization complexity. L. Franco et al. have introduced a complexity measure for Boolean functions, designed to capture how difficult the function is to learn and generalize [28], which was used to empirically find that simple functions generalize better in a neural network [29]. The measure consists of a sum of terms, each measuring the average over all inputs fraction of neighbours which change the output. The first term considers neighbours at Hamming distance of , the second at Hamming distance of and so on. The first term is also known (up to a normalization constant) as average sensitivity [47]. The terms in the series have also been called “generalized robustness” in the evolutionary theory literature [42]. Here we use the first two terms, so the measure is:

where is all neighbours of at Hamming distance .

Critical sample ratio. A measure of the complexity of a function was introduced in [14] to explore the dependence of generalization with complexity. In general, it is defined with respect to a sample of inputs as the fraction of those samples which are critical samples, defined to be an input such that there is another input within a ball of radius , producing a different output (for discrete outputs). Here, we define it as the fraction of all inputs, that have another input at Hamming distance , producing a different output.

c.2 Correlation between complexities

In Fig. 6, we compare the different complexity measures against one another. We also plot the frequency of each complexity; generally more functions are found with higher complexity.

Figure 6: Scatter matrix showing the correlation between the different complexity measures used in this paper On the diagonal, a histogram (in grey) of frequency versus complexity is depicted. The functions are from the sample of parameters for the network.

c.3 Probability-complexity plots

In Fig. 7 we show how the probability versus complexity plots look for other complexity measures. The behaviour is similar to that seen for the LZ complexity measure in Fig 1(b) of the main text. In Fig. 8 we show probability versus LZ complexity plots for other choices of parameter distributions.

(a) Probability versus Boolean complexity
(b) Probability versus generalization complexity
(c) Probability versus entropy
(d) Probability versus critical sample ratio
Figure 7: Probability versus different measures of complexity (see main text for Lempel-Ziv), estimated from a sample of parameters, for a network of shape . Points with a frequency of are removed for clarity because these suffer from finite-size effects (see Appendix D). The measures of complexity are described in Appendix C.
(a)
(b)
Figure 8: Probability versus LZ complexity for network of shape and varying sampling distributions. Samples are of size . (a) Weights are sampled from a Gaussian with variance where is the input dimension of each layer. (b) Weights are sampled from a Gaussian with variance

c.4 Error-complexity histograms for different target complexities and different complexity measures

In Figs.‘9 - 11 we compare the generalization performance of the neural network to an unbiased learner for different target complexities, and for different complexity measures. All the plots exhibit the same general structure as found for LZ in the main text, namely that the neural networks generalize much better than the unbiased learner for simple target functions, but that this improved performance degrades for more complex functions where we find preliminary evidence that the neural networks performs less well than the unbiased learner (as would be expected from No Free Lunch theorems). However, it is never much worse, so even if the target could be either simple or complex, using a neural network will be hugely beneficial for simple functions and may not hurt too much compared to an unbiased learner for highly complex functions (a point also made by Schmidhuber in  [36]).

Note also that, not surprisingly, the best generalization is typically for learned functions close in complexity to the target function. Learned functions that are higher or lower in complexity typically generalize less well. This effect is most obvious for intermediate complexities. A naive interpretation of Occam’s razor is that one should always choose the simplest function (hypothesis) that fits the data, but clearly this dictum doesn’t quite work here.

Figure 9: Generalization error versus LZ complexity. Red dots and histograms correspond to the unbiased learner, blue dots and histograms to the neural network of shape . The functions where the network performed worse than the unbiased learner (we show one at the middle bottom here) were Boolean functions sampled uniformly at random from all Boolean functions, and so these are expected to have maximum complexity.
Figure 10: Generalization error versus generalization complexity. Red dots and histograms correspond to the unbiased learner, blue dots and histograms to the neural network of shape .
Figure 11: Generalization error versus entropy. Red dots and histograms correspond to the unbiased learner, blue dots and histograms to the neural network of shape .

c.5 Effects of target function complexity on learning for different complexity measures

In the main text (Figure LABEL:main-fig:target_comp_effects), we show the effect of target function LZ complexity on different learning metrics. Here we show the effect of other complexity measures on learning, as well as other complementary results.

The functions in these experiments (including those in the main text) were chosen by randomly sampling parameters, and so even the highest complexity ones are probably not fully random555The fact that non-random strings can have maximum LZ complexity is a consequence of LZ complexity being a less powerful complexity measure than Kolmogorov complexity, see e.g. [48]. The fact that neural networks do well for non-random functions, even if they have maximum LZ, suggests that their simplicity bias captures a notion of complexity stronger than LZ.. In fact, when training the network on truly random functions, we obtain generalization errors equal or above those of the unbiased learner. This is expected from the No Free Lunch theorem, which says that no algorithm can generalize better (for off-training error) uniformly over all functions than any other algorithm ([31]).

(a) Generalization error of learned functions
(b) Complexity of learned functions
(c) Number of iterations to perfectly fit training set
(d) Net Euclidean distance traveled in parameter space to fit training set
Figure 12: Different learning metrics versus the generalization complexity of the target function, when learning with a network of shape using advSGD or the unbiased learner. Dots represent the means, while the shaded envelope corresponds to piecewise linear interpolation of the standard deviation, over random initializations and training sets.
(a) Generalization error of learned functions
(b) Complexity of learned functions
(c) Number of iterations to perfectly fit training set
(d) Net Euclidean distance traveled in parameter space to fit training set
Figure 13: Different learning metrics versus the Boolean complexity of the target function, when learning with a network of shape using advSGD or the unbiased learner. Dots represent the means, while the shaded envelope corresponds to piecewise linear interpolation of the standard deviation, over random initializations and training sets.
(a) Generalization error of learned functions
(b) Complexity of learned functions
(c) Number of iterations to perfectly fit training set
(d) Net Euclidean distance traveled in parameter space to fit training set
Figure 14: Different learning metrics versus the entropy of the target function, when learning with a network of shape using advSGD or the unbiased learner. Dots represent the means, while the shaded envelope corresponds to piecewise linear interpolation of the standard deviation, over random initializations and training sets.

c.6 Lempel-Ziv versus Entropy

To check that the correlation between LZ complexity and generalization isn’t only because of a correlation with function entropy (which is just a measure of the fraction of inputs mapping to or , see Section C), we can see in Figure 11 in Section C.4 that for some target functions with maximum entropy (but which are simple when measured using LZ complexity), the network still generalizes better than the unbiased learner, showing that the bias towards simpler functions is better captured by more powerful complexity measures than entropy666LZ is a better approximation to Kolmogorov complexity than entropy [49], but of course LZ can still fail, for example when measuring the complexity of the digits of .. This is confirmed by the results in Fig. 15 where we fix the target function entropy (to ), and observe that the generalization error still exhibits considerable variation, as well as a positive correlation with complexity

(a)
(b)
Figure 15: Generalization error of learned function versus the complexity of the target function for target functions with fixed entropy , for a network of shape . Complexity measures are (a) LZ and (b) generalisation complexity. Here the training set size was of size , but sampled with replacement, and the generalization error is over the whole input space. Note that despite the fixed entropy there is still variation in generalization error, which correlates with the complexity of the function. These figures demonstrate that entropy is a less accurate complexity measure than LZ or generalisation complexity, for predicting generalization performance.

Appendix D Finite-size effects for sampling probability

Since for a sample of size the minimum estimated probability is , many of the low-probability samples that arise just once may in fact have a much lower probability than suggested. See Figure 16), for an illustration of how this finite-size sampling effect manifests with changing sample size . For this reason, these points are typically removed from plots.

Figure 16: Probability (calculated from frequency) versus Lempel-Ziv complexity for a neural network of shape , and sample sizes . The lowest frequency functions for a given sample size can be seen to suffer from finite-size effects, causing them to have a higher frequency than their true probability.

Appendix E Effect of number of layers on simplicity bias

In Figure 17 we show the effect of the number of layers on the bias (for feedforward neural networks with neurons per layer). We can see that between the layer perceptron and the layer network there is an increased number of higher complexity functions. This is most likely because of the increasing expressivity of the network. For layers and above, the expressivity doesn’t significantly change, and instead, we observe a shift of the distribution towards lower complexity, similar to what was observed in Figure LABEL:main-fig:CSR_hists in the main text. Exploring the effect of other architecture choices (like convolutions and skip-connections) is something we aim to do in the future.

(a) Perceptron
(b) Perceptron
(c) 1 hidden layer
(d) 1 hidden layer
(e) 2 hidden layers
(f) 2 hidden layers
(g) 5 hidden layers
(h) 5 hidden layers
(i) 8 hidden layers
(j) 8 hidden layers
Figure 17: Probability versus LZ complexity for networks with different number of layers. Samples are of size , except for the 1 hidden layer case, where it is . (a) & (b) A perceptron with input neurons (complexity is capped at to aid comparison with the other figures). (c) & (d) A network with 1 hidden layer of 40 neurons (e) & (f) A network with 2 hidden layer of 40 neurons (g) & (h) A network with 5 hidden layers of 40 neurons each. (i) & (j) A network with 8 hidden layers of 40 neurons each

Appendix F Generalization bounds using SGD

In Figure 18, we show the generalization error PAC-Bayes bound (see Section LABEL:main-pac-bayes in main text), when training the network using SGD. Note that the range of target function complexities is smaller, because for the most complex functions SGD wasn’t able to perfectly fit the data. The bounds shown here are only for a single training set (rather than averaged over many training sets), and so will show more variance than those in the main text. Furthermore, for each training set we only count runs where SGD was able to perfectly fit the data, which only happened for a small fraction of runs for several target functions. This also causes the bounds to be less good (as the estimate of is worse).

Figure 18: Generalization error bounds computed for target functions of different frequencies from the estimated probabilities of learned functions (black dots), overlaid over the generalization error for the unbiased learner (red) and neural network (blue), a network of shape trained with SGD. Here we train the network on a single randomly chosen training set, and with random initializations.

Appendix G Error-complexity landscape

g.1 Bounds in the error-complexity plane

Figure LABEL:main-fig:histograms in the main text (and Appendix C.4) show that for a (simple) fixed target function, the complexity of the candidate function correlates with its generalization error. This correlation suggests that for simple target functions, simplicity bias helps to learn functions with low generalization error,

To provide some theoretical support, we derive here some bounds for the region in the error-complexity plane that contains all possible Boolean functions. When using entropy as the complexity measure, we can obtain a very precise picture. We find a more qualitative but still insightful picture using Kolmogorov complexity.

As we are not considering a training set here, we cannot define an off-training generalization error (which we use in the experiments). Therefore, we define error here as the fraction of bits in which a function differs with the fixed target function which, if uniformly sampling the input space, corresponds to the standard notion of i.i.d. generalization error [31].

In Subsection G.3, we derive a series of inequalities relating the generalization error and the entropy of a function, which delineate the region where all possible functions lie, for a target function of fixed entropy. As can be seen in Figure 19, the shape of this region forces the entropy and error to correlate. The distribution of functions in this region is very biased towards high entropy. In the figure, we show (red line) the average entropy of the functions that produce a given error value. Within this line, the distribution is itself exponentially concentrated near the error point. This is the fundamental reason why the unbiased learner gives functions with high error (Figure 11).

For the neural network, on the other hand, we see points concentrated at lower entropy and error, and we can even see the ‘V’ shape around the entropy of the target function predicted by the bounds (see Figure 11)

(a) For a target function of entropy about
(b) For a target function of entropy about
Figure 19: The blue lines indicate the error-entropy bounds for all possible Boolean functions, given a fixed target function, as derived in Subsection G.3. The region is symmetric with respect to reflection along the error equals line. The red line shows where the number of functions exponentially concentrates, for a fixed error, as the size of the input space grows to infinity.

In Subsection G.4, we also derive bounds relating the Kolmogorov complexity of a function and its error, relative to a fixed target function. If we ignore the order one terms (which we can in the asymptotic limit of the input space size going to infinity), we get regions like those shown in Figure 20. The upper boundary of the region is only valid for almost all functions (see Subsection G.4). We can see that the shape of the region also imposes a correlation between complexity and error, for simple target functions.

(a) For a target function of complexity
(b) For a target function of complexity
Figure 20: The blue region contains the error-Kolmogorov complexity pairs for almost all possible functions (for each value of error), given a fixed target function, corresponding to the bounds derived in Subsection G.4.

g.2 What functions does the network find?

Knowing the error-complexity landscape for all possible functions is only half the story. We also want to know which functions are more likely to be produced by a learning algorithm. The results shown in Figure LABEL:main-fig:histograms in the main text give us a good picture of this. In this section, we show a simple model which assumes a bias depending only on entropy, and which captures some of the behaviour observed for neural networks. We also show empirical results showing the complexities of the functions actually found by the network.

In Figure 21, we show the probability that a learning algorithm chooses a function of a certain entropy and error, given a target function of entropy , for two algorithms. Figure 20(b) shows the unbiased learner introduced before, and Figure 20(a) shows a learner where the probability of picking a particular function (assuming it is consistent with the training set) decreases exponentially with the function entropy. In Appendix G.5, we derive exact expressions for the probability of choosing a function of a given entropy and generalization error, for both the biased and unbiased learners, which we used to get the results in Figure 21.

We can see that, as expected for this simple target function, the unbiased learner will typically find high-entropy functions with high generalization error, while the biased learner is more likely to find simpler functions with lower generalization error, agreeing with the results in Figure LABEL:main-fig:histograms in the main text. Furthermore, the biased algorithm shows a linear correlation between the target function entropy and learned function entropy, which also agrees with the results for the neural network (Fig. 21(a)).

Although, the algorithm with entropy-dependent bias shows similar behaviour to the neural network when only considering the entropy of the functions, it does not capture the full bias, as evidenced by the clear outliers in Figure 6(c) in the main text, and results like those in Figure 15. Figure 21(b) shows that for some functions the neural network is able to generalize significantly better than the purely entropy-dependent bias predicts. Despite these differences, the entropy bias offers an interesting toy model to understand the effects of simplicity bias, and could offer a starting point for more advanced analysis.

(a)
(b)
Figure 21: Probability of choosing a function with certain error and entropy, for the biased (a) and unbiased algorithm (b). Note that for the biased learner (plot (a)) high probability is found at the low error and low entropy corner, while for the unbiased learner (plot (b)), high probability is found at the high error and high entropy corner. The target function has entropy . Note that the checkerboard pattern simply comes from the combinatorics of binary strings disallowing certain pairs of error-entropy (see Subsection G.3).
(a)
(b)
Figure 22: Learned function entropy and generalization error versus entropy of target function. Dots are means, and lines correspond to standard deviations. Orange dots and lines are for a neural network of shape trained with advSGD, averaged over random initializations and training sets. The blue dots and error bars are calculated for the learner with exponential entropy bias.

g.3 Bounds and number of functions in the error-entropy plane

Here we represent Boolean functions (through their table of outputs) as a string of bits, of length . The entropy of a bit string with s and s is defined in Appendix C. Consider a target function with s and s. Consider another function which differs in bits with the target function, and which has s and s. Furthermore, call the number of bits in which it differs with the target, and for which the target bit was , , and similarly for . We have . It is clear that the number of s and s satisfy

(4)
(5)

We also satisfy these inequalities

which translate to these inequalities for :

These inequalities on can be easily translated to inequalities on the entropy, noting that entropy is a monotonically increasing function of for and monotonically decreasing for . These inequalities, where the error is normalized and define to be , define the region shown in Figure 19.

The other quantity of interest, used in Section G.2, is , the number of functions with a given error and entropy , for a target with entropy . We can just work out the number , for a given and , for a target with s, knowing that there are two values of producing a given value of . Using Eq. 5,

(6)

where , , and . We can use the asymptotic form of the binomial coefficients to get

(7)

where is the binomial entropy function. If we write , , and allow to vary continuously between and , we can look for a maximum of the number of functions, for fixed , by setting the derivative of the expression in Eq. 7 with respect to , and setting it to . Doing that gives . Now , so that is just the (normalized) error. This points gives a value of , and therefore and , where exponentially concentrates, for a fixed . This is what is plotted as the red line in Figure 19. Finally, for points in this line, we have

(8)

so we can see that the number of functions also exponentially concentrates at the point , where is maximum (and equal to ).

g.4 Bounds for the error-Kolmogorov complexity plane

Consider a target function , and another function , both over inputs (so represented as binary strings of length ), which differ from each other in bits. Then we can describe by giving and specifying the locations of the bits and their values, giving the following bound on the Kolmogorov complexity of :

(9)

We have a similar bound with and swapped, so that we have

(10)

On the other hand, giving a and a which differ in bits describes a binary string of length with bits, furthermore there is a one-to-one correspondence between and this binary string, for a given . So if the string is , we know . However, from a simple counting argument [24], we know that most such binary strings have complexity close to , so that for most , given fixed and , we have

(11)

Ignoring the terms is what gives the bounds in Figure 20.

g.5 Probability of choosing a function of given error and entropy

To calculate the probability of choosing any function of a particular error and entropy (Section G.2), we first calculate the probability of choosing a particular function which differs with the target in bits (which we call the error), and has entropy .

We assume sampling the input space (of cardinality ) uniformly at random without replacement, and a sample of size . Therefore, every particular sequence of instances777 refer to elements of the input space as instances has the same probability . The probability that our sample contains a particular set of instances in any order is then .

Consider a particular function with error . For it to be consistent with the sample, the sample must contain only instances within the set of instances on which the function agrees with the target. There are such sets of size , so the probability that the sample is consistent with this particular function is .

With any such sample, there are functions consistent with it (as we are assuming the hypothesis class to be all possible Boolean functions over the input space). The unbiased learner chooses any of these with equal probability , and so the probability that it chooses the particular function is

Because the events corresponding to choosing particular functions are mutually exclusive, the probability of choosing any function of error and entropy for the unbiased learner is simply the sum of the probabilities of all such functions:

(12)

where is the number of functions with error and entropy , for target function entropy , given by Eq. 6. Note that we take the convention that when or .

For a learner with bias (we absorb the factor in the definition of for simplicity), the probability that it chooses a particular function given a particular training set of size is no longer . Instead it is given by

where is the entropy of function , and is the set of all functions compatible with the training set. We want to compute the quantity in the denominator. Call the number of instances in the training set where the target (and ) equals . Then there are functions in with s, and so the numerator is:

(13)

where is the entropy of a bit string with s.

What is the probability of a training set with a particular value of ? We can use the result from Section G.3 that the number of differing bits which are equal to , , is fixed by the entropy of the target and the entropy of , from Eq. 5. The number of training sets of size consistent with with a particular value of is then , and the probability of each of them is as before. Putting everything together the probability of obtaining any function of error and entropy , for a target function with entropy , and training set size , is

(14)

where is given by Eq. 6 , is given by Eq. 13, (from Eq. 5, where and are the number of s corresponding to entropy and , respectively), , and is the number of s corresponding to entropy .

Finally, note that because the number of instances in the training set is fixed to (as we sample without replacement), the off-training set generalization error for functions consistent with the training set is just .

g.6 Lower bound on the complexity of functions compatible with the training set

We first prove the following theorem which shows that you cannot have functions with a big error, and which are very simple, with high probability over training sets (when the sampling distribution over the input space is uniform)

Theorem 2.

Let be the size of the input space. If for a fraction greater than of possible training sets of size drawn from a fixed target function, we have that there exists some function consistent with the training set, with error with respect to the target, and with complexity , then

Proof.

The set of all functions with complexity has cardinality at most . There are training sets with unique elements. We assume that for of these, there is a function which is consistent with it. We call the number of training sets with which a particular is consistent. The mean of is at least , and so the maximum is also at least this number (as the maximum is always greater than the mean). By assumption the corresponding to differs with the target in bits. Therefore, the training sets consistent with it must avoid these bits. There are such training sets, and so . Combining this with the previous inequality gives , and taking the logarithm of both sides, gives the desired result. ∎

If , and , then the bound is approximately , which is what one gets from a simple PAC bound with hypothesis class of size . Theorem 2 is essentially a rephrasing of the PAC bound for training sets with distinct elements (that is sampling without replacement).

If we combine the bound from Theorem 2 with those in Figure 20, we get a lower bound on the complexity of the simplest function compatible with the training set, holding with high probability. This bound depends on the complexity of the target, because of the dependence in Eq. 10. The result is plotted in Fig. 23. However, the bound is not very tight, when compared with the results in Figure LABEL:main-fig:LZ_LZ in the main text, so a better theoretical understanding of this behaviour in Figure LABEL:main-fig:LZ_LZ is still needed.

Figure 23: Lower bound on the complexity of the simplest function compatible with the training set, versus the complexity of the target function. This is found by intersecting the bound from Theorem 2, with the bound from Eq. 10.

g.7 PAC-Bayes argument for error-complexity correlation

Theorem 3.

(Preliminary PAC-Bayes theorem [32]) For any probability distribution assigning non-zero probability to any concept in a countable concept class containing a target concept , and any probability distribution on instances, we have, for any , that with probability at least over the selection of a sample of instances, the following holds for all concepts agreeing with on that sample:

where is the generalization error of the learned concept .

We can use the generalization error bound in Preliminary Theorem 1 from [32], which depends on the probability given to a concept in the hypothesis class. In [23], Dingle et al. give a lower bound that holds with high probability, and give empirical evidence for it when using computable approximations to . We can use this lower bound for , to obtain a bound based on the Kolmogorov complexity of the learned concept, which would hold with high probability over parameter space. This gives

(15)

If the term can be ignored relative to the other terms, and the bound is reasonably tight, this predicts a better generalization error bound for simpler learned concepts (with high probability over samples, and parameter space), which agrees with the results in Figure LABEL:main-fig:histograms in the main text. Note that the results in Section LABEL:main-generalization in the main text indicate that simple functions generalizing better only holds for simple target functions, while the bound in Eq. 15 only directly depends on the learned function, not the target function. This is related to the fact that PAC bounds only hold with high probability, and finding a simple function fitting a sample from a complex target function is very unlikely. These types of subtleties are discussed at length in [31]. This is a reason why we present the analysis in Section LABEL:main-generalization in the main text: to clarify the need of the target function being simple, which is not as transparent in the PAC-Bayes formalism.

Appendix H Related work

The topic of generalization in neural networks has been extensively studied both in theory and experiment, and the literature is vast. Theoretical approaches to generalization include classical notions like VC dimension [46, 45] and Rademacher complexity [50], but also more modern concepts such as robustness [51], compression [18] as well as studies on the relation between generalization and properties of stochastic gradient descent (SGD) algorithms [17, 16, 6].

Empirical studies have also pushed the boundaries proposed by theory, In particular, in recent work by Zhang et al. [13], it is shown that while deep neural networks are expressive enough to fit randomly labeled data, they can still generalize for data with structure. The generalization error correlates with the amount of randomization in the labels. A similar result was found much earlier in experiments with smaller neural networks [29], where the authors defined a complexity measure for Boolean functions, called generalization complexity (see Appendix C), which appears to correlate well with the generalization error.

Inspired by the results of Zhang et al. [13], Arpit et al. [14] propose that the data dependence of generalization for neural networks can be explained because they tend to prioritize learning simple patterns first. The authors show some experimental evidence supporting this hypothesis, and suggest that SGD might be the origin of this implicit regularization. This argument is inspired by the fact that SGD converges to minimum norm solutions for linear models [52], but only suggestive empirical results are available for the case of nonlinear models, so that the question remains open [16]. Wu et al. [15] argue that full-batch gradient descent also generalizes well, suggesting that SGD is not the main cause behind generalization. It may be that SGD provides some form of implicit regularisation, but here we argue that the exponential bias towards simplicity is so strong that it is likely the main origin of the implicit regularization in the parameter-function map.

The idea of having a bias towards simple patterns has a long history, going back to the philosophical principle of Occam’s razor, but having been formalized much more recently in several ways in learning theory. For instance, the concepts of minimum description length (MDL) [53], Blumer algorithms [30, 31], and universal induction [24] all rely on a bias towards simple hypotheses. Interestingly, these approaches go hand in hand with non-uniform learnability, which is an area of learning theory which tries to predict data-dependent generalization. For example, MDL tends to be analyzed using structural risk minimization or the related PAC-Bayes approach [54, 55].

Hutter et al. [56] have shown that the generalization error grows with the target function complexity for a perfect Occam algorithm888Here what we call a ‘perfect Occam algorithm’ is an algorithm which returns the simplest hypothesis which is consistent with the training data, as measured using some complexity measure, such as Kolmogorov complexity. which uses Kolmogorov complexity to choose between hypotheses. Schmidhuber applied variants of universal induction to learn neural networks [36]. The simplicity bias from Dingle et al. [23] arises from a simpler version of the coding theorem of Solomonoff and Levin [24]. More theoretical work is needed to make these connections rigorous, but it may be that neural networks intrinsically approximate universal induction because the parameter-function map results in a prior which approximates the universal distribution.

Another popular approach to explaining generalisation is based around the idea of flat minima [57, 15]. In [58], Hochreiter and Schmidhuber argue that flatness could be linked to generalization via the MDL principle. Several experiments also suggest that flatness correlates with generalization. However, it has also been pointed out that flatness is not enough to understand generalization, as sharp minima can also generalize [59]. We show in Section LABEL:main-bias in the main text that simple functions have much larger regions of parameter space producing them, so that they likely give rise to flat minima, even though the same function might also be produced by other sharp regions of parameter space.

Other papers discussing properties of the parameter-function map in neural networks include Montufar et al. [60], who suggested that looking at the size of parameter space producing functions of certain complexity (measured by the number of linear regions) would be interesting, but left it for future work. In [61], Poole et al. briefly look at the sensitivity to small perturbations of the parameter-function map. In spite of these previous works, there is clearly still much scope to study the properties of the parameter-function map for neural networks.

Finally, our work follows the growing line of work exploring random neural networks [62, 63, 61, 64], as a way to understand fundamental properties of neural networks, robust to other choices like initialization, objective function, and training algorithm.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
198645
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description