Size-free generalization bounds
for convolutional neural networks
We prove bounds on the generalization error of convolutional networks. The bounds are in terms of the training loss, the number of parameters, the Lipschitz constant of the loss and the distance from the weights to the initial weights. They are independent of the number of pixels in the input, and the height and width of hidden feature maps. We present experiments with CIFAR-10 and a scaled-down variant, along with varying hyperparameters of a deep convolutional network, comparing our bounds with practical generalization gaps.
Recently, substantial progress has been made regarding theoretical analysis of the generalization of deep learning models (see Zhang et al., 2016; Dziugaite and Roy, 2017; Bartlett et al., 2017; Neyshabur et al., 2017, 2018; Arora et al., 2018; Neyshabur et al., 2019). One interesting point that has been explored, with roots in (Bartlett, 1998), is that even if there are many parameters, the set of models computable using weights with small magnitude is limited enough to provide leverage for induction (Bartlett et al., 2017; Neyshabur et al., 2018). Intuitively, if the weights start small, since the most popular training algorithms make small, incremental updates that get smaller as the training accuracy improves, there is a tendency for these algorithms to produce small weights. (For some deeper theoretical exploration of implicit bias in deep learning and related settings, see (Gunasekar et al., 2017, 2018a, 2018b; Ma et al., 2018).) Even more recently, authors have proved generalization bounds in terms of the distance from the initial setting of the weights instead of the size of the weights (Bartlett et al., 2017; Neyshabur et al., 2019). This is important because small initial weights may promote vanishing gradients; it is advisable instead to choose initial weights that maintain a strong but non-exploding signal as computation flows through the network (see LeCun et al., 2012; Glorot and Bengio, 2010; Saxe et al., 2013; He et al., 2015). A number of recent theoretical analyses have shown that, for a large network initialized in this way, a large variety of well-behaved functions can be found through training by traveling a short distance in parameter space (see Du et al., 2019b, a; Allen-Zhu et al., 2019). Thus, the distance from initialization may be expected to be significantly smaller than the magnitude of the weights. Furthermore, there is theoretical reason to expect that, as the number of parameters increases, the distance from initialization decreases.
Convolutional layers are used in all competitive deep neural network architectures applied to image processing tasks. The most influential generalization analyses in terms of distance from initialization have thus far concentrated on networks with fully connected layers. Since a convolutional layer has an alternative representation as a fully connected layer, these analyses apply in the case of convolutional networks, but, intuitively, the weight-tying employed in the convolutional layer constrains the set of functions computed by the layer. This additional restriction should be expected to aid generalization.
In this paper, we prove new generalization bounds for convolutional networks that take account of this effect. As in earlier analyses, our bounds are in terms of the distance from the initial weights, and the number of parameters. Additionally, they are “size-free”, in the sense that they are independent of the number of pixels in the input, or the height and width of the hidden feature maps.
As is often the case for generalization analyses, the central technical lemmas are bounds on covering numbers. Borrowing a technique due to Barron et al. (1999), these are proved by bounding the Lipschitz constant of the mapping from the parameters to the loss of the functions computed by the networks. (Our proof also borrows ideas from the analysis of the fully connected case, especially (Bartlett et al., 2017; Neyshabur et al., 2018).) Covering bounds may be applied to obtain a huge variety of generalization bounds. We present two examples for each covering bound. One is a standard bound on the difference between training and test error. Perhaps the more relevant bound has the flavor of “relative error”; it is especially strong when the training loss is small, as is often the case in modern practice. Our covering bounds are polynomial in the inverse of the granularity of the cover. Such bounds seem to be especially useful for bounding the relative error.
In particular, our covering bounds are of the form , where is the granularity of the cover, is proportional to the Lipschitz constant of a mapping from parameters to functions, and is the number of parameters in the model. We apply a bound from the empirical process literature in terms of covering bounds of this form due to Giné and Guillou (2001), who paid particular attention to the dependence of estimation error on . This bound may be helpful for other analyses of the generalization of deep learning in terms of different notions of distance from initialization.
Related work. Du et al. (2018) proved size-free bounds for CNNs in terms of the number of parameters, for two-layer networks. Arora et al. (2018) analyzed the generalization of networks output by a compression scheme applied to CNNs. Zhou and Feng (2018) provided a generalization guarantee for CNNs satisfying a constraint on the rank of matrices formed from their kernels. Li et al. (2018) analyzed the generalization of CNNs under other constraints on the parameters. Lee and Raginsky (2018) provided a size-free bound for CNNs in a general unsupervised learning framework that includes PCA and codebook learning.
Notation. If is the kernel of convolutional layer number , then refers to its operator matrix 111Convolution is a linear operator and can thus be written as a matrix-vector product. The operator matrix of kernel , refers to the matrix that describes convolving the input with kernel . For details, see (Sedghi et al., 2018). and denotes the vectorization of the kernel tensor . For matrix , denotes the operator norm of . For vectors, represents the Euclidian norm, and is the norm. For a multiset of elements of some set , and a function from to , let . We will denote the function parameterized by by .
2 Bounds for a basic setting
In this section, we provide a bound for a clean and simple setting.
2.1 The setting and the bounds
In the basic setting, the input and all hidden layers have the same number of channels. Each input satisfies .
We consider a deep convolutional network, whose convolutional layers use zero-padding (see Goodfellow et al., 2016). Each layer but the last consists of a convolution followed by an activation function that is applied componentwise. The activations are -Lipschitz and nonexpansive (examples include ReLU and tanh). The kernels of the convolutional layers are for . Let be the -tensor obtained by concatening the kernels for the various layers. Vector represents the last layer; the weights in the last layer are fixed with . Let be the total number of trainable parameters in the network.
We let take arbitrary fixed values (interpreted as the initial values of the kernels) subject to the constraint that, for all layers , . (This is often the goal of initialization schemes.) Let be the corresponding tensor. We provide a generalization bound in terms of distance from initialization, along with other natural parameters of the problem. The distance is measured with .
For , define to be the set of kernel tensors within distance of , and define to be set of functions computed by CNNs with kernels in . That is, .
Let be a loss function such that is -Lipschitz for all . An example is the -margin loss.
For a function from to , let .
We will use to denote a set of random training examples where each .
Theorem 2.1 (Basic bounds).
For any , there is a such that for any , , , for any joint probability distribution over , if a training set of examples is drawn independently at random from , then, with probability at least , for all ,
If Theorem 2.1 is applied with the margin loss, then is in turn an upper bound on the probability of misclassification on test data. Using the algorithm from (Sedghi et al., 2018), may be efficiently computed. Since (Sedghi et al., 2018), Theorem 2.1 yields the same bounds as a corollary if the definition of is replaced with the analogous definition using .
For , a norm over is full if its unit ball has positive volume.
For , a set of functions with a common domain , we say that is -Lipschitz parameterized if there is a full norm on and a mapping from the unit ball w.r.t. in to such that, for all and such that and , and all ,
The following lemma is essentially known. Its proof, which uses standard techniques (see Pollard, 1984; Talagrand, 1994, 1996; Barron et al., 1999; Giné and Guillou, 2001; Mohri et al., 2018), is in Appendix A.
Suppose a set of functions from a common domain to is -Lipschitz parameterized for and .
Then, for any , there is a such that, for all large enough , for any , for any probability distribution over , if is obtained by sampling times independently from , then, with probability at least , for all ,
2.3 Proof of Theorem 2.1
We will prove Theorem 2.1 by showing that is -Lipschitz parameterized. This will be achieved through a series of lemmas.
Choose and a layer . Suppose satisfies for all . Then, for all examples ,
For each layer , let .
Since is -Lipschitz w.r.t. its first argument, we have that so it suffices to bound . Let be the function from the inputs to the whole network with parameters to the inputs to the convolution in layer , and let be the function from the output of this convolution to the output of the whole network, so that . Choose an input to the network, and let . Recalling that , and the non-linearities are nonexpansive, we have Since the non-linearities are 1-Lipschitz, and, recalling that for , we have
where the last inequality uses the fact that for all and for all .
Now , and the latter is maximized over the nonnegative ’s subject to when each of them is . Since , this completes the proof. ∎
Now we prove a bound when all the layers can change between and .
For any , for any input to the network,
Consider transforming to by replacing one layer of at a time with the corresponding layer in . Applying Lemma 2.6 to bound the distance traversed with each replacement and combining this with the triangle inequality gives
Now we are ready to prove our basic bound.
Proof (of Theorem 2.1).
Since, for any kernel tensors and , (Sedghi et al., 2018), the unit ball w.r.t. contains the unit ball w.r.t. , and thus is a full norm.
2.4 A comparison
Since a convolutional network has an alternative parameterization as a fully connected network, the bounds of (Bartlett et al., 2017) have consequences for convolutional networks. To compare our bound with this, first, note that Theorem 2.1, together with standard model selection techniques, yields a high-probability bound on proportional to
where, for a matrix , One can get an idea of how this bound relates to (1) by comparing the bounds in a simple concrete case. Suppose that each of the convolutional layers of the network parameterized by computes the identity function, and that is obtained from by adding to each entry. In this case, disregarding edge effects, for all , and (as proved in Appendix C). Also, We get additional simplification if we set . In this case, (2) gives a constant times
where (1) gives a constant times
In this scenario, the new bound is independent of , and grows more slowly with , and . Note that (and, typically, it is much less).
3 A more general bound
In this section, we generalize Theorem 2.1.
3.1 The setting
The more general setting concerns a neural network where the input is a tensor whose flattening has Euclidian norm at most , and network’s output is a -dimensional vector, which may be logits for predicting a one-hot encoding of an -class classification problem.
The network is comprised of convolutional layers followed by fully connected layers. The th convolutional layer includes a convolution, with kernel , followed by a componentwise non-linearity and an optional pooling operation. We assume that the non-linearity and any pooling operations are -Lipschitz and nonexpansive. Let be the matrix of weights for the th fully connected layer. Let be all of the parameters of the network. Let .
We assume that, for all , is -Lipschitz for all and that for all and .
An example includes a -tensor and .
We let take arbitrary fixed values subject to the constraint that, for all convolutional layers , , and for all fully connected layers , . Let .
For and . define
For , define to be set of functions computed by CNNs as described in this subsection with parameters within -distance of . Let be the set of their parameterizations.
Theorem 3.1 (General Bound).
For any , there is a constant such that the following holds. For any such that , for any , for any joint probability distribution over such that, with probability 1, satisfies , under the assumptions of this section, if a training set of examples is drawn independently at random from , then, with probability at least , for all ,
3.2 Proof of Theorem 3.1
We will prove Theorem 3.1 by using to witness the fact that is -Lipschitz parameterized.
Choose and a convolutional layer . Suppose that for all convolutional layers and for all fully connected layers . Then, for all examples ,
Choose and a fully connected layer . Suppose that for all convolutional layers and for all fully connected layers . Then, for all examples ,
Now we prove a bound when all the layers can change between and .
For any , for any input ,
Consider transforming to by replacing one layer at a time of with the corresponding layer in . Applying Lemma 3.2 to bound the distance traversed with each replacement of a convolutional layer, and Lemma 3.3 to bound the distance traversed with each replacement of a fully connected layer, and combining this with the triangle inequality gives the lemma. ∎
Now we are ready to prove our more general bound.
Proof (of Theorem 3.1).
First, let us prove that is full. Note that, for any and , we have
but this is just the vector norm applied to flattenings of and . Since this norm is full, and its unit ball is contained in the unit ball for , the latter is also full.
3.3 Another comparison
Theorem 3.1 applies in the case that there are no convolutional layers, i.e. for a fully connected network. In this subsection, we compare its bound in this case with the bound of (Bartlett et al., 2017). Because the bounds are in terms of different quantities, we compare them in a simple concrete case. In this case, for , each hidden layer has components, and there are classes. For all , and , where is a Hadamard matrix (using the Sylvester construction), and . Then, dropping the superscripts, each layer has
We trained a 10-layer all-convolutional model on the CIFAR-10 dataset. The architecture was similar to VGG (Simonyan and Zisserman, 2014). The network was trained with dropout regularization and an exponential learning rate schedule. We define the generalization gap as the difference between train error and test error. In order to analyze the effect of the number of network parameters on generalization gap, we scaled up the number of channels in each layer, while keeping other elements of the architecture, including the depth, fixed. Each network was trained repeatedly, sweeping over different values of the initial learning rate and batch sizes . For each setting the results were averaged over five different random initializations. Figure 1 shows the generalization gap for different values of . As in the bound of Theorem 3.1, the generalization gap increases with . Figure 2 shows that as the network becomes more over-parametrized, the generalization gap remains almost flat with increasing . This is expected due to role of over-parametrization on generalization (Neyshabur et al., 2019). An explanation of this phenomenon that is consistent with the bound presented here is that increasing leads to a decrease in value of ; see Figure (a)a. The fluctuations in Figure (a)a are partly due to the fact that training neural networks is not an stable process. We provide the medians for different values of in Figure (b)b.
We thank Peter Bartlett for a valuable conversation.
While Theorem 3.1 applies to practical architectures, we restricted the scope of that theorem to remove clutter from the analysis. For example, we assumed that the activation functions and pooling operators are -Lipschitz, but it is easy to see how to slightly modify our analysis to handle larger Lipschitz constants.
Many variants of our bounds are possible. For example, our arguments use the fact that the Lipschitz constants of subnetworks are bounded above by . Of course, a bound that is in terms of these Lipschitz constants, along the with other parameters of the problem, would be tighter, while more complicated. We have chosen to present relatively simple and interpretable bounds.
- Allen-Zhu et al.  Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. ICML, 2019.
- Arora et al.  Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.
- Barron et al.  Andrew Barron, Lucien Birgé, and Pascal Massart. Risk bounds for model selection via penalization. Probability theory and related fields, 113(3):301–413, 1999.
- Bartlett  Peter L Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE transactions on Information Theory, 44(2):525–536, 1998.
- Bartlett et al.  Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.
- Du et al.  Simon S Du, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Ruslan R Salakhutdinov, and Aarti Singh. How many samples are needed to estimate a convolutional neural network? In Advances in Neural Information Processing Systems, pages 373–383, 2018.
- Du et al. [2019a] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. ICML, 2019a.
- Du et al. [2019b] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. ICLR, 2019b.
- Dziugaite and Roy  Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. UAI, 2017.
- Giné and Guillou  Evarist Giné and Armelle Guillou. On consistency of kernel density estimators for randomly censored data: rates holding uniformly over adaptive intervals. In Annales de l’IHP Probabilités et statistiques, volume 37, pages 503–522, 2001.
- Glorot and Bengio  Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
- Goodfellow et al.  Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
- Gunasekar et al.  Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151–6159, 2017.
- Gunasekar et al. [2018a] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018a.
- Gunasekar et al. [2018b] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461–9471, 2018b.
- Haussler  D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78–150, 1992.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- Kolmogorov and Tikhomirov  A. N. Kolmogorov and V. M. Tikhomirov. -entropy and -capacity of sets in function spaces. Uspekhi Matematicheskikh Nauk, 14(2):3–86, 1959.
- LeCun et al.  Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
- Lee and Raginsky  Jaeho Lee and Maxim Raginsky. Learning finite-dimensional coding schemes with nonlinear reconstruction maps. arXiv preprint arXiv:1812.09658, 2018.
- Li et al.  Xingguo Li, Junwei Lu, Zhaoran Wang, Jarvis Haupt, and Tuo Zhao. On tighter generalization bound for deep neural networks: Cnns, resnets, and beyond. arXiv preprint arXiv:1806.05159, 2018.
- Ma et al.  Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In ICML, pages 3351–3360, 2018.
- Mohri et al.  Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
- Neyshabur et al.  Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017.
- Neyshabur et al.  Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. ICLR, 2018.
- Neyshabur et al.  Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards understanding the role of over-parametrization in generalization of neural networks. ICLR, 2019.
- Pollard  D. Pollard. Convergence of Stochastic Processes. Springer Verlag, Berlin, 1984.
- Pollard  D. Pollard. Empirical Processes : Theory and Applications, volume 2 of NSF-CBMS Regional Conference Series in Probability and Statistics. Institute of Math. Stat. and Am. Stat. Assoc., 1990.
- Saxe et al.  Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
- Sedghi et al.  Hanie Sedghi, Vineet Gupta, and Philip M Long. The singular values of convolutional layers. arXiv preprint arXiv:1805.10408, 2018.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Talagrand  M. Talagrand. Sharper bounds for Gaussian and empirical processes. Annals of Probability, 22:28–76, 1994.
- Talagrand  Michel Talagrand. New concentration inequalities in product spaces. Inventiones mathematicae, 126(3):505–563, 1996.
- Vapnik  V. N. Vapnik. Estimation of Dependencies based on Empirical Data. Springer Verlag, 1982.
- Vapnik and Chervonenkis  V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280, 1971.
- Zhang et al.  Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
- Zhou and Feng  Pan Zhou and Jiashi Feng. Understanding generalization and optimization performance of deep cnns. In ICML, pages 5955–5964, 2018.
Appendix A Proof of Lemma 2.4
If is a metric space and , we say that is an -cover of with respect to if every has a such that . Then denotes the size of the smallest -cover of w.r.t. .
For a domain , define a metric on pairs of functions from to by .
We need two lemmas in terms of these covering numbers. The first is by now a standard bound from Vapnik-Chervonenkis theory [Vapnik and Chervonenkis, 1971, Vapnik, 1982, Pollard, 1984]. For example, it is a direct consequence of [Haussler, 1992, Theorem 3].
For any , there is a constant depending only on such that the following holds. Let be an arbitrary set of functions from a common domain to . If there are constants and such that, for all , then there is an absolute constant such that, for all large enough , for any , for any probability distribution over , if is obtained by sampling times independently from , then, with probability at least , for all ,
We will also use the following, which is the combination of (2.5) and (2.7) of [Giné and Guillou, 2001].
Let be an arbitrary set of functions from a common domain to . If there are constants and such that for all , then there is an absolute constant such that, for all large enough , for any , for any probability distribution over , if is obtained by sampling times independently from , then, with probability at least , for all ,
So now we want a bound on for Lipschitz-parameterized classes. For this, we need the notion of a packing which we now define.
For any metric space and any , let be the size of the largest subset of whose members are pairwise at a distance greater than w.r.t. .
Lemma A.6 ([Kolmogorov and Tikhomirov, 1959]).
For any metric space , any , and any , we have
We will also need a lemma about covering a ball by smaller balls. This is probably also already known, and uses a standard proof [see Pollard, 1990, Lemma 4.1], but we haven’t found a reference for it.
be an integer,
be a full norm
be the metric induced by , and
A ball in of radius w.r.t. can be covered by balls of radius .
We may assume without loss of generality that . Let be the volume of the unit ball w.r.t. in . Then the volume of any -ball with respect to is . Let be the ball of radius in . The -balls centered at the members of any -packing of are disjoint. Since these centers are contained in , the balls are contained in a ball of radius . Thus
Solving for and applying Lemma A.6 completes the proof. ∎
Appendix B Proof of (1)
For , and for each , let let . Taking a union bound over an application of Theorem 2.1 for each value of , with probability at least , for all , and all
For any , if we apply these bounds in the case of the least such that , we get
and simplifying completes the proof.
Appendix C The operator norm of
Let . Since , it suffices to find .
For the rest of this section, we number indices from , let , and define . To facilitate the application of matrix notation, pad the tensor out with zeros to make a tensor .
The following lemma is an immediate consequence of Theorem 6 of Sedghi et al. .
Lemma C.1 (Sedghi et al. ).
Let be the complex matrix defined by .