Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network

Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network

Taiji Suzuki
Graduate School of Information Science and Technology, The University of Tokyo
Center for Advanced Intelligence Project, RIKEN
taiji@mist.i.u-tokyo.ac.jp
Abstract

One of the biggest issues in deep learning theory is the generalization ability of networks with huge model size. The classical learning theory suggests that overparameterized models cause overfitting. However, practically used large deep models avoid overfitting, which is not well explained by the classical approaches. To resolve this issue, several attempts have been made. Among them, the compression based bound is one of the promising approaches. However, the compression based bound can be applied only to a compressed network, and it is not applicable to the non-compressed original network. In this paper, we give a unified frame-work that can convert compression based bounds to those for non-compressed original networks. The bound gives even better rate than the one for the compressed network by improving the bias term. By establishing the unified frame-work, we can obtain a data dependent generalization error bound which gives a tighter evaluation than the data independent ones.

\externaldocument

NIPS2019_supplementary

1 Introduction

Deep learning has shown quite successful results in wide range of machine learning applications. such as image recognition (Krizhevsky et al., 2012), natural language processing (Devlin et al., 2018) and image synthesis tasks (Radford et al., 2015). The success of deep learning methods is mainly due to its flexibility, expression power and computational efficiency for large dataset training. Due to its significant importance in wide range of application areas, its theoretical analysis is also getting much important. For example, it has been known that the deep neural network has universal approximation capability (Cybenko, 1989; Hornik, 1991; Sonoda and Murata, 2015) and its expressive power grows up in an exponential order against the number of layers (Montufar et al., 2014; Bianchini and Scarselli, 2014; Cohen et al., 2016; Cohen and Shashua, 2016; Poole et al., 2016; Suzuki, 2019). However, theoretical understandings are still lacking in several important issues.

Among several topics of deep learning theories, a generalization error analysis is one of the biggest issues in the machine learning literature. An important property of deep learning is that it generalizes well even though its parameter size is quite large compared with the sample size (Neyshabur et al., 2019). This can not be well explained by a classical learning theory which suggests that overparameterized models cause overfitting and thus result in poor generalization ability.

For this purpose, norm based bounds have been extensively studied so far (Neyshabur et al., 2015; Bartlett et al., 2017b; Neyshabur et al., 2017; Golowich et al., 2018). These bounds are beneficial because the bounds are not explicitly dependent on the number of parameters and thus are useful to explain the generalization error of overparameterized network (Neyshabur et al., 2019). However, these bounds are typically exponentially dependent on the number of layers and thus tends to be loose for deep network situations. As a result, Arora et al. (2018) reported that a simple VC-dimension bound (Li et al., 2018; Harvey et al., 2017) can still give sharper evaluations than these norm based bounds in some practically used deep networks. Wei and Ma (2019) improved this issue by involving a data dependent Lipschitz constant as performed in Arora et al. (2018); Nagarajan and Kolter (2019).

On the other hand, compression based bound is another promising approach for tight generalization error evaluation which can avoid the exponential dependence on the depth. The complexity of deep neural network model is regulated from several aspects. For example, we usually impose explicit regularization such as weight decay (Krogh and Hertz, 1992), dropout (Srivastava et al., 2014; Wager et al., 2013), batch-normalization (Ioffe and Szegedy, 2015), and mix-up (Zhang et al., 2018; Verma et al., 2018). Zhang et al. (2016) reported that such explicit regularization does not have much effect but implicit regularization induced by SGD (Hardt et al., 2016; Gunasekar et al., 2018; Ji and Telgarsky, 2019) is important. Through these explicit and implicit regularizations, deep learning tends to produce a simpler model than its full expression ability (Valle-Perez et al., 2019; Verma et al., 2018). To measure how “simple” the trained model is, one of the most promising approaches currently investigated is the compression bounds (Arora et al., 2018; Baykal et al., 2019; Suzuki et al., 2018). These bounds measure how much the network can be compressed and characterize the size of the compressed network as the implicit effective dimensionality. Arora et al. (2018) characterized the implicit dimensionality based on so called layer-cushion quantity and suggested to perform random projection to obtain a compressed network. Along with a similar direction, Baykal et al. (2019) proposed a pruning scheme called Corenet and derived a bound of the size of the compressed network. Suzuki et al. (2018) has developed a spectrum based bound for their compression scheme. Unfortunately, all of these bounds guarantee the generalization error of only the compressed network, not the original network. Hence, it does not give precise explanations about why large network can avoid overfitting.

In this paper, we derive a unified framework to obtain a compression based bound for a non-compressed network. Unlike the existing researches, our bound is valid to evaluate the original network before compression, and thus gives a direct explanation about why deep learning generalizes despite its large network size. The difficulty to apply the compression bound to the original network lies in evaluation of the population -bound between the compression network and the original network. A naive evaluation results in the VC-bound which is not preferable. This difficulty is overcome by developing novel data dependent capacity control technique using local Rademacher complexity bounds (Mendelson, 2002; Bartlett et al., 2005; Koltchinskii, 2006; Giné and Koltchinskii, 2006). Then, the bound is applied to some typical situations where the network is well compressed. Our analysis stands on the implicit bias hypothesis (Gunasekar et al., 2018; Ji and Telgarsky, 2019) that claims deep learning tends to produce rather simple models. Actually, Gunasekar et al. (2018); Ji and Telgarsky (2019) showed gradient descent results in (near) low rank parameter matrices in each layer in linear network settings. Martin and Mahoney (2018) evaluated the eigenvalue decays of the weight matrix through random matrix theories and several numerical experiments. These observations are also supported by the flat minimum analysis (Hochreiter and Schmidhuber, 1997; Wu et al., 2017; Langford and Caruana, 2002), that is, the product of the eigenvalues of the Hessian around the SGD solution tends to be small, which means SGD converges to a flat minimum and possess stability against small perturbations leading to good generalization. Based on these observations, we make use of the eigenvalue decay of the weight matrix and the covariance matrix among the nodes in each layer. The eigenvalue decay speed characterizes the redundancy in each layer and thus is directly relevant to compression ability. Our contributions in this paper are summarized as follows:

  • We give a unified framework to obtain a compression based bound for non-compressed network which properly explains that a compressible network can generalizes well. The bound can convert several existing compression based bounds to that for non-compressed one in a unifying manner. The bound is applied to near low rank models as concrete examples.

  • We develop a data dependent capacity control technique to bound the discrepancy between the original network and compressed network. As a result, we obtain a sharp generalization error bound which is even better than that of the compressed network. All derived bounds are characterized by data dependent quantities.

Authors Rate Bound type Original
Neyshabur et al. (2015) Norm base Yes
Bartlett et al. (2017b) Norm base Yes
Wei and Ma (2019) Norm base Yes
Neyshabur et al. (2017) Norm base Yes
Golowich et al. (2018) Norm base Yes
Li et al. (2018)
Harvey et al. (2017)
VC-dim.
Yes
Arora et al. (2018) Compression No
Suzuki et al. (2018) Compression No
Ours (Thm. 1)
General
Yes
Ours (Cor. 1)
Low rank weight
Yes
Ours (Thm. 4)
Low rank weight
Low rank cov.
Yes

Table 1: Comparison of each generalization error to our bound. is the Frobenius norm of the weight matrix, is the operator norm of the weight matrix, is the matrix norm, is the depth, is the maximum of the width, is the sample size. and represent the Rademacher complexity and local Rademacher complexity respectively. is a Lipschitz constant between layers. represents the eigenvalue drop rate of the weight matrix, and represents that of the covariance matrix among the nodes in each internal layer. is the bias induced by compression. “Original” indicates whether the bound is about the original network or not.

Other related work

Recently, the role of over-parameterization for two layer networks has been extensively studied (Neyshabur et al., 2019; Arora et al., 2019). These are for the shallow network and the generalization error is essentially given by the norm based bounds. It is not obvious that these bounds also give sharp bounds for deep models.

PAC-Bayes bound is also applied to obtain a non-vacuous compression based bound (Zhou et al., 2019). However, the bound is still for the compressed (quantized) models and it is not obvious that that bound can be converted to that for the original network.

Relation between compression and learnability was traditionally studied in a different framework as in Littlestone and Warmuth (1986) and minimum description code length (Hinton and Van Camp, 1993). Our bound would share the same spirits with these studies but give a new analysis by incorporating recent observations in deep learning researches.

2 Preliminaries: Problem formulation and notations

In this section, we give the problem setting and notations that will be used in the theoretical analysis. We consider the standard supervised leaning formulation where data consists of input and output (or label) . We consider a single output setting, i.e., the output is a 1-dimensional real value, but it is straight forward to generalize the result to a multiple output case. Suppose that we are given i.i.d. observations distributed from a probability distribution the marginal distribution of is denoted by and the one corresponding to is denoted by . To measure the performance of a trained function , we use a loss function and define a training error and its expected one as

where the expectation is taken with respect to . Basically, we are interested in the generalization error for an estimator . We denote the empirical -norm by for an empirical observation . The population -norm is denoted by .

This paper deals with deep neural networks as a model. The activation function is denoted by which will be assumed to be 1-Lipschitz as satisfied by ReLU (Assumption 1). Let the depth of the network be and the width of the -th internal layer be where we set (dimension of input) and (dimension of output) for convention. Then, the set of networks having depth and width with norm constraint as

where 111In this paper, denotes the Euclidean norm: . is the operator norm (the maximum singular value), is the Frobenius norm, and is the “clipping” operator that is defined by for a constant . The reason why we put the clipping operator on top of the last layer is because the clipping operator restricts the -norm by a constant and then we can avoid unrealistically loose generalization error. Note that the clipping operator does not change the classification error for binary classification. We express to represent the “full model”: for a given . Here, we implicitly suppose that is close to 1 so that the norm of the output from internal layers is not too much amplified, while could be moderately large.

The Rademacher complexity is the typical tool to evaluate the generalization error on a function class , which is denoted by where , and is an i.i.d. Rademacher sequence . This is also called conditional Rademacher complexity because the expectation is taken conditioned on fixed . Its expectation with respect to is denoted by Roughly speaking the Rademacher complexity measures the size of the model and it gives an upper bound of the generalization error (Vapnik, 1998; Mohri et al., 2012).

The main difficulty in generalization error analysis of deep learning is that the Rademacher complexity of the full model is quite large. One of the successful approaches to avoid this difficulty is the compression based bound (Arora et al., 2018; Baykal et al., 2019; Suzuki et al., 2018) which measures how much the trained network can be compressed. If the network can be compressed to much smaller one, then its intrinsic dimensionality can be regarded as small. To describe it more precisely, suppose that the trained network is included in a subset of the neural network model: . For example, can be a set of networks with weight matrices that have bounded norms and are near low rank (Sec. 3.1 or Sec. 3.2). We do not assume a specific type of training procedure, but we give a uniform bound valid for any estimator that falls into and satisfies the following compressibility condition. We suppose that the network is easy to compress, that is, can be compressed to a smaller network which is included in a submodel: . For example, can be a set of networks with a smaller size than . How small the trained network can be compressed has been characterized by several notions such as “layer-cushion” (Arora et al., 2018). Typical compression based bounds give generalization errors of the compressed model , not the original network . Our approach converts an error bound of to that of and eventually obtains a tighter evaluation.

The biggest difficulty for transforming the compression bound to that of lies in evaluation of the population -norm between and . Basically, the compression based bounds are given as

(1)

for a constant under some assumptions (Table 1). The term appears to adapt the empirical error of to that of , that is called “compression error” which can be seen as a bias term. We see that, in the right hand side, there appears the complexity of which is assumed to be much smaller than that of the full model . However, the left hand side is not the expected error of but that of . One way to transfer this bound to that of is that we have by assuming Lipschitz continuity of the loss function and then convert the bound (1) to

However, to bound the term , there typically appears the complexity of the model which is larger than the compressed model like , which results in slow convergence rate. To overcome this difficulty, we need to carefully control the difference between the training and test error of and by utilizing the local Rademacher complexity technique (Mendelson, 2002; Bartlett et al., 2005; Koltchinskii, 2006; Giné and Koltchinskii, 2006). The local Rademacher complexity of a model with radius is defined as

The main difference from the standard Rademacher complexity is that the model is localized to a set of functions satisfying . As a result, we obtain a tighter error bound.

Throughout this paper, we always assume the following assumptions.

Assumption 1 (Lipschitz continuity of loss and activation functions).

The loss function is 1-Lipschitz continuous with respect to the function output:

The activation function is also 1-Lipschitz continuous: where is any positive integer.

Assumption 2.

The norm of input is bounded by :

Assumption 3.

The -norms of all elements in and are bounded by : for all and .

This assumption can be ensured by applying the clipping operator to the output of the functions. In this paper, all the variables are supposed to be . What we will derive in the following is a bound which has mild dependency on the depth and depends on the width in a sub-linear order by using the compression based approach.

Existing bounds for no-compressed network

Here we give a brief review of the generalization error bound for non-compressed models. (i) VC-bound: The Rademacher complexity of the full model can be bounded by a naive VC-dimension bound (Harvey et al., 2017) which is In this bound, there appears the number of parameters in the numerator. However, the number of parameters is often larger than the sample size in practical use. Hence, this bound is not appropriate to evaluate generalization ability of overparameterized networks. (ii) Norm-based bound: Golowich et al. (2018) showed the norm based bound which is given as . However, this is exponentially dependent on the depth as resulting in quite loose bound. Neyshabur et al. (2017) showed a norm based bound of which avoids the exponential dependency. However, there is still dependency on the width, , which is larger than the linear order of the width since could be moderately large. Bartlett et al. (2017b) showed . The norm constraint on implicitly assumes sparsity on the weight matrix and typically depends on the width linearly. Wei and Ma (2019) improved the exponential dependency appearing in this bound (Bartlett et al., 2017b) to obtained a bound where is the Lipschitz continuity between layers. We can see that and can depend on the width linearly and quadratically respectively even though is bounded.

3 Compression bound for noncompressed network

Here, we give a general theoretical tool that converts a compression based bound to that for the original network . We suppose the model classes and are fixed independently on each data observation222 We can extend the result to data dependent models and by taking uniform bound for all possible choice of the pair and . However, we omit explicit presentation of this uniform bound for simplicity. . We denote the Minkowski difference of and by . We assume that the local Rademacher complexity of this set has a concave shape with respect to : Suppose that there exists a function such that

This condition is not restrictive, and usual bounds for the local Rademacher complexity satisfy this condition (Mendelson, 2002; Bartlett et al., 2005). Using this notation, we define as

This is roughly given by the fixed point of a function , and it is useful to bound the ratio of the empirical -norm and the population -norm of an element in : with high probability. Then, we obtain the following theorem that gives the compression based bound for non-compressed networks.

Theorem 1.

Suppose that the empirical -distance between and is bounded by for a fixed almost surely. Let , then, under Assumptions 1, 2, 3, there exists a universal constant such that

with probability at least for all .

The proof is given in Appendix A. The bound consists of two terms: “main term” and “fast term.” The main term represents the complexity of the compressed model which could be much smaller than . The fast term represents a sample complexity to bridge the original model and the compressed model. If we set , then it can be faster than the main term which is . Indeed, the fast term achieves in a typical situation. The term can be refined by directly evaluating the covering number of the model (the poly- factor can be improved). The refined version is given in Appendix A. This bound is general, and can be combined with the compression bounds derived so far such as Arora et al. (2018); Baykal et al. (2019); Suzuki et al. (2018) where the complexity of and the bias are analyzed for their generalization error bounds.

The main difference from the compression bound (1) for is that the bias term is replaced by which is times smaller. Since and are typically , we may neglect these terms, and then the bound is informally written as

This allows us to obtain tighter bound than the compression bound for because the bias term is much smaller than and eventually we can let the variance term much smaller by taking small compressed model when we balance the bias and variance trade-off. This is an advantageous point of directly bounding the generalization error of instead of .

Finally, we note that some existing bounds such as Arora et al. (2018); Bartlett et al. (2017b); Wei and Ma (2019) assumes a constant margin so that the bias term can be a sufficiently small constant (which does not need to converge to 0). On the other hand, our bound does not assume it and the bias term should converge to 0 so that the bias is balanced with the variance term, which is a more difficult problem setting.

Example 1.

In practice, a trained network can be usually compressed to one with sparse weight matrix via pruning techniques (Denil et al., 2013; Denton et al., 2014). Based on this observation, Baykal et al. (2019) derived a compression based bound based on a pruning procedure. In this situation, we may suppose that is the set of networks with non-zero parameters where is much smaller than the total number of parameters: where is the number of nonzero parameters of the weight matrix . In this situation, its Rademacher complexity is bounded by (see Appendix B.2 for the proof). This is much smaller than the VC-dimension bound if .

Although our bound can be adopted to several compression based bounds, we are going to demonstrate how small the obtained bound can be through some typical situations in the following.

3.1 Compression bound with near low rank weight matrix

Here, we analyze the situation where the trained network has near low rank weight matrices . It has been reported that the trained network tends to have near low rank weight matrices experimentally (Gunasekar et al., 2018; Ji and Telgarsky, 2019). This situation has been analyzed in Arora et al. (2018) where the low rank property is characterized by their original quantities such as layer cushion. However, we employ a much simpler and intuitive condition to highlight how the low rank property affects the generalization.

Assumption 4.

Assume that each of weight matrices of any is near low rank, that is, there exists and such that

where is the -th largest singular value of a matrix ().

In this situation, we can see that for any , we can approximate by a rank matrix as . Let the set of networks with exactly low rank weight matrices be for . If we set , then we have the following theorem.

Theorem 2.

The compressed model has the following complexity:

If satisfies Assumption 4, we can set : for any , there exists such that . Then, letting and , the overall generalization error is bounded by

with probability for any where is a constant depending on .

See Appendix B.3 for the proof. This indicates that, if is large (in other words, each weight matrix is close to rank ), then we have a better generalization error bound. Note that the rank can be arbitrary chosen and and are in a trade-off relation. Hence, by selecting the rank appropriately so that this trade-off is balanced, then we obtain the optimal upper bound as in the following corollary.

Corollary 1.

Under Assumption 4, using the same notation as Theorem 2, it holds that

with probability for any where is a constant depending on .

An important point here is that the bound is which has linear dependency on the width in the square root, but the naive VC-dimension bound has quadratic dependency . In other words, the term in the square root has linear dependency to the number of nodes instead of the number of parameters. This is huge gap because the width can be quite large in practice. This result implies that a compressible model achieves much better generalization than the naive VC-bound.

In the generalization error bound, there appears . Even though can be much smaller than , the exponential dependency can give loose bound as pointed out in Arora et al. (2018). This is due to a rough evaluation of the Lipschitz continuity between layers, but the practically observed Lipschitz constant is usually much smaller. To fix this issue, we give a refined version of Corollary 1 in Appendix B.4 by using data dependent Lipschitz constants such as interlayer cushion and interlayer smoothness introduced by Arora et al. (2018). The refined bound does not involve the exponential term , but instead (: Lipschitz continuity) appears.

3.2 Compression bound with near low rank covariance matrix

Strictly speaking, the near low rank condition on the weight matrix in the previous section can be dealt with a standard Rademacher complexity argument. Here, we consider more data dependent bound: We assume the near low rank property of the covariance matrix among the nodes in an internal layer. A compression based bound for using the low rank property of the covariance has been studied by Suzuki et al. (2018), but their analysis requires a bit strong condition on the weight matrix. In this paper, we employ a weaker assumption.

Let be the covariance matrix of the nodes in the -th layer where Its eigenvalues are denoted by where they are sorted in decreasing order: .

Assumption 5.

Suppose that the trained network satisfies the following conditions:

(2)

for a fixed and .

If satisfies this assumption, then we can show that can be compressed to a smaller one that has width with compression error roughly evaluated as . More precisely, for given () which corresponds to the compression error in the -th layer, let . Then, we define for . Correspondingly we set

Then, we obtain the following theorem.

Theorem 3.

Let . Then, under Assumption 5, there exists with width that satisfies and

In particular, we may set , and then it holds that for a constant .

See Appendix B.5 for the proof. Here, we again observe that there appears a trade-off between and because as becomes small, then becomes large and thus becomes large. The evaluation given in Theorem 3 can be substituted to the general bound (Theorem 1). If is the full model , then there appears the number of parameters which could be larger than , which is unavoidable. This dependency on the number of parameters becomes much milder if both of Assumptions 4 and 5 are satisfied.

Theorem 4.

Under Assumptions 4 and 5, it holds that

for and with probability , where is a constant depending on .

If we omit and terms for simplicity of presentation, then the bound can be written as

where the symbol hides the poly-log order. This is tighter than that of Corollary 1. We can see that as and get large, the bound becomes tighter. Actually, by taking the limit of , then the bound goes to . Moreover, the term dependent on the width is with respect to the sample size which is faster than the rate which was presented in Corollary 1. Hence, the low rank property of both the covariance matrix and the weight matrix helps to obtain better generalization. Although the bound contains the exponential term , we can give a refined version that does not contain the exponential term by assuming interlayer cushion (Arora et al., 2018). See Appendix B.6 for the refined version.

There appears which is exponentially dependent on . However, this term is moderately small for realistic settings of the depth . Actually, it is 7.27 for and 26.7 for (we can replace this term in exchange for larger polynomial dependency on ). The bound is not optimized with respect to the dependency on the depth . In particular, the term could be an artifact of the proof technique and the term would be improved.

Finally, we compare our bound with the following norm based bounds; by Bartlett et al. (2017b) and by Wei and Ma (2019). Since our bound and their bounds are derived from different conditions, we cannot tell which is better. Here, we consider a special case where and which is an extreme case of low rank settings (note that has rank 1). Then, , , and , and thus their bounds are and respectively. However, and in our bound can be arbitrary large in this situation, so that our bound has much milder dependency on the width . On the other hand, if the weight matrix has small norm and has no spectral decay (corresponding to small and ), then our bound can be looser than theirs. Combining compression based bounds and norm based bounds would be interesting future work.

4 Conclusion

In this paper, we derived a compression based error bound for non-compressed network. The bound is general and it can be adopted to several compression based bound derived so far. The main difficulty lies in evaluating the population -norm between the original network and the compressed network, but it can be overcome by utilizing the data dependent bound by the local Rademacher complexity technique. We have applied the derived bound to a situation where low rank properties of the weight matrices and the covariance matrices are assumed. The obtained bound gives much better dependency on the parameter size than ever obtained compression based ones.

Acknowledgment

TS was partially supported by JSPS Kakenhi (15H05707, 18H03201, and 18K19793), Japan Digital Design and JST-CREST.

References

  • S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang (2019) Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks. arXiv e-prints, pp. arXiv:1901.08584. External Links: 1901.08584 Cited by: §1.
  • S. Arora, R. Ge, B. Neyshabur, and Y. Zhang (2018) Stronger generalization bounds for deep nets via a compression approach. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmassan, Stockholm Sweden, pp. 254–263. External Links: Link Cited by: §B.4, §B.4, Table 1, §1, §1, §2, §3.1, §3.1, §3.2, §3, §3, Assumption 6.
  • F. Bach (2017) On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research 18 (21), pp. 1–38. Cited by: §C.2.
  • P. Bartlett, O. Bousquet, and S. Mendelson (2005) Local Rademacher complexities. The Annals of Statistics 33, pp. 1487–1537. Cited by: §1, §2, §3.
  • P. Bartlett, D. J. Foster, and M. Telgarsky (2017a) Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498. Cited by: Appendix A.
  • P. L. Bartlett, D. J. Foster, and M. J. Telgarsky (2017b) Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6241–6250. Cited by: Table 1, §1, §2, §3.2, §3.
  • C. Baykal, L. Liebenwein, I. Gilitschenski, D. Feldman, and D. Rus (2019) Data-dependent coresets for compressing neural networks with applications to generalization bounds. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §3, Example 1.
  • M. Bianchini and F. Scarselli (2014) On the complexity of neural network classifiers: a comparison between shallow and deep architectures. IEEE transactions on neural networks and learning systems 25 (8), pp. 1553–1565. Cited by: §1.
  • S. Boucheron, G. Lugosi, and P. Massart (2013) Concentration inequalities: a nonasymptotic theory of independence. OUP Oxford. External Links: ISBN 9780199535255, LCCN 2012277339, Link Cited by: Appendix A, Appendix A, Appendix A, Appendix A.
  • O. Bousquet (2002) A Bennett concentration inequality and its application to suprema of empirical process. C. R. Acad. Sci. Paris Ser. I Math. 334, pp. 495–500. Cited by: Appendix A, Proposition 2.
  • N. Cohen, O. Sharir, and A. Shashua (2016) On the expressive power of deep learning: a tensor analysis. In The 29th Annual Conference on Learning Theory, pp. 698–728. Cited by: §1.
  • N. Cohen and A. Shashua (2016) Convolutional rectifier networks as generalized tensor decompositions. In Proceedings of the 33th International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, Vol. 48, pp. 955–963. Cited by: §1.
  • G. Cybenko (1989) Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS) 2 (4), pp. 303–314. Cited by: §1.
  • M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas (2013) Predicting parameters in deep learning. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 2148–2156. External Links: Link Cited by: Example 1.
  • E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 1269–1277. External Links: Link Cited by: Example 1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, pp. arXiv:1810.04805. External Links: 1810.04805 Cited by: §1.
  • E. Giné and V. Koltchinskii (2006) Concentration inequalities and asymptotic results for ratio type empirical processes. The Annals of Probability 34 (3), pp. 1143–1216. Cited by: §1, §2.
  • N. Golowich, A. Rakhlin, and O. Shamir (2018) Size-independent sample complexity of neural networks. In Proceedings of the 31st Conference On Learning Theory, S. Bubeck, V. Perchet, and P. Rigollet (Eds.), Proceedings of Machine Learning Research, Vol. 75, , pp. 297–299. External Links: Link Cited by: Table 1, §1, §2.
  • S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro (2018) Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pp. 9482–9491. Cited by: §1, §1, §3.1.
  • M. Hardt, B. Recht, and Y. Singer (2016) Train faster, generalize better: stability of stochastic gradient descent. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 1225–1234. External Links: Link Cited by: §1.
  • N. Harvey, C. Liaw, and A. Mehrabian (2017) Nearly-tight VC-dimension bounds for piecewise linear neural networks. In Proceedings of the 2017 Conference on Learning Theory, S. Kale and O. Shamir (Eds.), Proceedings of Machine Learning Research, Vol. 65, Amsterdam, Netherlands, pp. 1064–1068. External Links: Link Cited by: Table 1, §1, §2.
  • G. Hinton and D. Van Camp (1993) Keeping neural networks simple by minimizing the description length of the weights. Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1997) Flat minima. Neural Computation 9 (1), pp. 1–42. Cited by: §1.
  • K. Hornik (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks 4 (2), pp. 251–257. Cited by: §1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learningin Proc. of the 6th Ann. ACM Conf. on Computational Learning TheoryInternational Conference on Learning RepresentationsInternational Conference on Learning RepresentationsAdvances in neural information processing systemsProceedings of the 34th International Conference on Machine Learning-Volume 70Advances in neural information processing systemsInternational Conference on Algorithmic Learning TheoryInternational Conference on Learning RepresentationsAdvances in neural information processing systemsInternational Conference on Learning Representations, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 448–456. External Links: Link Cited by: §1.
  • Z. Ji and M. Telgarsky (2019) Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §3.1.
  • V. Koltchinskii (2006) Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics 34, pp. 2593–2656. Cited by: §1, §2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. pp. 1097–1105. Cited by: §1.
  • A. Krogh and J. A. Hertz (1992) A simple weight decay can improve generalization. pp. 950–957. Cited by: §1.
  • J. Langford and R. Caruana (2002) (Not) bounding the true error. In Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), pp. 809–816. External Links: Link Cited by: §1.
  • M. Ledoux and M. Talagrand (1991) Probability in banach spaces. isoperimetry and processes. Springer, New York. Note: MR1102015 Cited by: Appendix A, Appendix A, Appendix A, Appendix A.
  • X. Li, J. Lu, Z. Wang, J. Haupt, and T. Zhao (2018) On tighter generalization bound for deep neural networks: cnns, resnets, and beyond. arXiv preprint arXiv:1806.05159. Cited by: Table 1, §1.
  • N. Littlestone and M. K. Warmuth (1986) Relating data compression and learnability. Technical report University of California, Santa Cruz. Cited by: §1.
  • C. H. Martin and M. W. Mahoney (2018) Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning. arXiv preprint arXiv:1810.01075. Cited by: §1.
  • S. Mendelson (2002) Improving the sample complexity using global data. IEEE Transactions on Information Theory 48, pp. 1977–1991. Cited by: §1, §2, §3.
  • M. Mohri, A. Rostamizadeh, and A. Talwalkar (2012) Foundations of machine learning. Cited by: Appendix A, §2.
  • G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio (2014) On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N.d. Lawrence, and K.q. Weinberger (Eds.), pp. 2924–2932. External Links: Link Cited by: §1.
  • V. Nagarajan and Z. Kolter (2019) Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience. External Links: Link Cited by: §1.
  • B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro (2017) A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564. Cited by: Table 1, §1, §2.
  • B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro (2019) The role of over-parametrization in generalization of neural networks. External Links: Link Cited by: §1, §1, §1.
  • B. Neyshabur, R. Tomioka, and N. Srebro (2015) Norm-based capacity control in neural networks. In Proceedings of The 28th Conference on Learning Theory, Montreal Quebec, pp. 1376–1401. Cited by: Table 1, §1.
  • B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli (2016) Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 3360–3368. External Links: Link Cited by: §1.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv e-prints, pp. arXiv:1511.06434. External Links: 1511.06434 Cited by: §1.
  • S. Sonoda and N. Murata (2015) Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis. Cited by: §1.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §1.
  • I. Steinwart and A. Christmann (2008) Support vector machines. Springer. Cited by: Appendix A.
  • T. Suzuki, H. Abe, T. Murata, S. Horiuchi, K. Ito, T. Wachi, S. Hirai, M. Yukishima, and T. Nishimura (2018) Spectral-Pruning: Compressing deep neural network via spectral analysis. arXiv e-prints, pp. arXiv:1808.08558. External Links: 1808.08558 Cited by: §C.2, Table 1, §1, §2, §3.2, §3.
  • T. Suzuki (2019) Adaptivity of deep reLU network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. In International Conference on Learning Representations, External Links: Link Cited by: §C.1, §1, Lemma 3.
  • M. Talagrand (1996) New concentration inequalities in product spaces. Inventiones Mathematicae 126, pp. 505–563. Cited by: Appendix A, Proposition 2.
  • G. Valle-Perez, C. Q. Camargo, and A. A. Louis (2019) Deep learning generalizes because the parameter-function map is biased towards simple functions. External Links: Link Cited by: §1.
  • A. W. van der Vaart and J. A. Wellner (1996) Weak convergence and empirical processes: with applications to statistics. Springer, New York. Cited by: §B.3, Appendix.
  • V. N. Vapnik (1998) Statistical learning theory. Wiley, New York. Cited by: §2.
  • V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, A. Courville, D. Lopez-Paz, and Y. Bengio (2018) Manifold mixup: better representations by interpolating hidden states. arXiv preprint arXiv:1806.05236. Cited by: §1.
  • S. Wager, S. Wang, and P. S. Liang (2013) Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 351–359. External Links: Link Cited by: §1.
  • M.J. Wainwright (2019) High-dimensional statistics: a non-asymptotic viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press. Cited by: Appendix A.
  • C. Wei and T. Ma (2019) Data-dependent sample complexity of deep neural networks via lipschitz augmentation. pp. to appear. Cited by: Table 1, §1, §2, §3.2, §3.
  • L. Wu, Z. Zhu, et al. (2017) Towards understanding generalization of deep learning: perspective of loss landscapes. arXiv preprint arXiv:1706.10239. Cited by: §1.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1.
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. External Links: Link Cited by: §1.
  • W. Zhou, V. Veitch, M. Austern, R. P. Adams, and P. Orbanz (2019) Non-vacuous generalization bounds at the imagenet scale: a PAC-bayesian compression approach. In International Conference on Learning Representations, Cited by: §1.

Appendix

In the appendix, we give the proofs of the main text. We use the following notation throughout the appendix: