Function Norms and Regularization in Deep Networks

# Function Norms and Regularization in Deep Networks

Amal Rannen Triki
KU Leuven, ESAT-PSI, imec, Belgium
amal.rannen@esat.kuleuven.be
&Maxim Berman
KU Leuven, ESAT-PSI, imec, Belgium
maxim.berman@esat.kuleuven.be
&Matthew B. Blaschko
KU Leuven, ESAT-PSI, imec, Belgium
matthew.blaschko@esat.kuleuven.be
Amal is also affiliated with Department of Computational Science and Engineering in Yonsei University, South Korea.
###### Abstract

Deep neural networks (DNNs) have become increasingly important due to their excellent empirical performance on a wide range of problems. However, regularization is generally achieved by indirect means, largely due to the complex set of functions defined by a network and the difficulty in measuring function complexity. There exists no method in the literature for additive regularization based on a norm of the function, as is classically considered in statistical learning theory. In this work, we propose sampling-based approximations to weighted function norms as regularizers for deep neural networks. We provide, to the best of our knowledge, the first proof in the literature of the NP-hardness of computing function norms of DNNs, motivating the necessity of an approximate approach. We then derive a generalization bound for functions trained with weighted norms and prove that a natural stochastic optimization strategy minimizes the bound. Finally, we empirically validate the improved performance of the proposed regularization strategies for both convex function sets as well as DNNs on real-world classification and image segmentation tasks demonstrating improved performance over weight decay, dropout, and batch normalization. Source code will be released at the time of publication.

## 1 Introduction

Regularization is essential in ill-posed problems and to prevent overfitting. Regularization has traditionally been achieved in machine learning by penalization of a norm of a function or a norm of the parameter vector. In the case of linear functions (e.g. Tikhonov regularization (Tikhonov, 1963)), penalizing the parameter vector corresponds to a penalization of a function norm as a straightforward result of the Riesz representation theorem (Riesz, 1907). In the case of reproducing kernel Hilbert space (RKHS) regularization (including splines (Wahba, 1990)), this by construction corresponds directly to a function norm regularization (Vapnik, 1998; Schölkopf and Smola, 2001).

In the case of deep neural networks, similar approaches have been applied directly to the parameter vectors, resulting in an approach referred to as weight decay (Moody et al., 1995). This, in contrast to the previously mentioned Hilbert space approaches, does not directly penalize a measure of function complexity, such as a norm (\threfthm:WeightNormNotFunctionNorm). Indeed, we show here that any function norm of a DNN with rectified linear unit (ReLU) activation functions (Hahnloser et al., 2000) is NP-hard to compute as a function of its parameter values (Section 3), and it is therefore unreasonable to expect that simple measures, such as weight penalization, would be able to capture appropriate notions of function complexity.

In this light, it is not surprising that two of the most popular regularization techniques for the non-convex function sets defined by deep networks with fixed topology make use of stochastic perturbations of the function itself (dropout (Hinton et al., 2012; Baldi and Sadowski, 2013)) or stochastic normalization of the data in a given batch (batch normalization (Ioffe and Szegedy, 2015)). While their algorithmic description is clear, interpreting the regularization behavior of these methods in a risk minimization setting has proven challenging. What is clear, however, is that dropout can lead to a non-convex regularization penalty (Helmbold and Long, 2015) and therefore does not correspond to a norm of the function. Other regularization penalties such as path-normalization (Neyshabur et al., 2015) are polynomial time computable and thus also do not correspond to a function norm assuming .

Although we show that norm computation is NP-hard, we demonstrate that some norms admit stochastic approximations. This suggests incorporating penalization by these norms through stochastic gradient descent, thus directly controlling a principled measure of function complexity. In work developed in parallel to ours, Kawaguchi et al. (2017) suggest to penalize function values on the training data based on Rademacher complexity based generalization bounds, but have not provided a link to function norm penalization. We also develop a generalization bound, which shows how direct norm penalization controls expected error similarly to their approach. Furthermore, we observe in our experiments that the sampling procedure we use to stochastically minimize the function norm penalty in our optimization objective empirically leads to better generalization performance (cf. Figure 3).

Different approaches have been applied to explain the capacity of DNNs to generalize well, even though they can use a number of parameters several orders of magnitude larger than the number of training samples. Hardt et al. (2016) analyze stochastic gradient descent (SGD) applied to DNNs using the uniform stability concept introduced by Bousquet and Elisseeff (2002). However, the stability parameter they show depends on the number of training epochs, which makes the related bound on generalization rather pessimistic and tends to confirm the importance of early stopping for training DNNs (Girosi et al., 1995). More recently, Zhang et al. (2017) have suggested that classical learning theory is incapable of explaining the generalization behavior of deep neural networks. Indeed, by showing that DNNs are capable of fitting arbitrary sets of random labels, the authors make the point that the expressivity of DNNs is partially data-driven, while the classical analysis of generalization does not take the data into account, but only the function class and the algorithm. Nevertheless, learning algorithms, and in particular SGD, seem to have an important role in the generalization ability of DNNs. Keskar et al. (2017) show that using smaller batches results in better generalization. Other works (e.g. Hochreiter and Schmidhuber (1997)) relate the implicit regularization applied by SGD to the flatness of the minimum to which it converges, but Dinh et al. (2017) have shown that sharp minima can also generalize well.

Previous work concerning the generalization of DNNs present several contradictory results. Taking a step back, it appears that our better understanding of classical learning models – such as linear functions and kernel methods – with respect to DNNs comes from the well-defined hypothesis set on which the optimization is performed, and clear measures of the function complexity.

In this work, we make a step towards bridging the described gap by introducing a new family of regularizers that approximates a proper function norm (Section 2). We demonstrate that this approximation is necessary by, to the best of our knowledge, the first proof in the literature that computing a function norm of DNNs is NP-hard (Section 3). We develop a generalization bound for function norm penalization in Section 4 and demonstrate that a straightforward stochastic optimization strategy appropriately minimizes this bound. Our experiments reinforce these conclusions by showing that the use of these regularizers lowers the generalization error and that we achieve better performance than other regularization strategies in the small sample regime (Section 5).

## 2 Function norm based regularization

We consider the supervised training of the weights of a deep neural network (DNN) given a training set , where is the input space and the output space. Let be the function encoded by the neural network. The prediction of the network on an is generally given by , where is a decision function. For instance, in the case of a classification network, gives the unnormalized scores of the network. During training, the loss function penalizes the outputs given the ground truth label , and we aim to minimize the risk

 R(f)=∫ℓ(f(x),y)dP(x,y), (1)

where is the underlying joint distribution of the input-output space. As this distribution is generally inaccessible, empirical risk minimization approximates the risk integral (1) by

 ^R(f)=1nn∑i=1ℓ(f(xi),yi), (2)

where the elements from the dataset are supposed to be i.i.d. samples drawn from .

When the number of samples is large, the empirical risk (2) is a good approximation of the risk (1). In the small-sample regime, however, better control of the generalization error can be achieved by adding a regularization term to the objective. In the statistical learning theory literature, this is most typically achieved through an additive penalty (Vapnik, 1998; Murphy, 2012)

 argminf^R(f)+λΩ(f), (3)

where is a measure of function complexity. The regularization biases the objective towards “simpler” candidates in the model space.

In machine learning, using the norm of the learned mapping appears as a natural choice to control its complexity. This choice limits the hypothesis set to a ball in a certain topological set depending on the properties of the problem. In an RKHS, the natural regularizer is a function of the Hilbert space norm: for the space induced by a kernel , . Several results showed that the use of such a regularizer results in a control of the generalization error (Girosi and Poggio, 1990; Wahba, 1990; Bousquet and Elisseeff, 2002). In the context of function estimation, for example using splines, it is customary to use the norm of the approximation function or its derivative in order to obtain a regression that generalizes better (Wahba, 2000).

However, for neural networks, defining the best prior for regularization is less obvious. The topology of the function set represented by a neural network is still fairly unknown, which complicates the definition of a proper complexity measure.

###### Lemma 1.
\thlabel

thm:WeightNormNotFunctionNorm The norm of the weights of a neural network, used for regularization in e.g. weight decay, is not a proper function norm.

It is easy to see that different weights can encode the same function , for instance by permuting neurons or rescaling different layers. Therefore, the norm of the weights is not even a function of encoded by those weights. Moreover, in the case of a network with ReLU activations, it can easily be seen that the norm of the weights does not have the same homogeneity degree as the output of the function, which induces optimization issues, as detailed in Haeffele and Vidal (2017)

Nevertheless, if the activation functions are continuous, any function encoded by a network is in the space of continuous functions. Moreover, supposing the input domain is compact, the network function has a finite -norm.

###### Definition 1 (Lq-norm).

Given a measure , the function -norm for is defined as

 ∥f∥q=(∫∥f(x)∥qq dμ(x))1q, (4)

where the inner norm represents the -norm of the output space.

In the sequel, we will focus on the special case of . This function space has attractive properties, being a Hilbert space. Note that in an RKHS, controlling the -norm can also control the RKHS norm under some assumptions. When the kernel has a finite norm, the inclusion mapping between the RKHS and is continuous and injective, and constraining the function to be in a ball in one space constrains it similarly in the other (Mendelson and Neeman, 2010; Steinwart and Christmann, 2008, Chapter 4).

However, because of the high dimensionality of neural network function spaces, the optimization of function norms is not an easy task. Indeed, the exact computation of any of these function norms is NP-hard, as we show in the following section.

## 3 NP-hardness of function norm computation

###### Proposition 1.

For defined by a deep neural network (of depth greater or equal to 4) with ReLU activation functions, the computation of any function norm from the weights of a network is NP-hard. \thlabelproposition:NPhardness

We prove this statement by a linear time reduction of the classic NP-complete problem of Boolean 3-satisfiability Cook (1971) to the computation of the norm of a particular network with ReLU activation functions. Furthermore, we can construct this network such that it always has finite -norm.

###### Lemma 2.
\thlabel

thm:NormNPhardConstruction Given a Boolean expression with variables in conjunctive normal form in which each clause is composed of three literals, we can construct, in time polynomial in the size of the predicate, a network of depth  and realizing a continuous function that has non-zero norm if and only if the predicate is satisfiable.

###### Proof.

See Supplementary Material for a construction of this network. ∎

###### Corollary 1.
\thlabel

th:NPhardAllNorms Although not all norms are equivalent in the space of continuous functions, \threfthm:NormNPhardConstruction implies that any function norm for a network of depth must be NP-hard since for all norms .

This shows that the exact computation of any function norm is intractable. However, assuming the measure in the definition of the norm (4) is a probability measure , the function norm can be written as . Moreover, assuming we have access to i.i.d samples , this weighted -function norm can be approximated by

 (1mm∑i=1∥f(zi)∥22)12. (5)

For samples outside the training set, empirical estimates of the squared weighted -function norm are -statistics of order 1, and have an asymptotic Gaussian distribution to which finite sample estimates converge quickly as (Lee, 1990). In the next section, we demonstrate sufficient conditions under which control of results in better control of the generalization error.

## 4 Generalization bound and optimization

In this section, rather than the regularized objective of the form of Equation (3), we consider an equivalent constrained optimization setting. The idea about controlling an type of norm is to attract the output of the function towards 0, effectively limiting the confidence of the network, and thus the values of the loss function. Classical bounds on the generalization show the virtue of a bounded loss. As we are approximating a norm with respect to a sampling distribution, this bound on the function values can only be probably approximately correct, and will depend on the statistics of the norm of the outputs–namely the mean (i.e. the -norm) and the variance, as detailed by the following proposition:

###### Proposition 2.
\thlabel

th:generalization When the number of samples is small, and if we suppose bounded, and Lipschitz-continuous, solving the problem

 f∗ =argminf^R(f), s.t.∥f∗∥22,Q≤Aandvarz∼Q(∥f∗(z)∥22)≤B2 (6)

effectively reduces the complexity of the hypothesis set, and the bounds and on the weighted -norm and the standard deviation control the generalization error, provided that is small, where the marginal input distribution and the sampling distribution.111We note that is the -divergence between and and is minimized when . Specifically, the following generalization bound holds with probability at least :

 R(f∗)≤^R(f∗)+⎛⎜⎝K⎡⎢⎣(A+B)12DP(P∥Q)14√δ+A12DP(P∥Q)12⎤⎥⎦+C⎞⎟⎠√2ln2δN. (7)

The proof can be found in the supplementary material, Appendix C.

### 4.1 Practical optimization

In practice, we try to get close to the ideal conditions of \threfth:generalization. The Lipschitz continuity of the loss and the boundedness of hold in most of the common situations. Therefore, three conditions require attention: (i) the norm ; (ii) the standard deviation of for ; (iii) the relation between the sampling distribution and the marginal distribution. Even if we can generate samples from the distribution , at each step of training, only a batch of limited size can be presented to the network. Nevertheless, controlling the sample mean of a different batch at each iteration can be sufficient to attract all the observed realizations of the output towards 0, and therefore simultaneously bound both the expected value and the standard deviation.

###### Proposition 3.
\thlabel

th:boundSampleMean If for a fixed , for all samples of size :

 1m∑∥f(zi)∥22≤A, (8)

then and are also bounded.

###### Proof.

If for any sample of size , the condition (8) holds, then:

 ∀zi∼Q,∥f(zi)∥22≤mA (9)

and

 Ez∼Q[∥f(z)∥22]≤mA;varz∼Q(∥f(z)∥22)≤Ez∼Q[∥f(z)∥42]≤m2A2. (10)

While training, in order to satisfy the two first conditions, we use small batches to estimate the function norm with the expression (5). When possible, a new sample is generated at each iteration in order to approach the condition in \threfth:boundSampleMean. Concerning the condition on the relation between the two distributions, three possibilities where considered in our experiments: (i) using unlabeled data that are not used for training, (ii) generating from a Gaussian distribution that have mean and variance related to training data statistics, and (iii) optimizing a generative model, e. g. a variational autoencoder (Kingma and Welling, 2014) on the training set. In the first case, the sampling is done with respect to the data marginal distribution, in which case the derived generalization bound is the tightest. However, in this case, we can use only a limited number of samples, and our control on the function norm can be loose because of the estimation error. In the second and third case, it is possible to generate as many samples as needed to estimate the norm. The Gaussian distribution satisfy the boundedness of , but does not take into account the spatial data structure. The variational autoencoder, in the contrary, captures the spatial properties of the data, but suffers from mode collapse. In order to alleviate the effect of having a tighter distribution than the data, we use an enlarged Gaussian distribution in the latent space when generating the samples from the trained autoencoder.

## 5 Experiments and results

To test the proposed regularizer, we consider three different settings: {enumerate*}[label=()]

A classification task with kernelized logistic regression, for which control of the weighted norm theoretically controls the RKHS norm, and should therefore result in accuracy similar to that achieved by standard RKHS regularization;

A semantic image segmentation task with DNNs.

### 5.1 Oxford Flowers classification with kernelized logistic regression

In Sec. 2, we state that according to Steinwart and Christmann (2008); Mendelson and Neeman (2010), the -norm regularization should result in a control over the RKHS norm. The following experiment shows that both norms have similar behavior on the test data.

#### Data and Kernel

For this experiment we consider the 17 classes Oxford Flower Dataset, composed of 80 images per class, and precomputed kernels that have been shown to give good performance on a classification task Nilsback and Zisserman (2006, 2008). We have taken the mean of Gaussian kernels as described in Gehler and Nowozin (2009).

#### Settings

To test the effect of the regularization, we train the logistic regression on a subset of 10% of the data, and test on 20% of the samples. The remaining 70% are used as potential samples for regularization. For both regularizers, the regularization parameter is selected by a 3-fold cross validation. For the weighted norm regularization, we used a 4 different sample sizes ranging from 20% to 70% of the data as this results in a favorable balance between controlling the terms in Eq. (6) (cf. \threfth:boundSampleMean). This procedure is repeated on 10 different splits of the data for a better estimate. The optimization is performed by quasi-Newton gradient descent, which is guaranteed to converge due to the convexity of the objective.

#### Results

Figure 1 shows the means and standard deviations of the accuracy on the test set obtained without regularization, and with regularization using the RKHS norm, along with the histogram of accuracies obtained with the weighted norm regularization with the different sample sizes and across the ten trials. This figure demonstrates the equivalent effect of both regularizer, as expected with the stability properties induced by both norms.

The use of the weighted function norm is more useful for DNNs, where very few other direct function complexity control is known to be polynomial. The next two experiments show the efficiency of our regularizer when compared to other regularization strategies: Weight decay Moody et al. (1995), dropout Hinton et al. (2012) and batch normalization Ioffe and Szegedy (2015).

### 5.2 MNIST classification

#### Data and Model

In order to test the performance of the tested regularization strategies, we consider only small subsets of 100 samples of the MNIST dataset for training. The tests are conducted on 10,000 samples. We consider the LeNet architecture LeCun et al. (1995), with various combinations of weight decay, dropout, batch normalization, and weighted function norms (Figure 2).

#### Settings

We train the model on 10 different random subsets of 100 samples. For the norm estimation, we consider both generating from Gaussian distributions and from a VAE trained for each of the subsets. The VAEs used for this experiment are composed of 2 hidden layers as in Kingma and Welling (2014). More details about the training and sampling are given in the supplementary material. For each batch, a new sample is generated for the function norm estimation. SGD is performed using ADAM Kingma and Ba (2015) for the training of the VAE and plain SGD with momentum is used for the main model. The obtained models are applied to the test set, and classification error curves are averaged over the 10 trials. The regularization parameter is set to 0.01 for all experiments.

#### Results

Figure 2 displays the averaged curves and error bars for two different architectures for MNIST. Figure 1(a) compares the effect of the function norm to dropout and weight decay. Figure 1(b) compares the effect of the function norm to dropout and weight decay. Two different sizes of regularization batches are used, in order to test the effect of this parameter. It appears that a higher batch size can reach higher performances but seems to have a higher variance, while the smaller batch size shows more stability with comparable performance at convergence. These experiments show a better performance of our regularization when compared with dropout and batch normalization. Combining our regularization with dropout seems to increase the performance even more, but batch-normalization seems to annihilate the effect of the norm.

Figure 3 displays the averaged curves and error bars for various experiments using dropout. Figure 2(a) shows the results using Gaussian distributions for generation of the regularization samples. Using Gaussians with mean 5 and variance 2, and mean 10 and variance 1 caused the training to diverge and yielded only random performance. Figure 2(b) shows that our method outperforms the regularizer proposed in Kawaguchi et al. (2017).

Note that each of the experiments use a different set of randomly generated subsets for training. However, the curves in each individual figure use the same data.

In the next experiment, we show that weighted function norm regularization can improve performance over batch normalization on a real-world image segmentation task.

### 5.3 Regularized training of ENet

We consider the training of ENet Paszke et al. (2016), a network architecture designed for fast image segmentation, on the Cityscapes dataset Cordts et al. (2016). As regularization plays a more significant role in the low-data regime, we consider a fixed random subset of images of the training set of Cityscapes as an alternative to the full training images. We compare train ENet similarly to the author’s original optimization settings, in a two-stage training of the encoder and the encoder + decoder part of the architecture, using weighted cross-entropy loss. We use Adam a base learning rate of with a polynomially decaying learning rate schedule and batches of size for both training stages. We found the validation performance of the model trained under these settings with all images to be mean IoU; this performance is reduced to when training only on the subset.

We use our proposed weighted function norm regularization using unlabeled samples taken from the images of the “coarse” training set of Cityscapes, disjoint from the training set. Figure 4 shows the evolution of the validation accuracy during training. We see that the added regularization leads to a higher performance on the validation set. Figure 5 shows a segmentation output with higher performance after adding the regularization.

In our experiments, we were not able to observe an improvement over the baseline in the same setting with a state-of-the-art semi-supervised method, mean-teacher (Tarvainen and Valpola, 2017). We therefore believe the observed effect to be attributed to the effect of the regularization. The impact of semi-supervision in the regime of such high resolution images for segmentation is however largely unknown and it is possible that a more thorough exploration of unsupervised methods would lead to a better usage of the unlabeled data.

## 6 Discussion and Conclusions

Regularization in deep neural networks has been challenging, and the most commonly applied frameworks only indirectly penalize meaningful measures of function complexity. It appears that the better understanding of regularization and generalization in more classically considered function classes, such as linear functions and RKHSs, is due to the well behaved and convex nature of the function class and regularizers. By contrast DNNs define poorly understood non-convex function sets. Existing regularization strategies have not been shown to penalize a norm of the function. We have shown here for the first time that norm computation in a low fixed depth neural network is NP-hard, elucidating some of the challenges of working with DNN function classes. This negative result motivates the use of stochastic approximations to weighted norm computation, which is readily compatible with stochastic gradient descent optimization strategies. We have developed gene backpropagation algorithms for weighted norms, and have demonstrated consistent improvement in performance over the most popular regularization strategies. We empirically validated the expected effect of the employed regularizer on generalization with experiments on the Oxford Flowers dataset, the MNIST image classification problem, and the training of ENet on Cityscapes. We will make source code available at the time of publication.

## Acknowledgments

This work is funded by Internal Funds KU Leuven, FP7-MC-CIG 334380, the Research Foundation - Flanders (FWO) through project number G0A2716N, and an Amazon Research Award.

## References

• P. Baldi and P. J. Sadowski (2013) Understanding dropout. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani and K. Q. Weinberger (Eds.), pp. 2814–2822. Cited by: §1.
• O. Bousquet and A. Elisseeff (2002) Stability and generalization. Journal of Machine Learning Research 2 (Mar), pp. 499–526. Cited by: §1, §2.
• S. A. Cook (1971) The complexity of theorem-proving procedures. In Proceedings of the Third Annual ACM Symposium on Theory of Computing, pp. 151–158. Cited by: §3.
• M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth and B. Schiele (2016) The Cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §5.3.
• L. Dinh, R. Pascanu, S. Bengio and Y. Bengio (2017) Sharp minima can generalize for deep nets. In International Conference on Machine Learning, Cited by: §1.
• P. V. Gehler and S. Nowozin (2009) On feature combination for multiclass object classification. In International Conference on Computer Vision, pp. 221–228. Cited by: §5.1.
• F. Girosi, M. Jones and T. Poggio (1995) Regularization theory and neural networks architectures. Neural Computation 7 (2), pp. 219–269. Cited by: §1.
• F. Girosi and T. Poggio (1990) Networks and the best approximation property. Biological cybernetics 63 (3), pp. 169–176. Cited by: §2.
• B. D. Haeffele and R. Vidal (2017) Global optimality in neural network training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7331–7339. Cited by: §2.
• R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas and H. S. Seung (2000) Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405 (6789), pp. 947. Cited by: §1.
• M. Hardt, B. Recht and Y. Singer (2016) Train faster, generalize better: stability of stochastic gradient descent. In International Conference on Machine Learning, Cited by: §1.
• D. P. Helmbold and P. M. Long (2015) On the inductive bias of dropout. Journal of Machine Learning Research 16, pp. 3403–3454. Cited by: §1.
• G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. Cited by: §1, §5.1.
• S. Hochreiter and J. Schmidhuber (1997) Flat minima. Neural Computation 9 (1), pp. 1–42. Cited by: §1.
• W. Hoeffding (1963) Probability inequalities for sums of bounded random variables. Journal of the American statistical association 58 (301), pp. 13–30. Cited by: Appendix C.
• S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, Cited by: §1, §5.1.
• K. Kawaguchi, L. P. Kaelbling and Y. Bengio (2017) Generalization in deep learning. arXiv preprint arXiv:1710.05468. Cited by: §1, 2(b), Figure 3, §5.2.
• N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy and P. T. P. Tang (2017) On large-batch training for deep learning: generalization gap and sharp minima. In International Conference on Learning Representations, Cited by: §1.
• D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: Appendix D, §5.2.
• D. P. Kingma and M. Welling (2014) Auto-encoding variational Bayes. In International Conference on Learning Representations, Cited by: §4.1, §5.2.
• Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Säckinger, P. Simard and V. Vapnik (1995) Comparison of learning algorithms for handwritten digit recognition. In International Conference on Artificial Neural Networks, Cited by: §5.2.
• A. J. Lee (1990) U-statistics: theory and practice. CRC Press. Cited by: §3.
• S. Mendelson and J. Neeman (2010) Regularization in kernel learning. The Annals of Statistics 38 (1), pp. 526–565. Cited by: §2, §5.1.
• J. Moody, S. Hanson, A. Krogh and J. A. Hertz (1995) A simple weight decay can improve generalization. Advances in Neural Information Processing Systems 4, pp. 950–957. Cited by: §1, §5.1.
• K. P. Murphy (2012) Machine learning: a probabilistic perspective. MIT Press. Cited by: §2.
• B. Neyshabur, R. R. Salakhutdinov and N. Srebro (2015) Path-SGD: path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett (Eds.), pp. 2422–2430. Cited by: §1.
• M-E. Nilsback and A. Zisserman (2006) A visual vocabulary for flower classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 1447–1454. Cited by: §5.1.
• M-E. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Cited by: §5.1.
• A. Paszke, A. Chaurasia, S. Kim and E. Culurciello (2016) Enet: a deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147. Cited by: §5.3.
• F. Riesz (1907) Sur une espece de geometrie analytique des systemes de fonctions sommables. Gauthier-Villars. Cited by: §1.
• B. Schölkopf and A. J. Smola (2001) Learning with kernels. MIT Press. Cited by: §1.
• B. K. Sriperumbudur, K. Fukumizu and G. R. Lanckriet (2011) Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research 12 (Jul), pp. 2389–2410. Cited by: Definition 7.
• I. Steinwart and A. Christmann (2008) Support vector machines. Springer. Cited by: §2, §5.1.
• A. Tarvainen and H. Valpola (2017) Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780. Cited by: §5.3.
• A. N. Tikhonov (1963) Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl. 4, pp. 1035–1038. Cited by: §1.
• V. Vapnik (1998) Statistical learning theory. Wiley. Cited by: §1, §2.
• G. Wahba (1990) Spline models for observational data. Vol. 59, Siam. Cited by: §1, §2.
• G. Wahba (2000) Splines in nonparametric regression. Encyclopedia of Environmetrics. Cited by: §2.
• C. Zhang, S. Bengio, M. Hardt, B. Recht and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, Cited by: §1.

Function Norms and Regularization in Deep Neural Networks: Supplementary Material

In this Supplementary Material, Section A details our NP-hardness proof of function norm computation for DNN. Section B. gives additional insight in the fact that weight decay does not define a function norm. Section C details our proof of our generalization bound for the L2 weighted function norm. Section D gives some details on the VAE architecture used. Finally, Section E gives additional results concerning Sobolev function norms.

## Appendix A NP-hardness of DNN L2 function norm

We divide the proof of \threfthm:NormNPhardConstruction in the two following subsections. In Section A.1, we introduce the necessary functions in order to build our constructive proof. Section A.2 gives the proof of LABEL:thm:NormNPhardConstruction, while Section A.3 demonstrate some technical property needed for one of the definitions in A.1.

### a.1 Definitions

###### Definition 2.
\thlabel

def:F0 For a fixed , we define as

 f0(x)=ε−1[max(0,x+ε)−2max(0,x)+max(0,x−ε)], (11)

and

 f1(x)=f0(x−1). (12)

These functions place some non-zero values in the neighborhood of and , respectively, and zero elsewhere. Furthermore, and (see Figure 6).

A sentence in -conjunctive normal form (3-CNF) consists of the conjunction of clauses. Each clause is a disjunction over literals, a literal being either a logical variable or its negation. For the construction of our network, each variable will be identified with a dimension of our input , we denote each of clauses in the conjunction for , and each literal for will be associated with if the literal is a negation of the th variable or if the literal is not a negation. Note that each variable can appear in multiple clauses with or without negation, and therefore the indexing of literals is distinct from the indexing of variables.

###### Definition 3.

We define the function as with such that – for a vector of ones.

###### Definition 4.

We define the function as

 3∑j=1f0(3∑i=1zi−j). (13)

For a proof that has values in , see \threfthm:NPhardnessLemmaOutputORblocksin01 in Section A.3 below.

In order to ensure our network defines a function with finite measure, we may use the following function to truncate values outside the unit cube.

###### Definition 5.
\thlabel

def:SATconstructionTruncationFunction .

### a.2 Proof of \threfthm:NormNPhardConstruction

###### Proof of \threfthm:NormNPhardConstruction.

We construct the three hidden layers of the network as follows. {enumerate*}[label=()]

In the first layer, we compute for the literals containing negations and ) for the literals without negation. These operators introduce one hidden layer of at most nodes.

The second layer computes the clauses of three literals using the function . This operator introduces one hidden layer with a number of nodes linear in the number of clauses in .

Finally, each of the outputs of are concatenated into a vector and passed to the function . This operator requires one application and thus introduces one hidden layer with a constant number of nodes.

Let be the function coded by this network. By optionally adding an additional layer implementing the truncation in \threfdef:SATconstructionTruncationFunction we can guarantee that the resulting function has finite norm. It remains to show that the norm of is strictly positive if and only if is satisfiable.

If is satisfiable, let be a satisfying assignment of ; by construction is , as all the clauses evaluate exactly to 1. being continuous by composition of continuous functions, we conclude that .

Now suppose not satisfiable. For a given clause , consider the dimensions associated with the variables contained within this clause and label them , , and . Now, for all possible assignments of the variables, consider the polytopes defined by restricting each to be greater than or less than . Exactly one of those variable assignments will have . The function value over the corresponding polytope must be zero. This is because the output of the th must be zero over this region by construction, and therefore the output of the will also be zero as the summation of all the outputs will be at most . For each of the assignments of the Boolean variables at least one clause will guarantee that for all in the corresponding polytope, as the sentence is assumed to be unsatisfiable. The union of all such polytopes is the entire space . As everywhere, . ∎

###### Corollary 2.

is satisfiable, and has finite measure for all .

### a.3 Output of Or blocks

###### Lemma 3.
\thlabel

thm:NPhardnessLemmaOutputORblocksin01 The output of all OR blocks in the construction of the network implementing a given SAT sentence has values in the range .

###### Proof.

Following the steps of Proposition 3, this function is defined for and:

 F(X) =f0(∑if1(Xi)−1)+f0(∑if1(Xi)−2) +f0(∑if1(Xi)−3) (14)

To compute the values of over , we consider two cases for every : and .

#### Case 1: all Xi∉(1−ε,1+ε):

In this case, we have . Therefore,, and

#### Case 2: only one Xi∈(1−ε,1+ε):

Without loss of generality, we suppose that and . Thus:

 ∑if1(Xi)=1−1ε|X1−1|. (15)

Thus, we have , , and . Therefore:

 F(X)={1−1ε2|X1−1|, for 0≤|X1−1|≤ε20, otherwise. (16)

#### Case 3: two Xi∈(1−ε,1+ε):

Suppose . We have then:

 ∑if1(Xi)=2−1ε|X1−1|−1ε|X2−1|. (17)

Therefore:

1.  |∑if1(Xi)−2|<ε⟺|X1−1|+|X2−1|<ε2 (18)
2.  |∑if1(Xi)−1|<ε⟺ε−ε2<|X1−1|+|X2−1|<ε+ε2 (19)

The resulting function values are then:

 F(X)=⎧⎪ ⎪⎨⎪ ⎪⎩1−1ε2|X1−1|−1ε2|X2−1|, for X1,2∈(???)1−1ε|1−1ε|X1−1|−1ε|X2−1||, for X1,2∈(???)0, otherwise. (20)

As , the regions (18) and (19) do not overlap.

#### Case 4: all Xi∈(1−ε,1+ε):

We have then: (21)

Therefore

1.  |∑if1(Xi)−2|<ε ⟺|X1−1|+|X2−1|+|X2−1|<ε2 (22)
2.  |∑if1(Xi)−2|<ε ⟺ε−ε2<|X1−1|+|X2−1|+|X3−1|<ε+ε2 (23)
3.  |∑if1(Xi)−1|<ε ⟺2ε−ε2<|X1−1|+|X2−1|+|X3−1|<2ε+ε2 (24)

The resulting function values are then

 F(X)=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩1−1ε2∑i|Xi−1|, for X1,2,3∈(???)1−1ε|1−1ε∑i|Xi−1||, for X1,2,3∈(???)1−1ε|2−1ε∑i|Xi−1||, for X1,2,3∈(???)0, otherwise. (25)

Again, as , the regions (22), (23) and (24) do not overlap. Finally,

 ∀X∈R3,0≤F(X)≤1. (26)

## Appendix B Weight decay does not define a function norm

It is straightforward to see that weight decay, i.e. the norm of the weights of a network, does not define a norm of the function determined by the network. Consider a layered network

 f(x)=Wdσ(Wd−1σ(…σ(W1x)…)). (27)

where the non-linear activation function can be e.g. a ReLU. The weight decay complexity measure is

 d∑i=1∥Wi∥2Fro, (28)

where is the Frobenius norm. A simple counter-example to show this cannot define a function norm is to set any of the matrices and for all . However can be set to an arbitrary value by changing the for although this does not change the underlying function.

## Appendix C Proof of \threfth:generalization

###### Proof.

In the following, is the marginal input distribution

 P(x)=∫P(x,y)dy (29)

We first suppose that is bounded, and that all the activations of the network are continuous, so that any function represented by the network is continuous. Furthermore, if the magnitude of the weights are bounded (this condition will be subsequently relaxed), without further control we know that:

 ∃L>0,∀f∈H,∀x∈X,∥f(x)∥2≤L, (30)

and supposing -Lipschitz continuous with respect to its first argument and under the -norm, we have:

 ∀x∈X,|ℓ(f(x),y)−ℓ(0,y)|≤KL, (31)

and

 |ℓ(f(x),y)|≤KL+|ℓ(0,y)|. (32)

If we suppose bounded as well, then:

 ∃C>0,∀(x,y)∈X×Y,|ℓ(f(x),y)|≤KL+C. (33)

Under these assumptions, using the Hoeffding inequality [Hoeffding, 1963], we have with probability at least 1-:

 R(f)≤^R(f)+(KL+C)√2ln2δn. (34)

When is large, this inequality insures a control over the generalization error when applied to . However, when is small, this control can be insufficient. We will show in the following that under the constraints described above, we can further bound the generalization error by replacing with a term that we can control.

In the the sequel, we consider verifying the conditions (6), while releasing the boundedness of and the weights of . Using Chebyshev’s inequlity, we have with probability at least 1-:

 ∀x∈X,|∥f(x)∥2−Eν∼P[∥f(ν)∥2]|≤σf,P√δ, where σ2f,P=varν∼P(∥f(ν)∥2), (35)

and

 ∥f(x)∥2≤σf,P√δ+Eν∼P[∥f(ν)∥2]. (36)

We have on the right-hand side of this inequality

 Eν∼P[∥f(ν)∥2] =∫∥f(ν)∥2P(ν)dν≤(∫∥f(ν)∥22Q(ν)dν)12∥f∥2,Q(∫P(ν)2Q(ν)2Q(ν)dν)12 (37)

using the Cauchy-Schwartz inequality. Similarly, we can write

 σ2f,P ≤∫∥f(ν)∥22P(ν)dν≤(∫∥f(ν)∥42Q(ν)dν)12⎛⎝∫(P(ν)Q(ν))2Q(ν)dν⎞⎠12 (38) (39) ≤(A+B)(∫P(ν)Q(ν)P(ν)dν)12 (40)

To summarize, denoting , we have with probability at least 1-, for any and satisfying (6):

 ∥f(x)∥2≤(A+B)12DP(P∥Q)14√δ+A12DP(P∥Q)12 (41)

Therefore, with probability at least ,

 R(f)≤^R(f)+⎛⎜⎝K⎡⎢⎣(A+B)12DP(P∥Q)14√δ+A12DP(P∥Q)12⎤⎥⎦+C⎞⎟⎠=:~L(A,B,DP(P∥Q))√2ln2δN. (42)

is fixed and depends only on the loss function (e.g. for the cross entropy loss, is the logarithm of the number of classes). We note that is continuous and increasing in its arguments which finishes the proof. ∎

## Appendix D Variational autoencoders

To generate samples for DNNs regularization, we choose to train VAEs on the training data. The chosen architecture is composed of two hidden layers for encoding and decoding. Figure 7 displays such an architecture. For each of the datasets, the size of the hidden layers is set empirically to ensure convergence. The training is done with ADAM Kingma and Ba [2015]. The VAE of IBSR data has 512 and 10 nodes in the first and second hidden layer respectively, and is trained during 1000 epochs. As the latent space is mapped to a normal distribution, it is customary to generate the samples by reconstructing a normal noise. In order to have samples that are close to the data distribution but have a slightly broader support, we sample a normal variable with a higher variance. In our experiments, we multiply the variance by 2.

## Appendix E Weighted Sobolev norms

We may analogously consider a weighted Sobolev norm:

###### Definition 6 (Weighted Sobolev norm).
 ∥f∥2H2,Q= ∥f∥22,Q+∥∇xf∥22,Q (43) = Ex∼Q(∥f(x)∥22+∥∇xf(x)∥22) (44)

### e.1 Computational complexity of weighted L2 vs. Sobolev regularization

We restrict our analysis of the computational complexity of the stochastic optimization to a single step as the convergence of stochastic gradient descent will depend on the variance of the stochastic updates, which in turn depends on the variance of .

For the weighted norm, the complexity is simply a forward pass for the regularization samples in a given batch. The gradient of the norm can be combined with the loss gradients into a single backward pass, and the net increase in computation is a single forward pass.

The picture is somewhat more complex for the Sobolev norm. The first term is the same as the norm, but the second term penalizing the gradients introduces substantial additional computational complexity with computation of the exact gradient requiring a number of backpropagation iterations dependent on the dimensionality of the inputs. We have found this to be prohibitively expensive in practice, and instead penalize a directional gradient in the direction of , a random unitary vector that is resampled at each step to ensure convergence of stochastic gradient descent.

### e.2 Comparative performance of the Sobolev and L2 norm on MNIST

Figure 8 displays the averaged curves and error bars on MNIST in a low-data regime for various regularization methods for the same network architecture and optimization hyperparameters. Comparisons are made between , Sobolev, gradient (i.e. penalizing only the second term of the Sobolev norm), weight decay, dropout, and batch normalization. In all cases, and Sobolev norms perform similarly, significantly outperforming the other methods.

## Appendix F A polynomial-time computable function norm for shallow networks

###### Definition 7.
\thlabel

def:CharacteristicKernel Let be an RKHS associated with the kernel over the topological space . is characteristic [Sriperumbudur et al., 2011] if the mapping from the set of all Borel probability measures defined on to is injective.

For example, the Gaussian kernel over is characteristic.

###### Proposition 4.

Given a 2-layer neural network mapping from to with hidden units, and a kernel characteristic over , there exists a function norm that can be computed in a quadratic time in and the cost of evaluation of (assuming we allow a square root operation). For example, for a Gaussian kernel, the cost of the kernel evaluation is linear in and the function norm can be computed in (assuming that we allow square root and exponential operations). \thlabelproposition:2Layer

###### Proof.

We will define a norm on two layer ReLU networks by defining an inner product through a RKHS construction.

A two layer network with a single output can be written as

 f(x)=wT1σ(W2x) (45)

where and , and taken element-wise. In the following, such a network is represented by: , and we note:

\thlabel

lem:2layerAddition Let and be two functions represented by a 2-layer neural network. Then, the function is represented by .

###### Lemma 5 (Scalar multiplication).
\thlabel

lem:2layerScalarMultiplication Let be a function represented by a 2-layer neural network. Then, the function is represented by .

These operations define a linear space. A two-layer network is preserved when scaling the th row of by and the th entry of by . We therefore assume that each row of is scaled to have unit norm, removing any rows of that consist entirely of zero entries.222The choice of vector norm is not particularly important. For concreteness we may assume it be normalized, which when considering rational weights with bounded coefficients, preserves polynomial boundedness after normalization. Now, we define an inner product as follows:

###### Definition 8 (An inner product between 2-layer networks).

Let be a characteristic kernel (\threfdef:CharacteristicKernel) over . Let and be two-layer networks represented by and , respectively, where no row of or is a zero vector, and each row has unit norm. Define

 ⟨u,v⟩H:= mu∑i=1mv∑j=1[u1]i[v1]jk([U2]i,:,[V2]i,:), (46)

where denotes the th row of , which induces the norm .

We note that must be characteristic to guarantee the property of a norm that .

This inner product inherits the structure of the linear space defined above. Using the addition (\threflem:2layerAddition) and scalar multiplication (\threflem:2layerScalarMultiplication) operations, verifying that Equation (46) satisfies the properties of an inner product is now a basic exercise.

We may take Equation (46) as the basis of a constructive proof that two-layer networks have polynomial time computable norm. To summarize, to compute such a norm, we need to:

1. Normalize so has rows with unit norm, and no row is a zero vector, which takes time;

2. Compute according to Equation (46), which is quadratic in times the complexity of ;

3. Compute .

Therefore, assuming we allow square roots as operations, the constructed norm can be computed in a quadratic time in the cost of the evaluation of . For example, for a Gaussian kernel , and allowing as operation, the cost of the kernel evaluation is linear in the input dimension and the cost of the constructed norm is quadratic in the number of hidden units. ∎

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters

388168

How to quickly get a good answer:
• Keep your question short and to the point
• Check for grammar or spelling errors.
• Phrase it like a question
Test
Test description