Understanding training and generalization in deep learning by Fourier analysis\fundingThis work was funded by the NYU Abu Dhabi Institute G1301

Understanding training and generalization in deep learning by Fourier analysisthanks: \fundingThis work was funded by the NYU Abu Dhabi Institute G1301

Zhi-Qin John Xu New York University Abu Dhabi, United Arab Emirates and Courant Institute of Mathematical Sciences, New York University, New York, United States. () zhiqinxu@nyu.edu
Abstract

Background: It is still an open research area to theoretically understand why Deep Neural Networks (DNNs)—equipped with many more parameters than training data and trained by (stochastic) gradient-based methods—often achieve remarkably low generalization error. Contribution: We study DNN training by Fourier analysis. Our theoretical framework explains: i) DNN with (stochastic) gradient-based methods endows low-frequency components of the target function with a higher priority during the training; ii) Small initialization leads to good generalization ability of DNN while preserving the DNN’s ability of fitting any function. These results are further confirmed by experiments of DNNs fitting the following datasets, i.e., natural images, one-dimensional functions and MNIST dataset.

d
\newsiamremark

remarkRemark \newsiamremarkhypothesisHypothesis \newsiamthmclaimClaim \headersUnderstanding deep learning by Fourier analysisZ. J. Xu

eep learning, generalization, training, frequency, optimization

{AMS}

68Q32, 68T01,

1 Introduction

Background

Deep learning has achieved great success as in many fields [12]. Recent studies have focused on understanding why DNNs, trained by (stochastic) gradient-based methods, can generalize well, that is, DNNs often fit the test data well which are not used for training in practice. Counter-intuitively, although DNNs have many more parameters than training data, they can rarely overfit the training data in practice.

Several studies have focused on the local property (sharpness/flatness) of loss function at minima [7] to explore the DNN’s generalization ability. Keskar et al., (2016) [9] empirically demonstrated that with small batch in each training step, DNNs consistently converge to flat minima, and lead to a better generalization. However, Dinh et al., (2017) [4] argued that most notions of flatness are problematic. To this end, Dinh et al., (2017) [4] used deep networks with rectifier linear units (ReLUs) to theoretically show that any minimum can be arbitrarily sharp or flat without specifying parameterization. With the constraint of small weights in parameterization, Wu et al., (2017) [18] proved that for two-layer ReLU networks, low-complexity solutions lie in the areas with small Hessian, that is, flat and large basins of attractor [18]. They then concluded that a random initialization tends to produce starting parameters located in the basin of flat minima with a high probability, using gradient-based methods.

Several studies rely on the concept of stochastic approximation or uniform stability [2, 6]. To ensure the stability of a training algorithm, Hardt et al., (2015) [6] assumed loss function with good properties, such as Lipschitz or smooth conditions. However, the loss function of a DNN is often very complicated [21].

Another approach to understand the DNN’s generalization ability is to find general principles during training. Empirically, Arpit et al., (2017) [1] suggested that DNNs may learn simple patterns first in real data, before memorizing. Xu et al., (2018) [19] empirically found a similar phenomenon, which is referred to Frequency Principle (F-Principle), that is, for a low-frequency dominant function, DNNs with common settings first quickly capture the dominant low-frequency components while keeping the amplitudes of high-frequency components small, and then relatively slowly captures those high-frequency components. F-Principle can explain how the training can lead DNNs to a good generalization empirically [19]. However, without a theoretical understanding, Xu et al., (2018) [19] are limited to a understanding of DNNs’ fitting of low-frequency dominate functions with low dimension empirically.

Contribution

In this work, we develop a theoretical framework in Fourier domain aiming to understanding the training process and the generalization of DNNs with sufficient neurons and hidden layers. We show that for any parameter, the gradient descent magnitude in each frequency component of the loss function is proportional to the product of two factors: one is a decay term with respect to (w.r.t.) frequency; the other is the amplitude of the difference between the DNN output and the target function. This theoretical framework can answer the following crucial questions:

Question 1: Do DNNs trained by gradient-based methods endow low-frequency components with higher priority during the training process? The decay term in the gradient descent magnitude (see Eq. (13)) shows that the priority of low-frequency components dominates high-frequency components before converging. The theoretical understanding of this problem can explain the F-Principle observed in the previous study [19]. In addition, this theoretical framework can also explain the more complicated behavior of DNNs’ fitting of high-frequency dominant functions.

Question 2: How does initialization affect the DNN generalization? We analyze the DNN with the sigmoid activation function. We begin from the following point: the power spectrum of the sigmoid function exponentially decays w.r.t. frequency, in which the exponential decay rate is proportional to the inverse of weight.We then show that small (large) initialization would result in small (large) amplitude of high-frequency components, thus leading the DNN output to a low (high) complexity function with good (bad) generalization ability. Therefore, with small initialization, sufficient large DNNs can fit any function [3] while keeping good generalization.

We demonstrate that the analysis in this work can be qualitatively extended to general DNNs. We exemplified our theoretical results through DNNs fitting natural images, 1-d functions and MNIST dataset [11].

The paper is established as follows. The common settings of DNNs in this work are presented in Section 2. The theoretical framework is given in Section 3. We show that the mean magnitude of DNN parameters only change slightly during the DNN training empirically in Section 4. The theoretical framework is applied to understand the training and generalization of DNNs in Section 5. The conclusions and discussions are followed in Section 6.

2 Methods

The activation function for each neuron is tanh. We use DNNs of multiple hidden layers with no activation function for the output layer. The DNN is trained by Adam optimizer [10]. Parameters of the Adam optimizer are set to the default values [10]. The loss function is the mean squared error of the difference between the DNN output and the target function in the training set.

3 Theoretical framework

In this section, we will develop a theoretical framework in the Fourier domain to understand the training process of DNN. For illustration, we first use a DNN with one hidden layer with sigmoid function as the activation function:

The Fourier transform of with is (See Appendix 3),

(1)

where

When , —the integration of on —would be infinity. However, we only consider a finite range of in practice. Thus, the infinity at in Eq. (1) is not an issue in practice.

Consider a DNN with one hidden layer with nodes, 1-d input that has a 1-d output:

(2)

Note that we call all , and as parameters, in which and are weights, and is a bias term. When is large, without loss of generality, we assume and ,

(3)

We define the difference between DNN output and any target function at each as

Write as

(4)

where and indicate the amplitude and phase of , respectively.

The loss at frequency is , where denotes the norm of a complex number. The total loss function is defined as:

Note that according to the Parseval’s theorem111Without loss of generality, the constant factor that is related to the definition of corresponding Fourier Transform is ignored here., this loss function in the Fourier domain is equal to the commonly used loss of mean squared error, that is,

(5)

At frequency , the amplitude of the gradient with respect to each parameter can be obtained,

(6)
(7)
(8)

where

(9)
(10)
(11)

The descent amount at any direction, say, with respect to parameter , is

(12)

The absolute contribution from frequency to this total amount at is

(13)

where , , is a function with respect to and , which can be found in one of Eqs. (6, 7, 8).

When the component at frequency does not converge yet, would dominate for a small . After training, when the component at frequency converges, . The contribution of frequency to the total descent amount vanishes. Therefore, the behavior of Eq. (13) is dominated by . This dominant term also indicates that weights are much more important than bias terms, which will be verified by MNIST dataset later.

Next, we demonstrate that the analysis of can be qualitatively extended to general DNNs. in Eq. (13) comes from the square operation in the loss function, thus, is irrelevant with DNN structure. in Eq. (13) comes from the exponential decay of the activation function in the Fourier domain. The analysis of is insensitive to the following factors. i) Activation function. The power spectrum of most activation functions decreases as the frequency increases; ii) Neuron number. The summation in Eq. (2) does not affect the exponential decay; iii) Multiple hidden layers. If there are multiple hidden layers, the composition of continuous activation functions is still a continuous function. The power spectrum of the continuous function still decays as the frequency increases; iv) High-dimensional input. We simply need to consider the vector form of in Eq. (1); v) High-dimensional output. The total loss function is the summation of the loss of each output node. Therefore, the analysis of of a single hidden layer qualitatively applies to different activation functions, neuron numbers, multiple hidden layers, and high-dimensional functions.

4 The magnitude of DNN parameters during training

Since the magnitude of DNN parameters is important to the analysis of the gradients, such as in Eq. (13), we study the evolution of the magnitude of DNN parameters during training. Through training DNNs by MNIST dataset, empirically, we show that for a network with sufficient neurons and layers, the mean magnitude of the absolute values of DNN parameters only changes slightly during the training. For example, we train a DNN by MNIST dataset with different initialization. In Fig.1, DNN parameters are initialized by Gaussian distribution with mean and standard deviation 0.06, 0.2, 0.6 for (a, b, c), respectively. We compute the mean magnitude of absolute weights and bias terms. As shown in Fig.1, the mean magnitude of the absolute value of DNN parameters only slightly changes during the training. Thus, empirically, the initialization almost determines the magnitude of DNN parameters. Note that the magnitude of DNN parameters can have significant change during training when the network size is small (We have more discussion in Discussion).


(a) std: 0.06

(b) std: 0.2

(c) std: 0.6
Figure 1: Magnitude of DNN parameters during fitting MNIST dataset. DNN parameters are initialized by Gaussian distribution with mean and standard deviation 0.06, 0.2, 0.6 for (a, b, c), respectively. Solids lines show the mean magnitude of the absolute weights (red) and the absolute bias (green) at each training epoch. The dashed lines are the meanstd for the corresponding color. Note that the green and the red lines almost overlap. We use a tanh DNN with width: 800-400-200-100. The learning rate is with batch size .

5 Understanding deep learning

Here, we consider only DNNs with sufficient neurons and layers. Since initialization is very important to deep learning, we will first discuss small initialization and then discuss different situations of initialization.

5.1 Fitting low-frequency dominant functions

First, we use Eq. (13) to understand the F-Principle observed in Xu et al., (2018) [19] (See Introduction). To show that the F-Principle holds in real data, we train a DNN to fit a natural image, as shown in Fig.2a—a mapping from position to gray scale strength, which is subtracted by its mean and then normalized by the maximal absolute value. As an illustration of F-Principle, we study the Fourier transform of the image with respect to for a fixed (red dashed line in Fig.2a, denoted as the target function in spatial domain). In a finite interval, the frequency components of the target function can be quantified by Fourier coefficients computed from Discrete Fourier Transform (DFT). Note that the frequency in DFT is discrete. The Fourier coefficient of for the frequency component is denoted by (a complex number in general). is the corresponding amplitude. Fig.2b displays the first frequency components of . To examine the convergence behavior of different frequency components during the training, we compute the relative difference of the DNN output and in frequency domain at each recording step, that is,

(14)

where denotes the DNN output. As shown in Fig.2c, the first four frequency peaks converge from low to high in order.

The mechanism underlying the phenomenon above is as follows. With small initialization, at the beginning of the training, the DNN output is close to zero and is also close to zero. Therefore, in Eq. (4)—the amplitude of difference between the DNN output and the target function—is close to , that is, the gradient descent is proportional to . As the definition in Eq. (14), the relative difference is then independent of at the beginning. Due to the small initialization, in Eq. (13) has a large decay rate . Therefore, the descent direction is mainly determined by low-frequency components. Initially, the convergence speed for exponentially decays as the frequency increases. When the DNN is fitting low-frequency components, high-frequency components stay small compared with low-frequency components due to the following reason—DNN parameters are small throughout the training, thus, in Eq. (3) leads to that high frequency components of each neuron are small during low-frequency convergence; The gradient descent does not drive the phase of high-frequency components, i.e., in Eq. (3), towards any specific direction; Therefore, until low-frequency converged, the amplitude of high-frequency components of the DNN output is a summation of small values with random phase. As an example in Fig.2d, when the DNN is fitting low-frequency components, high-frequency components stay relatively small.

When a low-frequency component converges, say, , for the low-frequency components, the main contributor to the total descent amount will be higher frequency components, say, . The contribution from frequency to the descent amount is constrained by . The gradient descent of would cause DNN output to deviate from the target function at . If the DNN differs too much at again, the would dominate the total gradient descent, leading to the convergence of . The maximum deviation of the frequency component at occurs approximately when the decent amount caused by and are comparable, that is,

(15)

then, we have

(16)

Since for low-frequency dominant functions, is close to and , then, the deviation at , , is much smaller than . Therefore, the descent amount caused by only causes a small fluctuation around . That is, when the DNN is fitting higher frequency components, the low-frequency components stay converged as an example shown in Fig.2c.

Viewing from the snapshots during the training process, we can see DNN captures the image from coarse-grained low-frequency components to detailed high-frequency components (Fig.2e-2g). And the DNN output generalizes well (Fig.2h), that is, the test error is very close to the training error [8].

The analysis for the image in Fig.2 is the similar for the famous image “Lena”, as shown in Fig.5 in Appendix. In addition, this theoretical framework can also explain the more complicated training behavior of DNNs’ fitting of high-frequency dominant functions (See Appendix B).


Figure 2: Convergence from low to high frequency for a natural image. The training data are all pixels whose horizontal indexes are odd. (a) True image. (b) of the red dashed pixels in (a) as a function of frequency index—Note that for DFT, we can refer to a frequency component by the frequency index instead of its physical frequency.—with selected peaks marked by black dots. (c) at different training epochs for different selected frequency peaks in (b). (d) (red) and (green) at epoch 1369. (e-g) DNN outputs of all pixels at different training steps. (h) Loss functions. We use a DNN with width 500-400-300-200-200-100-100. We train the DNN with the full batch and learning rate . We initialize DNN parameters by Gaussian distribution with mean and standard deviation .

5.2 Initialization and generalization

Next, we analyze how initialization can affect the DNN’s generalization ability. We first perform schematic analysis on why small initialization of DNN’s parameters can lead to a good generalization while a large initialization leads to a poor generalization. We then examine our analysis by training DNNs to fit MNIST dataset and a natural image.

5.2.1 Schematic analysis

Different initialization can result in very different generalization ability of DNN’s fitting and very different training courses. Here, we schematically analyze DNN’s output after training. With finite training data points, there is an effective frequency range for this training set, which is defined as the range in frequency domain bounded by Nyquist-Shannon sampling theorem [16] when the sampling is evenly spaced, or its extensions [20, 13] otherwise. Based on the effective frequency range, we can decompose the Fourier transform of DNN’s output into two parts, that is, effective frequency range and extra-higher frequency range. For different initialization, since the DNN can well fit the training data, the frequency components in the effective frequency range are the same.

Then, we consider the frequency components in the extra-higher frequency range. The amplitude at each frequency for node is controlled by with a decay rate . This decay rate is large for small initialization. Since the gradient descent does not drive these extra-higher frequency components towards any specific direction, the amplitudes of the DNN output in the extra-higher frequency range stay small. For large initialization, the decay rate is relative small. Then, extra-higher frequency components of the DNN output could have large amplitudes and much fluctuate compared with small initialization.

Higher-frequency function is of more complexity (for example, using to characterize complexity [18]). With small (large) initialization, the DNN’s output is a low-complexity (high-complexity) function after training. When the training data captures all important frequency components of the target function, a low-complexity DNN output can generalize much better than a high-complexity DNN output.

5.2.2 Experiments: MNIST dataset

We next use MNIST dataset to verify the effect of initialization. We use Gaussian distribution with mean to initialize DNN parameters. For simplicity, we use two-dimensional vector to denote standard deviations of weights and bias terms, respectively. Fix the standard deviation for bias terms, we consider the effect of different standard deviations of weights, that is, in Fig.3a and in Fig.3b. As shown in Fig.3, in both cases, DNNs have high accuracy for the training data. However, compared the red dashed line in Fig.3a with the yellow dashed line in Fig.3b, for the small standard deviation of weight, the prediction accuracy of the test data is much higher than that of the large one.

Note that the effect of initialization is governed by weights rather than bias terms. To verify this, we initialize bias terms with standard deviation . As shown by black curves in Fig.3a, the DNN with standard deviation has a bit slower training course and a bit lower prediction accuracy, which is suggested by Eq. (10) and Eq. (11).


(a)

(b)
Figure 3: Analysis of the training process of DNNs with different initialization while fitting MNIST dataset. Illustrations are the prediction accuracy on the training data and the test data at different training epochs. We use a tanh DNN with width: 800-400-200-100. The learning rate is with batch size . DNN parameters are initialized by Gaussian distribution with mean . The legend denotes standard deviations of weights and bias terms, respectively.

5.2.3 Experiments: natural image

When the DNN is fitting the natural image in Fig.2a with small initialization, it generalizes well as shown in Fig.2h. Here, we show the case of large initialization. Except for a larger standard deviation, other parameters are the same as in Fig.2. After training, the DNN can well capture the training data, as shown in the left in Fig.4a. However, the DNN output at the test pixels are very noisy, as shown in the right in Fig.4a. The loss functions of the training data and the test data in Fig.4b show that the DNN generalizes poorly. To visualize the high-frequency components of DNN output after training, we study the pixels at the red dashed lines in Fig.4a. As shown in the left in Fig.4c, the DNN accurately fits the training data. However, for the test data, the DNN output fluctuates a lot, as shown in the right in Fig.4c. The poor generalization and the highly fluctuated DNN output are consistent with our theoretical analysis.


Figure 4: Analysis of the training process of DNNs with large initialization while fitting the image in Fig.2a. The weights of DNNs are initialized by a Gaussian distribution with mean and standard deviation . (a) The DNN outputs at the training pixels (left) and all pixels (right). (b) Loss functions. (c) DNN outputs at the red dashed position in (a). We use a DNN with width 500-400-300-200-200-100-100, and learning rate .

6 Discussions

In this work, we have theoretically analyzed the DNN training process with (stochastic) gradient-based methods through Fourier analysis. Our theoretical framework explains the training process when the DNN is fitting low-frequency dominant functions or high-frequency dominant functions (See Appendix B). Based on the understanding of the training process, we explains why DNN with small initialization can achieve a good generalization ability. Our theoretical result shows that the initialization of weights rather than bias terms mainly determines the DNN training and generalization. We exemplify our results through natural images and MNIST dataset. These analyses are not constrained to low-dimensional functions. Next, we will discuss the relation of our results with other studies and some limitations of these analyses.

Weight norm

In this work, we analyze the DNN with sufficient neurons and layers such that the mean magnitude of DNN parameters keeps almost constant throughout the training. However, if we use a small-scale network, the mean magnitude of DNN parameters often increases to a stable value. Empirically, we found that the training increases the magnitude of DNN parameters when the DNN is fitting high-frequency components, for which our theory can provide some insight. If the weights are too small, the decay rate of in Eq. (3) will be too large. The training will have to increase the weights fit the high-frequency components of the target function because the number of neurons are fixed. By imposing regularization on the norm of weights in a small-scale network, we can prevent the training from fitting high-frequency components. Since high-frequency components usually have small power and are easily affected by noise, without fitting high-frequency components, the norm regularization will improve the DNN generalization ability. This discussion is consistent with other studies [15, 14]. We will address this topic about the evolution of the mean magnitude of DNN parameters of small-scale networks in our the future work.

Loss function and activation function

In this work, we use the mean square error as loss function and tanh as activation function for the training. Empirically, we found that by using the mean absolute error as loss function, we can still observe the convergence order from low to high frequency when the DNN is fitting low-frequency dominant functions. The term can be replaced by other forms that can characterize the difference between DNN outputs and the target function, by which the analysis of can be extended. The key exponential term comes from the property of the activation function. Note that when computing the Fourier transform of the activation function, we have ignored the windowing’s effect – which would not change the property of the activation function in the Fourier domain whose power decays as the frequency increases. Therefore, for any activation function where power decreases as the frequency increases and any loss function which characterizes the difference of DNN outputs and the target function, the analysis in this work can be qualitatively extended. The exact mathematical forms of different activation functions and loss functions can be different. We leave this analysis to our future work.

Sharp/flat minima and generalization

Since exists in all gradient forms in Eqs. (6, 7, 8), will also exist in the second-order derivative of the loss function with respect to any parameter. Here, we only consider DNN parameters with similar magnitude. With smaller weights at a minima, the DNN has a good generalization ability along with that the second-order derivative at the minima is smaller, that is, a flatter minima. When the weights are very large, the minima is very sharp. When the DNN is close to a very sharp minima, one training step can cause the loss function deviate from the minima significantly (See an conceptual sketch in Figure 1 of [9]). We also observe that in Fig.4, for large initialization, the loss fluctuates significantly when it is small. Our theoretical analysis qualitatively shows that a flatter minima is associated with a better DNN generalization, which resembles the results of other studies [9, 18] (see Introduction).

Early stopping

Our theoretical framework through Fourier analysis can well explain F-Principle, in which DNN gives low-frequency components with higher priority as observed in Xu et al., (2018) [19]. Thus, our theoretical framework can provide insight into early stopping. High-frequency components often have low power [5] and are noisy, but with early stopping, we can prevent DNN from fitting high-frequency components to achieve a better generalization.

Noise and real data

Empirical studies found the qualitative differences in gradient-based optimization of DNNs on pure noise vs. real data [21, 1, 19]. We would discuss the mechanism underlying these qualitative differences.

Zhang et al., (2016) [21] found that the convergence time of a DNN increases as the label corruption ratio increases. Arpit et al., (2017) [1] concluded that DNNs do not use brute-force memorization to fit real datasets but extract patterns in the data based on experimental findings in the dataset of MNIST and CIFAR-10. Arpit et al., (2017) [1] suggests that DNNs may learn simple and general patterns first in the real data. Xu et al., (2018) [19] found similar results as the following. With the simple visualization of low-dimensional functions on Fourier domain, Xu et al., (2018) [19] found F-Principle for low-frequency dominant functions empirically. However, F-Principle does not apply to pure noise data [19].

To theoretically understand the above empirically findings, we first note that the real data in Fourier domain is usually low-frequency dominant [5] while pure noise data often does not have clear dominant frequency components, for example, white noise. For low-frequency dominant functions, DNNs can quickly capture the low-frequency components. However, for pure noise data, since the high-frequency components is also important, that is, in Eq. (13) could be large for a large , the priority of low-frequency during the training can be relatively small. During the training, the gradients of low frequency and high frequency can affect each other significantly. Thus, the training processes for real data and pure noise data are often very different. Therefore, along with the analysis in Results, F-Principle thus can well apply to low-frequency dominant functions but not pure noise data [1, 19]. Since the DNN needs to capture more large-amplitude higher-frequency components when it is fitting pure noise data, it often requires longer convergence time [21].

Memorization vs. generalization

Traditional learning theory—which restricts capacity (e.g., VC dimension [17]) to achieve good generalization—cannot explain how DNNs can have large capacity to memorize random labeled dataset [21], but still possess good generalization in real dataset. As other study has shown that sufficient large-scale DNNs can potentially approximate any function [3]. In this work, our theoretical framework further resolves this puzzle by showing that both the DNN structure and the training dataset affect the effective capacity of DNNs. In the Fourier domain, high-frequency components can increase the DNN capacity and complexity. However, with small initialization, DNNs invoke frequency components from low to high to fit training data and keep other extra-higher frequency components, which are beyond the effective frequency range of training data, small. Thus, DNN’s learned capacity and complexity are determined by the training data, however, they still have the potential of large capacity. This effect raised from individual dataset is consistent with the speculation from Kawaguchi et al., (2017) [8] that suggests the generalization ability of deep learning is affected different datasets.

Limitations

i) The theoretical framework in its current form could not analyze how different DNN structures, such as convolutional neural networks, affect the detailed training and generalization ability of DNNs. We believe that to consider the effect of DNN structure, we need to consider more properties of dataset in addition to that its power decays as the frequency increases. ii), this qualitative framework cannot analyze the difference between the number of layers and the width of layers. To this end, we need an exact mathematical form for the DFT of the output of the DNN with multiple hidden layers, which will be left for our future work. iii) this theoretical framework cannot analyze the DNN behavior around local minima/saddle points during training.

Acknowledgments

The author wants to thank Wei Dai, Qiu Yang, Shixiao Jiang for critically proofreading the manuscript.

Appendix A Fourier transform of sigmoid function222The mathematical calculation in this section is inspired by https://math.stackexchange.com/questions/2569814/fourier-transform-of-sigmoid-function.

The sigmoid function is

The Fourier transform 444Without loss of generality, the constant factor that is related to the definition of corresponding Fourier Transform is ignored here. is the product of and the inverse of period. is

(17)

Consider the case . For the positive part,

Since

we have

For the negative part,

then,

where is Lerch Phi function,

Since

we have

that is,

(18)

Then, when is large, without loss of generality, we take ,

(19)

When , in Eq. (17), is infinite. However, we only consider a finite range in practice. Thus, the infinity of in Eq. (18) is not an issue in practice.


Figure 5: Convergence from low to high frequency for a natural image, Lena. The training data are all pixels whose horizontal indexes are odd. (a) True image. (b) of the red dashed pixels in (a) as a function of frequency index with important peaks marked by black dots. (c) at different recording steps for different selected frequency peaks in (b). (d) and at the recording step 265. (e-g) DNN outputs of all pixels at different steps. (h) Loss functions. We use a DNN with width 500-400-300-200-200-100-100. We train the DNN with full batch and learning rate . We initialize DNN parameters by Gaussian distribution with mean and standard deviation .

Appendix B Fitting high-frequency dominant functions

We explain the training process of DNN’s fitting of a high-frequency dominant function, which is beyond the F-Principle. Similarly, for small initialization, since is independent of the amplitude at the beginning, the low-frequency components will converge earlier, say, . The frequency of the main contributor to the descent amount would be higher ones, say, . Since —the amplitude of the difference between the DNN output and the target function at —is much lager than for the high-frequency dominant function, from Eq. (15), the gradient caused by the frequency component at could cause a non-ignorable deviation of DNN output from at . With the training by gradient descent, deviates from zero and decreases. in Eq. (13) would lead to that low-frequency has important contribution to the total descent amount again; thus, we can observe the low-frequency component would once again converge. As the training goes on, decreases; thus, the deviation of the DNN output from the target function at frequency decreases. Therefore, during the training, oscillates and decreases its oscillation amplitude. The very low-frequency components would converge after a small deviation. Therefore, we can observe that for very low-frequency components oscillates more with smaller amplitudes.

For example, we construct a function as follows. First, we evenly sample points from for . Second, we perform DFT for , that is, . Third, we flip to derive as shown in Fig.6a and Fig.6b. In Fig. 6c, except for the highest peak (the th frequency), of other frequency components oscillate and decrease their amplitudes. For a very low-frequency components, such as the first frequency component in Fig. 6c, its oscillates more, and its amplitude is small compared with other frequencies in Fig. 6c.


(a)

(b)

(c)
Figure 6: Frequency domain analysis for high-frequency dominant function , whose DFT is obtained by flipping the DFT of with evenly sampled points from . (a) red dots is for , blue squares (connected by blue lines) are for the DNN output at the end of training. (b) FFT of . (c) at different recording steps for different frequency peaks (different curves). We use a tanh DNN with width: 200-200-200-200-100. The learning rate is with the full batch. DNN parameters are initialized by Gaussian distribution with mean and standard deviation 0.1.

References

  • [1] D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al., A closer look at memorization in deep networks, arXiv preprint arXiv:1706.05394, (2017).
  • [2] O. Bousquet and A. Elisseeff, Stability and generalization, Journal of machine learning research, 2 (2002), pp. 499–526.
  • [3] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems, 2 (1989), pp. 303–314.
  • [4] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, Sharp minima can generalize for deep nets, arXiv preprint arXiv:1703.04933, (2017).
  • [5] D. W. Dong and J. J. Atick, Statistics of natural time-varying images, Network: Computation in Neural Systems, 6 (1995), pp. 345–358.
  • [6] M. Hardt, B. Recht, and Y. Singer, Train faster, generalize better: Stability of stochastic gradient descent, arXiv preprint arXiv:1509.01240, (2015).
  • [7] S. Hochreiter and J. Schmidhuber, Simplifying neural nets by discovering flat minima, in Advances in neural information processing systems, 1995, pp. 529–536.
  • [8] K. Kawaguchi, L. P. Kaelbling, and Y. Bengio, Generalization in deep learning, arXiv preprint arXiv:1710.05468, (2017).
  • [9] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, On large-batch training for deep learning: Generalization gap and sharp minima, arXiv preprint arXiv:1609.04836, (2016).
  • [10] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, (2014).
  • [11] Y. LeCun, The mnist database of handwritten digits, http://yann. lecun. com/exdb/mnist/, (1998).
  • [12] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, nature, 521 (2015), p. 436.
  • [13] M. Mishali and Y. C. Eldar, Blind multiband signal reconstruction: Compressed sensing for analog signals, IEEE Transactions on Signal Processing, 57 (2009), pp. 993–1009.
  • [14] V. Nagarajan and J. Z. Kolter, Generalization in deep networks: The role of distance from initialization, in NIPS workshop on Deep Learning: Bridging Theory and Practice, 2017.
  • [15] T. Poggio, K. Kawaguchi, Q. Liao, B. Miranda, L. Rosasco, X. Boix, J. Hidary, and H. Mhaskar, Theory of deep learning iii: the non-overfitting puzzle, tech. report, Technical report, CBMM memo 073, 2018.
  • [16] C. E. Shannon, Communication in the presence of noise, Proceedings of the IRE, 37 (1949), pp. 10–21.
  • [17] V. N. Vapnik, An overview of statistical learning theory, IEEE transactions on neural networks, 10 (1999), pp. 988–999.
  • [18] L. Wu, Z. Zhu, and W. E, Towards understanding generalization of deep learning: Perspective of loss landscapes, arXiv preprint arXiv:1706.10239, (2017).
  • [19] Z.-Q. J. Xu, Y. Zhang, and Y. Xiao, Training behavior of deep neural network in frequency domain, arXiv preprint arXiv:1807.01251, (2018).
  • [20] J. Yen, On nonuniform sampling of bandwidth-limited signals, IRE Transactions on circuit theory, 3 (1956), pp. 251–257.
  • [21] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning requires rethinking generalization, arXiv preprint arXiv:1611.03530, (2016).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
254252
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description