Training behavior of deep neural network in frequency domain
Abstract
Why deep neural networks (DNNs) capable of overfitting often generalize well in practice is a mystery in deep learning. Existing works indicate that this observation holds for both complicated real datasets and simple datasets of onedimensional (1d) functions. In this work, for general lowfrequency dominant 1d functions, we find that a DNN with common settings first quickly captures the dominant lowfrequency components, and then relatively slowly captures highfrequency ones. We call this phenomenon Frequency Principle (FPrinciple). FPrinciple can be observed over various DNN setups of different activation functions, layer structures and training algorithms in our experiments. FPrinciple can be used to understand (i) the behavior of DNN training in the information plane and (ii) why DNNs often generalize well albeit its ability of overfitting. This FPrinciple potentially can provide insights into understanding the general principle underlying DNN optimization and generalization for real datasets.
Training behavior of deep neural network in frequency domain
ZhiQin J. Xu^{†}^{†}thanks: This work is done while Xu is a visiting member at Courant Institute of Mathematical Sciences, New York University, New York, United States. Code is available at Xu’s homepage., Yaoyu Zhang, Yanyang Xiao New York University Abu Dhabi Abu Dhabi 129188, United Arab Emirates {zhiqinxu,yz1961,yx742}@nyu.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Although Deep Neural Networks (DNNs) is totally transparent, i.e., the value of each node and the strength of each connection can be easily obtained, it is difficult to interpret how information is processed through DNNs. We can easily record the trajectories of the parameters of DNNs during training. However, it is very difficult to understand the principle underlying the highly nonconvex problem of DNN optimization [7]. Therefore, albeit its transparency, the DNN is often criticized for being a “black box” [1, 14]. Even for the simple problem of fitting onedimensional (1d) functions, the training process of the DNN is still not well understood [12, 15]. For example, Wu et al. (2017) [15] used DNNs with a different number of layers to fit a few data points sampled from a 1d function of the thirdorder polynomial. They found that even when a DNN is capable of overfitting, i.e., the number of its parameters is much larger than the size of the training set, it generalizes well (i.e., no overfitting) when it is well trained on the training set. In practice, DNNs often generalize well empirically for much more complicated datasets even when it is capable of overfitting [5, 9, 18, 17, 8, 15]. Intuitively, when a network is capable of overfitting, its solution of optimization lies in a huge space where wellgeneralized solutions only occupy a small subset. Therefore, it is mysterious that DNN optimization can often lead to a wellgeneralized solution while ignoring a huge space of overfitting solutions. To understand the underlying mechanism, in this work, we analyze in detail the principle underlying the DNN optimization process using a class of 1d functions. Our work could potentially provide insight into how DNNs behave in general during the training and why it tends to a solution of good generalization performance after training.
Motivated by the spectrum distribution of natural signals (e.g., image and sound), which typically have more power at low frequencies and decreasing power at high frequencies [3], we design a type of 1d function which is dominated by lowfrequency components to study the training of DNNs. We find that, for this type of functions, DNNs with common settings first quickly capture the dominant lowfrequency components while keeping highfrequency ones small, and then relatively slowly captures those highfrequency components. We call this phenomenon Frequency Principle (FPrinciple). From our numerical experiments, this FPrinciple can be widely observed over different neuron numbers (tens to thousands in each layer), layer numbers (one to tens), training algorithms (gradient descent, stochastic gradient descent, Adam) and activation functions (tanh and ReLU). According to FPrinciple, DNNs naturally utilize a strategy that is also adopted in some numerical algorithms to achieve remarkable efficiency, namely, fitting the target function progressively in ascending frequency order. These numerical algorithms include, for examples, the Multigrid method for solving largescale partial differential equations [4] and a recent numerical scheme that efficiently fits the threedimensional structure of proteins and protein complexes from noisy twodimensional images [2]. FPrinciple can be used to understand the following important phenomena: i) the behavior of DNN training in the information plane; ii) why DNNs capable of overfitting often generalize well.
SchwartzZiv and Tishby (2017) [14] claimed there are two phases in the information plane during the DNN training, that is, empirical error minimization and representation compression. Although there is an ongoing debate about this twophase characterization during the training of hidden layers, it seems that the twophase characterization holds well for the output layer during the training, no matter whether or rectified linear unit (ReLU) is used as the activation function [14, 12]. They also hypothesized that the compression is the result of the random diffusionlike behavior of stochastic gradient descent (SGD), which could lead to the excellent generalization performance of DNNs [14]. In this work, we demonstrate that there could be no compression phase in the training course of DNNs for certain continuous functions. By discretizing these functions, we can observe the compression phase. Note that, for the discretized functions, the compression phase always appears regardless of the training method. The compression phase can be well explained by the FPrinciple as follows. the DNN first fits the continuous lowfrequency components of the discretized function. Then, the DNN output discretizes as the network gradually captures the highfrequency components. By the definition of entropy, this discretization process naturally reduces the entropy of the DNN output. Thus, the compression phase appears in the information plane.
Using FPrinciple, we can also explain why DNNs often generalize well empirically albeit its ability of overfitting [5, 9, 18, 17, 8, 15]. For a finitesize training set, there exists an effective frequency range [13, 16, 10] beyond which the information of the signal is lost. By FPrinciple, with no constraint on the highfrequency components beyond the effective frequency range, DNNs tend to keep them small. For a wide class of lowfrequency dominant natural signals (e.g., image and sound), this tendency coincides with their behavior of decaying power at high frequencies. Thus, DNNs often generalize well in practice. When the training data is noisy, the smallamplitude highfrequency components are easier to be contaminated. By FPrinciple, DNNs first capture the less noisy dominant frequency components of the training data and keeps other frequency components small. At this stage, although the loss function is not best optimized for the training data, DNNs could have better generalization performance for not fitting the noise in the higherfrequency components. Therefore, as widely observed, better generalization performance often can be achieved with earlystopping. Also, for noisy training data, we demonstrate that DNNs could gradually well fit the noise at high frequencies during the compression phase according to FPrinciple. In this case, in contrary to the claim in SchwartzZiv and Tishby (2017) [14], the compression phase actually worsens the generalization performance of DNNs.
2 Methods
We use a DNN of hidden layers with width: 200200200100. The activation function is either tanh or rectified linear unit (ReLU). Note that the output layer is a linear transformation of the previous layer with no activation function. We record the state of the DNN every training epochs, which is referred to as one recording step. When we mention “common settings”, we refer to a setting of small initial weights, sufficient neurons and hidden layers, with which a DNN can well fit the target function.
Unless otherwise specified, we use the following setup in our results. The parameters of the DNN are initialized following a Gaussian distribution with mean and standard deviation . The activation function for each neuron is tanh. The DNN is trained by full batch Adam optimizer with learning rate . Parameters of the Adam optimizer are set to the default values [6]. The loss function is the mean squared error of the difference between DNN outputs and sample labels in the training set.
3 FPrinciple
Natural signals (e.g., image and sound) generally have large power at low frequencies and decreasing power at high frequencies [3]. This observation motivates us to design a class of functions with dominant lowfrequency components to study the behavior of training process of DNNs. Especially, we are interested in how and when each frequency component is captured during the training of DNNs. We design a target function by discretizing a smooth function as follows,
(1) 
where takes the nearest integer value. When , we define . A feature of this type of functions is that, with the same but different ’s, they share the same dominant lowfrequency components.
In a finite interval, e.g., , the frequency components of a target function are quantified by Fourier coefficients computed from Discrete Fourier Transform (DFT). Note that because the frequency in DFT is discrete, we can refer to a frequency component by its index instead of its physical frequency. The Fourier coefficient of for the th frequency component is denoted by (a complex number in general). is the corresponding amplitude, where denotes the absolute value. Note that we call the frequency index. Fig.1a displays the first frequency components of for and in Eq. (1). Clearly, the dominant first peak of and coincide. To examine the convergence behavior of different frequency components during the training of a DNN, we compute the relative difference of the DNN output and in frequency domain at each recording step, i.e.,
(2) 
where denotes the DNN output. Fig.1b shows the of the first frequency components at each recording step for with . It is hard to see what principle the DNN follows when capturing all frequency components. Theoretically, frequency components other than the peaks are susceptible to the artificial periodic boundary condition implicitly applied in DFT, thereby are not essential to our frequency domain analysis [11]. Therefore, in the following, we only focus on the convergence behavior of the frequency peaks during the training. As shown in Fig. 1c, at each frequency peak (marked by black dots in Fig.1a) converges in a precise order from low to highfrequency. When we use ReLU as the activation function or use nonuniformly sampled training data for the DNN, we can still see this precise order of convergence of different frequency peaks (Fig.6 in Appendix A.2). We call this phenomenon FPrinciple. In the following, we test the validity of FPrinciple at different setups and illustrate how it is related to the turning points of information plane trajectory and loss function during the training of DNNs.
We denote as the last recording step where is larger than the threshold . Fig.1d shows that the first peak converges very fast with , where denotes the frequency index of the th frequency peak. Fig.1e shows the entropy of the DNN output , , and mutual information between and the true output , , at each recording step. Note that a detailed discussion of mutual information can be found in Appendix A.1. Fig.1f shows the trajectories of loss functions using training data and test data. Two green dashed lines are respectively drawn at steps and . It is clear from Fig.1e and Fig.1f that steps and are respectively close to the two turning points observed in both the information plane and loss functions. Therefore, our frequency domain analysis could provide a novel perspective of visualizing the training process of DNNs and understanding the transition points in the information plane or loss function trajectory.
We have performed the same frequency domain analysis for various lowfrequency dominant functions, such as , and with different ’s (data not shown), and FPrinciple always holds during the training of DNNs. For illustration, we show another example in Fig.2a, i.e., with in Eq. (1), in which the amplitude of lowfrequency peaks does not monotonically decrease. In this case, we can still observe a precise convergence order from low to highfrequency for frequency peaks as shown in Fig.2b and Fig.2c. Therefore, FPrinciple seems to be an intrinsic behavior of DNN optimization which cannot be explained by the amplitude difference at different frequencies. Note that, for the output of a DNN outside the fitting range, FPrinciple also holds by keeping it “flat” (i.e., low frequency dominant) as shown in Fig.7 in Appendix A.3.
Similar to Zhang et al. (2016) [17], the DNN used in our experiments is capable of capturing dataset with randomized labels. For randomized dataset, its DFT is complicated with random fluctuations. In this case, the convergence order of each frequency peak has no obvious rules (see Appendix A.4 for details). That is, FPrinciple no longer holds.
In summary, we find that, for a general class of functions dominated by lowfrequency components, the training course of DNNs follows FPrinciple in which lowfrequency components are first captured, followed by highfrequency components. For a function with no clear decaying trend in Fourier coefficients, e.g., a function with randomized labels, there is no clear rule underlying the optimization of DNNs.
4 Analysis of FPrinciple
As the loss function drives the whole training process of a DNN, one may suspect that it is the difference in driving forces, i.e., the gradients of loss function, at different frequencies that leads to the phenomenon of FPrinciple. To investigate this possibility, we rewrite the loss function in Eq. (3) to its frequency domain equivalence in Eq. (4) as follows.
(3) 
(4) 
where subscripts and respectively denote the spatial domain and the weighted frequency domain. is the sample size. When for all frequencies, according to the Parseval’s theorem^{1}^{1}1Without loss of generality, the constant factor that is related to the definition of corresponding Fourier Transform is ignored here.. This equality gives us a frequency domain decomposition of loss function Eq. (3) as well as its gradient.
A natural conjecture would be that the weight determines the relative speed of convergence towards the target function for each frequency component. However, in Fig.1c, with frequencywise equally weighted loss function Eq. (3), lowerfrequency peaks clearly converge faster than higherfrequency ones. Therefore, we hypothesize that the training process of DNNs implicitly endow lowerfrequency components with higher priority by ignoring the higherfrequency components when the lowerfrequency components are not well captured. To justify this hypothesis, we can manipulate so that the weights of higherfrequency components are set to zero at the early stage of training to see if the same convergence behavior as in Fig.1c can be observed.
Specifically, we train DNNs from low to highfrequency to fit with in Eq. (1). Denote , where denotes the frequency index of the th frequency peak in Fig.1a. For the th frequency peak, we set the weight of the first half part of the frequency domain as for , and for . The second half of weights are fulfilled by the constraint of symmetry. goes from to in order during the training and is increased by one whenever . As shown in Fig.3a, the behavior of at different recording steps for different frequency peaks is similar to the one in Fig.1c. This observation conforms with our hypothesis that somehow the higherfrequency components are ignored at the early stage of training.
To further validate our hypothesis, we manually train DNNs from high to lowfrequency to fit the same function of with in Eq. (1). For , we set the weight of the first half part of the frequency domain as for , and for . The second half of weights are fulfilled by the constraint of symmetry. goes from to in order during the training and is decreased by one whenever or the number of training epochs used for fitting the th peak reaches (equivalent to recording steps). We add the latter condition because the convergence speed for highfrequency components in this approach is often very slow. Fig.3b shows that starts to converge properly only when all frequency components are used for training after recording step . This observation further justifies our hypothesis that higherfrequency components indeed have lower priority as they converge efficiently and properly only when the lowerfrequency components are well captured. Therefore, FPrinciple is an intrinsic behavior of the DNN optimization process, which implicitly endows lowerfrequency components with higher priority.
5 FPrinciple in understanding DNN
5.1 Compression vs. no compression in the information plane
Through the empirical exploration of DNN training in the information plane, SchwartzZiv and Tishby (2017) [14] claimed that (i) information compression is a general process; (ii) information compression is induced by SGD; (iii) information compression is good for generalization. They suggested that with analyzing the behavior of DNN training in the information plane, it is able to obtain a deeper understanding of why DNNs perform well in practice. In this subsection, we demonstrate how FPrinciple can be used to understand the behavior of the training process of DNNs in the information plane.
We first demonstrate how compression can appear or disappear by tuning a parameter in Eq. (1) with for using full batch gradient descent (GD) without stochasticity. In our simulations, the DNN well fits for both equal to and after training (see Fig.4a and c). In the information plane, there is no compression phase for for (see Fig.4b). By increasing in Eq. (1) we can observe that: i) the fitting function is discretized with only few possible outputs (see Fig.4c); ii) the compression of appears (see Fig.4d). For , behaviors of information plane are similar to previous results [14]. To understand why compression would happen for , we next focus on the training courses for different in the frequency domain.
A key feature of the class of functions in Eq. (1) is that the dominant lowfrequency components for with different are the same (see Fig.1a). By FPrinciple, the DNN first capture those dominant lowfrequency components, thus, the training courses for different at the beginning are similar, i.e., i) the DNN output is close to at certain training epochs (blue lines in Fig.4a and c); ii) in the information plane increases rapidly until it reaches a value close to the entropy of , i.e., (see Fig.4b and d). For , the target function is , therefore, would be closer and closer to during the training. For , the entropy of the target function, , is much less than . In the latter stage of capturing highfrequency components, the DNN output would converge to the discretized function . Therefore, would decrease from to .
Actually, for , theoretically, and would finally converge to . Thus, is a trivial case that compression phase could not occur. This analysis is also applicable for other continuous functions. For any discretized function, the DNN first fits the lowfrequency components of the discretized function with a continuous function. Then, the DNN output converges to discretized values as the network gradually captures the highfrequency components. By the definition of entropy, this discretization process naturally reduces the entropy of the DNN output. Thus, the compression phase appears in the information plane. As the discretization in general is inevitable for classification problems with discrete labels, we can often observe information compression in practice as described in the previous study [14].
For the issue of generalization, which is regarded in the third claim above [14], we will discuss in detail in the next subsection.
5.2 Generalization
Why DNNs capable of overfitting often generalize well is a mystery in deep learning [5, 9, 18, 17, 8, 15]. By FPrinciple, this mystery could be naturally explained as follows. For a class of functions dominated by low frequencies, with finite training data points, there is an effective frequency range for this training set, which is defined as the range in frequency domain bounded by NyquistShannon sampling theorem [13] when the sampling is evenly spaced, or its extensions [16, 10] otherwise. When the number of parameters of a DNN is greater than the size of the training set, the DNN can overfit these sampling data points (i.e., training set) with different amount of powers outside the effective frequency range. However, by FPrinciple, the training process will implicitly bias the DNN towards a solution with low power for the highfrequency components outside the effective frequency range. For functions dominated by low frequencies, this bias coincides with their intrinsic feature of low power at high frequencies, thus naturally leading to a wellgeneralized solution after training. By the above analysis, we can predict that, in the case of insufficient training data, when the higherfrequency components are not negligible, e.g., there exists a significant frequency peak above the effective frequency range, the DNN cannot generalize well after training.
In another case where the training data is contaminated by noise, earlystopping method is usually applied to avoid overfitting in practice [8]. By FPrinciple, earlystopping can help avoid fitting the noisy highfrequency components. Thus, it naturally leads to a wellgeneralized solution which accurately captures the dominant lowfrequency components while keeping the highfrequency ones small. We use the following example for illustration.
As shown in Fig.5a, we consider with in Eq. (1). For each sample , we add a noise on , where follows a Gaussian distribution with mean and standard deviation . The training set and the test set respectively consist of and data points evenly sampled in . The DNN can well fit the sampled training set as the loss function of the training set decreases to a very small value (green stars in Fig.5b). However, the loss function of the test set first decreases and then increases (red dots in Fig.5b). That is, the generalization performance of the DNN gets worse during the training after a certain step. In Fig.5c, for the training data (red) and the test data (black) only overlap around the dominant lowfrequency components. This indicates that the noise in samples has much more impact on highfrequency components. Around step (green dashed lines in Fig. 5b and d), the DNN well captures the dominant peak as shown in Fig. 5c, and the loss function of the test set attains its minimum and starts to increase (red dots in Fig.5b). These phenomena are consistent with our above analysis that earlystopping helps prevent fitting the noisy highfrequency components thus naturally leads to a better generalization performance of DNNs.
At step , the entropy of the DNN output stays around its maximum (see Fig.5d). After that, gradually compresses (see Fig. 5d) as the DNN starts to capture the noisy highfrequency components, leading to worse generalization performance as shown in Fig. 5b. Therefore, in contrary to the third claim (in Section 5.1) [14], compression could worsen the generalization performance of DNNs.
6 Discussion and conclusion
In this work, we find a widely applied principle, i.e., FPrinciple, underlying the optimization process of DNNs for fitting 1d functions. Specifically, for a function with dominant lowfrequency components, DNNs with common settings first capture these lowfrequency components while keeping highfrequency ones small. In our experiments, this phenomenon can be widely observed for DNNs with different neuron numbers (tens to thousands in each layer), layer numbers (one to tens), training algorithms (GD, SGD, Adam), and activation functions (tanh and ReLU). From our analysis, FPrinciple is an intrinsic property of DNN optimization process under common settings for fitting 1d functions. It can well explain the compression phase in the information plane as well as the good generalization performance of DNNs often observed in experiments. Our findings could potentially provide insights into the behavior of DNN optimization process in general.
Note that initializing weights with large values could complicate the phenomenon of FPrinciple. In previous experiments, DNNs are initialized by Gaussian distribution with mean and standard deviation and its training behavior follows FPrinciple. However, with largeinitialization values, FPrinciple no longer holds (see Fig. 9b in Appendix A.5), and the training speed of the DNN is much slower comparing to initialization with small values (see Fig. 9c in Appendix A.5). More importantly, these two initialization strategies could result in very different generalization performances. When the weights of the DNN are initialized with small values, say, Gaussian distribution with mean and standard deviation , the initial DNN output is flat (see Fig.10d in Appendix A.5). In contrast, the initial DNN output highly fluctuates when the standard deviation for initialization is large ^{2}^{2}2The bias terms are always initialized by standard deviation 0.1., say, (see Fig.10a in Appendix A.5). For both initializations, DNNs can well fit the training data. However, for test data, the DNN with smallinitialization values generalizes well whereas the DNN with largeinitialization values clearly overfits as shown in Fig.10. Intuitively, the above phenomenon can be understood as follows. Without explicit constraints on the highfrequency components beyond the effective frequency range of the training data, the DNN output after training tends to inherit these highfrequency components from the initial output. Therefore, with largeinitialization values, DNN outputs can easily overfit the target function with fluctuating highfrequency components. In practice, the weights of DNNs are often randomly initialized with standard deviations close to zero [15]. As suggested by our analysis, the smallinitialization strategy may implicitly lead to a more efficient and wellgeneralized optimization process of DNNs as characterized by FPrinciple for 1d functions.
Acknowledgments
The authors want to thank David W. McLaughlin for helpful discussions and thank Qiu Yang, Zheng Ma, and Tao Luo for critically reading the manuscript. ZX, YZ, YX are supported by the NYU Abu Dhabi Institute G1301.
References

[1]
G. Alain and Y. Bengio, Understanding intermediate layers using
linear classifier probes, arXiv preprint arXiv:1610.01644, (2016).

[2]
A. Barnett, L. Greengard, A. Pataki, and M. Spivak, Rapid solution
of the cryoem reconstruction problem by frequency marching, SIAM Journal on
Imaging Sciences, 10 (2017), pp. 1170–1195.

[3]
D. W. Dong and J. J. Atick, Statistics of natural timevarying
images, Network: Computation in Neural Systems, 6 (1995), pp. 345–358.

[4]
W. Hackbusch, Multigrid methods and applications, vol. 4, Springer
Science & Business Media, 2013.

[5]
K. Kawaguchi, L. P. Kaelbling, and Y. Bengio, Generalization in deep
learning, arXiv preprint arXiv:1710.05468, (2017).

[6]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization,
arXiv preprint arXiv:1412.6980, (2014).

[7]
Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, nature, 521
(2015), p. 436.

[8]
J. Lin, R. Camoriano, and L. Rosasco, Generalization properties and
implicit regularization for multiple passes sgm, in International Conference
on Machine Learning, 2016, pp. 2340–2348.

[9]
C. H. Martin and M. W. Mahoney, Rethinking generalization requires
revisiting old ideas: statistical mechanics approaches and complex learning
behavior, arXiv preprint arXiv:1710.09553, (2017).

[10]
M. Mishali and Y. C. Eldar, Blind multiband signal reconstruction:
Compressed sensing for analog signals, IEEE Transactions on Signal
Processing, 57 (2009), pp. 993–1009.

[11]
D. B. Percival and A. T. Walden, Spectral analysis for physical
applications, cambridge university press, 1993.

[12]
A. M. Saxe, Y. Bansal, J. Dapello, and M. Advani, On the information
bottleneck theory of deep learning, International Conference on Learning
Representations, (2018).

[13]
C. E. Shannon, Communication in the presence of noise, Proceedings
of the IRE, 37 (1949), pp. 10–21.

[14]
R. ShwartzZiv and N. Tishby, Opening the black box of deep neural
networks via information, arXiv preprint arXiv:1703.00810, (2017).

[15]
L. Wu, Z. Zhu, and W. E, Towards understanding generalization of
deep learning: Perspective of loss landscapes, arXiv preprint
arXiv:1706.10239, (2017).

[16]
J. Yen, On nonuniform sampling of bandwidthlimited signals, IRE
Transactions on circuit theory, 3 (1956), pp. 251–257.

[17]
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning requires rethinking generalization, arXiv
preprint arXiv:1611.03530, (2016).

[18]
G. Zheng, J. Sang, and C. Xu, Understanding deep learning
generalization by maximum entropy, arXiv preprint arXiv:1711.07758, (2017).
References
 [1] G. Alain and Y. Bengio, Understanding intermediate layers using linear classifier probes, arXiv preprint arXiv:1610.01644, (2016).
 [2] A. Barnett, L. Greengard, A. Pataki, and M. Spivak, Rapid solution of the cryoem reconstruction problem by frequency marching, SIAM Journal on Imaging Sciences, 10 (2017), pp. 1170–1195.
 [3] D. W. Dong and J. J. Atick, Statistics of natural timevarying images, Network: Computation in Neural Systems, 6 (1995), pp. 345–358.
 [4] W. Hackbusch, Multigrid methods and applications, vol. 4, Springer Science & Business Media, 2013.
 [5] K. Kawaguchi, L. P. Kaelbling, and Y. Bengio, Generalization in deep learning, arXiv preprint arXiv:1710.05468, (2017).
 [6] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, (2014).
 [7] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, nature, 521 (2015), p. 436.
 [8] J. Lin, R. Camoriano, and L. Rosasco, Generalization properties and implicit regularization for multiple passes sgm, in International Conference on Machine Learning, 2016, pp. 2340–2348.
 [9] C. H. Martin and M. W. Mahoney, Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior, arXiv preprint arXiv:1710.09553, (2017).
 [10] M. Mishali and Y. C. Eldar, Blind multiband signal reconstruction: Compressed sensing for analog signals, IEEE Transactions on Signal Processing, 57 (2009), pp. 993–1009.
 [11] D. B. Percival and A. T. Walden, Spectral analysis for physical applications, cambridge university press, 1993.
 [12] A. M. Saxe, Y. Bansal, J. Dapello, and M. Advani, On the information bottleneck theory of deep learning, International Conference on Learning Representations, (2018).
 [13] C. E. Shannon, Communication in the presence of noise, Proceedings of the IRE, 37 (1949), pp. 10–21.
 [14] R. ShwartzZiv and N. Tishby, Opening the black box of deep neural networks via information, arXiv preprint arXiv:1703.00810, (2017).
 [15] L. Wu, Z. Zhu, and W. E, Towards understanding generalization of deep learning: Perspective of loss landscapes, arXiv preprint arXiv:1706.10239, (2017).
 [16] J. Yen, On nonuniform sampling of bandwidthlimited signals, IRE Transactions on circuit theory, 3 (1956), pp. 251–257.
 [17] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning requires rethinking generalization, arXiv preprint arXiv:1611.03530, (2016).
 [18] G. Zheng, J. Sang, and C. Xu, Understanding deep learning generalization by maximum entropy, arXiv preprint arXiv:1711.07758, (2017).
Appendix A Appendix
a.1 Computation of information
For any random variables and with a joint distribution : the entropy of is defined as ; their mutual information is defined as ; the conditional entropy of on is defined as .
By the construction of the DNN, its output is a deterministic function of its input , thus, and . To compute entropy numerically, we evenly bin , , to , , with bin size as follows. For any value , its binned value is define as . In our work, and are approximated by and , respectively, with . Note that, after binning, one value of may map to multiple values of . Thus, and . The difference vanishes as bin size shrinks. Therefore, with a small bin size, is a good approximation of . In experiments, we also find that and behave almost the same in the information plane for the default value (data not shown).
a.2 Examples of FPrinciple
For illustration, Fig.6 shows that FPrinciple holds when we use ReLU as the activation function (see Fig.6a) or use nonuniformly sampled training data (see Fig.6b) for DNNs.
a.3 DNN outputs outside the fitting range
We observe in our experiments that, outside the fitting range of the training data, the DNN output is very flat. For illustration, we train a DNN to fit an oscillation function . At the end of training, the DNN well captures the fitting range , whereas, outside the fitting range, it is very flat with no oscillation, as shown in Fig.7. This indicates that, without constraints outside the fitting range, the DNN keeps its output lowfrequency (close to 0) dominant.
a.4 Fitting data with complicated DFT
The following case demonstrates that for a function with a complicated DFT, FPrinciple may not apply. For an example of random labels as shown in Fig.8a, its DFT is shown in Fig.8b. The amplitude of frequency peak fluctuates a lot for all frequencies. In this case, we found no obvious rule of the capturing order (see Fig.8c), even only for the peaks (see Fig.8d).
a.5 Initialization
Fig. (9) and Fig. (10) show the training processes of DNNs with different initializations ^{3}^{3}3The bias terms are always initialized by standard deviation 0.1.. More discussion about initialization can be found in the Discussion of the main text.