Frequency Principle in Deep Learning with General Loss Functions and Its Potential Application
Abstract
Previous studies have shown that deep neural networks (DNNs) with common settings often capture target functions from low to high frequency, which is called Frequency Principle (FPrinciple). It has also been shown that FPrinciple can provide an understanding to the often observed good generalization ability of DNNs. However, previous studies focused on the loss function of mean square error, while various loss functions are used in practice. In this work, we show that the FPrinciple holds for a general loss function (e.g., mean square error, cross entropy, etc.). In addition, DNN’s FPrinciple may be applied to develop numerical schemes for solving various problems which would benefit from a fast converging of low frequency. As an example of the potential usage of FPrinciple, we apply DNN in solving differential equations, in which conventional methods (e.g., Jacobi method) is usually slow in solving problems due to the convergence from high to low frequency.
Frequency Principle in Deep Learning with General Loss Functions and Its Potential Application
ZhiQin John Xu^{†}^{†}thanks: This work is done while Xu is a visiting member at Courant Institute of Mathematical Sciences, New York University, New York, United States. New York University Abu Dhabi Abu Dhabi 129188, United Arab Emirates zhiqinxu@nyu.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Deep neural networks (DNNs) has achieved many stateoftheart results in various fields [1], such as object recognition, language translation and gameplay. A fully understanding of why DNNs can achieve such good results remains elusive. The often used DNNs equip much more parameters than the number of the training data. As Von Neumann said “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk”. It is no surprise that such DNNs can well fit the training data. However, counterintuitive to the traditional learning theory, such DNNs often do not overfit (DNNs often generalize well to the test data which are not seen during the training), which is often referred to as “apparent paradox” [2].
A series of recent works [3, 4, 5, 6, 7, 8, 9, 10], both experiments and theories, have gained us more understanding to this paradox. In this work, we focus on the Fourier analysis of DNNs [3, 4, 5]. Although the often used dataset, such as MNIST and CIFAR, are relative simple compared with practical dataset, the input dimension (the pixel number of each input image) is still very high for a quantitative analysis. A good starting point to understand this apparent paradox is to find an example that is simple enough for analysis but also preserves this interesting paradox. The simple example turns out to be the fitting of a function with onedimension (1d) input and 1d output [3]. A prompt example to understand the apparent paradox is that a very highorder polynomial fitting for randomly sampled data points often overfit the training data, that is, high oscillation occurs around the sample boundary (Runge’s phenomenon); however, a DNN with small weight initialization, no matter how large size the DNN is, often learns the training data with a relative flat function [10]. Starting from such 1d functions, experimentally [3, 5] and theoretically [4], there exists a Frequency Principle (FPrinciple) that DNNs often first quickly capture lowfrequency components while keeping highfrequency ones small, and then relatively slowly captures highfrequency components. By FPrinciple, the highfrequency components of the DNN output is controlled by the training data. High oscillation which exists in the Runge’s phenomenon is then absent in the DNN fitting. FPrinciple also holds well in the often used dataset [3], that is, MNIST and CIFAR10. Theoretical work indicates that the key ingredient underlying the FPrinciple in general DNN fitting problems is that the power spectrum of the activation function decays in the Fourier space, where the powerdecay property is easy to be satisfied, such as sigmoid function and rectified linear unit.
Previous studies focused on the DNN with mean square error [3, 4, 5]. It is yet to study whether FPrinciple applies in the DNN with other types of loss functions. This is important since loss function varies in different problems, such as image classification and solving differential equations [11]. In this work, we perform a theoretical analysis to show that for a general loss function, e.g., cross entropy, the FPrinciple qualitative holds in the DNN training, which is also verified by experiments. The first experiment is a classification problem with the loss function of cross entropy. The second experiment is to apply DNN to solve Poisson equation by using Dirichlet’s principle.
The DNN is a powerful tool to solve differential equations [11, 12, 13], especially for highdimensional problems. It is wellknown that different frequencies converge with different speeds in solving differential equations by numerical schemes. For example, for the Jacobi method, low frequency converges much slower than high frequency. Multigrid method is designed to speed up the convergence, which explicitly first captures lowfrequency parts [14]. In addition, manual frequency marching from low frequency to high frequency has achieved great success in designing numerical schemes in various problems, such as inverse scattering problems [15] and CryoEM reconstruction problems [16]. By showing FPrinciple in solving Poisson’s equations, we emphasize that the DNN structure, which implicitly endows low frequency with high priority, could be a powerful tool to the problems that benefit from a fast converging of low frequency. For example, we propose an ideal that combines DNN and conventional methods (e.g., Jacobi method or GaussSeidel method), in which DNN is in charge of capturing lowfrequency parts and conventional methods are in charge of capturing highfrequency parts. This idea is exemplified by solving a 1d Poisson’s equation.
2 FPrinciple with general loss function
Consider a general DNN, and denote its output as , where stands for the DNN parameters and stands for the input. Represent with orthonormal basis :
(1) 
where is the coefficient of mode depending on . Denote the loss at sample as
(2) 
The total loss is
(3) 
Consider the gradient of the total loss with respect to parameter :
(4)  
(5)  
(6)  
(7) 
Let
(8) 
is the coefficient of at the component of . Consider that is Fourier basis. According to RiemannLebesgue lemma, if a function is an integrable function on an interval, then the Fourier coefficients of this function tend to 0 as the order tends to infinity. Therefore, when the activation function and the target function both are integrable functions on the considered interval, and tend to 0 as the order tends to infinity. Denote
(9) 
We have
(10) 
Therefore, we can decompose into a summation of , which tends to 0 as the order tends to infinity. This analysis implies that for any loss function, the change of any parameter at each training step is affected more by lower frequencies, which would rationalize the FPrinciple in general loss functions, as examined in the following experiments.
3 Experiment: cross entropy loss
The loss function of cross entropy is widely used in classification problems. We use experiments to show that FPrinciple holds in the DNN training with this loss function.
3.1 Toy data
Consider a target function , where
(11) 
(12) 
This fitting problem is a toy classification problem. In the DNN in this problem, the output layer has two neurons with softmax as activation function. The output is denoted as . The loss function is
(13) 
For illustration, we focus on , which is shown in Fig.1a. Next, we examine the convergence of different frequencies. In a finite interval, the frequency components of a target function are quantified by Fourier coefficients computed from Discrete Fourier Transform (DFT). Note that because the frequency in DFT is discrete, we can refer to a frequency component by its index instead of its physical frequency. The Fourier coefficient of for the th frequency component is denoted by (a complex number in general). is the corresponding amplitude, where denotes the absolute value. Note that we call the frequency index. is shown in Fig. 1b. To examine the convergence behavior of different frequency components during the training of a DNN, we compute the relative difference of the DNN output and in frequency domain at each recording step, i.e.,
(14) 
During the training, captures from low to high frequency in a clear order, as shown in Fig.1c.
3.2 MNIST data
To verify that the FPrinciple holds in the image classification problems (MNIST) with the loss function of cross entropy, we perform Fourier analysis in the first principle component of the input space. The procedure is as follows.
The training set is a list of images with labels: . Each image is represented by a vector , where is the pixel number of an image. is an onehot vector indicating the label. The dimensions for the input layer and the output layer are and , respectively. First, we compute the first principle direction. Transform each image by
(15) 
Denote all images by . The covariance matrix is . The eigenvector of the maximal eigenvalue of can be obtained, denoted by , i.e., the first principle direction. The projection of each image in the direction is . We rescale to such that by
(16) 
Then, the sample set is . For illustration, we only consider the first component of , i.e., . Note that now is a nonuniform sampling. Using nonuniform FFT (NUFFT), we can obtain
(17) 
Then, the sampling on the Fourier domain is
(18) 
After each training step, we feed each into the DNN and obtain the DNN output :
(19) 
Using nonuniform FFT (NUFFT), similarly, we can obtain the sampling of the DNN’s first dimension output on the Fourier domain
(20) 
which is shown in Fig.2a. We then examine the relative error of certain selected important frequency components (marked by black squares). As shown in the first column in Fig.2b, we can observe that the DNN tends to capture lowfrequency components first.
4 Experiment: Poisson’s equations
Consider onedimension (1d) Poisson’s equation [11, 17]:
(21) 
The Poisson’s equation can be solved by numerical schemes (e.g., Jacobi method) or DNN. As well known, high frequency converges faster in the Jacobi method. In the following, we would show that high frequency converges slower when the DNN is applied to solve the above Poisson’s equation.
4.1 Central differencing scheme and Jacobi method
is uniformly discretized into points with step , i.e., . The Poisson’s equation in Eq. (21) can be solved by central differencing scheme:
(22) 
Write the above in the matrix form:
(23) 
where
(24) 
(25) 
If is not a large number, Eq. (23) can be solved by performing the inverse of . When is a very large number, this problem can be solved by iterative schemes. For example, we illustrate the Jacobi method. Let , where is diagonal, and and are the strictly lower and upper parts of , respectively. Then, we can obtain
(26) 
The Jacobi iteration is
(27) 
We perform error analysis of the above iteration process. Denote as the true value obtained by directly performing inverse of in Eq. (23). The error at step is . Then, , where . The converging speed of is determined by the eigenvalues of , that is,
(28) 
and the corresponding eigenvector is
(29) 
Write
(30) 
where can be understood as the magnitude of in the direction of . Then,
(31) 
Therefore, the converging speed of in the direction of is controlled by . Since
(32) 
the frequencies and are closely related and converge with the same speed. Consider the frequency , is larger for lower frequency. Therefor, lower frequency converges slower in the Jacobi method.
4.2 DNN approach
Similar as the loss function in Ref [11], we consider the following loss function (energy method)
(33) 
It is equivalent to solve Poisson’s equation by finding the function that minimizes (Dirichlet’s principle) [17]. The last term in is a penalty in order to satisfy the boundary condition. The DNN structure is 1d input (i.e., ) and 1d output (denoted as ) for solving Eq. (21). is a constant.
The procedure is similar. We discretized into evenspace points. In each training step, we compute for . The gradient of with respect to parameter is
(34) 
At each training step, we compare and in the Fourier domain. Note that is the one obtained by directly performing inverse of in Eq. (23).
4.3 Experiment
Consider
(35) 
As shown in Fig. 3a, after training, the DNN output can well fit the solution obtained by directly performing inverse of in Eq. (23). As shown in Fig. 3b, there are three peaks of in the Fourier domain.
To examine the convergence behavior of different frequency components during the DNN training, we compute the relative difference of the DNN output and in frequency domain at each recording step, i.e.,
(36) 
As shown in Fig. 3c, FPrinciple holds well in solving Poisson’s equation [3, 4]. For comparison, we also show that low frequency converges much slower than high frequency in Jacobi method, as shown in Fig. 3d.
5 Combination of DNN and convention methods
In light of the above numerical simulations, it is natural to consider if we can combine DNN and Jacobi method to solve the Poisson’s equation. For simplicity, we call the combination method DJacobi method.
In the first part of DJacobi method, we solve the Poisson’s equation by DNN with steps. In the second part, we use the DNN output at step as the initial value for the Jacobi method.
We solve the problem in Fig. 3 by a laptop (Dell, Precision 5510). As shown in Fig.4a, the DNN loss fluctuates after some running time. We use Jacobi method to solve the problem after some time points, which are indicated by vertical dashed lines. As shown in Fig.4b (Fig.4c), green stars indicates the of the DNN output at different steps. Dashed lines indicates the evolution of the Jacobi (GaussSeidel) method. As we can see, if the selected timing is too early, it would still take long time to converge to a small error, because the low frequencies are not converged, yet. If the selected timing is too late, much time would be waste because the DNN is hard to capture high frequencies and fluctuates a lot. The selected timing of the green or the red one is a better choice. In practice, a better way to select the timing is when the loss gets flat and fluctuated for a short while.
6 Discussion
In this work, we have shown that FPrinciple holds well in the DNN training with a general loss function, extending the study of FPrinciple in the loss function of mean square error in previous works [3, 4, 5]. Along with the previous study that FPrinciple holds in both DNN and convolutional neural networks with the activation function of either tanh or Relu [3, 4], these works implicate that the FPrinciple may provide understandings to the generalization ability of general DNNs.
We also show that the generality of FPrinciple in the DNN training could potentially be useful in designing algorithms for solving practical problems. To be specific, we apply DNN to solve 1d Poisson’s equation. Compared with conventional numerical schemes, DNN could potentially work better in rather high dimensions [11]. In addition, it does not requires discretization for the DNN method, which would be much easier to be implemented. In future, it would be interested to use DNN’s FPrinciple to develop numerical schemes for solving various problems which would benefit from a fast converging of low frequency.
Acknowledgments
The author wants to thank Weinan E, Wei Cai for helpful discussion. The author also wants to thank Tao Luo, Zheng Ma, Yanyang Xiao and Yaoyu Zhang for the discussion of the FPrinciple. This work was funded by the NYU Abu Dhabi Institute G1301.
References

[1]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.
Deep learning.
nature, 521(7553):436, 2015.

[2]
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.
Understanding deep learning requires rethinking generalization.
arXiv preprint arXiv:1611.03530, 2016.

[3]
ZhiQin J Xu, Yaoyu Zhang, and Yanyang Xiao.
Training behavior of deep neural network in frequency domain.
arXiv preprint arXiv:1807.01251, 2018.

[4]
Zhiqin John Xu.
Understanding training and generalization in deep learning by fourier
analysis.
arXiv preprint arXiv:1808.04295, 2018.

[5]
Nasim Rahaman, Devansh Arpit, Aristide Baratin, Felix Draxler, Min Lin, Fred A
Hamprecht, Yoshua Bengio, and Aaron Courville.
On the spectral bias of deep neural networks.
arXiv preprint arXiv:1806.08734, 2018.

[6]
Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel
Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville,
Yoshua Bengio, et al.
A closer look at memorization in deep networks.
arXiv preprint arXiv:1706.05394, 2017.

[7]
T Poggio, K Kawaguchi, Q Liao, B Miranda, L Rosasco, X Boix, J Hidary, and
HN Mhaskar.
Theory of deep learning iii: the nonoverfitting puzzle.
Technical report, Technical report, CBMM memo 073, 2018.

[8]
Guillermo Valle Pérez, Ard A Louis, and Chico Q Camargo.
Deep learning generalizes because the parameterfunction map is
biased towards simple functions.
arXiv preprint arXiv:1805.08522, 2018.

[9]
Daniel Jakubovitz, Raja Giryes, and Miguel RD Rodrigues.
Generalization error in deep learning.
arXiv preprint arXiv:1808.01174, 2018.

[10]
Lei Wu, Zhanxing Zhu, and Weinan E.
Towards understanding generalization of deep learning: Perspective of
loss landscapes.
arXiv preprint arXiv:1706.10239, 2017.

[11]
E Weinan and Bing Yu.
The deep ritz method: A deep learningbased numerical algorithm for
solving variational problems.
Communications in Mathematics and Statistics, 6(1):1–12, 2018.

[12]
Yuehaw Khoo, Jianfeng Lu, and Lexing Ying.
Solving parametric pde problems with artificial neural networks.
arXiv preprint arXiv:1707.03351, 2017.

[13]
E Weinan, Jiequn Han, and Arnulf Jentzen.
Deep learningbased numerical methods for highdimensional parabolic
partial differential equations and backward stochastic differential
equations.
Communications in Mathematics and Statistics, 5(4):349–380,
2017.

[14]
William L Briggs, Steve F McCormick, et al.
A multigrid tutorial, volume 72.
Siam, 2000.

[15]
Gang Bao, Peijun Li, Junshan Lin, and Faouzi Triki.
Inverse scattering problems with multifrequencies.
Inverse Problems, 31(9):093001, 2015.

[16]
Alex Barnett, Leslie Greengard, Andras Pataki, and Marina Spivak.
Rapid solution of the cryoem reconstruction problem by frequency
marching.
SIAM Journal on Imaging Sciences, 10(3):1170–1195, 2017.

[17]
Lawrence C Evans.
Partial differential equations.
2010.
References
 [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
 [2] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
 [3] ZhiQin J Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. arXiv preprint arXiv:1807.01251, 2018.
 [4] Zhiqin John Xu. Understanding training and generalization in deep learning by fourier analysis. arXiv preprint arXiv:1808.04295, 2018.
 [5] Nasim Rahaman, Devansh Arpit, Aristide Baratin, Felix Draxler, Min Lin, Fred A Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of deep neural networks. arXiv preprint arXiv:1806.08734, 2018.
 [6] Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. arXiv preprint arXiv:1706.05394, 2017.
 [7] T Poggio, K Kawaguchi, Q Liao, B Miranda, L Rosasco, X Boix, J Hidary, and HN Mhaskar. Theory of deep learning iii: the nonoverfitting puzzle. Technical report, Technical report, CBMM memo 073, 2018.
 [8] Guillermo Valle Pérez, Ard A Louis, and Chico Q Camargo. Deep learning generalizes because the parameterfunction map is biased towards simple functions. arXiv preprint arXiv:1805.08522, 2018.
 [9] Daniel Jakubovitz, Raja Giryes, and Miguel RD Rodrigues. Generalization error in deep learning. arXiv preprint arXiv:1808.01174, 2018.
 [10] Lei Wu, Zhanxing Zhu, and Weinan E. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.
 [11] E Weinan and Bing Yu. The deep ritz method: A deep learningbased numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1):1–12, 2018.
 [12] Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric pde problems with artificial neural networks. arXiv preprint arXiv:1707.03351, 2017.
 [13] E Weinan, Jiequn Han, and Arnulf Jentzen. Deep learningbased numerical methods for highdimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017.
 [14] William L Briggs, Steve F McCormick, et al. A multigrid tutorial, volume 72. Siam, 2000.
 [15] Gang Bao, Peijun Li, Junshan Lin, and Faouzi Triki. Inverse scattering problems with multifrequencies. Inverse Problems, 31(9):093001, 2015.
 [16] Alex Barnett, Leslie Greengard, Andras Pataki, and Marina Spivak. Rapid solution of the cryoem reconstruction problem by frequency marching. SIAM Journal on Imaging Sciences, 10(3):1170–1195, 2017.
 [17] Lawrence C Evans. Partial differential equations. 2010.