Frequency Principle in Deep Learning with General Loss Functions and Its Potential Application

Frequency Principle in Deep Learning with General Loss Functions and Its Potential Application

Zhi-Qin John Xu
New York University Abu Dhabi
Abu Dhabi 129188, United Arab Emirates
zhiqinxu@nyu.edu
This work is done while Xu is a visiting member at Courant Institute of Mathematical Sciences, New York University, New York, United States.
Abstract

Previous studies have shown that deep neural networks (DNNs) with common settings often capture target functions from low to high frequency, which is called Frequency Principle (F-Principle). It has also been shown that F-Principle can provide an understanding to the often observed good generalization ability of DNNs. However, previous studies focused on the loss function of mean square error, while various loss functions are used in practice. In this work, we show that the F-Principle holds for a general loss function (e.g., mean square error, cross entropy, etc.). In addition, DNN’s F-Principle may be applied to develop numerical schemes for solving various problems which would benefit from a fast converging of low frequency. As an example of the potential usage of F-Principle, we apply DNN in solving differential equations, in which conventional methods (e.g., Jacobi method) is usually slow in solving problems due to the convergence from high to low frequency.

 

Frequency Principle in Deep Learning with General Loss Functions and Its Potential Application


  Zhi-Qin John Xuthanks: This work is done while Xu is a visiting member at Courant Institute of Mathematical Sciences, New York University, New York, United States. New York University Abu Dhabi Abu Dhabi 129188, United Arab Emirates zhiqinxu@nyu.edu

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Deep neural networks (DNNs) has achieved many state-of-the-art results in various fields [1], such as object recognition, language translation and game-play. A fully understanding of why DNNs can achieve such good results remains elusive. The often used DNNs equip much more parameters than the number of the training data. As Von Neumann said “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk”. It is no surprise that such DNNs can well fit the training data. However, counter-intuitive to the traditional learning theory, such DNNs often do not overfit (DNNs often generalize well to the test data which are not seen during the training), which is often referred to as “apparent paradox” [2].

A series of recent works [3, 4, 5, 6, 7, 8, 9, 10], both experiments and theories, have gained us more understanding to this paradox. In this work, we focus on the Fourier analysis of DNNs [3, 4, 5]. Although the often used dataset, such as MNIST and CIFAR, are relative simple compared with practical dataset, the input dimension (the pixel number of each input image) is still very high for a quantitative analysis. A good starting point to understand this apparent paradox is to find an example that is simple enough for analysis but also preserves this interesting paradox. The simple example turns out to be the fitting of a function with one-dimension (1-d) input and 1-d output [3]. A prompt example to understand the apparent paradox is that a very high-order polynomial fitting for randomly sampled data points often overfit the training data, that is, high oscillation occurs around the sample boundary (Runge’s phenomenon); however, a DNN with small weight initialization, no matter how large size the DNN is, often learns the training data with a relative flat function [10]. Starting from such 1-d functions, experimentally [3, 5] and theoretically [4], there exists a Frequency Principle (F-Principle) that DNNs often first quickly capture low-frequency components while keeping high-frequency ones small, and then relatively slowly captures high-frequency components. By F-Principle, the high-frequency components of the DNN output is controlled by the training data. High oscillation which exists in the Runge’s phenomenon is then absent in the DNN fitting. F-Principle also holds well in the often used dataset [3], that is, MNIST and CIFAR-10. Theoretical work indicates that the key ingredient underlying the F-Principle in general DNN fitting problems is that the power spectrum of the activation function decays in the Fourier space, where the power-decay property is easy to be satisfied, such as sigmoid function and rectified linear unit.

Previous studies focused on the DNN with mean square error [3, 4, 5]. It is yet to study whether F-Principle applies in the DNN with other types of loss functions. This is important since loss function varies in different problems, such as image classification and solving differential equations [11]. In this work, we perform a theoretical analysis to show that for a general loss function, e.g., cross entropy, the F-Principle qualitative holds in the DNN training, which is also verified by experiments. The first experiment is a classification problem with the loss function of cross entropy. The second experiment is to apply DNN to solve Poisson equation by using Dirichlet’s principle.

The DNN is a powerful tool to solve differential equations [11, 12, 13], especially for high-dimensional problems. It is well-known that different frequencies converge with different speeds in solving differential equations by numerical schemes. For example, for the Jacobi method, low frequency converges much slower than high frequency. Multigrid method is designed to speed up the convergence, which explicitly first captures low-frequency parts [14]. In addition, manual frequency marching from low frequency to high frequency has achieved great success in designing numerical schemes in various problems, such as inverse scattering problems [15] and Cryo-EM reconstruction problems [16]. By showing F-Principle in solving Poisson’s equations, we emphasize that the DNN structure, which implicitly endows low frequency with high priority, could be a powerful tool to the problems that benefit from a fast converging of low frequency. For example, we propose an ideal that combines DNN and conventional methods (e.g., Jacobi method or Gauss-Seidel method), in which DNN is in charge of capturing low-frequency parts and conventional methods are in charge of capturing high-frequency parts. This idea is exemplified by solving a 1-d Poisson’s equation.

2 F-Principle with general loss function

Consider a general DNN, and denote its output as , where stands for the DNN parameters and stands for the input. Represent with orthonormal basis :

(1)

where is the coefficient of mode depending on . Denote the loss at sample as

(2)

The total loss is

(3)

Consider the gradient of the total loss with respect to parameter :

(4)
(5)
(6)
(7)

Let

(8)

is the coefficient of at the component of . Consider that is Fourier basis. According to Riemann-Lebesgue lemma, if a function is an integrable function on an interval, then the Fourier coefficients of this function tend to 0 as the order tends to infinity. Therefore, when the activation function and the target function both are integrable functions on the considered interval, and tend to 0 as the order tends to infinity. Denote

(9)

We have

(10)

Therefore, we can decompose into a summation of , which tends to 0 as the order tends to infinity. This analysis implies that for any loss function, the change of any parameter at each training step is affected more by lower frequencies, which would rationalize the F-Principle in general loss functions, as examined in the following experiments.

3 Experiment: cross entropy loss

The loss function of cross entropy is widely used in classification problems. We use experiments to show that F-Principle holds in the DNN training with this loss function.

3.1 Toy data

Consider a target function , where

(11)
(12)

This fitting problem is a toy classification problem. In the DNN in this problem, the output layer has two neurons with softmax as activation function. The output is denoted as . The loss function is

(13)

For illustration, we focus on , which is shown in Fig.1a. Next, we examine the convergence of different frequencies. In a finite interval, the frequency components of a target function are quantified by Fourier coefficients computed from Discrete Fourier Transform (DFT). Note that because the frequency in DFT is discrete, we can refer to a frequency component by its index instead of its physical frequency. The Fourier coefficient of for the -th frequency component is denoted by (a complex number in general). is the corresponding amplitude, where denotes the absolute value. Note that we call the frequency index. is shown in Fig. 1b. To examine the convergence behavior of different frequency components during the training of a DNN, we compute the relative difference of the DNN output and in frequency domain at each recording step, i.e.,

(14)

During the training, captures from low to high frequency in a clear order, as shown in Fig.1c.


(a)

(b)

(c) DNN
Figure 1: F-Principle with cross entropy loss. The first output dimension of the target function is shown in (a) and its Fourier coefficient amplitude as a function of frequency index is shown in (b). Frequency peaks are marked by black dots. (c) Relative difference at different recording steps for different selected frequency indexes. The training data are evenly sampled in with sample size . We use a DNN with width 400-400-200-100 with full batch training, the output layer has two neurons with softmax as activation function, and learning rate is . The parameters of the DNN are initialized following a Gaussian distribution with mean and standard deviation .

3.2 MNIST data

To verify that the F-Principle holds in the image classification problems (MNIST) with the loss function of cross entropy, we perform Fourier analysis in the first principle component of the input space. The procedure is as follows.

The training set is a list of images with labels: . Each image is represented by a vector , where is the pixel number of an image. is an one-hot vector indicating the label. The dimensions for the input layer and the output layer are and , respectively. First, we compute the first principle direction. Transform each image by

(15)

Denote all images by . The covariance matrix is . The eigenvector of the maximal eigenvalue of can be obtained, denoted by , i.e., the first principle direction. The projection of each image in the direction is . We rescale to such that by

(16)

Then, the sample set is . For illustration, we only consider the first component of , i.e., . Note that now is a non-uniform sampling. Using non-uniform FFT (NUFFT), we can obtain

(17)

Then, the sampling on the Fourier domain is

(18)

After each training step, we feed each into the DNN and obtain the DNN output :

(19)

Using non-uniform FFT (NUFFT), similarly, we can obtain the sampling of the DNN’s first dimension output on the Fourier domain

(20)

which is shown in Fig.2a. We then examine the relative error of certain selected important frequency components (marked by black squares). As shown in the first column in Fig.2b, we can observe that the DNN tends to capture low-frequency components first.


(a)

(b)
Figure 2: F-Principle with cross entropy loss on MNIST dataset. The Fourier coefficient amplitude of the first output dimension of the target function is shown in (a). Frequency peaks are marked by black dots. (b) Relative difference at different recording steps for different selected frequency indexes. The training data are test samples of MNIST dataset. We use a DNN with width 400-200 with batch size as 128, the output layer has 10 neurons with softmax as activation function, and learning rate is . The parameters of the DNN are initialized following a Gaussian distribution with mean and standard deviation .

4 Experiment: Poisson’s equations

Consider one-dimension (1-d) Poisson’s equation [11, 17]:

(21)

The Poisson’s equation can be solved by numerical schemes (e.g., Jacobi method) or DNN. As well known, high frequency converges faster in the Jacobi method. In the following, we would show that high frequency converges slower when the DNN is applied to solve the above Poisson’s equation.

4.1 Central differencing scheme and Jacobi method

is uniformly discretized into points with step , i.e., . The Poisson’s equation in Eq. (21) can be solved by central differencing scheme:

(22)

Write the above in the matrix form:

(23)

where

(24)
(25)

If is not a large number, Eq. (23) can be solved by performing the inverse of . When is a very large number, this problem can be solved by iterative schemes. For example, we illustrate the Jacobi method. Let , where is diagonal, and and are the strictly lower and upper parts of , respectively. Then, we can obtain

(26)

The Jacobi iteration is

(27)

We perform error analysis of the above iteration process. Denote as the true value obtained by directly performing inverse of in Eq. (23). The error at step is . Then, , where . The converging speed of is determined by the eigenvalues of , that is,

(28)

and the corresponding eigenvector is

(29)

Write

(30)

where can be understood as the magnitude of in the direction of . Then,

(31)

Therefore, the converging speed of in the direction of is controlled by . Since

(32)

the frequencies and are closely related and converge with the same speed. Consider the frequency , is larger for lower frequency. Therefor, lower frequency converges slower in the Jacobi method.

4.2 DNN approach

Similar as the loss function in Ref [11], we consider the following loss function (energy method)

(33)

It is equivalent to solve Poisson’s equation by finding the function that minimizes (Dirichlet’s principle) [17]. The last term in is a penalty in order to satisfy the boundary condition. The DNN structure is 1-d input (i.e., ) and 1-d output (denoted as ) for solving Eq. (21). is a constant.

The procedure is similar. We discretized into even-space points. In each training step, we compute for . The gradient of with respect to parameter is

(34)

At each training step, we compare and in the Fourier domain. Note that is the one obtained by directly performing inverse of in Eq. (23).

4.3 Experiment

Consider

(35)

As shown in Fig. 3a, after training, the DNN output can well fit the solution obtained by directly performing inverse of in Eq. (23). As shown in Fig. 3b, there are three peaks of in the Fourier domain.

To examine the convergence behavior of different frequency components during the DNN training, we compute the relative difference of the DNN output and in frequency domain at each recording step, i.e.,

(36)

As shown in Fig. 3c, F-Principle holds well in solving Poisson’s equation [3, 4]. For comparison, we also show that low frequency converges much slower than high frequency in Jacobi method, as shown in Fig. 3d.


(a)

(b)

(c) DNN

(d) Jacobi
Figure 3: Frequency domain analysis of the Poisson’s equation in Eq. (21) with . (a) . The true value is computed by central differencing scheme with directly compute the inverse of coefficient matrix. (b) (red solid line) as a function of frequency index. Frequency peaks are marked by black dots. (c, d) Relative difference at different recording steps for different selected frequency indexes. The training data and the test data are evenly sampled in with sample size and , respectively. We use a DNN with width 4000-800 with full batch training. The learning rate is at beginning and halved every training epochs. is . Each step consists of four epochs. (d) is the result of Jacobi iteration. The parameters of the DNN are initialized following a Gaussian distribution with mean and standard deviation .

5 Combination of DNN and convention methods

In light of the above numerical simulations, it is natural to consider if we can combine DNN and Jacobi method to solve the Poisson’s equation. For simplicity, we call the combination method D-Jacobi method.

In the first part of D-Jacobi method, we solve the Poisson’s equation by DNN with steps. In the second part, we use the DNN output at step as the initial value for the Jacobi method.

We solve the problem in Fig. 3 by a laptop (Dell, Precision 5510). As shown in Fig.4a, the DNN loss fluctuates after some running time. We use Jacobi method to solve the problem after some time points, which are indicated by vertical dashed lines. As shown in Fig.4b (Fig.4c), green stars indicates the of the DNN output at different steps. Dashed lines indicates the evolution of the Jacobi (Gauss-Seidel) method. As we can see, if the selected timing is too early, it would still take long time to converge to a small error, because the low frequencies are not converged, yet. If the selected timing is too late, much time would be waste because the DNN is hard to capture high frequencies and fluctuates a lot. The selected timing of the green or the red one is a better choice. In practice, a better way to select the timing is when the loss gets flat and fluctuated for a short while.


(a) Loss

(b) Jacobi:

(c) GS:
Figure 4: Combination methods for solving the Poisson’s equation in Eq. (21) with . The abscissa is the computer running time. (a) Loss is the form in Eq. (33). We use Jacobi method to solve the problem after several time points, which are indicated by vertical dashed lines. (b) Green stars indicates the at different steps. Dashed lines indicates the evolution of the Jacobi method. (c) Gauss-Seidel method. The training data are evenly sampled in with sample size . is computed by central differencing scheme with directly computing the inverse of coefficient matrix. We use a DNN with width 4000-500-400 with full batch training, and learning rate is . is . The parameters of the DNN are initialized following a Gaussian distribution with mean and standard deviation .

6 Discussion

In this work, we have shown that F-Principle holds well in the DNN training with a general loss function, extending the study of F-Principle in the loss function of mean square error in previous works [3, 4, 5]. Along with the previous study that F-Principle holds in both DNN and convolutional neural networks with the activation function of either tanh or Relu [3, 4], these works implicate that the F-Principle may provide understandings to the generalization ability of general DNNs.

We also show that the generality of F-Principle in the DNN training could potentially be useful in designing algorithms for solving practical problems. To be specific, we apply DNN to solve 1-d Poisson’s equation. Compared with conventional numerical schemes, DNN could potentially work better in rather high dimensions [11]. In addition, it does not requires discretization for the DNN method, which would be much easier to be implemented. In future, it would be interested to use DNN’s F-Principle to develop numerical schemes for solving various problems which would benefit from a fast converging of low frequency.

Acknowledgments

The author wants to thank Weinan E, Wei Cai for helpful discussion. The author also wants to thank Tao Luo, Zheng Ma, Yanyang Xiao and Yaoyu Zhang for the discussion of the F-Principle. This work was funded by the NYU Abu Dhabi Institute G1301.

References

  • [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
  • [2] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
  • [3] Zhi-Qin J Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. arXiv preprint arXiv:1807.01251, 2018.
  • [4] Zhiqin John Xu. Understanding training and generalization in deep learning by fourier analysis. arXiv preprint arXiv:1808.04295, 2018.
  • [5] Nasim Rahaman, Devansh Arpit, Aristide Baratin, Felix Draxler, Min Lin, Fred A Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of deep neural networks. arXiv preprint arXiv:1806.08734, 2018.
  • [6] Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. arXiv preprint arXiv:1706.05394, 2017.
  • [7] T Poggio, K Kawaguchi, Q Liao, B Miranda, L Rosasco, X Boix, J Hidary, and HN Mhaskar. Theory of deep learning iii: the non-overfitting puzzle. Technical report, Technical report, CBMM memo 073, 2018.
  • [8] Guillermo Valle Pérez, Ard A Louis, and Chico Q Camargo. Deep learning generalizes because the parameter-function map is biased towards simple functions. arXiv preprint arXiv:1805.08522, 2018.
  • [9] Daniel Jakubovitz, Raja Giryes, and Miguel RD Rodrigues. Generalization error in deep learning. arXiv preprint arXiv:1808.01174, 2018.
  • [10] Lei Wu, Zhanxing Zhu, and Weinan E. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.
  • [11] E Weinan and Bing Yu. The deep ritz method: A deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1):1–12, 2018.
  • [12] Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric pde problems with artificial neural networks. arXiv preprint arXiv:1707.03351, 2017.
  • [13] E Weinan, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017.
  • [14] William L Briggs, Steve F McCormick, et al. A multigrid tutorial, volume 72. Siam, 2000.
  • [15] Gang Bao, Peijun Li, Junshan Lin, and Faouzi Triki. Inverse scattering problems with multi-frequencies. Inverse Problems, 31(9):093001, 2015.
  • [16] Alex Barnett, Leslie Greengard, Andras Pataki, and Marina Spivak. Rapid solution of the cryo-em reconstruction problem by frequency marching. SIAM Journal on Imaging Sciences, 10(3):1170–1195, 2017.
  • [17] Lawrence C Evans. Partial differential equations. 2010.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
320235
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description