Training behavior of deep neural network in frequency domain

Training behavior of deep neural network in frequency domain

Zhi-Qin J. Xu, Yaoyu Zhang, Yanyang Xiao
New York University Abu Dhabi
Abu Dhabi 129188, United Arab Emirates
{zhiqinxu,yz1961,yx742}@nyu.edu
This work is done while Xu is a visiting member at Courant Institute of Mathematical Sciences, New York University, New York, United States. Code is available at Xu’s homepage.
Abstract

Why deep neural networks (DNNs) capable of overfitting often generalize well in practice is a mystery in deep learning. Existing works indicate that this observation holds for both complicated real datasets and simple datasets of one-dimensional (1-d) functions. In this work, for general low-frequency dominant 1-d functions, we find that a DNN with common settings first quickly captures the dominant low-frequency components, and then relatively slowly captures high-frequency ones. We call this phenomenon Frequency Principle (F-Principle). F-Principle can be observed over various DNN setups of different activation functions, layer structures and training algorithms in our experiments. F-Principle can be used to understand (i) the behavior of DNN training in the information plane and (ii) why DNNs often generalize well albeit its ability of overfitting. This F-Principle potentially can provide insights into understanding the general principle underlying DNN optimization and generalization for real datasets.

 

Training behavior of deep neural network in frequency domain


  Zhi-Qin J. Xuthanks: This work is done while Xu is a visiting member at Courant Institute of Mathematical Sciences, New York University, New York, United States. Code is available at Xu’s homepage., Yaoyu Zhang, Yanyang Xiao New York University Abu Dhabi Abu Dhabi 129188, United Arab Emirates {zhiqinxu,yz1961,yx742}@nyu.edu

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Although Deep Neural Networks (DNNs) is totally transparent, i.e., the value of each node and the strength of each connection can be easily obtained, it is difficult to interpret how information is processed through DNNs. We can easily record the trajectories of the parameters of DNNs during training. However, it is very difficult to understand the principle underlying the highly non-convex problem of DNN optimization [7]. Therefore, albeit its transparency, the DNN is often criticized for being a “black box” [1, 14]. Even for the simple problem of fitting one-dimensional (1-d) functions, the training process of the DNN is still not well understood [12, 15]. For example, Wu et al. (2017) [15] used DNNs with a different number of layers to fit a few data points sampled from a 1-d function of the third-order polynomial. They found that even when a DNN is capable of over-fitting, i.e., the number of its parameters is much larger than the size of the training set, it generalizes well (i.e., no overfitting) when it is well trained on the training set. In practice, DNNs often generalize well empirically for much more complicated datasets even when it is capable of over-fitting [5, 9, 18, 17, 8, 15]. Intuitively, when a network is capable of over-fitting, its solution of optimization lies in a huge space where well-generalized solutions only occupy a small subset. Therefore, it is mysterious that DNN optimization can often lead to a well-generalized solution while ignoring a huge space of over-fitting solutions. To understand the underlying mechanism, in this work, we analyze in detail the principle underlying the DNN optimization process using a class of 1-d functions. Our work could potentially provide insight into how DNNs behave in general during the training and why it tends to a solution of good generalization performance after training.

Motivated by the spectrum distribution of natural signals (e.g., image and sound), which typically have more power at low frequencies and decreasing power at high frequencies [3], we design a type of 1-d function which is dominated by low-frequency components to study the training of DNNs. We find that, for this type of functions, DNNs with common settings first quickly capture the dominant low-frequency components while keeping high-frequency ones small, and then relatively slowly captures those high-frequency components. We call this phenomenon Frequency Principle (F-Principle). From our numerical experiments, this F-Principle can be widely observed over different neuron numbers (tens to thousands in each layer), layer numbers (one to tens), training algorithms (gradient descent, stochastic gradient descent, Adam) and activation functions (tanh and ReLU). According to F-Principle, DNNs naturally utilize a strategy that is also adopted in some numerical algorithms to achieve remarkable efficiency, namely, fitting the target function progressively in ascending frequency order. These numerical algorithms include, for examples, the Multigrid method for solving large-scale partial differential equations [4] and a recent numerical scheme that efficiently fits the three-dimensional structure of proteins and protein complexes from noisy two-dimensional images [2]. F-Principle can be used to understand the following important phenomena: i) the behavior of DNN training in the information plane; ii) why DNNs capable of overfitting often generalize well.

Schwartz-Ziv and Tishby (2017) [14] claimed there are two phases in the information plane during the DNN training, that is, empirical error minimization and representation compression. Although there is an ongoing debate about this two-phase characterization during the training of hidden layers, it seems that the two-phase characterization holds well for the output layer during the training, no matter whether or rectified linear unit (ReLU) is used as the activation function [14, 12]. They also hypothesized that the compression is the result of the random diffusion-like behavior of stochastic gradient descent (SGD), which could lead to the excellent generalization performance of DNNs [14]. In this work, we demonstrate that there could be no compression phase in the training course of DNNs for certain continuous functions. By discretizing these functions, we can observe the compression phase. Note that, for the discretized functions, the compression phase always appears regardless of the training method. The compression phase can be well explained by the F-Principle as follows. the DNN first fits the continuous low-frequency components of the discretized function. Then, the DNN output discretizes as the network gradually captures the high-frequency components. By the definition of entropy, this discretization process naturally reduces the entropy of the DNN output. Thus, the compression phase appears in the information plane.

Using F-Principle, we can also explain why DNNs often generalize well empirically albeit its ability of over-fitting [5, 9, 18, 17, 8, 15]. For a finite-size training set, there exists an effective frequency range [13, 16, 10] beyond which the information of the signal is lost. By F-Principle, with no constraint on the high-frequency components beyond the effective frequency range, DNNs tend to keep them small. For a wide class of low-frequency dominant natural signals (e.g., image and sound), this tendency coincides with their behavior of decaying power at high frequencies. Thus, DNNs often generalize well in practice. When the training data is noisy, the small-amplitude high-frequency components are easier to be contaminated. By F-Principle, DNNs first capture the less noisy dominant frequency components of the training data and keeps other frequency components small. At this stage, although the loss function is not best optimized for the training data, DNNs could have better generalization performance for not fitting the noise in the higher-frequency components. Therefore, as widely observed, better generalization performance often can be achieved with early-stopping. Also, for noisy training data, we demonstrate that DNNs could gradually well fit the noise at high frequencies during the compression phase according to F-Principle. In this case, in contrary to the claim in Schwartz-Ziv and Tishby (2017) [14], the compression phase actually worsens the generalization performance of DNNs.

2 Methods

We use a DNN of hidden layers with width: 200-200-200-100. The activation function is either tanh or rectified linear unit (ReLU). Note that the output layer is a linear transformation of the previous layer with no activation function. We record the state of the DNN every training epochs, which is referred to as one recording step. When we mention “common settings”, we refer to a setting of small initial weights, sufficient neurons and hidden layers, with which a DNN can well fit the target function.

Unless otherwise specified, we use the following setup in our results. The parameters of the DNN are initialized following a Gaussian distribution with mean and standard deviation . The activation function for each neuron is tanh. The DNN is trained by full batch Adam optimizer with learning rate . Parameters of the Adam optimizer are set to the default values [6]. The loss function is the mean squared error of the difference between DNN outputs and sample labels in the training set.

3 F-Principle

Natural signals (e.g., image and sound) generally have large power at low frequencies and decreasing power at high frequencies [3]. This observation motivates us to design a class of functions with dominant low-frequency components to study the behavior of training process of DNNs. Especially, we are interested in how and when each frequency component is captured during the training of DNNs. We design a target function by discretizing a smooth function as follows,

(1)

where takes the nearest integer value. When , we define . A feature of this type of functions is that, with the same but different ’s, they share the same dominant low-frequency components.

In a finite interval, e.g., , the frequency components of a target function are quantified by Fourier coefficients computed from Discrete Fourier Transform (DFT). Note that because the frequency in DFT is discrete, we can refer to a frequency component by its index instead of its physical frequency. The Fourier coefficient of for the -th frequency component is denoted by (a complex number in general). is the corresponding amplitude, where denotes the absolute value. Note that we call the frequency index. Fig.1a displays the first frequency components of for and in Eq. (1). Clearly, the dominant first peak of and coincide. To examine the convergence behavior of different frequency components during the training of a DNN, we compute the relative difference of the DNN output and in frequency domain at each recording step, i.e.,

(2)

where denotes the DNN output. Fig.1b shows the of the first frequency components at each recording step for with . It is hard to see what principle the DNN follows when capturing all frequency components. Theoretically, frequency components other than the peaks are susceptible to the artificial periodic boundary condition implicitly applied in DFT, thereby are not essential to our frequency domain analysis [11]. Therefore, in the following, we only focus on the convergence behavior of the frequency peaks during the training. As shown in Fig. 1c, at each frequency peak (marked by black dots in Fig.1a) converges in a precise order from low- to high-frequency. When we use ReLU as the activation function or use non-uniformly sampled training data for the DNN, we can still see this precise order of convergence of different frequency peaks (Fig.6 in Appendix A.2). We call this phenomenon F-Principle. In the following, we test the validity of F-Principle at different setups and illustrate how it is related to the turning points of information plane trajectory and loss function during the training of DNNs.

We denote as the last recording step where is larger than the threshold . Fig.1d shows that the first peak converges very fast with , where denotes the frequency index of the -th frequency peak. Fig.1e shows the entropy of the DNN output , , and mutual information between and the true output , , at each recording step. Note that a detailed discussion of mutual information can be found in Appendix A.1. Fig.1f shows the trajectories of loss functions using training data and test data. Two green dashed lines are respectively drawn at steps and . It is clear from Fig.1e and Fig.1f that steps and are respectively close to the two turning points observed in both the information plane and loss functions. Therefore, our frequency domain analysis could provide a novel perspective of visualizing the training process of DNNs and understanding the transition points in the information plane or loss function trajectory.

We have performed the same frequency domain analysis for various low-frequency dominant functions, such as , and with different ’s (data not shown), and F-Principle always holds during the training of DNNs. For illustration, we show another example in Fig.2a, i.e., with in Eq. (1), in which the amplitude of low-frequency peaks does not monotonically decrease. In this case, we can still observe a precise convergence order from low- to high-frequency for frequency peaks as shown in Fig.2b and Fig.2c. Therefore, F-Principle seems to be an intrinsic behavior of DNN optimization which cannot be explained by the amplitude difference at different frequencies. Note that, for the output of a DNN outside the fitting range, F-Principle also holds by keeping it “flat” (i.e., low frequency dominant) as shown in Fig.7 in Appendix A.3.

Similar to Zhang et al. (2016) [17], the DNN used in our experiments is capable of capturing dataset with randomized labels. For randomized dataset, its DFT is complicated with random fluctuations. In this case, the convergence order of each frequency peak has no obvious rules (see Appendix A.4 for details). That is, F-Principle no longer holds.


(a)

(b)

(c)

(d)

(e)

(f)
Figure 1: Frequency domain analysis of the training process of a DNN for with in Eq. (1). (a) (red solid line) and (blue dashed line) as functions of frequency index. Since the values of all valleys are zero, we use a sufficiently small value of to replace them for visualization in the log-log scale. Frequency peaks are marked by black dots. (b) at different recording steps for different frequency indexes. larger than (or smaller than ) is represented by blue (or red). (c) at different recording steps for different frequency peaks. (d) (red line) and (green line) at recording step as functions of frequency index. (e) (red solid line) and (blue solid line) at different recording steps. The black and green solid lines indicate constant values of and , respectively. (f) Loss functions for training data and test data at different recording steps. For (e) and (f), two green dashed lines are drawn at step and , respectively. The training data and the test data are evenly sampled in with sample size and , respectively.

(a)

(b)

(c)
Figure 2: Frequency domain analysis of the training process of a DNN for with in Eq. (1). (a) The target function. (b) at different frequency indexes. First four frequency peaks are marked by black dots. (c) at different recording steps for different frequency peaks. The training data is evenly sampled in with sample size .

In summary, we find that, for a general class of functions dominated by low-frequency components, the training course of DNNs follows F-Principle in which low-frequency components are first captured, followed by high-frequency components. For a function with no clear decaying trend in Fourier coefficients, e.g., a function with randomized labels, there is no clear rule underlying the optimization of DNNs.

4 Analysis of F-Principle

As the loss function drives the whole training process of a DNN, one may suspect that it is the difference in driving forces, i.e., the gradients of loss function, at different frequencies that leads to the phenomenon of F-Principle. To investigate this possibility, we rewrite the loss function in Eq. (3) to its frequency domain equivalence in Eq. (4) as follows.

(3)
(4)

where subscripts and respectively denote the spatial domain and the weighted frequency domain. is the sample size. When for all frequencies, according to the Parseval’s theorem111Without loss of generality, the constant factor that is related to the definition of corresponding Fourier Transform is ignored here.. This equality gives us a frequency domain decomposition of loss function Eq. (3) as well as its gradient.

A natural conjecture would be that the weight determines the relative speed of convergence towards the target function for each frequency component. However, in Fig.1c, with frequency-wise equally weighted loss function Eq. (3), lower-frequency peaks clearly converge faster than higher-frequency ones. Therefore, we hypothesize that the training process of DNNs implicitly endow lower-frequency components with higher priority by ignoring the higher-frequency components when the lower-frequency components are not well captured. To justify this hypothesis, we can manipulate so that the weights of higher-frequency components are set to zero at the early stage of training to see if the same convergence behavior as in Fig.1c can be observed.

Specifically, we train DNNs from low- to high-frequency to fit with in Eq. (1). Denote , where denotes the frequency index of the -th frequency peak in Fig.1a. For the -th frequency peak, we set the weight of the first half part of the frequency domain as for , and for . The second half of weights are fulfilled by the constraint of symmetry. goes from to in order during the training and is increased by one whenever . As shown in Fig.3a, the behavior of at different recording steps for different frequency peaks is similar to the one in Fig.1c. This observation conforms with our hypothesis that somehow the higher-frequency components are ignored at the early stage of training.

To further validate our hypothesis, we manually train DNNs from high- to low-frequency to fit the same function of with in Eq. (1). For , we set the weight of the first half part of the frequency domain as for , and for . The second half of weights are fulfilled by the constraint of symmetry. goes from to in order during the training and is decreased by one whenever or the number of training epochs used for fitting the -th peak reaches (equivalent to recording steps). We add the latter condition because the convergence speed for high-frequency components in this approach is often very slow. Fig.3b shows that starts to converge properly only when all frequency components are used for training after recording step . This observation further justifies our hypothesis that higher-frequency components indeed have lower priority as they converge efficiently and properly only when the lower-frequency components are well captured. Therefore, F-Principle is an intrinsic behavior of the DNN optimization process, which implicitly endows lower-frequency components with higher priority.


(a)

(b)
Figure 3: Analysis of DNN training process with loss function for with in Eq. (1). (a) at different recording steps for different frequency peaks using with tuned from low- to high-frequency. (b) at different recording steps for different frequency peaks using with tuned from high- to low-frequency.

5 F-Principle in understanding DNN

5.1 Compression vs. no compression in the information plane

Through the empirical exploration of DNN training in the information plane, Schwartz-Ziv and Tishby (2017) [14] claimed that (i) information compression is a general process; (ii) information compression is induced by SGD; (iii) information compression is good for generalization. They suggested that with analyzing the behavior of DNN training in the information plane, it is able to obtain a deeper understanding of why DNNs perform well in practice. In this sub-section, we demonstrate how F-Principle can be used to understand the behavior of the training process of DNNs in the information plane.

We first demonstrate how compression can appear or disappear by tuning a parameter in Eq. (1) with for using full batch gradient descent (GD) without stochasticity. In our simulations, the DNN well fits for both equal to and after training (see Fig.4a and c). In the information plane, there is no compression phase for for (see Fig.4b). By increasing in Eq. (1) we can observe that: i) the fitting function is discretized with only few possible outputs (see Fig.4c); ii) the compression of appears (see Fig.4d). For , behaviors of information plane are similar to previous results [14]. To understand why compression would happen for , we next focus on the training courses for different in the frequency domain.

A key feature of the class of functions in Eq. (1) is that the dominant low-frequency components for with different are the same (see Fig.1a). By F-Principle, the DNN first capture those dominant low-frequency components, thus, the training courses for different at the beginning are similar, i.e., i) the DNN output is close to at certain training epochs (blue lines in Fig.4a and c); ii) in the information plane increases rapidly until it reaches a value close to the entropy of , i.e., (see Fig.4b and d). For , the target function is , therefore, would be closer and closer to during the training. For , the entropy of the target function, , is much less than . In the latter stage of capturing high-frequency components, the DNN output would converge to the discretized function . Therefore, would decrease from to .

Actually, for , theoretically, and would finally converge to . Thus, is a trivial case that compression phase could not occur. This analysis is also applicable for other continuous functions. For any discretized function, the DNN first fits the low-frequency components of the discretized function with a continuous function. Then, the DNN output converges to discretized values as the network gradually captures the high-frequency components. By the definition of entropy, this discretization process naturally reduces the entropy of the DNN output. Thus, the compression phase appears in the information plane. As the discretization in general is inevitable for classification problems with discrete labels, we can often observe information compression in practice as described in the previous study [14].

For the issue of generalization, which is regarded in the third claim above [14], we will discuss in detail in the next sub-section.


(a)

(b)

(c)

(d)
Figure 4: Analysis of compression phase in the information plane. is for (a, b) and for (c, d). (a, c) ) (red square) with in Eq. (1) and the DNN output (blue solid line) at step . (b, d) Trajectories of the training process of the DNN in the information plane. Color of each dot indicates its recording step. The green dashed vertical line and the red dashed horizontal line indicate constant values of and , respectively.

5.2 Generalization

Why DNNs capable of over-fitting often generalize well is a mystery in deep learning [5, 9, 18, 17, 8, 15]. By F-Principle, this mystery could be naturally explained as follows. For a class of functions dominated by low frequencies, with finite training data points, there is an effective frequency range for this training set, which is defined as the range in frequency domain bounded by Nyquist-Shannon sampling theorem [13] when the sampling is evenly spaced, or its extensions [16, 10] otherwise. When the number of parameters of a DNN is greater than the size of the training set, the DNN can over-fit these sampling data points (i.e., training set) with different amount of powers outside the effective frequency range. However, by F-Principle, the training process will implicitly bias the DNN towards a solution with low power for the high-frequency components outside the effective frequency range. For functions dominated by low frequencies, this bias coincides with their intrinsic feature of low power at high frequencies, thus naturally leading to a well-generalized solution after training. By the above analysis, we can predict that, in the case of insufficient training data, when the higher-frequency components are not negligible, e.g., there exists a significant frequency peak above the effective frequency range, the DNN cannot generalize well after training.

In another case where the training data is contaminated by noise, early-stopping method is usually applied to avoid overfitting in practice [8]. By F-Principle, early-stopping can help avoid fitting the noisy high-frequency components. Thus, it naturally leads to a well-generalized solution which accurately captures the dominant low-frequency components while keeping the high-frequency ones small. We use the following example for illustration.

As shown in Fig.5a, we consider with in Eq. (1). For each sample , we add a noise on , where follows a Gaussian distribution with mean and standard deviation . The training set and the test set respectively consist of and data points evenly sampled in . The DNN can well fit the sampled training set as the loss function of the training set decreases to a very small value (green stars in Fig.5b). However, the loss function of the test set first decreases and then increases (red dots in Fig.5b). That is, the generalization performance of the DNN gets worse during the training after a certain step. In Fig.5c, for the training data (red) and the test data (black) only overlap around the dominant low-frequency components. This indicates that the noise in samples has much more impact on high-frequency components. Around step (green dashed lines in Fig. 5b and d), the DNN well captures the dominant peak as shown in Fig. 5c, and the loss function of the test set attains its minimum and starts to increase (red dots in Fig.5b). These phenomena are consistent with our above analysis that early-stopping helps prevent fitting the noisy high-frequency components thus naturally leads to a better generalization performance of DNNs.

At step , the entropy of the DNN output stays around its maximum (see Fig.5d). After that, gradually compresses (see Fig. 5d) as the DNN starts to capture the noisy high-frequency components, leading to worse generalization performance as shown in Fig. 5b. Therefore, in contrary to the third claim (in Section 5.1) [14], compression could worsen the generalization performance of DNNs.


(a)

(b)

(c)

(d)
Figure 5: Frequency domain analysis of the training process of the DNN for a dataset contaminated by noise. In this example, with in Eq. (1). For each sample at , we add a noise on , where follows a Gaussian distribution with mean and standard deviation . The training set and the test set respectively consist of and data points evenly sampled in . (a) The sampled values of the test set (red square dashed line) and DNN outputs at recording step (blue solid line). (b) Loss functions for training set (green stars) and test set (red dots) at different recording steps. (c) for the training set (red) and test set (black), and for the training set (green), and test set (magenta) at recording step . (d) at different recording steps. For (b) and (d), the green dashed lines are drawn at step , around where the best generalization performance is achieved.

6 Discussion and conclusion

In this work, we find a widely applied principle, i.e., F-Principle, underlying the optimization process of DNNs for fitting 1-d functions. Specifically, for a function with dominant low-frequency components, DNNs with common settings first capture these low-frequency components while keeping high-frequency ones small. In our experiments, this phenomenon can be widely observed for DNNs with different neuron numbers (tens to thousands in each layer), layer numbers (one to tens), training algorithms (GD, SGD, Adam), and activation functions (tanh and ReLU). From our analysis, F-Principle is an intrinsic property of DNN optimization process under common settings for fitting 1-d functions. It can well explain the compression phase in the information plane as well as the good generalization performance of DNNs often observed in experiments. Our findings could potentially provide insights into the behavior of DNN optimization process in general.

Note that initializing weights with large values could complicate the phenomenon of F-Principle. In previous experiments, DNNs are initialized by Gaussian distribution with mean and standard deviation and its training behavior follows F-Principle. However, with large-initialization values, F-Principle no longer holds (see Fig. 9b in Appendix A.5), and the training speed of the DNN is much slower comparing to initialization with small values (see Fig. 9c in Appendix A.5). More importantly, these two initialization strategies could result in very different generalization performances. When the weights of the DNN are initialized with small values, say, Gaussian distribution with mean and standard deviation , the initial DNN output is flat (see Fig.10d in Appendix A.5). In contrast, the initial DNN output highly fluctuates when the standard deviation for initialization is large 222The bias terms are always initialized by standard deviation 0.1., say, (see Fig.10a in Appendix A.5). For both initializations, DNNs can well fit the training data. However, for test data, the DNN with small-initialization values generalizes well whereas the DNN with large-initialization values clearly overfits as shown in Fig.10. Intuitively, the above phenomenon can be understood as follows. Without explicit constraints on the high-frequency components beyond the effective frequency range of the training data, the DNN output after training tends to inherit these high-frequency components from the initial output. Therefore, with large-initialization values, DNN outputs can easily overfit the target function with fluctuating high-frequency components. In practice, the weights of DNNs are often randomly initialized with standard deviations close to zero [15]. As suggested by our analysis, the small-initialization strategy may implicitly lead to a more efficient and well-generalized optimization process of DNNs as characterized by F-Principle for 1-d functions.

Acknowledgments

The authors want to thank David W. McLaughlin for helpful discussions and thank Qiu Yang, Zheng Ma, and Tao Luo for critically reading the manuscript. ZX, YZ, YX are supported by the NYU Abu Dhabi Institute G1301.

References

  • [1] G. Alain and Y. Bengio, Understanding intermediate layers using linear classifier probes, arXiv preprint arXiv:1610.01644, (2016).
  • [2] A. Barnett, L. Greengard, A. Pataki, and M. Spivak, Rapid solution of the cryo-em reconstruction problem by frequency marching, SIAM Journal on Imaging Sciences, 10 (2017), pp. 1170–1195.
  • [3] D. W. Dong and J. J. Atick, Statistics of natural time-varying images, Network: Computation in Neural Systems, 6 (1995), pp. 345–358.
  • [4] W. Hackbusch, Multi-grid methods and applications, vol. 4, Springer Science & Business Media, 2013.
  • [5] K. Kawaguchi, L. P. Kaelbling, and Y. Bengio, Generalization in deep learning, arXiv preprint arXiv:1710.05468, (2017).
  • [6] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, (2014).
  • [7] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, nature, 521 (2015), p. 436.
  • [8] J. Lin, R. Camoriano, and L. Rosasco, Generalization properties and implicit regularization for multiple passes sgm, in International Conference on Machine Learning, 2016, pp. 2340–2348.
  • [9] C. H. Martin and M. W. Mahoney, Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior, arXiv preprint arXiv:1710.09553, (2017).
  • [10] M. Mishali and Y. C. Eldar, Blind multiband signal reconstruction: Compressed sensing for analog signals, IEEE Transactions on Signal Processing, 57 (2009), pp. 993–1009.
  • [11] D. B. Percival and A. T. Walden, Spectral analysis for physical applications, cambridge university press, 1993.
  • [12] A. M. Saxe, Y. Bansal, J. Dapello, and M. Advani, On the information bottleneck theory of deep learning, International Conference on Learning Representations, (2018).
  • [13] C. E. Shannon, Communication in the presence of noise, Proceedings of the IRE, 37 (1949), pp. 10–21.
  • [14] R. Shwartz-Ziv and N. Tishby, Opening the black box of deep neural networks via information, arXiv preprint arXiv:1703.00810, (2017).
  • [15] L. Wu, Z. Zhu, and W. E, Towards understanding generalization of deep learning: Perspective of loss landscapes, arXiv preprint arXiv:1706.10239, (2017).
  • [16] J. Yen, On nonuniform sampling of bandwidth-limited signals, IRE Transactions on circuit theory, 3 (1956), pp. 251–257.
  • [17] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning requires rethinking generalization, arXiv preprint arXiv:1611.03530, (2016).
  • [18] G. Zheng, J. Sang, and C. Xu, Understanding deep learning generalization by maximum entropy, arXiv preprint arXiv:1711.07758, (2017).

Appendix A Appendix

a.1 Computation of information

For any random variables and with a joint distribution : the entropy of is defined as ; their mutual information is defined as ; the conditional entropy of on is defined as .

By the construction of the DNN, its output is a deterministic function of its input , thus, and . To compute entropy numerically, we evenly bin , , to , , with bin size as follows. For any value , its binned value is define as . In our work, and are approximated by and , respectively, with . Note that, after binning, one value of may map to multiple values of . Thus, and . The difference vanishes as bin size shrinks. Therefore, with a small bin size, is a good approximation of . In experiments, we also find that and behave almost the same in the information plane for the default value (data not shown).

a.2 Examples of F-Principle

For illustration, Fig.6 shows that F-Principle holds when we use ReLU as the activation function (see Fig.6a) or use non-uniformly sampled training data (see Fig.6b) for DNNs.


(a)

(b)
Figure 6: at different recording steps for different frequency peaks as shown in Fig. 1a under different settings. (a) ReLU is used as activation function. (b) Training data is sampled non-uniformly as follows. We first evenly sample 3000 points in . Then, the training data is sampled randomly from these data points with sample size 600. Note that, for the non-uniform sampling example, we use hidden layers with width 400-400-400-200-200 to accelerate the training process. All other DNN settings are the same as Fig.1.

a.3 DNN outputs outside the fitting range

We observe in our experiments that, outside the fitting range of the training data, the DNN output is very flat. For illustration, we train a DNN to fit an oscillation function . At the end of training, the DNN well captures the fitting range , whereas, outside the fitting range, it is very flat with no oscillation, as shown in Fig.7. This indicates that, without constraints outside the fitting range, the DNN keeps its output low-frequency (close to 0) dominant.


Figure 7: DNN output after training (blue line) for . Black dots indicate training data, which is evenly sampled in with sample size .

a.4 Fitting data with complicated DFT

The following case demonstrates that for a function with a complicated DFT, F-Principle may not apply. For an example of random labels as shown in Fig.8a, its DFT is shown in Fig.8b. The amplitude of frequency peak fluctuates a lot for all frequencies. In this case, we found no obvious rule of the capturing order (see Fig.8c), even only for the peaks (see Fig.8d).


(a)

(b)

(c)

(d)
Figure 8: Frequency domain analysis of the training process of a DNN for a randomly labeled function. The training data is evenly sampled in with sample size 100. For the sample at , its label is constructed by Eq. (1) with and , where is randomly sampled in . (a) Training data (black dots) and the DNN output (blue line) after training. (b) at different frequency indexes. Frequency peaks are marked by black dots. (c) at different recording steps for different frequency indexes. (d) at different recording steps for different frequency peaks.

a.5 Initialization

Fig. (9) and Fig. (10) show the training processes of DNNs with different initializations 333The bias terms are always initialized by standard deviation 0.1.. More discussion about initialization can be found in the Discussion of the main text.


(a)

(b)

(c)
Figure 9: Analysis of the training process of DNNs with different initializations while fitting function of with in Eq. (1). The weights of DNNs are initialized by a Gaussian distribution with mean and standard deviation either or . The training data is evenly sampled in with sample size . (a) at different frequency indexes. Frequency peaks are marked by black dots. (b) at different recording steps for different frequency peaks for large initialization values of standard deviation . (c) Loss functions of training data for initialization values of standard deviation (red) and (green) at different recording steps, respectively.

(a)

(b)

(c)

(d)

(e)

(f)
Figure 10: DNN outputs with different initializations while fitting function of with in Eq. (1). The training data and the test data are evenly sampled in with sample size and , respectively. The weights of DNNs are initialized by a Gaussian distribution with mean and standard deviation either (first row) or (second row). (a, d): (red dashed line) and initial DNN outputs (blue solid line) for the test data. (b, e): (red dashed line) and DNN outputs (blue solid line) for the training data at the end of training. (c, f): (red dashed line) and DNN outputs (blue solid line) for the test data at the end of training.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
211830
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description