Convolutional Spectral Kernel Learning

Convolutional Spectral Kernel Learning

Abstract

Recently, non-stationary spectral kernels have drawn much attention, owing to its powerful feature representation ability in revealing long-range correlations and input-dependent characteristics. However, non-stationary spectral kernels are still shallow models, thus they are deficient to learn both hierarchical features and local interdependence. In this paper, to obtain hierarchical and local knowledge, we build an interpretable convolutional spectral kernel network (CSKN) based on the inverse Fourier transform, where we introduce deep architectures and convolutional filters into non-stationary spectral kernel representations. Moreover, based on Rademacher complexity, we derive the generalization error bounds and introduce two regularizers to improve the performance. Combining the regularizers and recent advancements on random initialization, we finally complete the learning framework of CSKN. Extensive experiments results on real-world datasets validate the effectiveness of the learning framework and coincide with our theoretical findings.

\printAffiliationsAndNotice

1 Introduction

With solid theoretical guarantees and complete learning frameworks, kernel methods have achieved great success in various domains over the past decades. However, compared to neural networks, kernel methods show inferior performance in practical applications because they failed in extracting rich representations for complex latent features.

There are three factors that limit the representation ability of common kernel methods: 1) Stationary representation Bengio et al. (2006). Common used kernels are stationary because the kernel function is shift-invariant where the induced feature representations only depend on the distance while free from inputs theirselves. 2) Kernel hyperparameters selection Cortes et al. (2010). The assigned hyperparameters of kernel function decide the performance of kernel methods Genton (2001). Cross-validation (CV) Cawley (2006) and kernel target alignment (KTA) Cortes et al. (2010) were introduced to kernel selection, however, these methods split the process of kernel selection and mode training. 3) Without hierarchical or convolutional architecture. For example, Gaussian kernels , equivalent to a single layer neural network with infinity width, only characterize the distance and their performance depends on the choice of the kernel hyperparameter .

Yaglom’s theorem provides spectral statements for general kernel functions via inverse Fourier transform Yaglom (1987). To break the limitation of stationary property, non-stationary spectral kernels were proposed with a concise spectral representation based on Yaglom’s theorem Samo and Roberts (2015); Remes et al. (2017). Using Monte Carlo sampling, non-stationary spectral kernels were represented as neural networks Ton et al. (2018); Sun et al. (2019) in Gaussian process regression, where kernel hyperparameters can be optimized together with the estimator. Then, Xue et al. (2019); Li et al. (2020) extended neural networks of non-stationary spectral kernels to generel learning tasks. It has been proven that non-stationary kernels can learn both input-dependent and output-dependent characteristics Li et al. (2020). However, non-stationary kernels fail to extract hierarchical features and local correlations, while deep convolutional neural networks naturally capture those characteristics and present impressive performance LeCun et al. (1998); Krizhevsky et al. (2012).

1.1 Contributions

In this paper, we propose an effective learning framework (CSKN) which learns rich feature representations and optimize kernel hyperparameters in an end-to-end way.

On the Algorithmic Front. The framework incorporates non-stationary spectral kernels with deep convolutional neural networks to use the advantages of deep and convolutional architectures. Intuitively, the learned feature mapping are intput-dependent (non-spectral kernel), output-dependent (backpropagation w.r.t. the objective), hierarchical (deep architecture) and local related (convolutional filters).

On the Theoretical Front. We derived generalization error bounds of deep spectral kernel networks, revealing how the factors (including architecture, initialization and regularizers) affect the performance and suggesting ways to improve the algorithm. More importantly, we prove that deeper networks can lead to shaper error bounds with an appropriate initialization schema. For the first time, we provide a generalization interpretation of the superiority of deep neural networks than relatively shallow networks.

1.2 Related Work

Based on Bochner’s theorem, the first approximate spectral representations were proposed for shift-invariant kernels Rahimi and Recht (2007), known as random Fourier features. In theory, Bach (2017); Rudi and Rosasco (2017) provided the optimal learning guarantees for random features. Stacked random Fourier features as neural networks were presented in Zhang et al. (2017). Based on Yalom’s theorem, Samo and Roberts (2015) provided general spectral representations for arbitrary continuous kernels. Spectral kernel networks have attracted much attention in Gaussian process Remes et al. (2017); Sun et al. (2018) and were extended to general learning domains Xue et al. (2019); Li et al. (2020).

Deep convolutional neural networks (CNNs) have achieved unprecedented accuracies on in domains including computer vision LeCun et al. (1998); Krizhevsky et al. (2012) and nature language processing Kim (2014). Convolutional neural networks were encoded in a reproducing kernel Hilbert space (RKHS) to obtain invariance to particular transformations Mairal et al. (2014) in an unsupervised fashion. Then, combined with Nyström method, convolutional kernel networks were proposed in an end-to-end manner Mairal (2016), while its stability to deformation was studied in Bietti and Mairal (2017, 2019). Except for stability theory, group invariance was also learned Mallat (2012); Wiatowski and Bölcskei (2017). Recent research also explored the approximate theory of CNNs via downsampling Zhou (2020a) and universality of CNNs Zhou (2020b). Besides, Shen et al. (2019) introduced convolutional filters to spectral kernels and studied the len of spectrograms.

However, the generalization ability of spectral kernel networks was rarely studied. Using Rademacher complexity, the generalization ability of spectral kernels was studied in Li et al. (2020). The RKHS norm and spectral norm were considered to improve the generalization ability of neural networks Bartlett et al. (2017); Belkin et al. (2018); Bietti et al. (2019b). Furthermore, Allen-Zhu et al. (2019); Arora et al. (2019) proposed that the learnability of deep modes involves both generalization ability and trainability. Based on the mean field theory, Poole et al. (2016); Schoenholz et al. (2017) revealed that initialization schema determines both the trainability and the expressivity.

2 Preliminaries

Consider a supervised learning scenario where training samples are drawn i.i.d. from a fixed but unknown distribution Specifically, for general machine learning tasks, we assume the input space be and the output space be where for univariable labels (binary or regression) and for multivariable labels (multi-class or multi-labels).

Kernel methods include mappings from the input space to a reproducing kernel Hilbert space (RKHS) via an implicit feature mapping which is induced by a Mercer kernel . Classical kernel methods learn the prediction function which learns modes in the RKHS space, admitting the linear form The hypothesis space is denoted by

where is the weight of the estimator and the feature mapping is from the input space to a latent space to characterize more powerful feature representations. The goal of supervised learning is to learn an ideal estimator to minimize the expected loss

(1)

where is the loss function associated to specific tasks.

2.1 Shift-invariant Kernels

Shift-invariant kernels only depend on the distance , written as Commonly used kernels are shift-invariant (stationary), such as Gaussian kernels and Laplacian kernels . According to Bochner’s theorem, shift-invariant kernels are determined by its spectral density via inverse Fourier transform Stein (1999).

Lemma 1 (Bochner’s theorem).

A shift-invariant kernel on is positive definite if and only if it can be represented as

(2)

where is a non-negative probability density.

Based on Bochner’s theorem (2) and Monte Carlo sampling, random Fourier features were proposed to approximate shift-invariant kernels via Rahimi and Recht (2007):

(3)

where the frequency matrix is drawn from the spectral density and the phase vector is drawn uniformly from .

2.2 Non-stationary Spectral Kernels

Shift-invariant kernels are stationary, which only take into account the distance but neglect useful information of the inputs themselves, also called stationary spectral kernels. However, the most general family of kernels are non-stationary, i.e. linear kernels and polynomial kernels .

Recently, based on Yaglom’s theorem, the Fourier analysis theory has been extended to general kernels, including both stationary and non-stationary cases Samo and Roberts (2015).

Lemma 2 (Yaglom’s theorem).

A general kernel is positive definite on is positive define if and only if it admits the form

(4)

where is a Lebesgue-Stieltjes measure associated to some positive semi-definite (PSD) spectral density function with bounded variations.

Yaglom’s theorem illustrates that a general kernel is associated to some positive semi-definite spectral density over frequencies Meanwhile, shift-invariant kernels (Bochner’s theorem) is a special case of spectral kernels (Yaglom’s theorem) when the spectral measure is concentrated on the diagonal

To ensure a valid positive semi-definite spectral density in (4), we symmetrize spectral densities where and then introduce the diagonal components Samo and Roberts (2015); Remes et al. (2017), such that the kernel is defined as

(5)

where the exponential term is

Similar to the approximation of shift-invariant kernels (3), we derive a finite-dimensional approximation of non-stationary kernels (5) by performing Monte Carlo method

The random Fourier features for non-stationary kernels are

(6)

where the frequency matrices are paired Monte Carlo samples, the frequency pairs are drawn i.i.d. from the spectral density . The phase vectors and are drawn uniformly from

3 Convolutional Spectral Kernel Learning

Figure 1: The structure of the learning framework

3.1 Multilayer Spectral Kernel Networks

In the view of neural networks, a non-stationary kernel (5) is a single-layer neural network with infinite width, while the random Fourier approximation (6) reduce the infinite dimension to a finite width. Even though the non-stationary kernels characterize input-dependent features, it is deficient in feature representations due to its shallow architecture.

In this paper, we use the deep architectures of non-stationary spectral kernels by stacking their random Fourier features in a hierarchical composite way:

where the kernel consists of -layers stacked spectral kernels and the feature mappings for any layer are approximated by random Fourier features (6). Based on the feature mapping of the last layer we explicitly definite the random Fourier mapping of -th layer

where is the input data, the frequency pairs in the -th frequency matrices are drawn i.i.d. from the -th layer’s spectral density . The elements in -th phase vector are drawn uniformly from

The above architecture of deep spectral kernel networks is a kind of fully connected network (FCN), where the network includes convolutional layers and two frequency matrices and two bias vectors for the -th layer. Therefore, the -th convolutional layer involves parameters.

3.2 Convolutional Spectral Kernel Networks

Even though the multilayer spectral kernel representations can learn input-dependent characteristics, long-range relationships and hierarchical features, this full connected network (FCN) fails to extract local correlations on the structural dataset, i.e. image and natural language. However, convolutional networks guarantee the local connectivity, promising dramatic improvements in complex applications.

For the sake of simplicity, we integrate spectral kernel networks with convolutional architecture but without pooling layers and skip connections. We define the convolutional spectral kernel network (CSKN) in a hierarchical kernel form by stacking spectral kernels

For each channel of the -th convolutional layer, the convolutional mapping is defined as

(7)

where and the -th convolutional filters are pairwise in the filter size . The frequency pair is drawn from the spectral density for -th layer convolutional spectral kernel. The bias terms are uniformly sampled from

We assume there is channels for the -th convolutional layer. Due to weights sharing, the -th layer exists convolutional feature mappings in (7). Thus, there are parameters for the -th convolutional layer, because and are small constants thus the number of parameters is also dramatically reduced.

3.3 Learning Framework

The structure of estimator is shown as Figure 1. Because it is hard to estimate the minimization of the expected loss (1), so we aim to minimize the empirical loss.

Based on theoretical findings (Theorem 2 in next section), we incorporate the empirical loss with two kinds of regularization terms in the minimization objective, written as

(8)

where the depth is and . The estimator is , where the weighted matrix is and we employ the deep convolutional spectral kernel representations in a hierarchical composite way (7). The trace norm regularize the estimator weights and the squared Frobenius norm is used to regularize the feature mappings on all samples. These two norms are scarcely used in conventional methods, where represents the RKHS norm of primal kernel methods and regularizes the frequency pairs .

Using backpropagation w.r.t the objective, we update the model weights and frequency pairs for convolutional layers in the objective (8), that makes the feature mappings dependent on the specific tasks. The spectral density , the key of kernel methods’ generalization ability, is modified via the update of frequency pairs , where kernel hyperparameters in the spectral densities are optimized in an end-to-end manner.

3.4 Update via Singular Value Thresholding (SVT)

The updates of gradient of involves trace norm in (8), but we can’t update using gradient descent methods because the trace norm is nondifferentiable. So, we employ singular value thresholding (SVT) Cai et al. (2010) to solve the minimization of trace norm in the two steps:
1) Update with SGD on the empirical loss

where is the learning rate and is an intermediate.
2) Update with SVT on the trace norm

where is the singular values decomposition, is the diagonal and is the rank of .

3.5 Random Initialization

To approximate non-stationary kernels, we use random Gaussian weights as initialization. We initialize the joint probability distribution for the -th layer as two independent normalization distributions with zero mean and the variance for all dimensions

(9)

where According to mean field theory, we select the Gaussian initialization hyperparameters for every layer to achieve the critical line between order-to-chaos transition and satisfy dynamical isometry Poole et al. (2016); Pennington et al. (2017).

4 Generalization Analysis

Rademacher complexity theory has achieved great success in shallow learning, however it’s an open problem whether Rademacher complexity is applicative for deep neural networks Belkin et al. (2018); Bietti et al. (2019a). In this section, we apply Rademacher complexity theory to spectral kernel networks and explore how the factors in CSKN affect the generalization performance.

Firstly, we derive the generic generalization error bounds for kernel methods based on Rademacher complexity. The empirical Rademacher complexity is mainly dependent on the sum of diagonals So, we explore the generalization error bounds of three different architectures: 1) shift-invariant kernels, 2) non-stationary spectral kernels, 3) deep non-stationary spectral networks. We then discuss the approximation ability of random Fourier features and the use of convolutional filters.

Definition 1.

The empirical Rademacher complexity of hypothesis space is defined as

where means the -th value of the outputs and s are independent Rademacher variables. The expected Rademacher complexity is .

4.1 Excess risk bound for kernel methods

Lemma 3.

Assume the loss function is -Lipschitz for equaipped with the -norm. With a probability at least , the excess risk bound holds

where is the most accurate estimator in the hypothesis space, is the empirical estimator. The empirical Rademacher complexity is bounded by

(10)

where the upper bound of the trace norm on is .

Based on Rademacher complexity, generalization error bounds of kernel methods have been well-studied Bartlett and Mendelson (2002); Cortes et al. (2013), where the convergence depends on empirical Rademacher complexity . Meanwhile, empirical Rademacher complexity is determined by the trace of empirical kernel matrix . The upper bound of Rademacher complexity is related to the corresponding kernel function.

Remark 1.

From (10), we find that the minimization of Rademacher complexity need to minimize both and the sum of diagonals . Because is the upper bound of the trace norm and the trace holds we introduce and as regularizers to obtain better performance, leading to the objective in (8).

4.2 Rademacher Complexity of Shift-invariant Kernels

According to the Bochner’s theorem (2), we define shift-invariant kernels as

Lemma 4.

For arbitrary shift-invariant kernels, the diagonal element of the corresponding kernel matrix is

where

For shift-invariant kernels, diagonals of shift-invariant kernels identically equal to one regardless of the spectral density The trace equal to . The convergence rate of Rademacher complexity is when we bound the norm with a constant Bartlett and Mendelson (2002).

4.3 Improvements of Non-stationary Spectral Kernels

Based on the Yaglom’s theorem (4), we define the non-stationary spectral kernels as

where the frequency pair is drawn i.i.d from the spectral density We initialize the spectral density as two independent Gaussian distributions and , where

Theorem 1.

The diagonals of a non-stationary spectral kernel matrix are:

where the frequencies are drawn from the joint spectral density .

Due to and , the diagonal elements of non-stationary spectral kernels are less than . When the variance is large, the trace is even smaller than the case in stationary kernels (shift-invariant kernels). Note that, shift-invariant kernels are the special case of spectral kernels with the diagonal density where all diagonals are for

4.4 Improvements from Deep Architecture

We introduce deep architecture for spectral kernels via

where the -th layer spectral representations of stacked spectral kernels is related to its last layer in a recursive way:

where represents the inputs and . We use a simple initialization schema where the paired frequencies are drawn i.i.d. from two independent Gaussian distributions with and

Theorem 2.

For any input data , the diagonal is smaller than the diagonal of laster layer:

when the variance satisfy

(11)
Remark 2.

Theorem 2 holds for all diagonals , thus the sum of diagonals magnify the difference. With favorable initialization schema, we obtain decreasing diagonals as the depth increases, which leads to sharper generalization error bounds. It’s worth noting that, for the first time, we prove deeper architectures of neural networks can obtain better generalization performance with suitable initialization. The theorem reveals the superiority of deep neural networks than shallow learning (such as kernel methods) in the view of generalization.

The results in Theorem 2 guide the design of the variance to get better generalization performance for deep neural networks. The right of inequality (11) has decreasing property w.r.t. the diagonals . To make deeper architecture available, we should ensure the decreasing on the diagonals w.r.t. the depth , such that we enlarge for the increasing depth . Based on the mean field theory, recent work has devised the better initialization strategies Poole et al. (2016); Yang and Schoenholz (2017); Hanin and Rolnick (2018); Jia et al. (2019) to improve the trainability, however these strategies are irrelevant to the depth, ignoring the issues in generalization. It’s worthy to further study the initialization schema which characterizes both good generalization ability and trainability.

4.5 Trainable Spectral Kernel Network

We derive above generalization analysis in Lemma 4, Theorem 1, Theorem 2 in the RKHS space with implicit feature mappings. However, the computation of hierarchical stacked spectral kernels is intractable and optimal kernel hyperparameters are hard to estimate, so we construct explicit feature mappings via Monte Carlo approximation in (6), where and .

According to Hoeffding’s inequality, we bound the approximation error with a probability of at least :

where is a small constant. The approximation error converges fast with the number of Monte Carlo samplings . Rahimi and Recht (2007) has proven small approximate error is achieved by any constant probability when . Besides, recent work revealed random features can achieve optimal learning rates in kernel ridge regression tasks Rudi and Rosasco (2017); Bach (2017).

Traditional kernel selection methods split the choice of hyperparameters and model learning. In contrast, the presented spectral kernel networks are trainable, thus we can optimize the kernel hyperparameters and model weights together, which are trained in an end-to-end manner.

4.6 The Use of Convolutional Filters

Deep convolutional neural networks LeCun et al. (1998); Krizhevsky et al. (2012) have achieved impressive accuracies which are often attributed to effectively leverage of the local stationarity of natural images at multiple scales. The group invariance and stability to the action of diffeomorphisms were well-studied in Mallat (2012); Wiatowski and Bölcskei (2017); Bietti and Mairal (2019). Meanwhile, Zhou (2020b) studied the universality of deep convolutional neural networks and proved that CNNs can be used to approximate any continuous function to an arbitrary accuracy when the depth is large enough. However, the generalization ability of CNN was scarcely studied, because it’s hard to extend the generalization results of FCN to CNN due to different structures. It’s not clear how to prove the superiority of convolutional networks in the view of generalization.

Dataset CNN CRFF DSKN CDSK CSKN
segment 95.241.72 95.352.17 96.081.94 96.371.21 97.031.42
satimage 86.741.49 85.461.86 86.561.80 88.311.37 88.351.25
usps 97.811.74 97.762.03 99.141.64 98.171.56 99.181.27
pendigits 99.070.57 99.030.67 99.160.50 99.440.57 99.460.41
letter 95.701.47 95.341.56 96.161.71 96.691.46 96.971.31
Table 1: Classification accuracy (%) for all datasets. We bold the numbers of the best method and underline the numbers of the other methods which are not significantly worse than the best one.
Train size CNN CRFF DSKN CDSK CSKN
1K 90.822.31 91.152.37 91.491.66 91.841.44 92.021.54
2K 93.041.35 94.151.44 94.111.08 94.321.67 95.411.37
5K 96.641.65 96.131.68 96.831.33 98.451.58 98.471.73
10K 98.791.14 94.811.07 98.800.86 99.020.78 99.030.74
20K 99.030.61 97.390.72 98.970.49 99.190.68 99.260.51
40K 99.250.53 98.210.49 99.100.53 99.270.61 99.320.27
60K 99.300.41 98.450.51 99.340.37 99.390.24 99.450.18
Table 2: Classification accuracy (%) for compared methods on the MNIST dataset without data augmentation. Here, we bold the optimal results and underline the results which show no significant difference with the optimal one.

5 Experiments

In this section, compared with related algorithms, we study the experimental performance of CSKN on several benchmark datasets to demonstrate the effects of factors: 1) non-stationary spectral kernel, 2) deep architecture, 3) convolutional filters, 4) kernel learning via backpropagation. We first run algorithms on five structural datasets with a small size. Then, for a medium-size dataset MNIST, we conduct experiments on different data partitions with varying sizes.

5.1 Experimental Setup

We use a three-layer network with width for deep architectures to achieve favorable approximation for Monte Carlo sampling. All algorithms are initialized according to (9), where the spectral density for -th layer is fixed on the critical line between ordered and chaotic phases according to mean field theory Poole et al. (2016); Schoenholz et al. (2017). Specifically, convolutional networks use Delta-orthogonal initialization Xiao et al. (2018). Using -folds cross-validation, we select regularization parameters . We implement all algorithms using Pytorch Paszke et al. (2019) and exert Adam as optimizer Kingma and Ba (2014) with the samples in a batch. All experiments are repeated 10 times to obtain stable results, meanwhile those multiple test errors provide the statistical significance of the difference between compared methods and the optimal one. We make use of convolutional filters on all datasets where the convolutional filters are for the first layer while for higher layers.

To confirm the effectiveness of factors used in our algorithm, we compare CSKN with several relevant algorithms: 1) CNN: Vanilla convolutional network only consists of convolutional layers (ReLU as activation) but without pooling operators and skip connections Xiao et al. (2018).
2) CRFF: Stacked random Fourier features Zhang et al. (2017) with convolutional filters, corresponding to stationary spectral kernels.
3) DSKN: Deep spectral kernel network without convolutional filters Xue et al. (2019).
4) CDSK: A variant of CSKN where hyparameters are just assigned and backpropagation is not used.

5.2 Experiments on Small Image datasets

We first run experiments on several small size image datasets where the structural information is more likely to be captured by convolution operators. These images datasets are collected in LIBSVM Data Chang and Lin (2011). We use the primal partition of training and testing data.

We report the results in Table 1 where the results indicate: 1) the proposed CSKN achieves optimal accuracies on all datasets, validating the effectiveness of our learning framework. 2) The results of CDSK are slightly worse than CSKN due to the lack of updates on parameters. 3) Compared with CSKN, DSKN shows poor performance because DSKN is a fully connected network without convolutional filters. 4) CRFF provides the worst results that coincide with the generalization analysis where stationary spectral kernel leads to inferior generalization error bounds.

5.3 Handwriting recognition on MNIST

Here, we conduct experiments on the MNIST dataset LeCun et al. (1998) which consists of 60,000 training images and 10,000 testings of handwritten digits. We randomly select a part of instances from training data to evaluate the performance on different partitions.

Test accuracies are reported in Table 2. The results illustrate: 1) CSKN outperforms compared methods on all data size. 2) Non-stationary kernels always provide better results than the stationary kernels approach (CRFF). 3) With appropriate initialization, even without backpropagation, convolutional deep spectral kernel (CDSK) can still achieve similar performance as CSKN. 4) Kernel-based networks work better than CNN on a small number of training samples.

6 Conclusion and Discussion

In this paper, we first integrate the non-stationary spectral kernel with deep convolutional neural network architecture, using Monte Carlo approximation for each layer. The proposed algorithm is a trainable network where it optimizes the spectral density and the estimator together via backpropagation. Then, based on Rademacher complexity, we extend the generalization analysis of kernel methods to the proposed network. From the perspective of generalization, we prove non-stationary spectral kernel characterizes better generalization ability and deeper architectures lead to sharper error bounds with suitable initialization. Generalization analysis interprets the superiority of deep architectures and can be applied to general DNN to improve their interpretability. Intuitively, the generating feature mappings enjoy the following benefits: 1) input-dependent (non-stationary spectral kernels), 2) output-dependent (backpropagation towards the objective), 3) hierarchical represented (deep architecture), 4) local correlated (convolutional operators).

However, there are still a few tackle problems to be settled. For convolutional networks, current theoretical work focus on group invariance Mallat (2012), stability Bietti and Mairal (2019) and approximation ability Zhou (2020b). However, these theories can not explain why convolutional architectures work better than fully connected networks. In future work, we will try to explain the generalization ability of convolutional networks using downsampling Zhou (2020a) and locality. Besides, generalization analysis indicates that the initialization variance should be increased for the growth of depth, while decreases as increasing in current mean field theory work Schoenholz et al. (2017); Xiao et al. (2018). It’s worthy to explore the tradeoffs between generalization and optimization in terms of random initialization. Our work also can be incorporated with Neural Tangent Kernel (NTK) Jacot et al. (2018) to capture the dynamics of signals and conduct a simpler kernel.

7 Proof

proof of Lemma 3.

Based on the -Lipschitz condition, we combine Lemma A.5 of Bartlett et al. (2005) with the contraction lemma (Lemma 5 of Cortes et al. (2016)). Then, with a probability at least , there holds

(12)

We estimate empirical Rademacher complexity via

(13)

where and and the matrix is defined as follows:

Applying Hölder’s inequality and bounded by a constant to (13), there holds

(14)

Then, we bound as follows

(15)

The last step is due to the symmetry of the kernel . We finally bound the empirical Rademacher complexity

(16)

where . Substituting the above inequation (16) to (12), we complete the proof. ∎

References

  1. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pp. 6155–6166. Cited by: §1.2.
  2. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322–332. Cited by: §1.2.
  3. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research 18 (21), pp. 1–38. Cited by: §1.2, §4.5.
  4. Local rademacher complexities. The Annals of Statistics 33 (4), pp. 1497–1537. Cited by: §7.
  5. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6240–6249. Cited by: §1.2.
  6. Rademacher and gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3 (Nov), pp. 463–482. Cited by: §4.1, §4.2.
  7. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396. Cited by: §1.2, §4.
  8. The curse of highly variable functions for local kernel machines. In Advances in neural information processing systems, pp. 107–114. Cited by: §1.
  9. Invariance and stability of deep convolutional representations. In Advances in neural information processing systems, pp. 6210–6220. Cited by: §1.2.
  10. Group invariance, stability to deformations, and complexity of deep convolutional representations. The Journal of Machine Learning Research 20 (1), pp. 876–924. Cited by: §1.2, §4.6, §6.
  11. On regularization and robustness of deep neural networks. In International Conference on Learning Representations, Cited by: §4.
  12. A kernel perspective for regularizing deep neural networks. In International Conference on Machine Learning, pp. 664–674. Cited by: §1.2.
  13. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20 (4), pp. 1956–1982. Cited by: §3.4.
  14. Leave-one-out cross-validation based model selection criteria for weighted ls-svms. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), pp. 1661–1668. Cited by: §1.
  15. LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 1–27. Cited by: §5.2.
  16. Learning kernels using local rademacher complexity. In Advances in Neural Information Processing Systems 26 (NIPS), pp. 2760–2768. Cited by: §4.1.
  17. Structured prediction theory based on factor graph complexity. In Advances in Neural Information Processing Systems 29 (NIPS), pp. 2514–2522. Cited by: §7.
  18. Two-stage learning kernel algorithms. In 27th International Conference on Machine Learning, ICML 2010, pp. 239–246. Cited by: §1.
  19. Classes of kernels for machine learning: a statistics perspective. Journal of Machine Learning Research 2 (Dec), pp. 299–312. Cited by: §1.
  20. How to start training: the effect of initialization and architecture. In Advances in Neural Information Processing Systems, pp. 571–581. Cited by: §4.4.
  21. Neural tangent kernel: convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580. Cited by: §6.
  22. Orthogonal deep neural networks.. IEEE transactions on pattern analysis and machine intelligence. Cited by: §4.4.
  23. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Cited by: §1.2.
  24. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  25. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.2, §1, §4.6.
  26. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.2, §1, §4.6, §5.3.
  27. Automated spectral kernel learning. In Thirty-Four AAAI Conference on Artificial Intelligence, Cited by: §1.2, §1.2, §1.
  28. Convolutional kernel networks. In Advances in neural information processing systems, pp. 2627–2635. Cited by: §1.2.
  29. End-to-end kernel learning with supervised convolutional kernel networks. In Advances in neural information processing systems, pp. 1399–1407. Cited by: §1.2.
  30. Group invariant scattering. Communications on Pure and Applied Mathematics 65 (10), pp. 1331–1398. Cited by: §1.2, §4.6, §6.
  31. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §5.1.
  32. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in neural information processing systems, pp. 4785–4795. Cited by: §3.5.
  33. Exponential expressivity in deep neural networks through transient chaos. In Advances in neural information processing systems, pp. 3360–3368. Cited by: §1.2, §3.5, §4.4, §5.1.
  34. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 21 (NIPS), pp. 1177–1184. Cited by: §1.2, §2.1, §4.5.
  35. Non-stationary spectral kernels. In Advances in Neural Information Processing Systems 30 (NIPS), pp. 4642–4651. Cited by: §1.2, §1, §2.2.
  36. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems 30 (NIPS), pp. 3215–3225. Cited by: §1.2, §4.5.
  37. Generalized spectral kernels. arXiv preprint arXiv:1506.02236. Cited by: §1.2, §1, §2.2, §2.2.
  38. Deep information propagation. In International Conference on Learning Representations, Cited by: §1.2, §5.1, §6.
  39. Learning spectrograms with convolutional spectral kernels. arXiv preprint arXiv:1905.09917. Cited by: §1.2.
  40. Interpolation of spatial data: some theory for kriging. Springer Science & Business Media. Cited by: §2.1.
  41. Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779. Cited by: §1.
  42. Differentiable compositional kernel learning for gaussian processes. arXiv preprint arXiv:1806.04326. Cited by: §1.2.
  43. Spatial mapping with gaussian processes and nonstationary fourier features. Spatial statistics 28, pp. 59–78. Cited by: §1.
  44. A mathematical theory of deep convolutional neural networks for feature extraction. IEEE Transactions on Information Theory 64 (3), pp. 1845–1866. Cited by: §1.2, §4.6.
  45. Dynamical isometry and a mean field theory of cnns: how to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pp. 5393–5402. Cited by: §5.1, §5.1, §6.
  46. Deep spectral kernel learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 4019–4025. Cited by: §1.2, §1, §5.1.
  47. Correlation theory of stationary and related random functions.. Volume I: Basic Results. 526. Cited by: §1.
  48. Mean field residual networks: on the edge of chaos. In Advances in neural information processing systems, pp. 7103–7114. Cited by: §4.4.
  49. Stacked kernel network. arXiv preprint arXiv:1711.09219. Cited by: §1.2, §5.1.
  50. Theory of deep convolutional neural networks: downsampling. Neural Networks. Cited by: §1.2, §6.
  51. Universality of deep convolutional neural networks. Applied and computational harmonic analysis 48 (2), pp. 787–794. Cited by: §1.2, §4.6, §6.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
410143
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description