Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

Sanjeev Arora
Princeton University
arora@cs.princeton.edu
   Simon S. Du
Institute for Advanced Study
ssdu@ias.edu
   Zhiyuan Li
Princeton University
zhiyuanli@cs.princeton.edu
   Ruslan Salakhutdinov
Carnegie Mellon University
rsalakhu@cs.cmu.edu
   Ruosong Wang
Carnegie Mellon University
ruosongw@andrew.cmu.edu
   Dingli Yu
Princeton University
dingliy@cs.princeton.edu
Abstract

Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (jacot2018neural). An efficient algorithm to compute the NTK, as well as its convolutional counterparts, appears in arora2019exact, which allowed studying performance of infinitely wide nets on datasets like CIFAR-10. However, super-quadratic running time of kernel methods makes them best suited for small-data tasks. We report results suggesting neural tangent kernels perform strongly on low-data tasks.

  1. On a standard testbed of classification/regression tasks from the UCI database, NTK SVM beats the previous gold standard, Random Forests (RF), and also the corresponding finite nets.

  2. On CIFAR-10 with training samples, Convolutional NTK consistently beats ResNet-34 by - .

  3. On VOC07 testbed for few-shot image classification tasks on ImageNet with transfer learning (goyal2019scaling), replacing the linear SVM currently used with a Convolutional NTK SVM consistently improves performance.

  4. Comparing the performance of NTK with the finite-width net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis(arora2019exact). NTK’s efficacy may trace to lower variance of output.

1 Introduction

Modern neural networks (NNs) have way more parameters than training data points, which allow them to achieve near-zero training error while simultaneously — for some reason yet to be understood — have low generalization error (zhang2016understanding). This motivated formal study of highly overparametrized networks, including networks whose width (i.e., number of nodes in layers, or number of channels in convolutional layers) goes to infinity. A recent line of theoretical results shows that with loss and infinitesimal learning rate, in the limit of infinite width the trajectory of training converges to kernel regression with a particular kernel, neural tangent kernel (NTK) (jacot2018neural). For convolutional networks, the kernel is CNTK. See Section 2 for more discussions. arora2019exact gave an algorithm to exactly compute the kernel corresponding to the infinite limit of various realistic NN architectures with convolutions and pooling layers, allowing them to compute performance on CIFAR-10, which revealed that the infinite networks have to % higher error than their finite counterparts. This is still fairly good performance for a fixed kernel.

Ironically, while the above-mentioned analysis, at first sight, appears to reduce the study of a complicated model — deep networks — to an older, simpler model — kernel regression — in practice the simpler model is computationally less efficient because running time of kernel regression can be quadratic in the number of data points!111The bottleneck is constructing the kernel, which scales quadratically with the number of data points (arora2019exact). The regression also requires matrix inversion, which can be cubic in the number of data points. Thus computing using CNTK kernel on large datasets like ImageNet currently appears infeasible. Even on CIFAR-10, it seems infeasible to incorporate data augmentation.

However, kernel classifiers are very efficient on small datasets. Here NTKs could conceivably be practical while at the same time bringing some of the power of deep networks to these settings. We recall that recently olson2018modern showed that multilayer neural networks can be reasonably effective on small datasets, specifically on a UCI testbed of tasks with as few as dozens of training examples. Of course, this required some hyperparameter tuning, although they noted that such tuning is also needed for the champion method, Random Forests (RF), which multilayer neural networks could not beat.

It is thus natural to check if NTK — corresponding to infinitely wide fully-connected networks — performs well in such small-data tasks222Note that NTKs can also be used in kernel SVMs, which are not known to be equivalent to training infinitely wide networks. Currently, equivalence is only known for ridge regression. We tried both.. Convex objectives arising from kernels have stable solvers with minimal hyperparameter tuning. Furthermore, random initialization in deep network training seems to lead to higher variance in the output, which can hurt performance in small-data settings. Can NTK’s do better? Below we will see that in the setup of olson2018modern, NTK predictors indeed outperforms corresponding finite deep networks, and also slightly beats the earlier gold standard, Random Forests. This suggests NTK predictors should belong in any list of off-the-shelf machine learning methods.

Following are low-data settings where we used NTKs and CNTKs:

  • [leftmargin=*,topsep=0pt]

  • In the testbed of classification tasks from UCI database, NTK predictor achieves superior, and arguably the strongest classification performance. This is verified via several standard statistical tests, including Friedman Rank, Average Accuracy, Percentage of the Maximum Accuracy (PMA) and probability of achieving 90%/95% maximum accuracy (P90 and P95), performed to compare performances of different classifiers on datasets from UCI database. (The authors plan to release the code, to allow off-the-shelf use of this method. It does not require GPUs.)

  • We find the performance of NN is close to that of NTK. On every dataset from UCI database, the difference between the classification accuracy of NN and that of NTK is within . On the other hand, on some datasets, the difference between classification accuracy of NN (or NTK) and that of other classifiers like RF can be as high as . This indicates in low-data settings, NTK is indeed a good description of NN. Furthermore, we find NTK is more stable (smaller variance), which seems to help it achieve better accuracy on small datasets (cf. Figure 1(b)).

  • CNTK is useful in computer vision tasks with small-data. On CIFAR-10, we compare CNTK with ResNet using 10 - 640 training samples and find CNTK can beat ResNet by . We further study few-shot image classification task on VOC07 dataset. The standard method is to first use a pre-trained network, e.g., ResNet-50 trained on ImageNet, to extract features and then directly apply a linear classifier on the extracted features (goyal2019scaling). Here we replace the linear classifier with CNTK and obtain better classification accuracy in various setups.

Paper organization.

Section 2 discusses related work. Section 3 reviews the derivation of NTK. Section 4 presents experiments using NN and NTK on UCI datasets. Section 5 presents experiments using CNN and CNTK on small CIFAR-10 datasets. Section 6 presents experiments using CNTK for the few-shot learning setting. Additional technical details are presented in appendix.

2 Related Work

Our paper is inspired by fernandez2014we which conducted extensive experiments on UCI dataset. Their conclusion is random forest performs the best, which is followed by the SVM with Gaussian kernel. Therefore, RF may be considered as a reference (“gold-standard”) to compare with new classifiers. olson2018modern followed this testing strategy to evaluate the performance of modern neural networks concluding that modern neural networks, even though being highly over-parameterized, still give reasonable performances on these small datasets, though not as strong as RFs. Our paper follows the same testing strategy to evaluate the performance of NTK.

The focus of this paper, neural tangent kernel is induced from a neural network architecture. The connection between infinitely wide neural networks and kernel methods is not new (neal1996priors; williams1997computing; leroux07a; hazan2015steps; lee2018deep; matthews2018gaussian; novak2019bayesian; garriga-alonso2018deep; cho2009kernel; daniely2016toward; daniely2017sgd). However, these kernels correspond to neural network where only the last layer is trained. Neural tangent kernel, first proposed by jacot2018neural, is fundamentally different as NTKs correspond to infinitely wide NNs with all layer being trained. Theoretically, a line of work study the optimization and generalization behavior of ultra-wide NNs (allen2018convergence; allen2018learning; arora2019fine; du2018gradient; du2018deep; li2018learning; zou2018stochastic; yang2019scaling; cao2019generalization; cao2019generalizationtheory). Recently, arora2019exact gave non-asymptotic perturbation bound between the NN predictor trained by gradient descent and the NTK predictor. Empirically, lee2019wide verified on small scale data, NTK is a good approximation to NN. However, arora2019exact showed on large scale dataset, NN can outperform NTK which may due to the effect of finite-width and/or optimization procedure.

Generalization to architectures other than fully-connected NN and CNN are recently proposed (yang2019scaling; du2019graph; bietti2019inductive). du2019graph showed graph neural tangent kernel (GNTK) can achieve better performance than its counter part, graph neural network (GNN), on datasets with up to samples.

3 Neural Network and Neural Tangent Kernel

Since NTK is induced by a NN architecture, we first define a NN formally. Let be the input, and denote and for notational convenience. We define an -hidden-layer fully-connected neural network recursively:

(1)

where is the weight matrix in the -th layer (), is a coordinate-wise activation function, and is a scaling factor.333 Putting an explicit scaling factor in the definition of NN is typically called NTK parameterization (jacot2018neural; park2019effect). Standard parameterization scheme does not have the explicit scaling factor. The derivation of NTK requires NTK parameterization. In our experiments on NNs, we try both parameterization schemes. In this paper, for NN we will consider being ReLU or ELU (clevert2015fast) and for NTK we will only consider kernel functions induced by NNs with ReLU activation. The last layer of the neural network is

(2)

where is the weights in the final layer, and we let be all parameters in the network. All the weights are initialized to be i.i.d. random variables.

When the hidden widths , certain limiting behavior emerges along the gradient trajectory. Let be two data points, the covariance kernel of the -th layer’s outputs, , can be recursively defined in an analytical form:

(3)

for . Crucially, this analytical form holds not only at the initialization, but also holds during the training (when gradient descent with small learning rate is used as the optimization routine).

Formally, NTK is defined as the limiting gradient kernel

Again, one can obtain a recursive formula

(4)
(5)

where we let for convenience. It is easy to check if we fix the first layers and only train the remaining layers, then the resulting NTK is . Note when , then the resulting NTK is , which is the NNGP kernel (lee2018deep). can be viewed as a hyperparameter of NTK classifier, and in our UCI experiment we tune . Given a kernel function, one can directly use it for downstream classification tasks (scholkopf2001learning).

4 Experiments on UCI Datasets

In this section, we present our experimental results on UCI datasets which follow the setup of fernandez2014we with extensive comparisons of classifiers, including random forest, Gaussian kernel SVM, multiplayer neural networks, etc. Section 4.2 discusses the performance of NTK through detailed comparisons with other classifiers tested by fernandez2014we. Section 4.2 compares NTK classifier and the corresponding NN classifier and verifies how similar their predictions are. See Table 6 in Appendix A for a summary of datasets we used. We compute NTKs containing 1 to 5 fully-connected layers and then use kernel SVM with soft margin for classification. The detailed experiment setup, including the choices the datasets, training / test splitting and ranges of hyperparameters, is described in Appendix A. The code can be found in https://github.com/LeoYu/neural-tangent-kernel-UCI. We note that usual methods of obtaining confidence bounds in these low-data settings are somewhat heuristic.

4.1 Overall Performance Comparisons

Classifier Friedman Rank Average Accuracy P90 P95 PMA
NTK 28.34 81.95%14.10% 88.89% 72.22% 95.72% 5.17%
NN (He init) 40.97 80.88%14.96% 81.11% 65.56% 94.34% 7.22%
NN (NTK init) 38.06 81.02%14.47% 85.56% 60.00% 94.55% 5.89%
RF 33.51 81.56% 13.90% 85.56% 67.78% 95.25% 5.30%
Gaussian Kernel 35.76 81.03% 15.09% 85.56% 72.22% 94.56% 8.22%
Polynomial Kernel 38.44 78.21% 20.30% 80.00% 62.22% 91.29% 18.05%
Table 1: Comparisons of different classifiers on UCI datasets. P90/P95: the number of datasets a classifier achieves 90%/95% or more of the maximum accuracy, divided by the total number of datasets. PMA: average percentage of the maximum accuracy.

Table 1 lists the performance of 6 classifiers under various metrics: the three top classifiers identified in fernandez2014we, namely, RF, Gaussian kernel and polynomial kernel, along with our new methods NTK, NN with He initialization and NN with NTK initialization. Table 1 shows NTK is the best classifier under all metrics, followed by RF, the best classifier identified in fernandez2014we. Now we interpret each metric in more details.

Friedman Ranking and Average Accuracy.

NTK is the best (Friedman Rank 28.34, Average Accuracy 81.95%), followed by RF (Friedman Rank 33.51, Average Accuracy 81.56%) and then followed by SVM with Gaussian kernel (Friedman Rank 35.76, Average Accuracy 81.03%). The difference between NTK and RF is significant (-5.17 in Friedman Rank and +0.39% in Average Accuracy), just as the superiority of RF is significant compared to other classifiers as claimed in fernandez2014we. NN (with either He initialization or NTK initialization) performs significantly better than the polynomial kernel in terms of the Average Accuracy (80.88% and 81.02% vs. 78.21%), but in terms of Friedman Rank, NN with He initialization is worse than polynomial kernel (40.97 vs. 38.44) and NN with NTK initialization is slightly better than polynomial kernel (38.06 vs. 38.44). On many datasets where most classifiers have similar performances, NN’s rank is high as well, whereas on other datasets, NN is significantly better than most classifiers, including SVM with polynomial kernel. Therefore, NN enjoys higher Average Accuracy but suffers higher Friedman Rank. For example, on ozone dataset, NN with NTK initialization is only 0.25% worse than polynomial kernel but their difference in terms of rank is 56. It is also interesting to see that NN with NTK initialization performs better than NN with He initialization (38.06 vs. 40.97 in Friedman Rank and 81.02% vs. 80.88% in Average Accuracy).

P90/P95 and PMA.

This measures, for a given classifier, the fraction of datasets on which it achieves more than 90%/95% of the maximum accuracy among all classifiers. NTK is one of the best classifiers (ties with Gaussian kernel on P95), which shows NTK can consistently achieve superior classification performance across a broad range of datasets. Lastly, we consider the Percentage of the Maximum Accuracy (PMA). NTK achieves the best average PMA followed by RF whose PMA is 0.47% below that of NTK and other classifiers are all below 94.6%. An interesting observation is that NTK, NN with NTK initialization and RF have small standard deviation (5.17%, 5.89% and 5.30%) whereas all other classifiers have much larger standard deviation.

4.2 Pairwise Comparisons

NTK vs. RF.

In Figure 0(a), we compare NTK with RF. There are 42 datasets that NTK outperforms RF and 40 datasets that RF outperforms NTK.444We consider classifier A outperform classifier B if the classification accuracy of A is at least 0.1% better than B. The number 0.1% is chosen because results in fernandez2014we only kept the precision up to 0.1%. The mean difference is 2.51%, which is statistically significant by a Wilcoxon signed rank test. We see NTK and RF perform similarly when the Bayes error rate is low with NTK being slightly better. There are some exceptions. For example, on the balance-scale dataset, NTK achieves 98% accuracy whereas RF only achieves 84.1%. When the Bayes error rate is high, either classifier can be significantly better than the other.

Random Forest

NTK
(a) RF vs. NTK

SVM with Gaussian Kernel

NTK
(b) Gaussian Kernel vs. NTK
Figure 1: Performance comparisons between NTK and other classifiers.

NTK vs. Gaussian Kernel.

The gap between NTK and Gaussian kernel is more significant. As shown in Figure 0(b), NTK generally performs better than Gaussian kernel no matter Bayes error rate is low or high. There are 43 datasets that NTK outperforms Gaussian kernel and there are 34 datasets that Gaussian kernel outperforms NTK. The mean difference is 2.22%, which is also statistically significant by a Wilcoxon signed rank test. On balloons, heart-switzerland, pittsburg-bridges-material, teaching, trains datasets, NTK is better than Gaussian kernel by at least 11% in terms of accuracy. These five datasets all have less than 200 samples, which shows that NTK can perform much better than Gaussian kernel when the number of samples is small.

NTK vs. NN

In this section we compare NTK with NN. The goals are (i) comparing the performance and (ii) verifying NTK is a good approximation to NN.

In Figure 1(a), we compare the performance between NTK and NN with He initialization. For most datasets, these two classifiers perform similarly. However, there are a few datasets that NTK performs significantly better. There are 50 datasets that NTK outperforms NN with He initialization and there are 27 datasets that NN with He initialization outperforms NTK. The mean difference is 1.96%, which is also statistically significant by a Wilcoxon signed rank test.

In Figure 1(b), we compare the performance between NTK and NN with NTK initialization. Recall that for NN with NTK initialization, when the width goes to infinity, the predictor is just NTK. Therefore, we expect these two predictors give similar performance. Figure 1(b) verifies our conjecture. There is no dataset the one classifier is significantly better than the other. We do not have the same observation in Figure 0(a)0(b) or 1(a). Nevertheless, NTK often performs better than NN. There are 52 datasets that NTK outperforms NN with He initialization and there are 25 datasets NN with He initialization outperforms NTK. The mean difference is 1.54%, which is also statistically significant by a Wilcoxon signed rank test.

Neural Networks

NTK
(a) NN w. He init vs. NTK

Neural Networks

NTK
(b) NN w. NTK init vs. NTK
Figure 2: Performance Comparisons between NTK and NN.

5 Experiments on Small CIFAR-10 Dataset

In this section, we study the performance of CNTK on subsampled CIFAR-10 dataset. We randomly choose samples from CIFAR-10 training set, use them to train CNTK and ResNet-34, and test both classifiers on the whole test set. In our experiments, we vary from to , and the number of convolutional layers of CNTK varies from to . After computing CNTK, we use kernel regression for training and testing. See Appendix B for detailed experiment setup.

The results are reported in Table 2. It can be observed that in this setting CNTK consistently outperforms ResNet-34. The largest gap occurs at where 14-layer CNTK achieves accuracy and ResNet achieves . The smallest improvement occurs at : vs. . When the number of training data is large, i.e., , ResNet can outperform CNTK. It is also interesting to see that 14-layer CNTK is the best performing CNTK for all values of , suggesting depth has a significant effect on this task.

ResNet 5-layer CNTK 8-layer CNTK 11-layer CNTK 14-layer CNTK
10 14.59% 1.99% 15.08% 2.43% 15.24% 2.44% 15.31% 2.38% 15.33% 2.43%
20 17.50% 2.47% 18.03% 1.91% 18.50% 2.03% 18.69% 2.07% 18.79% 2.13%
40 19.52% 1.39% 20.83% 1.68% 21.07% 1.80% 21.23% 1.86% 21.34% 1.91%
80 23.32% 1.61% 24.82% 1.75% 25.18% 1.80% 25.40% 1.84% 25.48% 1.91%
160 28.30% 1.38% 29.63% 1.13% 30.17% 1.11% 30.46% 1.15% 30.48% 1.17%
320 33.15% 1.20% 35.26% 0.97% 36.05% 0.92% 36.44% 0.91% 36.57% 0.88%
640 41.66% 1.09% 41.24% 0.78% 42.10% 0.74% 42.44% 0.72% 42.63% 0.68%
1280 49.14% 1.31% 47.21% 0.49% 48.22% 0.49% 48.67% 0.57% 48.86% 0.68%
Table 2: Performance of ResNet-34 and CNTK on small CIFAR-10 Dataset.

6 Experiments on Few-shot Learning

linear SVM 1-layer CNTK 2-layer CNTK 3-layer CNTK 4-layer CNTK 5-layer CNTK 6-layer CNTK
1
2
3
4
5
6
7
8
Table 3: Performance of linear SVM and CNTK with different number of convolutional layers on the 11-20th classes in VOC07. The cost value is tuned on the first 10 classes. Feature extracted from conv5 in ResNet-50.
linear SVM 1-layer CNTK 2-layer CNTK 3-layer CNTK 4-layer CNTK 5-layer CNTK 6-layer CNTK
1
2
3
4
5
6
7
8
Table 4: Performance of linear SVM and CNTK with different number of convolutional layers on the 11-20th classes in VOC07. The cost value is tuned on the first 10 classes. Feature extracted from conv4 in ResNet-50.
linear SVM 1-layer CNTK 2-layer CNTK 3-layer CNTK 4-layer CNTK 5-layer CNTK 6-layer CNTK
1
2
3
4
5
6
7
8
Table 5: Performance of linear SVM and CNTK with different number of convolutional layers on the 11-20th classes in VOC07. The cost value is tuned on the first 10 classes. Feature extracted from conv3. in ResNet-50

In this section, we test the ability of NTK as a drop-in classifier to replace the linear classifier in the few-shot learning setting. Using linear SVM on top of extracted features is arguably the most widely used strategy in few-shot learning as linear SVM is easy and fast to train whereas more complicated strategies like fine-tuning or training a neural network on top of the extracted feature have more randomness and may overfit due to the small size of the training set. NTK has the same benefits (easy and fast to train, no randomness) as the linear classifier but also allows some non-linearity in the design. Note since we consider image classification tasks, we use CNTK in experiments in this section.

Experiment Setup.

We mostly follow the settings of goyal2019scaling. Features are extracted from layer conv1, conv2, conv3, conv4, conv5 in ResNet-50 (he2016deep) trained on ImageNet (deng2009imagenet), and we use these features for VOC07 classification task. The number of positive examples varies from to . For each value of , for each class in the VOC07 dataset, we choose positive examples and negative examples. For each and for each class, we randomly choose independent sets with training samples in each set. We report the mean and standard deviation of mAP on the test split of VOC07 dataset. This setting has been used in numbers of previous few-shot learning papers (goyal2019scaling; zhang2017split).

We take the extracted features as the input to CNTK. We tried CNTK with - convolution layers, a global average pooling layer and a fully-connected layer. We normalize the data in the feature space, and finally use SVM to train the classifiers. Note without the convolution layer, it is equivalent to directly applying linear SVM after global average pooling and normalization. We use sklearn.svm.LinearSVC to train linear SVMs, and sklearn.svm.SVC to train kernel SVMs (for CNTK). To train SVM, we choose the cost value from and set the class_weight ratio to be for positive/negative classes as in goyal2019scaling.

Since the number of given samples is usually small, goyal2019scaling chooses to report the performance of the best cost value . In our experiments, we use a more standard method to perform cross-validation. We use the first 10 classes of VOC07 to tune , and report the performance of selected in the other 10 classes in Table 3-5. We also report the performance of the best as in goyal2019scaling, in Tables 7-9 in the appendix for completeness.

Discussions.

First, we find CNTK is a strong drop-in classifier to replace the linear classifier in few-shot learning. Tables 3-9 clearly demonstrate that CNTK is consistently better than linear classifier. Note Tables 3-9 only show the prediction accuracy in an average sense. In fact, we find that on every randomly sampled training set, CNTK always gives a better performance. We conjecture that CNTK gives better performance because the non-linearity in CNTK helps prediction. This is verified by looking at the performance gain of CNTK for different feature extractors. Note Conv5 is often considered to be most useful for linear classification. There, CNTK only outperforms the linear classifier by about . On the other hand, with Conv3 and Conv4 features, CNTK can outperform linear classifier by around and sometimes by (last two rows in Table 5). We believe this happens because Conv3 and Conv4 correspond to middle-level features, and thus non-linearity is indeed beneficial for better accuracy.

We also observe that CNTK often performs the best with a single convolutional layer. This is expected since features extracted by ResNet-50 already produce a good representation. Nevertheless, we find for middle-level features Conv3 with or , CNTK with 3 convolutional layers give the best performance. We believe this is because with more data, utilizing the non-linearity induced by CNTK can further boost the performance.

7 Conclusion

The Neural Tangent Kernel, discovered by mathematical curiosity about deep networks in the limit of infinite width, is found to yield superb performance on low-data tasks, beating extensively tuned versions of classic methods such as random forests. The (fully-connected) NTK classifiers are easy to compute (no GPU required) and thus should be a good off-the-shelf classifier in many settings. We plan to release our code to allow drop-in replacement for SVMs and linear regression.

Many theoretical questions arise. Do NTK SVMs correspond to some infinite net architecture (as NTK ridge regression does)? What explains generalization in small-data settings? (This understanding is imperfect even for random forests.) Finally one can derive NTK corresponding to other architectures, e.g., recurrent neural tangent kernel (RNTK) induced by recurrent neural networks. It would be an interesting future research direction to test their performance on benchmark tasks.

Acknowledgments

S. Arora, Z. Li and D. Yu are supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Amazon Research, DARPA and SRC. S.S.D is supported by National Science Foundation (Grant No. DMS-1638352) and the Infosys Membership. R. Wang and R. Salakhutdinov are supported in part by NSF IIS-1763562, Office of Naval Research grant N000141812861, and Nvidia NVAIL award. We thank AWS for cloud computing time. We thank for Xiaolong Wang for discussing the few-shot learning task.

References

Appendix A Additional Experimental Details on UCI

In this section, we describe our experiment setup.

Dataset Selection

Our starting point was the pre-processed UCI datasets from fernandez2014we, which can be downloaded from http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz. Their preprocessing involved changing categorical features to numerical, some normalization, and turning regression problems to classification. Of these we only retained datasets with number of samples less than , since we are interested in low-data regime. Furthermore, we discovered an apparent bug in their pre-processing on datasets with explicit training/test split (specifically, their featurization was done differently on training/test data). Fixing the bug and retraining models on those datasets would complicate an apples-to-apples comparison, so we discarded these datasets from our experiment. This left 90 datasets. See Table 6 for a summary of the datasets. Even on these 90 remaining datasets, their ranking of different classifiers was almost preserved, with random forest the best followed by SVM with Gaussian kernel.

# samples # features # classes
min 10 3 2
25% 178 8 2
50% 583 16 3
75% 1022 32 6
max 5000 262 100
Table 6: Dataset Summary

Performance Comparison Details

We follow the comparison setup in fernandez2014we that we report -fold cross-validation. For hyperparameters, we tune them with the same validation methodology in fernandez2014we: all available training samples are randomly split into one training and one test set, while imposing that each class has the same number of training and test samples. Then the parameter with best validation accuracy is selected. It is possible to give confidence bounds for this parameter tuning scheme, but they are worse than standard ones for separated training/validation/testing data. For NTK and NN classifiers we train them on these datasets. For other classifiers, we use the results from fernandez2014we.

NTK Specification

We calculate NTK induced fully-connected neural networks with layers where bottom layers are fixed, and then use -support vector classification implemented by sklearn.svm. We tune hyperparameters from to , from 0 to , and cost value as powers of ten from to . The number of kernels used is , so the total number of parameter combinations is 105. Note this number is much less than the number of hyperparameter of Gaussian kernel reported in fernandez2014we where they tune hyperparameters with 500 combinations.

NN Specification

We use fully-connected NN with layers, number of hidden nodes per layer and use gradient descent to train the neural network. We tune hyperparameters from to , with / without batch normalization and learning rate or . We run gradient descent for epochs.555We found NN can achieve training loss after less than epochs. This is consistent with the observation in olson2018modern. We treat NN with He initialization and NTK initialization as two classifiers and report their results separately.

Appendix B Additional Experimental Details on Small CIFAR-10 Datasets

We randomly choose samples from each class of CIFAR-10 training set and test classifiers on the whole testing set. varies from to . For each , we repeat 20 times and report the mean accuracy and its standard deviation for each classifier.

The number of convolution layers of CNTK ranges from 5-14. After convolutional layers, we apply a global pooling layer and a fully connected layer. We refer readers to arora2019exact for exact formulas of CNTK. We normalize the kernel such that each sample has unit length in feature space.

We use ResNet-34 with width 64,128,256 and default hyperparameters: learning rate 0.1, momentum 0.9, weight decay 0.0005. We decay the learning rate by 10 at the epoch of 80 and 120, with 160 training epochs in total. The training batch size is the minimum of the size of the whole training dataset and 160. We report the best testing accuracy among epochs.

Appendix C Additional Results in Few-shot Learning

Tables 7-9 show the performance of the best as has been done in goyal2019scaling.

linear SVM 1-layer CNTK 2-layer CNTK 3-layer CNTK 4-layer CNTK 5-layer CNTK 6-layer CNTK
1
2
3
4
5
6
7
8
Table 7: Performance of linear SVM and CNTK with different number of convolutional layers on all classes in VOC07 with the best . Feature extracted from conv5 in ResNet-50
linear SVM 1-layer CNTK 2-layer CNTK 3-layer CNTK 4-layer CNTK 5-layer CNTK 6-layer CNTK
1
2
3
4
5
6
7
8
Table 8: Performance of linear SVM and CNTK with different number of convolutional layers on all classes in VOC07 with the best . Feature extracted from conv4 in ResNet-50
linear SVM 1-layer CNTK 2-layer CNTK 3-layer CNTK 4-layer CNTK 5-layer CNTK 6-layer CNTK
1
2
3
4
5
6
7
8
Table 9: Performance of linear SVM and CNTK with different number of convolutional layers on all classes in VOC07 with the best . Feature extracted from conv3 in ResNet-50
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393493
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description