Harnessing the Power of Infinitely Wide Deep Nets on Smalldata Tasks
Abstract
Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to socalled Neural Tangent Kernels (NTKs) (jacot2018neural). An efficient algorithm to compute the NTK, as well as its convolutional counterparts, appears in arora2019exact, which allowed studying performance of infinitely wide nets on datasets like CIFAR10. However, superquadratic running time of kernel methods makes them best suited for smalldata tasks. We report results suggesting neural tangent kernels perform strongly on lowdata tasks.

On a standard testbed of classification/regression tasks from the UCI database, NTK SVM beats the previous gold standard, Random Forests (RF), and also the corresponding finite nets.

On CIFAR10 with – training samples, Convolutional NTK consistently beats ResNet34 by  .

On VOC07 testbed for fewshot image classification tasks on ImageNet with transfer learning (goyal2019scaling), replacing the linear SVM currently used with a Convolutional NTK SVM consistently improves performance.

Comparing the performance of NTK with the finitewidth net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis(arora2019exact). NTK’s efficacy may trace to lower variance of output.
1 Introduction
Modern neural networks (NNs) have way more parameters than training data points, which allow them to achieve nearzero training error while simultaneously — for some reason yet to be understood — have low generalization error (zhang2016understanding). This motivated formal study of highly overparametrized networks, including networks whose width (i.e., number of nodes in layers, or number of channels in convolutional layers) goes to infinity. A recent line of theoretical results shows that with loss and infinitesimal learning rate, in the limit of infinite width the trajectory of training converges to kernel regression with a particular kernel, neural tangent kernel (NTK) (jacot2018neural). For convolutional networks, the kernel is CNTK. See Section 2 for more discussions. arora2019exact gave an algorithm to exactly compute the kernel corresponding to the infinite limit of various realistic NN architectures with convolutions and pooling layers, allowing them to compute performance on CIFAR10, which revealed that the infinite networks have to % higher error than their finite counterparts. This is still fairly good performance for a fixed kernel.
Ironically, while the abovementioned analysis, at first sight, appears to reduce the study of a complicated model — deep networks — to an older, simpler model — kernel regression — in practice the simpler model is computationally less efficient because running time of kernel regression can be quadratic in the number of data points!^{1}^{1}1The bottleneck is constructing the kernel, which scales quadratically with the number of data points (arora2019exact). The regression also requires matrix inversion, which can be cubic in the number of data points. Thus computing using CNTK kernel on large datasets like ImageNet currently appears infeasible. Even on CIFAR10, it seems infeasible to incorporate data augmentation.
However, kernel classifiers are very efficient on small datasets. Here NTKs could conceivably be practical while at the same time bringing some of the power of deep networks to these settings. We recall that recently olson2018modern showed that multilayer neural networks can be reasonably effective on small datasets, specifically on a UCI testbed of tasks with as few as dozens of training examples. Of course, this required some hyperparameter tuning, although they noted that such tuning is also needed for the champion method, Random Forests (RF), which multilayer neural networks could not beat.
It is thus natural to check if NTK — corresponding to infinitely wide fullyconnected networks — performs well in such smalldata tasks^{2}^{2}2Note that NTKs can also be used in kernel SVMs, which are not known to be equivalent to training infinitely wide networks. Currently, equivalence is only known for ridge regression. We tried both.. Convex objectives arising from kernels have stable solvers with minimal hyperparameter tuning. Furthermore, random initialization in deep network training seems to lead to higher variance in the output, which can hurt performance in smalldata settings. Can NTK’s do better? Below we will see that in the setup of olson2018modern, NTK predictors indeed outperforms corresponding finite deep networks, and also slightly beats the earlier gold standard, Random Forests. This suggests NTK predictors should belong in any list of offtheshelf machine learning methods.
Following are lowdata settings where we used NTKs and CNTKs:

[leftmargin=*,topsep=0pt]

In the testbed of classification tasks from UCI database, NTK predictor achieves superior, and arguably the strongest classification performance. This is verified via several standard statistical tests, including Friedman Rank, Average Accuracy, Percentage of the Maximum Accuracy (PMA) and probability of achieving 90%/95% maximum accuracy (P90 and P95), performed to compare performances of different classifiers on datasets from UCI database. (The authors plan to release the code, to allow offtheshelf use of this method. It does not require GPUs.)

We find the performance of NN is close to that of NTK. On every dataset from UCI database, the difference between the classification accuracy of NN and that of NTK is within . On the other hand, on some datasets, the difference between classification accuracy of NN (or NTK) and that of other classifiers like RF can be as high as . This indicates in lowdata settings, NTK is indeed a good description of NN. Furthermore, we find NTK is more stable (smaller variance), which seems to help it achieve better accuracy on small datasets (cf. Figure 1(b)).

CNTK is useful in computer vision tasks with smalldata. On CIFAR10, we compare CNTK with ResNet using 10  640 training samples and find CNTK can beat ResNet by . We further study fewshot image classification task on VOC07 dataset. The standard method is to first use a pretrained network, e.g., ResNet50 trained on ImageNet, to extract features and then directly apply a linear classifier on the extracted features (goyal2019scaling). Here we replace the linear classifier with CNTK and obtain better classification accuracy in various setups.
Paper organization.
Section 2 discusses related work. Section 3 reviews the derivation of NTK. Section 4 presents experiments using NN and NTK on UCI datasets. Section 5 presents experiments using CNN and CNTK on small CIFAR10 datasets. Section 6 presents experiments using CNTK for the fewshot learning setting. Additional technical details are presented in appendix.
2 Related Work
Our paper is inspired by fernandez2014we which conducted extensive experiments on UCI dataset. Their conclusion is random forest performs the best, which is followed by the SVM with Gaussian kernel. Therefore, RF may be considered as a reference (“goldstandard”) to compare with new classifiers. olson2018modern followed this testing strategy to evaluate the performance of modern neural networks concluding that modern neural networks, even though being highly overparameterized, still give reasonable performances on these small datasets, though not as strong as RFs. Our paper follows the same testing strategy to evaluate the performance of NTK.
The focus of this paper, neural tangent kernel is induced from a neural network architecture. The connection between infinitely wide neural networks and kernel methods is not new (neal1996priors; williams1997computing; leroux07a; hazan2015steps; lee2018deep; matthews2018gaussian; novak2019bayesian; garrigaalonso2018deep; cho2009kernel; daniely2016toward; daniely2017sgd). However, these kernels correspond to neural network where only the last layer is trained. Neural tangent kernel, first proposed by jacot2018neural, is fundamentally different as NTKs correspond to infinitely wide NNs with all layer being trained. Theoretically, a line of work study the optimization and generalization behavior of ultrawide NNs (allen2018convergence; allen2018learning; arora2019fine; du2018gradient; du2018deep; li2018learning; zou2018stochastic; yang2019scaling; cao2019generalization; cao2019generalizationtheory). Recently, arora2019exact gave nonasymptotic perturbation bound between the NN predictor trained by gradient descent and the NTK predictor. Empirically, lee2019wide verified on small scale data, NTK is a good approximation to NN. However, arora2019exact showed on large scale dataset, NN can outperform NTK which may due to the effect of finitewidth and/or optimization procedure.
Generalization to architectures other than fullyconnected NN and CNN are recently proposed (yang2019scaling; du2019graph; bietti2019inductive). du2019graph showed graph neural tangent kernel (GNTK) can achieve better performance than its counter part, graph neural network (GNN), on datasets with up to samples.
3 Neural Network and Neural Tangent Kernel
Since NTK is induced by a NN architecture, we first define a NN formally. Let be the input, and denote and for notational convenience. We define an hiddenlayer fullyconnected neural network recursively:
(1) 
where is the weight matrix in the th layer (), is a coordinatewise activation function, and is a scaling factor.^{3}^{3}3 Putting an explicit scaling factor in the definition of NN is typically called NTK parameterization (jacot2018neural; park2019effect). Standard parameterization scheme does not have the explicit scaling factor. The derivation of NTK requires NTK parameterization. In our experiments on NNs, we try both parameterization schemes. In this paper, for NN we will consider being ReLU or ELU (clevert2015fast) and for NTK we will only consider kernel functions induced by NNs with ReLU activation. The last layer of the neural network is
(2) 
where is the weights in the final layer, and we let be all parameters in the network. All the weights are initialized to be i.i.d. random variables.
When the hidden widths , certain limiting behavior emerges along the gradient trajectory. Let be two data points, the covariance kernel of the th layer’s outputs, , can be recursively defined in an analytical form:
(3)  
for . Crucially, this analytical form holds not only at the initialization, but also holds during the training (when gradient descent with small learning rate is used as the optimization routine).
Formally, NTK is defined as the limiting gradient kernel
Again, one can obtain a recursive formula
(4)  
(5) 
where we let for convenience. It is easy to check if we fix the first layers and only train the remaining layers, then the resulting NTK is . Note when , then the resulting NTK is , which is the NNGP kernel (lee2018deep). can be viewed as a hyperparameter of NTK classifier, and in our UCI experiment we tune . Given a kernel function, one can directly use it for downstream classification tasks (scholkopf2001learning).
4 Experiments on UCI Datasets
In this section, we present our experimental results on UCI datasets which follow the setup of fernandez2014we with extensive comparisons of classifiers, including random forest, Gaussian kernel SVM, multiplayer neural networks, etc. Section 4.2 discusses the performance of NTK through detailed comparisons with other classifiers tested by fernandez2014we. Section 4.2 compares NTK classifier and the corresponding NN classifier and verifies how similar their predictions are. See Table 6 in Appendix A for a summary of datasets we used. We compute NTKs containing 1 to 5 fullyconnected layers and then use kernel SVM with soft margin for classification. The detailed experiment setup, including the choices the datasets, training / test splitting and ranges of hyperparameters, is described in Appendix A. The code can be found in https://github.com/LeoYu/neuraltangentkernelUCI. We note that usual methods of obtaining confidence bounds in these lowdata settings are somewhat heuristic.
4.1 Overall Performance Comparisons
Classifier  Friedman Rank  Average Accuracy  P90  P95  PMA 
NTK  28.34  81.95%14.10%  88.89%  72.22%  95.72% 5.17% 
NN (He init)  40.97  80.88%14.96%  81.11%  65.56%  94.34% 7.22% 
NN (NTK init)  38.06  81.02%14.47%  85.56%  60.00%  94.55% 5.89% 
RF  33.51  81.56% 13.90%  85.56%  67.78%  95.25% 5.30% 
Gaussian Kernel  35.76  81.03% 15.09%  85.56%  72.22%  94.56% 8.22% 
Polynomial Kernel  38.44  78.21% 20.30%  80.00%  62.22%  91.29% 18.05% 
Table 1 lists the performance of 6 classifiers under various metrics: the three top classifiers identified in fernandez2014we, namely, RF, Gaussian kernel and polynomial kernel, along with our new methods NTK, NN with He initialization and NN with NTK initialization. Table 1 shows NTK is the best classifier under all metrics, followed by RF, the best classifier identified in fernandez2014we. Now we interpret each metric in more details.
Friedman Ranking and Average Accuracy.
NTK is the best (Friedman Rank 28.34, Average Accuracy 81.95%), followed by RF (Friedman Rank 33.51, Average Accuracy 81.56%) and then followed by SVM with Gaussian kernel (Friedman Rank 35.76, Average Accuracy 81.03%). The difference between NTK and RF is significant (5.17 in Friedman Rank and +0.39% in Average Accuracy), just as the superiority of RF is significant compared to other classifiers as claimed in fernandez2014we. NN (with either He initialization or NTK initialization) performs significantly better than the polynomial kernel in terms of the Average Accuracy (80.88% and 81.02% vs. 78.21%), but in terms of Friedman Rank, NN with He initialization is worse than polynomial kernel (40.97 vs. 38.44) and NN with NTK initialization is slightly better than polynomial kernel (38.06 vs. 38.44). On many datasets where most classifiers have similar performances, NN’s rank is high as well, whereas on other datasets, NN is significantly better than most classifiers, including SVM with polynomial kernel. Therefore, NN enjoys higher Average Accuracy but suffers higher Friedman Rank. For example, on ozone dataset, NN with NTK initialization is only 0.25% worse than polynomial kernel but their difference in terms of rank is 56. It is also interesting to see that NN with NTK initialization performs better than NN with He initialization (38.06 vs. 40.97 in Friedman Rank and 81.02% vs. 80.88% in Average Accuracy).
P90/P95 and PMA.
This measures, for a given classifier, the fraction of datasets on which it achieves more than 90%/95% of the maximum accuracy among all classifiers. NTK is one of the best classifiers (ties with Gaussian kernel on P95), which shows NTK can consistently achieve superior classification performance across a broad range of datasets. Lastly, we consider the Percentage of the Maximum Accuracy (PMA). NTK achieves the best average PMA followed by RF whose PMA is 0.47% below that of NTK and other classifiers are all below 94.6%. An interesting observation is that NTK, NN with NTK initialization and RF have small standard deviation (5.17%, 5.89% and 5.30%) whereas all other classifiers have much larger standard deviation.
4.2 Pairwise Comparisons
NTK vs. RF.
In Figure 0(a), we compare NTK with RF. There are 42 datasets that NTK outperforms RF and 40 datasets that RF outperforms NTK.^{4}^{4}4We consider classifier A outperform classifier B if the classification accuracy of A is at least 0.1% better than B. The number 0.1% is chosen because results in fernandez2014we only kept the precision up to 0.1%. The mean difference is 2.51%, which is statistically significant by a Wilcoxon signed rank test. We see NTK and RF perform similarly when the Bayes error rate is low with NTK being slightly better. There are some exceptions. For example, on the balancescale dataset, NTK achieves 98% accuracy whereas RF only achieves 84.1%. When the Bayes error rate is high, either classifier can be significantly better than the other.
NTK vs. Gaussian Kernel.
The gap between NTK and Gaussian kernel is more significant. As shown in Figure 0(b), NTK generally performs better than Gaussian kernel no matter Bayes error rate is low or high. There are 43 datasets that NTK outperforms Gaussian kernel and there are 34 datasets that Gaussian kernel outperforms NTK. The mean difference is 2.22%, which is also statistically significant by a Wilcoxon signed rank test. On balloons, heartswitzerland, pittsburgbridgesmaterial, teaching, trains datasets, NTK is better than Gaussian kernel by at least 11% in terms of accuracy. These five datasets all have less than 200 samples, which shows that NTK can perform much better than Gaussian kernel when the number of samples is small.
NTK vs. NN
In this section we compare NTK with NN. The goals are (i) comparing the performance and (ii) verifying NTK is a good approximation to NN.
In Figure 1(a), we compare the performance between NTK and NN with He initialization. For most datasets, these two classifiers perform similarly. However, there are a few datasets that NTK performs significantly better. There are 50 datasets that NTK outperforms NN with He initialization and there are 27 datasets that NN with He initialization outperforms NTK. The mean difference is 1.96%, which is also statistically significant by a Wilcoxon signed rank test.
In Figure 1(b), we compare the performance between NTK and NN with NTK initialization. Recall that for NN with NTK initialization, when the width goes to infinity, the predictor is just NTK. Therefore, we expect these two predictors give similar performance. Figure 1(b) verifies our conjecture. There is no dataset the one classifier is significantly better than the other. We do not have the same observation in Figure 0(a), 0(b) or 1(a). Nevertheless, NTK often performs better than NN. There are 52 datasets that NTK outperforms NN with He initialization and there are 25 datasets NN with He initialization outperforms NTK. The mean difference is 1.54%, which is also statistically significant by a Wilcoxon signed rank test.
5 Experiments on Small CIFAR10 Dataset
In this section, we study the performance of CNTK on subsampled CIFAR10 dataset. We randomly choose samples from CIFAR10 training set, use them to train CNTK and ResNet34, and test both classifiers on the whole test set. In our experiments, we vary from to , and the number of convolutional layers of CNTK varies from to . After computing CNTK, we use kernel regression for training and testing. See Appendix B for detailed experiment setup.
The results are reported in Table 2. It can be observed that in this setting CNTK consistently outperforms ResNet34. The largest gap occurs at where 14layer CNTK achieves accuracy and ResNet achieves . The smallest improvement occurs at : vs. . When the number of training data is large, i.e., , ResNet can outperform CNTK. It is also interesting to see that 14layer CNTK is the best performing CNTK for all values of , suggesting depth has a significant effect on this task.
ResNet  5layer CNTK  8layer CNTK  11layer CNTK  14layer CNTK  
10  14.59% 1.99%  15.08% 2.43%  15.24% 2.44%  15.31% 2.38%  15.33% 2.43% 
20  17.50% 2.47%  18.03% 1.91%  18.50% 2.03%  18.69% 2.07%  18.79% 2.13% 
40  19.52% 1.39%  20.83% 1.68%  21.07% 1.80%  21.23% 1.86%  21.34% 1.91% 
80  23.32% 1.61%  24.82% 1.75%  25.18% 1.80%  25.40% 1.84%  25.48% 1.91% 
160  28.30% 1.38%  29.63% 1.13%  30.17% 1.11%  30.46% 1.15%  30.48% 1.17% 
320  33.15% 1.20%  35.26% 0.97%  36.05% 0.92%  36.44% 0.91%  36.57% 0.88% 
640  41.66% 1.09%  41.24% 0.78%  42.10% 0.74%  42.44% 0.72%  42.63% 0.68% 
1280  49.14% 1.31%  47.21% 0.49%  48.22% 0.49%  48.67% 0.57%  48.86% 0.68% 
6 Experiments on Fewshot Learning
linear SVM  1layer CNTK  2layer CNTK  3layer CNTK  4layer CNTK  5layer CNTK  6layer CNTK  

1  
2  
3  
4  
5  
6  
7  
8 
linear SVM  1layer CNTK  2layer CNTK  3layer CNTK  4layer CNTK  5layer CNTK  6layer CNTK  

1  
2  
3  
4  
5  
6  
7  
8 
linear SVM  1layer CNTK  2layer CNTK  3layer CNTK  4layer CNTK  5layer CNTK  6layer CNTK  

1  
2  
3  
4  
5  
6  
7  
8 
In this section, we test the ability of NTK as a dropin classifier to replace the linear classifier in the fewshot learning setting. Using linear SVM on top of extracted features is arguably the most widely used strategy in fewshot learning as linear SVM is easy and fast to train whereas more complicated strategies like finetuning or training a neural network on top of the extracted feature have more randomness and may overfit due to the small size of the training set. NTK has the same benefits (easy and fast to train, no randomness) as the linear classifier but also allows some nonlinearity in the design. Note since we consider image classification tasks, we use CNTK in experiments in this section.
Experiment Setup.
We mostly follow the settings of goyal2019scaling. Features are extracted from layer conv1, conv2, conv3, conv4, conv5 in ResNet50 (he2016deep) trained on ImageNet (deng2009imagenet), and we use these features for VOC07 classification task. The number of positive examples varies from to . For each value of , for each class in the VOC07 dataset, we choose positive examples and negative examples. For each and for each class, we randomly choose independent sets with training samples in each set. We report the mean and standard deviation of mAP on the test split of VOC07 dataset. This setting has been used in numbers of previous fewshot learning papers (goyal2019scaling; zhang2017split).
We take the extracted features as the input to CNTK. We tried CNTK with  convolution layers, a global average pooling layer and a fullyconnected layer. We normalize the data in the feature space, and finally use SVM to train the classifiers. Note without the convolution layer, it is equivalent to directly applying linear SVM after global average pooling and normalization. We use sklearn.svm.LinearSVC to train linear SVMs, and sklearn.svm.SVC to train kernel SVMs (for CNTK). To train SVM, we choose the cost value from and set the class_weight ratio to be for positive/negative classes as in goyal2019scaling.
Since the number of given samples is usually small, goyal2019scaling chooses to report the performance of the best cost value . In our experiments, we use a more standard method to perform crossvalidation. We use the first 10 classes of VOC07 to tune , and report the performance of selected in the other 10 classes in Table 35. We also report the performance of the best as in goyal2019scaling, in Tables 79 in the appendix for completeness.
Discussions.
First, we find CNTK is a strong dropin classifier to replace the linear classifier in fewshot learning. Tables 39 clearly demonstrate that CNTK is consistently better than linear classifier. Note Tables 39 only show the prediction accuracy in an average sense. In fact, we find that on every randomly sampled training set, CNTK always gives a better performance. We conjecture that CNTK gives better performance because the nonlinearity in CNTK helps prediction. This is verified by looking at the performance gain of CNTK for different feature extractors. Note Conv5 is often considered to be most useful for linear classification. There, CNTK only outperforms the linear classifier by about . On the other hand, with Conv3 and Conv4 features, CNTK can outperform linear classifier by around and sometimes by – (last two rows in Table 5). We believe this happens because Conv3 and Conv4 correspond to middlelevel features, and thus nonlinearity is indeed beneficial for better accuracy.
We also observe that CNTK often performs the best with a single convolutional layer. This is expected since features extracted by ResNet50 already produce a good representation. Nevertheless, we find for middlelevel features Conv3 with or , CNTK with 3 convolutional layers give the best performance. We believe this is because with more data, utilizing the nonlinearity induced by CNTK can further boost the performance.
7 Conclusion
The Neural Tangent Kernel, discovered by mathematical curiosity about deep networks in the limit of infinite width, is found to yield superb performance on lowdata tasks, beating extensively tuned versions of classic methods such as random forests. The (fullyconnected) NTK classifiers are easy to compute (no GPU required) and thus should be a good offtheshelf classifier in many settings. We plan to release our code to allow dropin replacement for SVMs and linear regression.
Many theoretical questions arise. Do NTK SVMs correspond to some infinite net architecture (as NTK ridge regression does)? What explains generalization in smalldata settings? (This understanding is imperfect even for random forests.) Finally one can derive NTK corresponding to other architectures, e.g., recurrent neural tangent kernel (RNTK) induced by recurrent neural networks. It would be an interesting future research direction to test their performance on benchmark tasks.
Acknowledgments
S. Arora, Z. Li and D. Yu are supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Amazon Research, DARPA and SRC. S.S.D is supported by National Science Foundation (Grant No. DMS1638352) and the Infosys Membership. R. Wang and R. Salakhutdinov are supported in part by NSF IIS1763562, Office of Naval Research grant N000141812861, and Nvidia NVAIL award. We thank AWS for cloud computing time. We thank for Xiaolong Wang for discussing the fewshot learning task.
References
Appendix A Additional Experimental Details on UCI
In this section, we describe our experiment setup.
Dataset Selection
Our starting point was the preprocessed UCI datasets from fernandez2014we, which can be downloaded from http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz. Their preprocessing involved changing categorical features to numerical, some normalization, and turning regression problems to classification. Of these we only retained datasets with number of samples less than , since we are interested in lowdata regime. Furthermore, we discovered an apparent bug in their preprocessing on datasets with explicit training/test split (specifically, their featurization was done differently on training/test data). Fixing the bug and retraining models on those datasets would complicate an applestoapples comparison, so we discarded these datasets from our experiment. This left 90 datasets. See Table 6 for a summary of the datasets. Even on these 90 remaining datasets, their ranking of different classifiers was almost preserved, with random forest the best followed by SVM with Gaussian kernel.
# samples  # features  # classes  

min  10  3  2 
25%  178  8  2 
50%  583  16  3 
75%  1022  32  6 
max  5000  262  100 
Performance Comparison Details
We follow the comparison setup in fernandez2014we that we report fold crossvalidation. For hyperparameters, we tune them with the same validation methodology in fernandez2014we: all available training samples are randomly split into one training and one test set, while imposing that each class has the same number of training and test samples. Then the parameter with best validation accuracy is selected. It is possible to give confidence bounds for this parameter tuning scheme, but they are worse than standard ones for separated training/validation/testing data. For NTK and NN classifiers we train them on these datasets. For other classifiers, we use the results from fernandez2014we.
NTK Specification
We calculate NTK induced fullyconnected neural networks with layers where bottom layers are fixed, and then use support vector classification implemented by sklearn.svm. We tune hyperparameters from to , from 0 to , and cost value as powers of ten from to . The number of kernels used is , so the total number of parameter combinations is 105. Note this number is much less than the number of hyperparameter of Gaussian kernel reported in fernandez2014we where they tune hyperparameters with 500 combinations.
NN Specification
We use fullyconnected NN with layers, number of hidden nodes per layer and use gradient descent to train the neural network. We tune hyperparameters from to , with / without batch normalization and learning rate or . We run gradient descent for epochs.^{5}^{5}5We found NN can achieve training loss after less than epochs. This is consistent with the observation in olson2018modern. We treat NN with He initialization and NTK initialization as two classifiers and report their results separately.
Appendix B Additional Experimental Details on Small CIFAR10 Datasets
We randomly choose samples from each class of CIFAR10 training set and test classifiers on the whole testing set. varies from to . For each , we repeat 20 times and report the mean accuracy and its standard deviation for each classifier.
The number of convolution layers of CNTK ranges from 514. After convolutional layers, we apply a global pooling layer and a fully connected layer. We refer readers to arora2019exact for exact formulas of CNTK. We normalize the kernel such that each sample has unit length in feature space.
We use ResNet34 with width 64,128,256 and default hyperparameters: learning rate 0.1, momentum 0.9, weight decay 0.0005. We decay the learning rate by 10 at the epoch of 80 and 120, with 160 training epochs in total. The training batch size is the minimum of the size of the whole training dataset and 160. We report the best testing accuracy among epochs.
Appendix C Additional Results in Fewshot Learning
linear SVM  1layer CNTK  2layer CNTK  3layer CNTK  4layer CNTK  5layer CNTK  6layer CNTK  

1  
2  
3  
4  
5  
6  
7  
8 
linear SVM  1layer CNTK  2layer CNTK  3layer CNTK  4layer CNTK  5layer CNTK  6layer CNTK  

1  
2  
3  
4  
5  
6  
7  
8 
linear SVM  1layer CNTK  2layer CNTK  3layer CNTK  4layer CNTK  5layer CNTK  6layer CNTK  

1  
2  
3  
4  
5  
6  
7  
8 