Benefit of Interpolation in Nearest Neighbor Algorithms

# Benefit of Interpolation in Nearest Neighbor Algorithms

Yue  Xing
Department of Statistics
Purdue University
West Lafayette, Indiana, USA
xing49@purdue.edu
&Qifan  Song
Department of Statistics
Purdue University
West Lafayette, Indiana, USA
qfsong@purdue.edu
\ANDGuang  Cheng
Department of Statistics
Purdue University
West Lafayette, Indiana, USA
chengg@purdue.edu
###### Abstract

The over-parameterized models attract much attention in the era of data science and deep learning. It is empirically observed that although these models, e.g. deep neural networks, over-fit the training data, they can still achieve small testing error, and sometimes even outperform traditional algorithms which are designed to avoid over-fitting. The major goal of this work is to sharply quantify the benefit of data interpolation in the context of nearest neighbors (NN) algorithm. Specifically, we consider a class of interpolated weighting schemes and then carefully characterize their asymptotic performances. Our analysis reveals a U-shaped performance curve with respect to the level of data interpolation, and proves that a mild degree of data interpolation strictly improves the prediction accuracy and statistical stability over those of the (un-interpolated) optimal NN algorithm. This theoretically justifies (predicts) the existence of the second U-shaped curve in the recently discovered double descent phenomenon. Note that our goal in this study is not to promote the use of interpolated-NN method, but to obtain theoretical insights on data interpolation inspired by the aforementioned phenomenon.

## 1 Introduction

Classical statistical learning theory believes that over-fitting deteriorates prediction performance: when the model complexity is beyond necessity, the testing error must be huge. Therefore, various techniques have been proposed in literature to avoid over-fitting, such as early stopping, dropout and cross validation. However, recent experiments reveal that even with over-fitting, many learning algorithms still achieve small generalization error. For instances, Wyner et al. (2017) explored the over-fitting in AdaBoost and random forest algorithms; Belkin et al. (2019a) discovered a double descent phenomenon in random forest and neural network: with growing model complexity, testing performance firstly follows a (conventional) U-shaped curve, and as the level of overfitting increases, a second descent or even a second U-shaped testing performance curve occurs.

To theoretically understand the effect of over-fitting or data interpolation, Du and Lee (2018); Du et al. (2019, 2018); Arora et al. (2018, 2019); Xie et al. (2017) analyzed how to train neural networks under over-parametrization, and why over-fitting does not jeopardize the testing performance; Belkin et al. (2019) constructed a Nadaraya-Watson kernel regression estimator which perfectly fits training data but is still minimax rate optimal; Belkin et al. (2018) and Xing et al. (2018) studied the rate of convergence of interpolated nearest neighbor algorithm (interpolated-NN); Belkin et al. (2019b); Bartlett et al. (2019) quantified the prediction MSE of the linear least squared estimator when the data dimension is larger than sample size and the training loss attains zero. Similar analysis is also conducted by Hastie et al. (2019) for two-layer neural network models with a fixed first layer.

In this work, we aim to provide theoretical reasons on whether, when and why the interpolated-NN performs better than the optimal NN, by some sharp analysis. The classical NN algorithm for either regression or classification is known to be rate-minimax under mild conditions (Chaudhuri and Dasgupta (2014)), say diverges propoerly. However, can such a simple and versatile algorithm still benefit from intentional over-fitting? We first demonstrate some empirical evidence below.

Belkin et al. (2018) designed an interpolated weighting scheme as follows:

 ˆy(x)=∑ki=1∥x(i)−x∥−γy(x(i))∑ki=1∥x(i)−x∥−γ, (1)

where is the -th closest neighbor to with the corresponding label . The parameter controls the level of interpolation: with a larger , the algorithm will put more weights on the closer neighbors. In particular when or , interpolated-NN reduces to NN or 1-NN, respectively. Belkin et al. (2018) showed that such an interpolated estimator is rate minimax in the regression setup, but suboptimal in the setting of binary classification. Later, Xing et al. (2018) obtained the minimax rate of classification by adopting a slightly different interpolating kernel. What is indeed more interesting is the preliminary numerical analysis (see Figure 1) conducted in the aforementioned paper, which demonstrates that interpolated-NN is even better than the rate minimax NN in terms of MSE (regression) or mis-classification rate (classification). This observation asks for deeper theoretical exploration beyond the rate of convergence. A reasonable doubt is that the interpolated-NN may possess a smaller multiplicative constant for its rate of convergence, which may be used to study the generalization ability within the “over-parametrized regime.”

In this study, we will theoretically compare the minimax optimal NN and the interpolated-NN (under (1)) in terms of their multiplicative constants. On the one hand, we show that under proper smooth conditions, the multiplicative constant of interpolated-NN, as a function of interpolation level , is U-shaped. As a consequence, interpolation indeed leads to more accurate and stable performance when the interpolation level for some only depending on the data dimension . The amount of benefit (i.e., the “performance ratio” defined in Section 2) follows exactly the same asymptotic pattern for both regression and classification tasks. In addition, the gain from interpolation diminishes as the dimension grows to infinity, i.e. high dimensional data benefit less from data interpolation. We also want to point out that there still exist other “non-interpolating” weighting schemes, such as OWNN, which can achieve an even better performance; see Section 3.4. More subtle results are summarized in the figure below.

From Figure 2, we theoretically justify (predict) the existence of the U-shaped curve within the “over-fitting regime” of the recently discovered double descent phenomenon by Belkin et al. (2019a, b). As complementary to Belkin et al. (2018); Xing et al. (2018), we further show in appendix that interpolated-NN reaches optimal rate for both regression and classification under more general -smoothness conditions in Section F.

In the end, we want to emphasize that our goal here is not to promote the practical use of this interpolation method given that NN is more user-friendly. Rather, the interpolated-NN algorithm is used to precisely describe the role of interpolation in generalization ability so that more solid theoretical arguments can be made for the very interesting double descent phenomenon, especially in the over-fitting regime.

## 2 Interpolation in Nearest Neighbors Algorithm

In this section, we review the interpolated-NN algorithm introduced by Belkin et al. (2018) in more details. Given , we define to be the distance between and its th nearest neighbor. W.O.L.G, we let to denote the (unsorted) nearest neighbors of , and let to be distances between and . Thus, based on the same argument used in Chaudhuri and Dasgupta (2014) and Belkin et al. (2018), conditional on , to are iid variables whose support is a ball centered at with radius , and as a consequence, to are conditionally independent given as well. Note that when no confusion is caused, we will write as . Thus, the weights of the neighbors are defined as

 Wi=R−γi∑kj=1R−γj=(Ri/Rk+1)−γ∑kj=1(Rj/Rk+1)−γ,

for and some .

For regression models, denote as the target function, and where is an independent zero-mean noise with . The regression estimator at is thus

 ˆηk,n,γ(x)=k∑i=1WiYi.

For binary classification, denote , with as the Bayes estimator. The interpolated-NN classifier is defined as

 ˆgk,n,γ(x)={1∑ki=1WiYi>1/20∑ki=1WiYi≤1/2.

As discussed previously, the parameter controls the level of interpolation: a larger value of leads to a higher degree of data interpolation.

We adopt the conventional measures to evaluate the theoretical performance of interpolated-NN given a new test data :

 Regression:MSE(k,n,γ)=E((ˆηk,n,γ(X)−η(X))2).
 Classification:Regret(k,n,γ)=P(ˆgk,n,γ(X)≠Y)−P(g(X)≠Y).

## 3 Quantification of Interpolation Effect

### 3.1 Model Assumptions

Recent works by Belkin et al. (2018) and Xing et al. (2018) confirm the rate optimality of MSE and regret for interpolated-NN under mild interpolation conditions. Two deeper questions (hinted by Figure 1) we would like to address are whether and how interpolation strictly benefits NN algorithm, and whether interpolation affects regression and classification in the same manner.

To facilitate our theoretical investigation, we impose the following assumptions:

1. is a -dimensional random variable on a compact set with boundary .

2. For classification, is non-empty.

3. for some constant .

4. For classification, is continuous in some open set containing . The third-order derivative of is bounded when for a small constant . The gradient when , and with restriction on , if and .

5. For classification, density of , denoted as , is twice differentiable and finite.

6. For regression, the third-order derivative of is bounded for all .

7. For regression, is finite and has finite first-order derivative in .

The above assumptions (except A.3) are mostly derived from the framework established by Samworth and others (2012). Note that the additional smoothness required in and is needed to facilitate the asymptotic study of interpolation weighting scheme. We also want to point out that these assumptions are generally stronger than those used in Chaudhuri and Dasgupta (2014), but necessary to figure out the multiplicative constant. Further discussions regarding the conditions can be found in Remark 3 in appendix.

### 3.2 Main Theorem

The following theorem quantifies how interpolation affects NN estimate in terms of , then Corollary 2 examines the asymptotic performance ratios of MSE and Regret between interpolated-NN and NN and discovers that these ratios (under their respective optimal choice of ) converges to a function of only. In particular, a U-shaped curve is revealed where the ratio is smaller than 1 when for some .

###### Theorem 1

For regression, suppose that assumptions A.1, A.3, A.6, and A.7 hold. If satisfies for some , we have 111The notation “” is understood as .

 MSE(k,n,γ) = kE[(R1/Rk+1)−2γ(∑ki=1(Ri/Rk+1)−γ)2σ2(X)] +k2E(a2(X)E2[R21(R1/Rk+1)−γ∑ki=1(Ri/Rk+1)−γ∣∣∣X])+o.

For classification, under A.1 to A.5, the excess risk w.r.t. becomes

 Regret(k,n,γ)=14kB1Es2k,n,γ(X)+∫Sf(x0)∥˙η(x0)∥a2(x0)t2k,n,γ(x0)dVold−1(x0)+o,

where the exact form / value of , , , can be found in Section C in appendix.

Theorem 1 holds for any , where controls a proper diverging rate of as in Samworth and others (2012) and Sun et al. (2016). This allows us to define the minimum MSE and Regret over as follows:

 MSE(γ,n)=mink∈(nβ,n1−4β/d)MSE(k,n,γ) and Regret(γ,n)=mink∈(nβ,n1−4β/d)Regret(k,n,γ).

Corollary 2 asymptotically compares the interpolated-NN and NN, i.e., , in terms of the above measures. Interestingly, it turns out that the performance ratio, defined as

 PR(d,γ):=(1+γ2d(d−2γ))4d+4((d−γ)2(d+2−γ)2(d+2)2d2)dd+4,

is a function of and only, independent of the underlying data distribution. Note that is just the ratio of multiplicative constants before the minimax rate of interpolated-NN and NN.

###### Corollary 2

Under the same conditions as in Theorem 1, for any ,

 MSE(n,γ)MSE(n,0)→PR(d,γ), and Regret(n,γ)Regret(n,0)→PR(d,γ), as n→∞.

Note that / is the optimum MSE/Regret for NN.

The proofs of Theorem 1 and Corollary 2 are postponed to appendix (Section C and D respectively).

When can be chosen adaptively based on , we can address the second question that interpolation affects regression and classification in exactly the same manner through . In particular, this ratio exhibits an interesting U-shape of for any fixed . Specifically, as increases from , first decreases from and then increases above ; see Figures 2 and 3. Therefore, within the range for some only depending on dimension , , that is, the interpolated-NN is strictly better than the NN. Given the imposed condition that , Some further calculations reveal that when ; when .

###### Remark 1

It is easy to show that . This indicates that high dimensional model benefits less from interpolation, or said differently, high dimensional model is less affected by data interpolation. This phenomenon can be explained by the fact that, as increases, due to high dimensional geometry.

###### Remark 2

The optimum , which leads to the best MSE/regret, depends on the interpolation level . Thus, we denote it as . As shown in the appendix, , but for , i.e., interpolated-NN needs to employ slightly more neighbors to achieve the best performance. Empirical support for this finding can be found in Section A of appendix. If we insist using the same for interpolated-NN and NN, we can still verify that and , when for some depending on the distribution of and .

### 3.3 Statistical Stability

In this section, we will explore how the interpolation affects the statistical stability of nearest neighbor classification algorithms. This is beyond the generalization results obtained in Section 3.2. In short, if we choose the best in NN and apply it to the interpolated-NN, then NN will be more stable; otherwise, the interpolated-NN will be more stable for if the is allowed to be chosen separately and optimally based on .

For a stable classification method, it is expected that with high probability, the classifier can yield the same prediction when being trained by different data sets sampled from the same population. As a result, Sun et al. (2016) introduced a type of statistical stability, classification instability (CIS), which is different from the algorithmic stability in the literature (Bousquet and Elisseeff, 2002). Denote and as two i.i.d. training sets with the same sample size . The CIS is defined as:

 CISk,n(γ)=PD1,D2,X(ˆgk,n,γ(x,D1)≠ˆgk,n,γ(x,D2)).

Hence, a larger value of CIS indicates that the classifier is less statistically stable. In practice, we need to take into account of mis-classification rate and classification instability at the same time. Therefore, we are interested in comparing the stability between interpolated-NN and -NN only when the regrets of both algorithms reach their optimal performance under respective optimal choices.

Theorem 3 below illustrates how CIS is affected by interpolation through , and .

###### Theorem 3

Under the conditions in Theorem 1, the CIS of interpolated-NN is derived as

 CISk,n(γ)=B1√π1√kEsk,n,γ(X)+o.

The proof of Theorem 3 is postponed to Section E in appendix.

Similarly, Corollary 4 asymptotically compares CIS between interpolated-NN and NN.

###### Corollary 4

Following the conditions in Theorem 3, when the same value is used for NN and interpolated-NN, then as ,

 CISk,n(γ)CISk,n(0)>1.

On the other hand, if we choose optimum ’s for NN and interpolated-NN respectively, i.e. , when , we have

 (CISkγ,n(γ)CISk0,n(0))2→PR(d,γ).

Therefore, when , interpolated-NN with optimal has higher accuracy and stability than -NN at the same time.

From Corollary 4, the interpolated-NN is not as stable as NN if the same number of neighbors is used in both algorithms. However, this is not the case if an optimal is chosen separately. An intuitive explanation is that, under the same , NN has a smaller variance (more stable) given equal weights for all neighbors; on the other hand, by choosing an optimum , the interpolated-NN can achieve a much smaller bias, which offsets its performance lost in variance through enlarging .

### 3.4 Connection with OWNN and Double Descent Phenomenon

Samworth and others (2012) firstly worked out a general form of regret using a rank-based weighting scheme, and proposed the optimally weighted nearest neighbors algorithm (OWNN). The OWNN is the best nearest neighbors algorithm in terms of minimizing MSE for regression (and Regret for classification), when the weights of neighbors are only rank-based.

Combining Theorem 1, Corollary 2 with Samworth and others (2012), we can further compare the interpolated-NN against OWNN as follows:

 R(n,OWNN)R(n,γ)→24d+4(d+2d+4)2d+4d+4(1+γ2d(d−2γ))−4d+4((d−γ)2(d+2−γ)2(d+2)2d2)−dd+4,

which is always smaller than 1 (just by definition). Here denotes the MSE/Regret of OWNN given its optimum , and denotes the one of interpolated-NN given its own optimum choice. It is interesting to note from the above ratio that that the advantage of OWNN is only reflected at the level of multiplicative constant, and further that the ratio converges to 1 as diverges (just as the case of ; see Remark 1). Thus, under ultra high dimensional setting, the performance differences among NN, interpolated-NN and OWNN are almost negligible even at the multiplicative constant level.

We first describe the framework of the recently discovered double descent phenomenon (e.g., Belkin et al., 2019a, b), and then comment our contributions (summarized in Figure 2) to it in the context of nearest neighbor algorithm. Specifically, within the “classical regime” where exact data interpolation is impossible, the testing performance curve is the usual U-shape w.r.t. model complexity; once the model complexity is beyond a critical point it thus enters the “over-fitting regime,” the testing performance will start to decrease again as severeness of data interpolation increases, which is so-call “double descend”.

In the context of nearest neighbors algorithms, different weighting schemes may be viewed as a surrogate of modeling complexity. For OWNN, though it allocates more weights on closer neighbors, while none of the weights exceeds . Thus, OWNN is never an interpolation weighting scheme. From this aspect, -NN and OWNN both belong to the “classical regime,” while interpolated-NN is within the “over-fitting regime.” In particular, the testing performance of OWNN reaches the minimum point of the U-shaped curve inside the “classical regime.” Deviation from this optimal choice of weight leads to the increase of the MSE/Regret within this “classical regime.” After the “over-fitting regime” is reached by the interpolated-NN, say from in Figure 2, the MSE/Regret decreases as the interpolation level increases within the range (0,) and ascends again when (if the dimension allows ), forming the second U-shaped curve in Figure 2. Therefore, we obtain an overall W-shaped performance curve with theoretical guarantee, which coincides the empirical finding of Belkin et al. (2019b) for over-parametrized linear models.

## 4 Numerical Experiments

In this section, we will present several simulation studies to justify our theoretical discoveries for regression, classification and stability performances of the interpolated-NN algorithm, together with some real data analysis.

### 4.1 Simulations

We aim to estimate the performance ratio curve by data simulation and compare it with the theoretical curve . The second simulation setting in Samworth and others (2012) is adopted here. Specifically, the joint distribution of follows , and , where denotes the density of . The sample size and dimension . The interpolated-NN regressor and classifier were implemented under different choices of and . For regression, the MSE was estimated based on repetitions, and for classification, the Regret was based on repetitions.

When , the Regret/MSE ratio for different is shown in Figure 3. Here Regret ratio is defined by , the MSE ratio is defined by . The trends for theoretical value and simulation value are mostly close. The small difference is mostly caused by the small order terms in the asymptotic result and shall vanish if larger is used. Note that is outside our theoretical range , but the performance is still reasonable in our numerical experiment.

We further estimate CIS by training two classifiers based on two different simulated data sets of 1024 samples. The CIS was estimated by calculating the proportion of testing samples that have different prediction labels, that is

 ˆCIS(γ)=1nn∑i=11(ˆg(xi,D1)≠ˆg(xi,D2)).

The CIS result is shown in Figure 4. One can see that when is small, the simulated CIS ratio decreases in a similar manner as the asymptotic value, while simulated value will increase when gets larger. This pattern is the same as the theoretical result predicted in Theorem 3.

An additional experiment is postponed to appendix, which shows how the MSE and optimum changes in and where for .

### 4.2 Real Data Analysis

In real data experiment, we compare the classification accuracy of interpolated-NN with NN.

Five data sets were considered in this experiment. The data set HTRU2 from Lyon et al. (2016) uses 17,897 samples with 8 continuous attributes to classify pulsar candidates. The data set Abalone contains 4,176 samples with 7 attributes. Following Wang et al. (2018), we predict whether the number of rings is greater than 10. The data set Credit (Yeh and Lien, 2009) has 30,000 samples with 23 attributes, and predicts whether the payment will be default in the next month given the current payment information. The built-in data set of digits in sklearn (Pedregosa et al., 2011) contains 1,797 samples of 88 images. For images in MNIST are , we will use part of it in our experiment. Both the data set of digit and MNIST have ten classes. Here for binary classification we group 0 to 4 as the first class and 5 to 9 as the second class.

For each data set, a proportion of data is used for training and the rest is reserved to test the accuracy of the trained classifiers. For Abalone, HTRU2, Credit and Digit, we use 25% data as training data and 75% as testing data. For MNIST, we use randomly choosen 2000 samples as training data and 1000 as testing data, which is sufficient for our comparison. The above experiment is repeated for 50 times and the average testing error rate is summarized in Table 1. For all data sets, the testing error of interpolated-NN (column “best ”) is always smaller than the NN(column “”), which verifies that nearest neighbor algorithm actually benefits from interpolation.

## 5 Conclusion

Our work precisely quantifies how data interpolation affects the performance of nearest neighbor algorithms beyond the rate of convergence. We find that for both regression and classification problems, the asymptotic performance ratios between interpolated-NN and NN converge to the same value, which depends on and only. More importantly, when the interpolation level is within a reasonable range, the interpolated-NN is strictly better than NN as it has a smaller multiplicative constant of the convergence rate, and it has a more stable prediction performance as well.

Classical learning framework opposes data interpolation as it believes that over-fitting means fitting the random noise rather than the model structures. However, in the interpolated-NN, the weight degenerating occurs only on a nearly-zero-measure set, and thus there is only “local over-fitting”, which may not hurt the overall rate of convergence. Technically, through balancing the variance and bias, data interpolation can possibly improve the overall performance. And our work essentially quantify such a bias-variance balance in a very precise way. It is of great interest to investigate how our theoretical insights can be carried over to the real deep neural networks, leading to a more complete picture of double descent phenomenon.

#### Acknowledgments

Prof. Guang Cheng is a visiting member of Institute for Advanced Study, Princeton (funding provided by Eric and Wendy Schmidt) and visiting Fellow of SAMSI for the Deep Learning Program in the Fall of 2019; he would like to thank both Institutes for their hospitality.

## References

• S. Arora, N. Cohen, and E. Hazan (2018) On the optimization of deep networks: implicit acceleration by overparameterization. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, pp. 244–253. Cited by: §1.
• S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 322–332. Cited by: §1.
• P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler (2019) Benign overfitting in linear regression. arXiv preprint arXiv:1906.11300. Cited by: §1.
• M. Belkin, D. Hsu, S. Ma, and S. Mandal (2019a) Reconciling modern machine learning and the bias-variance trade-off. Proceedings of the National Academy of Sciences 116 (32), pp. 15849–15854. External Links: Document, ISSN 0027-8424 Cited by: §1, §1, §3.4.
• M. Belkin, D. Hsu, and P. Mitra (2018) Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems 31, pp. 2300–2311.
• M. Belkin, D. Hsu, and J. Xu (2019b) Two models of double descent for weak features. arXiv preprint arXiv:1903.07571. Cited by: §1, §1, §3.4, §3.4.
• M. Belkin, A. Rakhlin, and A. B. Tsybakov (2019) Does data interpolation contradict statistical optimality?. In Proceedings of Machine Learning Research, Proceedings of Machine Learning Research, Vol. 89, pp. 1611–1619. Cited by: §F.1, §1.
• O. Bousquet and A. Elisseeff (2002) Stability and generalization. Journal of machine learning research 2 (Mar), pp. 499–526. Cited by: §3.3.
• T. I. Cannings, T. B. Berrett, and R. J. Samworth (2017) Local nearest neighbour classification with applications to semi-supervised learning. arXiv preprint arXiv:1704.00642. Cited by: §C.2.2.
• K. Chaudhuri and S. Dasgupta (2014) Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, pp. 3437–3445. Cited by: §F.1, §F.1, §F.2, §F.2, §1, §2, §3.1, Remark 3.
• S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai (2019) Gradient descent finds global minima of deep neural networks. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 1675–1685. Cited by: §1.
• S. S. Du and J. D. Lee (2018) On the power of over-parametrization in neural networks with quadratic activation. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, pp. 1329–1338. Cited by: §1.
• S. S. Du, X. Zhai, B. Poczos, and A. Singh (2018) Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054. Cited by: §1.
• T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani (2019) Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560. Cited by: §1.
• R. J. Lyon, B. Stappers, S. Cooper, J. Brooke, and J. Knowles (2016) Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach. Monthly Notices of the Royal Astronomical Society 459 (1), pp. 1104–1123. Cited by: §4.2.
• F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.2.
• R. J. Samworth et al. (2012) Optimal weighted nearest neighbour classifiers. The Annals of Statistics 40 (5), pp. 2733–2763.
• W. W. Sun, X. Qiao, and G. Cheng (2016) Stabilized nearest neighbor classifier and its statistical properties. Journal of the American Statistical Association 111 (515), pp. 1254–1265. Cited by: Appendix E, §3.2, §3.3, Proposition 5.
• Y. Wang, S. Jha, and K. Chaudhuri (2018) Analyzing the robustness of nearest neighbors to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, pp. 5133–5142. Cited by: §4.2.
• A. J. Wyner, M. Olson, J. Bleich, and D. Mease (2017) Explaining the success of adaboost and random forests as interpolating classifiers. The Journal of Machine Learning Research 18 (1), pp. 1558–1590. Cited by: §1.
• B. Xie, Y. Liang, and L. Song (2017) Diverse neural network learns true target functions. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Vol. 54, pp. 1216–1224. Cited by: §1.
• Y. Xing, Q. Song, and G. Cheng (2018) Statistical optimality of interpolated nearest neighbor algorithms. arXiv preprint arXiv:1810.02814. Cited by: Figure 1, §1, §1, §1, §3.1.
• I. Yeh and C. Lien (2009) The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications 36 (2), pp. 2473–2480. Cited by: §4.2.

The appendix is organized as follows. In section A, we demonstrate additional simulation study which empirically shows that the performances of interpolated-NN and -NN converge under the same rate, and interpolated-NN generally requires larger number of neighbors than -NN. Section B-E provide the proofs for the main theorems in the manuscript. Section C is for Theorem 1, Section D provides the proof of Corollary 2, Section E is the proof of Theorem 3.

In Section F, we delivers a complementary result for rate optimality of interpolated-NN in classification. Originally Belkin et al. (2018) obtains the optimal MSE rate for regression task, but only a sub-optimal rate of classification regret. We adopt some techniques introduced by Samworth and others (2012), and rigorously show that, under -smooth condition (which is more general than the smooth condition imposed in our main theorems), interpolated-NN achieves optimal convergence rate for classification as well.

## Appendix A Additional Numerical Experiment

In this experiment, instead of taking only, we take for to see how the performance ratio and optimum changes in for different ’s. The phenomenon for classification is similar as regression so we only present regression. Figure 5 summarizes the change of MSE and optimum choice of with respect to different choices of and , when . The plot corresponding to is quite similar hence omitted here. This plot shows that with respect to the increase of , interpolated-NN converges in the same rate as NN, and interpolated-NN generally requires larger than -NN.

## Appendix B Preliminary Proposition

This section provides an useful result when integrating c.d.f:

###### Proposition 5

From Lemma S.1 in Sun et al. (2016), we have for any distribution function ,

 ∫R[G(−bu−a)−1{u<0}]du = −1b{a+∫RtdG(t)}, ∫Ru[G(−bu−a)−1{u<0}]du = 1b2{a22+12∫Rt2dG(t)+a∫RtdG(t)}.

## Appendix C Proof of Theorem 1

Define and (density as , ) as the conditional distributions of given respectively, and are the marginal probability and , then take , , , and also denote . The terms in Theorem 1 are defined as

 B1 = ∫Sf(x0)∥˙η(x0)∥dVold−1(x0), s2k,n,γ(x) = E(R1/Rk+1)−2γE2(R1/Rk+1)−γ, tk,n,γ(x) = a(x) = 1f(x)d{d∑j=1[ηj(x)fj(x)+ηj,j(x)f(x)/2]}.

### c.1 Regression

Rewrite the interpolated-NN estimate at given the distance to the th neighbor , interpolation level as

 Sk,n,γ(x,Rk+1)=k∑i=1WiYi,

where the weighting scheme is defined as

 Wi=(Ri/Rk+1)−γ∑ki=1(Ri/Rk+1)−γ.

For regression, we decompose MSE into bias square and variance, where

 E[(Sk,n,γ(x,Rk+1)−η(x))2|x] = E[k∑i=1Wi(η(Xi)−η(x))]2+E[k∑i=1Wi(Yi−η(Xi))]2,

in which the bias square can be rewritten as

 E[k∑i=1Wi(η(Xi)−η(x))]2 = kE(W1(η(X1)−η(x)))2+(k2−k)E2(W1(η(X1)−η(x))),

and the variance can be approximated as

 E[k∑i=1Wi(Yi−η(Xi))]2=kEW21σ(X1)2=kσ(x)2EW21+o.

Following a procedure similar as Step 1 for classification, i.e., use Taylor expansion to approximate the bias square, we obtain that for some function , the bias becomes

 EW1(η(X1)−η(x))=a(x)EW21R21+o.

As a result, the MSE of interpolated-NN estimate given becomes,

 E[(Sk,n,γ(x,Rk+1)−η(x))2|x]=kσ(x)2EW21+k2a(x)2E2W21R21+o.

Finally we integrate MSE over the whole support.

### c.2 Classification

The main structure of the proof follows Samworth and others (2012). As the whole proof is long, we provide a brief summary in Section C.2.1 to describe things we will do in each step, then in Section C.2.2 we will present the details in each step.

#### c.2.1 Brief Summary

Step 1: denote i.i.d random variables for where

 Zi(x,Rk+1)=(Ri/Rk+1)−γ(Y(Xi)−1/2)E(Ri(x)/Rk+1(x))γ,

then the probability of classifying as 0 becomes

 P(Sk,n,γ(x)<1/2)=P(k∑i=1Zi(x,Rk+1)<0).

The mean and variance of can be obtained through Taylor expansion of and density function of :

 E(Z1(x,Rk+1)) = η(x)+a(x)ER21(R1/Rk+1)−γE(R1/Rk+1)−γ+o Var(Z1(x,Rk+1)) = 14E(R1/Rk+1)−2γE2(R1/Rk+1)−γ+o,

for some function . The smoothness conditions are assumed in A.4 and A.5.

Note that on the denominator of , there is an expectation . From later calculation in Corollary 2, the value of this expectation in fact has little changes given or without a condition of , and it is little affected by either.

Step 2: One can rewrite Regret as

 ∫Rd(P(k∑i=1WiYi≤12)−1{η(x)<1/2})d¯P(x).

From Assumption A.2, A.4, the region where is likely to make a wrong prediction is near , thus we use tube theory to transform the integral of Regret over the -dimensional space into a tube, i.e.,

 ∫Rd(P(k∑i=1WiYi≤12)−1{η(x)<1/2})d¯P(x) = {1+o(1)}∫S∫ϵ−ϵt∥˙Ψ(x0)∥(P(Sk,n(xt0)<1/2)−1{t<0})dtdVold−1(x0)+o.

The term will be defined in detail in appendix. Basically, when is within a suitable range, the integral over will not depend on asymptotically.

Step 3: given and , the nearest neighbors are i.i.d. random variables distributed in , thus we use non-uniform Berry-Esseen Theorem to get the Gaussian approximation of the probability of wrong prediction:

 ∫S∫ϵ−ϵt∥˙Ψ(x0)∥(P(Sk,n(xt0)<1/2)−1{t<0})dtdVold−1(x0) = ∫S∫ϵ−ϵt∥˙Ψ(x0)∥ERk+1⎛⎜ ⎜⎝Φ⎛⎜ ⎜⎝−kEZ1(xt0,Rk+1)√kVar(Z1(xt0,Rk+1))⎞⎟ ⎟⎠−1{t<0}⎞⎟ ⎟⎠dtd% Vold−1(x0)+o.

Step 4: take expectation over all , and integral the Gaussian probability over the tube to obtain

 ∫S∫ϵ−ϵt∥˙Ψ(x0)∥ERk+1⎛⎜ ⎜⎝Φ⎛⎜ ⎜⎝−kEZ1(xt0,Rk+1)√kVar(Z1(xt0,Rk+1))⎞⎟ ⎟⎠−1{t<0}⎞⎟ ⎟⎠dtd% Vold−1(x0) = ∫S∫Rt∥˙Ψ(x0)∥⎛⎜ ⎜⎝Φ⎛⎜ ⎜⎝−t∥˙η(x0)∥√s2k,n,γ/k−E(R1/Rk+1)−γa(xt0)R21√s2k,n,γ/k⎞⎟ ⎟⎠−1{t<0}⎞⎟ ⎟⎠dtdVold−1(x0)+o = B14kE(R1/Rk+1)−2γE2(R1/Rk+1)−γ+∫S∥˙Φ(x0)∥∥˙η(x0)∥2a2(x0)E2(R1/Rk+1)−γR21E2(R1/Rk+1)−γdVold−1(x0)+o = 14kB1Es2k,n,γ+∫S∥˙Φ(x0)∥∥˙η(x0)∥2a2(x0)t2k,n,γdVold−1(x0)+o.

#### c.2.2 Details

Denote is the Euclidean ball volume parameter

Define and . Denote be the set that there exists such that , then for some constant ,

 r2p=ca1/ddc1/d0(2kn)1/d.

Hence from Claim A.5 in Belkin et al. (2018), there exist and satisfying

 P(E)≤c1kexp(−c2k).

Step 1: in this step, we figure out the i.i.d. random variable in our problem, and calculate its mean and variance given .

Denote

 Zi(x,Rk+1)=(Ri/Rk+1)−γ(Y(Xi)−1/2)E(Ri/Rk+1)γ, (2)

then the dominant part we want to integrate becomes

 P(Sk,n(x,Rk+1)≤12) = P(k∑i=1(Ri/Rk+1)−γ(Y(Xi)−1/2)<0∣∣∣Rk+1) = P(∑ki=1Zi(x,Rk+1)−kEZ1(x,Rk+1)√kVar(Z1(x,Rk+1))<−kEZ1(x,Rk+1)√kVar(Z1(x,Rk+1))∣∣∣Rk+1).

Therefore, one can adopt non-uniform Berry-Essen Theorem to approximate the probability using normal distribution. Unlike Samworth and others (2012) in which is calculated, since the i.i.d. item in non-uniform Berry-Essen Theorem is rather than , we now calculate mean and variance of . Under ,

 μk,n,γ(x,Rk+1):=EZ1(x,Rk+1) = E(R1/Rk+1)−γ(Y(X1)−1/2)E(R1/Rk+1)−γ = E(R1/Rk+1)−γ(η(X1)−1/2)E(R1/Rk+1)−γ,

and

 EZ21(x,Rk+1) = E(R1/Rk+1)−2γ(Y(X1)−1/2)2E2(R1/Rk+1)−γ = E(R1/Rk+1)−2γ4E2(Ri/Rk+1)−γ, σ2k,n,γ(x,Rk+1) := Var(Z1(x,Rk+1)).

Then the mean and variance of can be calculated as

 μk,n(xt0,Rk+1)=EZ1(xt0,Rk+1)+12=E(R1/Rk+1)−γη(X1)E(R1/Rk+1)−γ+12 = E(R1/Rk+1)−γ(η(xt0)+(X1−xt0)⊤˙η(xt0)+1/2(X1−xt0)⊤¨η(xt0)(X1−xt0))E(R1/Rk+1)−γ+o(R3k+1) = η(xt0)+E(R