kNearest Neighbors by Means of Sequence to Sequence Deep Neural Networks and Memory Networks
Abstract
kNearest Neighbors is one of the most fundamental but effective classification models. In this paper, we propose two families of models built on a sequence to sequence model and a memory network model to mimic the kNearest Neighbors model, which generate a sequence of labels, a sequence of outofsample feature vectors and a final label for classification, and thus they could also function as oversamplers. We also propose ‘outofcore’ versions of our models which assume that only a small portion of data can be loaded into memory. Computational experiments show that our models outperform kNearest Neighbors due to the fact that our models must produce additional output and not just the label. As oversamples on imbalanced datasets, the models often outperform SMOTE and ADASYN.
1 Introduction
Recently, neural networks have been attracting a lot of attention among researchers in both academia and industry, due to their astounding performance in fields such as natural language processing[Turian et al., 2010][Mikolov et al., 2013] and image classification[Krizhevsky et al., 2012][Deng et al., 2009]. Interpretability of these models, however, has always been an issue since it is difficult to understand the performance of neural networks. The wellknown manifold hypothesis states that realworld high dimensional data (such as images) form lowerdimensional manifolds embedded in the highdimensional space[Carlsson et al., 2008], but these manifolds are tangled together and are difficult to separate. The classification process is then equivalent to stretching, squishing and separating the tangled manifolds apart. However, these operations pose a challenge: it is quite implausible that only affine transformations followed by a pointwise nonlinear activation are sufficient to project or embed data into representative manifolds that are easily separable by class.
Therefore, instead of asking neural networks to separate the manifolds by a hyperplane or a surface, it is more reasonable to require points of the same manifold to be closer than points of other manifolds [Olah, 2014]. Namely, the distance between manifolds of different classes should be large and the distance between manifolds of the same class should be small. This distance property is behind the concept of kNearest Neighbor (kNN)[Cover & Hart, 1967]. Consequently, letting neural networks mimic kNN would combine the notion of manifolds with the desired distance property.
We explore kNN through two deep network models: sequence to sequence deep networks[Sutskever et al., 2014] and memory networks[Weston et al., 2015][Sukhbaatar et al., 2015]. A family of our models are based on a sequence to sequence network. Our new sequence to sequence model has the input sequence of length one corresponding to a sample, and then it decodes it to predict two sequences of output, which are the classes of closest samples and neighboring samples not necessarily in the training data, where we call the latter as outofsample feature vectors. We also propose a family of models built on a memory network, which has a memory that can be read and written to and is composed of a subset of training samples, with the goal of using it for predicting both classes of close samples and outofsample feature vectors. With the help of attention over memory vectors, our new memory network model generates the predicted label sequence and outofsample feature vectors. Both models use loss functions that mimic kNN. Computational experiments show that the new sequence to sequence model consistently outperforms benchmarks (kNN, a feedforward neural network and a vanilla memory network). We postulate that this is due to the fact that we are forcing the model to ‘work harder’ than necessary (producing outofsample feature vectors).
Different from general classification models, our models predict not only labels, but also outofsample feature vectors. Usually a classification model only predicts labels, but as in the case of kNN, it is desirable to learn or predict neighbors as well. Intuitively, if a deep neural network predicts both labels and neighbors, it is forced to learn and capture representative information of input, and thus it should perform better in classification. Our models also function as synthetic oversamplers: we add the outofsample feature vectors and their corresponding labels (synthetic samples) to the training set. Experiments show that our sequence to sequence kNN model outperforms SMOTE and ADASYN most of the times on imbalanced datasets.
Usually we allow models to perform kNN searching on the entire dataset, which we call the full versions of models, but kNN is computationally expensive on large datasets. We design an algorithm to resolve this and we test our models under such an ‘outofcore’ setting: only a batch of data can be loaded into memory, i.e. kNN searching in the entire dataset is not allowed. For each such random batch, we compute the closest samples with respect to the given training sample. We repeat this times and find the closest samples among these samples. These closest samples provide the approximate label sequence and feature vector sequence to the training sample based on the kNN algorithm. Computational experiments show that sequence to sequence kNN models and memory network kNN models significantly outperform the kNN benchmark in the outofcore setting.
Our main contributions are as follows. First, we develop two types of deep neural network models which mimic the kNN structure. Second, our models are able to predict both labels of closest samples and outofsample feature vectors at the same time: they are both classification models and oversamplers. Third, we establish the outofcore version of models in the situation where not all data can be read into computer memory or kNN cannot be run on the entire dataset. The full version of the sequence to sequence kNN models and the outofcore version of both sequence to sequence kNN models and memory network kNN models outperform the benchmarks, which we postulate is because learning neighboring samples enables the model to capture representative features.
We introduce background and related works in Section 2, show our approaches in Section 3, and describe datasets and experiments in Section 4. Conclusions are in Section 5.
2 Background and Literature Review
There are several works trying to mimic kNN or applying kNN within different models. [Mathy et al., 2015] introduced the boundary forest algorithm which can be used for nearest neighbor retrieval. Based on the boundary forest model, in [Zoran et al., 2017], a boundary deep learning tree model with differentiable loss function was presented to learn an efficient representation for kNN. The main differences between this work and our work are in the base models used (boundary tree vs standard kNN), in the main objectives (representation learning vs classification and oversampling) and in the loss functions (KL divergence vs KL divergence components reflecting the kNN strategy and norm). [Wang et al., 2017] introduced a text classification model which utilizes nearest neighbors of input text as the external memory to predict the class of input text. Our memory network kNN models differ from this model in 1) the external memory: our memory network kNN models simply feed a random batch of samples into the external memory without the requirement of nearest neighbors and thus they save computational time, and 2) number of layers: our memory network kNN models have layers while the model proposed by [Wang et al., 2017] has one layer. A higherlevel difference is that [Wang et al., 2017] consider a pure classification setting, while our models generate not only labels but outofsample feature vectors as well. Most importantly, the loss functions are different: [Wang et al., 2017] used KL divergence as the loss function while we use a specially designed KL divergence and norm to force our models to mimic kNN.
The sequence to sequence model, one of our base models, has recently become the leading framework in natural language processing [Sutskever et al., 2014][Cho et al., 2014]. In [Cho et al., 2014] an RNN encoderdecoder architecture was used to deal with statistical machine translation problems. [Sutskever et al., 2014] proposed a general endtoend sequence to sequence framework, which is used as the basic structure in our sequence to sequence kNN model. The major difference between our work and these studies is that the loss function in our work forces the model to learn from neighboring samples, and our models are more than just classifiers  they also create outofsample feature vectors that improve accuracy or can be used as oversamplers.
There are also a plethora of studies utilizing external memory or attention technique in neural networks. [Weston et al., 2015] proposed the memory network model to predict the correct answer of a query by means of ranking the importance of sentences in the external memory. [Sukhbaatar et al., 2015] introduced a continuous version of a memory network with a recurrent attention mechanism over an external memory, which outperformed the previous discrete memory network architecture in question answering. Since it has shown strong ability of capturing long term dependencies of sequential data, our memory network kNN model is built based on the endtoend memory network model. Moreover, [Bahdanau et al., 2015] introduced an attentionbased model to search for the informative part of input sentence to generate prediction. [Rocktäschel et al., 2016] proposed a wordbyword attentionbased model to encourage reasoning about entailment. These works utilized an attention vector over inputs, but in our work, the attention is over the external memory rather than the input sequence of length one.
In summary, the major differences between our work and previous studies are as follows. First, our models predict both labels of nearest samples and outofsample feature vectors rather than simply labels. Thus, they are more than classifiers: the predicted label sequences and feature vector sequences can be treated as synthetic oversamples to handle imbalanced class problems. Second, our work emphasizes on the outofcore setting. All of the prior works related to kNN and deep learning assume that kNN can be run on the entire dataset and thus cannot be used on large datasets. Third, our loss functions are chosen to mimic kNN, so that our models are forced to learn neighboring samples to capture the representative information.
2.1 Sequence to Sequence model
A family of our models are built on sequence to sequence models. A sequence to sequence (Seq2seq) [Sutskever et al., 2014] model is an encoderdecoder model. The encoder encodes the input sequence to an internal representation called the ‘context vector’ which is used by the decoder to generate the output sequence. Usually, each cell in the Seq2seq model is a Long ShortTerm Memory (LSTM) [Hochreiter & Schmidhuber, 1997] cell or a Gated Recurrent Unit (GRU) [Cho et al., 2014].
Given input sequence , in order to predict output (where the superscript denotes ‘predicted’), the Seq2seq model estimates conditional probability for each . At each time step , the encoder updates the hidden state , which can also include the cell state, by
where . The decoder updates the hidden state by
where . The decoder generates output by
(1) 
and
with usually being softmax function.
The model calculates the conditional distribution of output by
2.2 End to End Memory Networks
The other family of our models are built on an endtoend memory network (MemN2N). This model takes as the external memory, a ‘query’ , a ground truth and predicts an answer . It first embeds memory vectors and query into continuous space. They are then processed through multiple hops to generate the output label .
MemN2N has layers. In the layer, where , the external memory is converted into embedded memory vectors by an embedding matrix . The query is also embedded as by an embedding matrix . The attention scores between embedded query and memory vectors are calculated by
Each is also embedded to an output representation by another embedding matrix . The output vector from the external memory is defined as
By a linear mapping , the input to the next layer is calculated by
[Sukhbaatar et al., 2015] suggested that the input and output embeddings are the same across different layers, i.e. and . In the last layer, by another embedding matrix , MemN2N generates a label for the query by
3 kNN models
Our sequence to sequence kNN models are built on a Seq2seq model, and our memory network kNN models are built on a MemN2N model. Let denote the number of neighbors of interest.
Vector to Label Sequence (V2LS) Model
Given an input feature vector , a ground truth label (a single class corresponding to ) and a sequence of labels corresponding to the labels of the nearest sample to in the entire training set, V2LS predicts a label and , the predicted labels of the nearest samples. Since are obtained by using kNN upfront, the real input is only and . When kNN does not misclassify, corresponds to majority voting of .
The key concept of our model is to have as the input sequence (of length 1) and the output sequence to correspond to . The loss function also captures and .
In the V2LS model, by a softmax operation with temperature after a linear mapping , the label of the nearest sample to is predicted by
where is as in (1) for and is the temperature of softmax, [Karpathy, 2015][Hinton et al., 2015].
By taking the average of predicted label distributions, the label of is predicted by
Note that if corresponds to a Dirac distribution for each , then matches majority voting. Temperature controls the “peakedness” of . Values of below 1 push towards a Dirac distribution, which is desired in order to mimic kNN. We design the loss function as
where the first term captures the label at the neighbor level, the second term for the actual ground truth and is a hyperparameter to balance the two terms. The expectation is taken over all training samples, and denotes the KullbackLeibler divergence. Due to the fact that the first term is the sum of KL divergence between predicted labels of nearest neighbors and target labels of nearest neighbors, it forces the model to learn information about the neighborhood. The second term considers the actual ground truth label: a classification model should minimize the KL divergence between the predicted label (average of distributions) and the ground truth label. By combining the two terms, the model is forced to not only learn the classes of the final label but also the labels of nearest neighbors.
In inference, given an input , V2LS predicts and , but only is the actual output; it is used to measure the classification performance. Note that it is possible that is different from the majority voted class among when kNN misclassifies.
Vector to Vector Sequence (V2VS) Model
We use the same structure as the V2LS model except that in this model, the inputs are a feature vector and a sequence of feature vectors corresponding to the nearest sample to among the entire training set (calculated upfront using kNN). V2VS predicts which denote the predicted outofsample feature vectors of the nearest sample. Since are obtained using kNN, this is an unsupervised model.
The output of the decoder cell is processed by a linear layer (), a operation and another linear layer () to predict the outofsample feature vector
Numerical experiments show that works best compared with and other activation functions. The loss function is defined to be the sum of norms as
Since the predicted outofsample feature vectors should be close to the input vector, learning nearest vectors forces the model to learn a sequence of approximations to something very close to the identity function. However, this is not trivial. First it does not learn an exact identity function, since the output is a sequence of nearest neighbors to input, i.e. it does not simply copy the input times. Second, by limiting the number of hidden units of the neural network, the model is forced to capture the most representative and condensed information of input. A large amount of studies have shown this to be beneficial to classification problems [Erhan et al., 2010][Vincent et al., 2010][He et al., 2016] [Hinton & Salakhutdinov, 2006].
In inference, we predict the label of by finding the labels of outofsample feature vectors () and then perform majority voting among these labels.
Vector to Vector Sequence and Label Sequence (V2VSLS) Model
In previous models, V2LS learns to predict labels of nearest neighbors, and V2VS learns to predict feature vectors of nearest neighbors. Combining V2LS and V2VS together, this model predicts both and . Given an input feature vector , a ground truth label , a sequence of nearest labels and a sequence of nearest feature vectors , V2VSLS predicts a label , a label sequence and an outofsample feature vector sequence . Since the two target sequences are obtained by using kNN, the model still only needs and as input.
The loss function is a weighted sum of the two loss functions in V2LS and V2VS
where is a hyperparameter to account for the scale of the norm and the KL divergence.
The norm part enables the model to learn neighboring vectors. As discussed in the V2VS model, this is beneficial to classification since it drives the model to capture representative information of input and nearest neighbors. The part of the loss function focuses on predicting labels of nearest neighbors. As discussed in the V2LS model, the two terms in the loss let the model learn both neighboring labels (sum of ) and the ground truth label. Combining the two parts, the V2VSLS model is able to predict nearest labels and outofsample feature vectors, as well as one final predicted label for classification.
In inference, given an input , V2VSLS generates , and . Still only is used in measuring classification performance of the model.
Memory Network  kNN (MNkNN) Model
The MNkNN model is built on the MemN2N model, which has layers stacked together. After these layers, the MemN2N model generates a prediction. In order to mimic kNN, our MNkNN model has layers as well but it generates a label after each hop, i.e. after the hop, it predicts the label of the nearest sample. It mimics kNN because the first hop predicts the label of the first closest vector to , the second hop predicts the label of the second closest vector to , etc.
This model takes a query feature vector , its corresponding ground truth label , a random subset from the training set (to be stored in the external memory) and denoting the labels of the nearest samples to among the entire training set (calculated upfront using kNN). It predicts a label and a sequence of labels of closest samples .
After the layer, by a softmax operation with temperature after a linear mapping , the model predicts the label of nearest sample by
where . The role of is the same as in the V2LS model. Taking the average of the predicted label distributions, the final label of is calculated by
Same as V2LS, the loss function of MNkNN is defined as
The first term accounts for learning neighboring information, and the second term forces the model to provide the best single candidate class.
In inference, the model takes a query and random samples from the training set, and generates the predicted label (and a sequence of nearest labels ).
Memory Network  kNN with Vector Sequence (MNkNN_VEC) Model
This model is built on MNkNN, but it predicts outofsample feature vectors as well. MNkNN_VEC takes a query feature vector , its corresponding ground truth label , a random subset from the training dataset (to be stored in the external memory), and denoting labels and feature vectors of the nearest samples to among the entire training set (calculated both upfront using kNN). MNkNN_VEC predicts a label , labels and outofsample feature vectors .
By a linear mapping , a operation and another linear mapping , the feature vectors are then calculated by
Same as the V2VSLS model, combining the norm and the KL divergence together, the loss function is defined as
As discussed in the V2VSLS model, having such a loss function forces the model to learn both the feature vectors and the labels of nearest neighbors.
In inference, the model takes a query and random vectors from the training dataset, and generates the predicted label , a sequence of labels and a sequence of outofsample feature vectors .
OutofCore Models
In the models exhibited so far, we assume that kNN can be run on the entire dataset exactly to compute the nearest feature vectors and corresponding labels to an input sample. However, there are two problems with this assumption. First, this can be very computationally expensive if the dataset size is large. Second, the training dataset might be too big to fit in memory. When either of these two challenges is present, an outofcore model assuming it is infeasible to run a full kNN on the entire dataset has to be invoked. The outofcore models avoid running kNN on the entire dataset, and thus save computational time and resources.
Let be the maximum number of samples that can be stored in memory, where . For a training sample , we sample a subset from the training set (including ) where , then we run kNN on to obtain the nearest feature vectors and corresponding labels to , which are denoted as and for in the training process. The previously introduced loss functions and depend on and the model parameters , and thus our outofcore models are to solve
where is either or .
Sampling a set of size and then finding the nearest samples only once, however, are insufficient on imbalanced datasets, due to the low selection probability for minor classes. To resolve this, we iteratively take random batches: each time a random batch is taken, we update the closest samples by the closest samples among the current batch and the previous closest samples. These resulting nearest feature vectors and corresponding labels are used as and for in the loss function. Note that we allow the previously selected samples to be selected in later sampling iterations. The entire algorithm is exhibited in Algorithm 1. This procedure is also equivalent to sampling once a subset of size and finding the nearest samples among them.
4 Computational experiments
Four classification datasets are used: Network Intrusion (NI) [Hettich & Bay, 1999], Forest Covertype (COV) [Blackard & Dean, 1998], SensIT [Duarte & Hu, 2004] and Credit Card Default (CCD) [Yeh & hui Lien, 2009]. Details of these datasets are in Table 1. We only consider 3 classes in NI and COV datasets due to significant class imbalance (0.01% minority class in NI and 0.3% minority class in COV).
All of the models have been developed in Python 2.7 by using Tensorflow 1.4.
NI  COV  SensIT  CCD  
Dataset Size  796,497  530,895  98,528  30,000 
Feature Size  41  54  100  23 
Number of Classes  3  3  3  2 

Table 1: Datasets information.
For each dataset we repeat the following 5 times, each time with a different seed. The dataset is randomly split into 80%/10%/10% (training/validation/testing, respectively). The validation dataset is used in the usual way to identify the parameters with the best F1 score on the validation dataset which are then used on test. F1 score is used as the performance measure. All reported numbers are averages taken over 5 random seeds.
We discuss the performance of the models in two aspects: classification and oversampling.
Classification
As comparisons against memory network kNN models and sequence to sequence kNN models, we use kNN, a 4layer feedforward neural network (FFN) trained using the Adam optimization algorithm (which has been calibrated) and MemN2N (since MNkNN and MNkNN_VEC are built on MemN2N) as three benchmarks. Value is used in all models because it yields the best performance with low standard deviation among . Increasing beyond is somewhat detrimental to the F1 scores while significantly increasing the training time. Figure 1 where full models are used shows how changing affects the F1 scores of kNN and our models on the SensIT dataset. We observe similar patterns on other datasets for both the full models and the outofcore models.
In the sequence to sequence kNN models, LSTM cells are used. In the memory network kNN models, the size of the external memory is 64 since we observe that models with memory vectors of size 64 generally provide the best F1 scores with acceptable running time. Both sequence to sequence kNN models and memory network kNN models are trained using the Adam optimization algorithm [Kingma & Ba, ] with initial learning rate set to be 0.01. Dropout with probability 0.2 [Srivastava et al., 2014] and batch normalization Ioffe & Szegedy [2015] are used to avoid overfitting. Regarding the choices of other hyperparameters, we find that , and provide overall the best F1 scores.
NI  COV  SensIT  CCD  
90.54  91.15  82.56  63.81  
88.53  91.83  83.67  65.37  
79.36  77.98  75.17  61.83  
91.28  93.94  84.93  68.38  
86.18  90.39  74.84  64.23  
92.07  94.97  86.24  69.87  
83.83  80.12  79.58  67.26  
84.59  83.94  83.41  68.82  

Table 2: F1 score comparison of full models.
We first discuss the full models that can handle all of the training data, i.e. kNN can be run on the entire dataset. Figure 2 and Table 2 show that in the full model case, V2VSLS consistently outperforms other models on all four datasets. ttests show that it significantly outperforms three benchmarks at the 5% level on all four datasets. Moreover, it can also be seen that predicting not only labels but feature vectors as well is reasonable, since V2VSLS consistently outperforms V2LS and MNkNN_VEC consistently outperforms MNkNN. Models predicting feature vectors outperform models not predicting feature vectors on all datasets. These memory based models exhibit subpar performance, which is expected since they only consider 64 training samples at once (despite using exact labels). From Figure 2 we also observe that standard deviations do not differ significantly.
In the outofcore versions of our models, is set to be 50, since we observe that increasing 50 to, for instance, 100, only has a slight impact on F1 scores. This can also be seen from the selection probability: is selected in 50 batches is selected in 100 batches, if and . Although the selection probability is doubled, they are numerically almost equal given the fact that in the full model case, the selection probability is 1 and . However, increasing from 50 to 100 substantially increases the running time. Thus, we use to test and compare our models with benchmarks. The batch size of the outofcore models is set to be 64 since it is found to provide overall the best F1 scores with reasonable running time. Numerical experiments show that increasing beyond 64 does not make a great difference while the running time significantly increases.
Figure 3 shows the results of our models under the outofcore assumption when and . The comparison shows that both V2VLSL and MNkNN_VEC significantly outperform the kNN benchmark based on ttests at the 5% significance level. Note that the MemN2N and the FFN benchmarks are not available under the outofcore setting since they do not require calculating nearest neighbors. The kNN benchmark provides a low score since we restrict the batch size (or memory size) to be 64, and it turns out that kNN is substantially affected by the randomness of batches. We also find that even if we increase the batch size from 64 to 128, kNN still provides similar F1 scores. Our models (except V2VS, since it makes predictions only depending on feature vector sequences) are robust under the outofcore setting, because the weight of the ground truth label in the loss function is relatively high so that even if the input nearest sequences are noisy, they still can focus on learning the ground truth label and making reasonable predictions.
kNN  V2LS  V2VS  V2VSLS  MNkNN  MNkNN_VEC  

Full model F1  82.56  84.93  74.84  86.24  79.58  83.41 
OOC model F1  61.40  82.47  69.12  83.38  78.80  82.32 
Full model time (s)  312  443+635  857+1358  1391+1802  443+692  1391+1081 
OOC model time (s)  193  287+619  488+1316  741+1846  287+703  741+1055 

Table 3: Full model and outofcore (OOC) model comparison on SensIT.
Table 3 shows a comparison between the full and outofcore models with on the SensIT dataset. The running time of our models are broken down to two parts: the first part is the time to obtain sequences of nearest feature vectors and labels and the second part is the model training time. Under the outofcore setting, overall the kNN sequence preprocessing time is saved by approximately 40% while the models perform only slightly worse.
Oversampling
Since V2VSLS and MNkNN_VEC are able to predict outofsample feature vectors, we also regard our models as oversamplers and we compare them with two widely used oversampling techniques: Synthetic Minority Oversampling Technique (SMOTE) [Chawla et al., 2002] and Adaptive Synthetic sampling (ADASYN) [He et al., 2008]. We only test V2VSLS since it is the best model that can handle all of the data. In our experiments, we first fully train the model. Then for each sample from the training set, V2VSLS predicts outofsample feature vectors which are regarded as synthetic samples. We add them to the training set if they are in a minority class until the classes are balanced or there are no minority training data left for creating synthetic samples. We also observe that larger and smaller in the loss function produce better synthetic samples. In our oversampling experiments, we use and .
Figure 4 shows the F1 scores of FFN, extreme gradient boosting (XGB) [Chen & Guestrin, 2016] and random forest (RF) [Breiman, 2001] classification models, with different oversampling techniques, namely, original training set without oversampling, SMOTE, ADASYN and V2VSLS. V2VSLS performs the best among all combinations of classification models and oversampling techniques, as shown in Table 5. Although most of the time models on datasets with three oversampling techniques outperform models on datasets without oversampling, the classification performance still largely depends on the classification model used and which dataset is considered.
NI  COV  SensIT  CCD  
Best model  FFN+V2VSLS  RF+V2VSLS  FFN+V2VSLS  RF+V2VSLS 
Best F1 score  90.89  94.36  83.92  68.08 
Better than best SMOTE by  1%  0.51%  0.61%  2.28% 
Better than best ADASYN by  0.6%  0.56%  0.26%  1.4% 
Table 5: Oversampling techniques comparison.
Figure 5 shows a tSNE [van der Maaten & Hinton, 2008] visualization of the original set and the oversampled set, using SensIT dataset, projected onto 2D space. Although SMOTE and ADASYN overall perform well, their class boundaries are not as clean as those obtained by V2VSLS.
Summary
In summary, we find that it is beneficial to have neural network models learn not only labels but feature vectors as well. In the full models, V2VSLS outperforms all other models consistently; in the outofcore models, both V2VSLS and MNkNN_VEC significantly outperform the kNN benchmark. As an oversampler, the average F1 score based on the training set augmented by V2VSLS outperforms that of SMOTE and ADASYN.
We recommend to run V2VSLS with large and small for classification. In the oversampling scenario, however, we suggest to use small and large so that the model focuses more on the feature vectors, i.e. synthetic samples.
References
 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations, 2015.
 Jock A. Blackard and Denis J. Dean. UCI Machine Learning Repository, 1998. URL https://archive.ics.uci.edu/ml/datasets/covertype.
 Leo Breiman. Random forests. Machine Learning, 2001.
 Gunnar Carlsson, Tigran Ishkhanov, Vin de Silva, and Afra Zomorodian. On the local behavior of spaces of natural images. International Journal of Computer Vision, 2008.
 Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 2002.
 Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
 Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoderdecoder for statistical machine translation. Conference on Empirical Methods in Natural Language Processing, 2014.
 Thomas M. Cover and Peter E. Hart. Nearest neighbor pattern classification. Institute of Electrical and Electronics Engineers Transactions on Information Theory, 1967.
 Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. Conference on Computer Vision and Pattern Recognition, 2009.
 Marco F. Duarte and Yu Hen Hu. Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing, 2004.
 Dumitru Erhan, Yoshua Bengio, Aaron C. Courville, PierreAntoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pretraining help deep learning? Journal of Machine Learning Research, 2010.
 Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. International Joint Conference on Neural Networks, 2008.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Conference on Computer Vision and Pattern Recognition, 2016.
 S. Hettich and S. D. Bay. The UCI KDD Archive, 1999. URL http://kdd.ics.uci.edu.
 Geoffrey Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 2006.
 Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. Annual Conference on Neural Information Processing Systems, 2015.
 Sepp Hochreiter and JÃ¼rgen Schmidhuber. Long shortterm memory. Neural Computation, 1997.
 Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning, 2015.
 Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks, 2015. URL http://karpathy.github.io/2015/05/21/rnneffectiveness/.
 Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations.
 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Annual Conference on Neural Information Processing Systems, 2012.
 Charles Mathy, Nate Derbinsky, José Bento, Jonathan Rosenthal, and Jonathan S. Yedidia. The boundary forest algorithm for online supervised and unsupervised learning. Association for the Advancement of Artificial Intelligence Conference, 2015.
 Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013.
 Christopher Olah. Neural networks, manifolds, and topology, 2014. URL http://colah.github.io/posts/201403NNManifoldsTopology/.
 Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomás Kociský, and Phil Blunsom. Reasoning about entailment with neural attention. International Conference on Learning Representations, 2016.
 Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.
 Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. Weakly supervised memory networks. Annual Conference on Neural Information Processing Systems, 2015.
 Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. Annual Conference on Neural Information Processing Systems, 2014.
 Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: A simple and general method for semisupervised learning. Annual Meeting of the Association for Computational Linguistics, 2010.
 Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tSNE. Journal of Machine Learning Research, 2008.
 Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre antoine Manzagol. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 2010.
 Zhiguo Wang, Wael Hamza, and Linfeng Song. knearest neighbor augmented neural networks for text classification. arXiv Repository, 2017. URL http://arxiv.org/abs/1708.07863.
 Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. International Conference on Learning Representations, 2015.
 ICheng Yeh and Che hui Lien. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 2009.
 Daniel Zoran, Balaji Lakshminarayanan, and Charles Blundell. Learning deep nearest neighbor representations using differentiable boundary trees. arXiv Repository, 2017. URL http://arxiv.org/abs/1702.08833.