Improving Accuracy of Nonparametric Transfer Learning via
Vector Segmentation
Abstract
Transfer learning using deep neural networks as feature extractors has become increasingly popular over the past few years. It allows to obtain stateoftheart accuracy on datasets too small to train a deep neural network on its own, and it provides cutting edge descriptors that, combined with nonparametric learning methods, allow rapid and flexible deployment of performing solutions in computationally restricted settings. In this paper, we are interested in showing that the features extracted using deep neural networks have specific properties which can be used to improve accuracy of downstream nonparametric learning methods. Namely, we demonstrate that for some distributions where information is embedded in a few coordinates, segmenting feature vectors can lead to better accuracy. We show how this model can be applied to real datasets by performing experiments using three mainstream deep neural network feature extractors and four databases, in vision and audio.
Improving Accuracy of Nonparametric Transfer Learning via
Vector Segmentation

1 Introduction
Transfer learning consists in training a learning method on a first dataset to be used on a second, distinct one. In this context, using Deep Neural Networks (DNNs) [1, 2, 3] (in particular convolutional neural networks on vision datasets) has become increasingly popular over the past few years. Indeed, the features extracted by stateoftheart deep neural networks are so good that they allow, in some settings, to reach the best known accuracy when applied to other datasets and combined with simple classification routines. One of the key interest in using transfer learning methods is to avoid the heavy computational cost of training DNNs. Therefore, it is possible to exploit their accuracy on embedded devices such as small robots or smartphones [4, 5]. In this context, nonparametric methods such as Nearest Neighbors (NN) are particularly attractive for their ability to handle both classincremental and exampleincremental properties [6, 7].
It is interesting to point out that DNNs are trained to extract features well suited to perform a given classification task. In order for these features to become usable in other contexts (e.g. a new classification task), broad databases containing a large variety of classes should be used. As a direct consequence, it is expected that a significant part of the extracted feature vectors is useless for solving the new task at hand. Consequently, e..g., in the field of approximate nearest neighbor search for classification it is often observed that methods based on Product Quantization [8] and its derivatives can lead for certain choices of parameters to better performance than an exhaustive search on raw data. In these methods, the search space is split into quantized subspaces in which the search is performed independently.
In this paper we are interested in showing that, more generally, segmentation of feature vectors (obtained with pretrained DNNs) in multiple subvectors for which the search is performed independently can result in higher overall accuracy. We are interested in answering the following questions:

Are there simple convincing mathematical models in which such improvement exists?

How do such improvements depend on parameters?

Does this apply to real world data?
To answer these questions, we describe the mathematical core of the classification procedure in Section 2. Section 3 contains an example where this procedure is not successful as well as a couple of situations where segmentation not only helps, but does provide the right class of the test pattern with probability converging to 1 as dimension becomes large, while a comparison of the Euclidean distance as well as a comparison coordinate by coordinate fails with probability at least one half. Section 4 contains experiments on real datasets. Section 5 is a conclusion.
2 Classification by segmentation
In this section we will give a mathematical framework of the procedure we have in mind. We start with classes of data , where we always assume that the dimension is a large parameter. To simplify matters assume that all of these classes have elements. Now choose , such that divides d and write . The of these subspaces will also be denoted by . For each take a dictionary of segments such that contains segments of each class drawn uniformly at random. Thus, we write each word , as , where each of the and “” denotes concatenation. For each class , we pick segments of words in uniformly at random without replacement and put them into the dictionary . Given a fresh word we parse it in the same way into . Then for each we find
(1) 
Here denotes Euclidean distance in . Let the random variable take the value , if is the ’th segment of a word . If several words minimize the distance in (1) we let take any of the values of the classes corresponding to these words with equal probability, as a tiebreaking rule. Finally one takes
hence the class that is most often found by the above procedure. Again we add a tiebreaking rule, if this class is not unique. The segmentation procedure assigns the class .
3 Situations where segmentation is or is not favorable
3.1 When segmentation does not help
We start with an example that shows that the data need to have a special structure for the segmentation technique to be supportive. In this subsection we will assume that we only have two classes and . These are built in the following way. Take uniformly at random as “base vectors” of the two classes. Assume . Then for let be i.i.d. vectors with i.i.d. coordinates such that We define Here the multiplication is pointwise, i.e. . Also let us assume that for such that divides and each we take a dictionary consisting of two segments, one, , belonging to class , one, , belonging to class , only. This will help to facilitate computations. In this setting we claim that segmentation does not improve the accuracy of the naive Euclidean approach.
Proposition 3.1
In the above situation assume that and are odd (to avoid the discussion of tiebreaks) and that is distributed like a word from (but independent of all words in all classes). Then there is a number such
Whilst
Proof: First of all notice that, if and any of the other words, say , differ in a coordinate , they do so by 2. Hence we have that where denotes Hamming distance and this is also true for any subspace of . We will first compute the probability that the ’th segment of is misclassified: . Here is a random variable such that
A quick calculation shows that
In particular . Thus by Cramér’s theorem (cf. [9], Theorem 2.1.24) and the convexity of the rate function there, we get
where and in particular . This means that the probability of a misclassification in a segment of length asymptotically behaves like . The probability to completely misclassify is now given by Here is the indicator for the event that is closer to than to . Again we will apply Cramérs theorem to compute the asymptotics of this probability. Recall the fact that are i.i.d. Bernoullis with successprobability and that the rate function in the large deviations principle for Bernoullis with successprobability is the relative entropy (this is actually a special case of Theorem 2.1.10 in [9]) For our choice we obtain especially
Thus
On the other hand, using (3.1) for , i.e. using just the Euclidean distance in we obtain
hence the rate function of the probability of a misclassification with is twice the rate function of the above probability.
Remark 3.2
It is also interesting to compare the accuracy of the segmentation procedure to the other “natural” classification technique, the coordinatebycoordinate comparison. Here a misclassification occurs, if and the random variables take two values only:
Indeed the summand in each of the lines reflects the tiebreaking rule, if two coordinates agree. Again by Cramér’s theorem with Now with the notation from the previous proof
But for all
Therefore such that in this case the decision based on the Euclidean norm is the best of the proposed segmentation methods.
It is also worth mentioning, that although in this case it seems not useful to segment at all or into coordinates, segmentation with pieces still yields a classification with a probability close to 1.
3.2 When segmentation does help
In the previous paragraph we saw that there are natural situations where the simplest case, when one does not partition vectors at all, is the best. However, the situation described there is close to a situation where using pieces of size not only gives a better result than and , but also the results for the latter choice are useless.
We will start by describing a basic situation and then discuss possible extensions. All these models are influenced by the observation that the data classified in [10] seem to suffer from occasional large outliers.
Our first basic situation will be given by classes and where
and 
This time (for the sake of keeping things easy), and i.e. has a at regular positions. Moreover the , and are i.i.d vectors in such that and and will be chosen in the sequel.
Given for each segment we will again take a short dictionary consisting of one segment of a word from and one segment from a word of . Assume again we want to classify a word that is distributed like a word from (but independent of all words in all classes). We start with the observation, that for small and small the coordinate by coordinate comparison will fail.
Proposition 3.3
Assume that , , and that . Then for , i.e. the coordinate by coordinate comparison,
(2) 
Remark 3.4
Observe that the situation described in (2) is a worst case scenario when one has two classes only. Indeed if the probability on the right were even smaller than one could use the reverse method and decide just the opposite of the proposed classification to get a better result.
Proof: Due to the independence of the random parts of the coordinates of the words, we may assume without loss of generality that all the segments from class one in stem from the same word , all the segments from class two stem from the same word . Moreover, we write . Consider the set
According to our assumptions, with high probability, i.e. with probability converging to 1 as and thus with high probability. But for all coordinates in one has that , therefore the tiebreaking rule decides with probability one half for class . As the fluctuations of this random decision by the Central Limit Theorem are of order and therefore larger than any “signal” one might obtain from , the statement follows.
But also the Euclidean distance fails as a classification rule for a wide range of parameters.
Proposition 3.5
If and , we have for the Euclidean distance rule
(3) 
Proof: Again suppose that all the two words in the dictionary are and . Moreover, we write . Observe that
and
With large probability, the term is of order and the last sum in is of order , which implies that this last term will be negligible with respect to , if we choose and .
Moreover, for both sums as well as obey a Central Limit Theorem with the same parameters (we can omit the condition , since ). Therefore,
which gives the result.
The question remains, of course, whether there is any segmentation method that works in this case. Fortunately, the answer is yes.
Proposition 3.6
If the segmentation rule with works, more precisely
Proof: With the notation of the previous two proofs, let be the indicator for the event that the block of , , is classified correctly. With the segments have length and therefore the base vectors and of and , respectively, have Euclidean distance 1 in each segment. Thus , if
Since we assume that , we have that for any given and any fixed
as . Moreover these events are i.i.d. for different . Therefore with high probability the majority of the will be 1. This proves the proposition.
Altogether we have shown
Theorem 3.7
In the model described above assume that , , , but , then for and we have while for , we have
Up to now we have just discussed the basic example to illustrate which statistical properties of the classes favor segmentation in the classification process. Let us comment on some variants of the above model. A first natural extension of the model is to consider more than two classes. In the above setting it is obvious that we can build up to classes, where again with , and for , where all the are i.i.d. random vectors in with and the vectors are concatenations of strings of length , such that for each of these strings contains all coordinates but one are 0, the remaining coordinate is 1, and the s are placed at different positions for different . Then Theorem 3.7 translates to
Theorem 3.8
In the model described above assume that , , , but , then for and we have while for , we have
Proposition 3.9
Assume that , , and that . Then for , i.e. the coordinate by coordinate comparison
(4) 
The proof can be copied almost literally from the proof of Proposition 3.3. Similarly Proposition 3.5 can be translated to
Proposition 3.10
If and we have for the Euclidean distance rule
(5) 
Of course, it is not difficult to check that the proofs given for the case of two classes translate to the case of several classes. On the other hand, it is also evident, that correct classification can only become difficult, if more classes are available. The crucial question is thus, if the segmentation rule also gives the right decision with more than two segments. However, checking the proof of Proposition 3.6 it is clear that also for more than two classes we have
Proposition 3.11
If the segmentation rule with works, i.e. more precisely
as .
So altogether the case of two classes is generic. Therefore we will discuss other variants of the model for this case only.
Another obvious modification of the model at the beginning of this subsection one might discuss is the influence of a larger dictionary. So let us assume now that the dictionaries contain segments of each class, i.e. and again the words are of the form , . Again we will check whether Propositions 3.3 to 3.6 remain true. For the proof of Proposition 3.3 we let be distributed as a word from class one. Define the set in the proof of Proposition 3.3 now as the set of coordinates that are not of the form and such that none of the Bernoulli’s in the dictionaries is 1. Then again with high probability, which implies that Proposition 3.3 remains true. However, the behaviour of the Euclidean distance rule Proposition 3.5 improves, if the dictionaries become larger. Indeed is classified correctly, if there exists a word such that
The probability that this holds true for a fixed is asymptotically . So the probability that such an does not exist is given by , which is smaller than but for not depending on still not . However, also the accuracy of the segmentation method with as in Proposition 3.6 improves and this basically for the same reasons: If there is one segment of a word of class in the j’th dictionary such that all its variables in this segment are 0, one classifies correctly. And this probability, of course, increases, as becomes larger.
One might, of course, ask which features of the model discussed in this subsection are decisive for the segmentation method to be favorable. These features are:

The vectors in each class are rare but large perturbations of a base vector. Most of the coordinates of the base vectors of two distinct classes agree.

The perturbations are much rarer than the frequencies of the coordinates in which the base vectors disagree.
However, analyzing the data used in [10] one sees that our models above describe well the behavior of one class, but not that of two classes simultaneously. Indeed, there is some evidence, that the coordinates that take large values in a class are, also likely to take large values in another, but the variance is larger for those that take large values. To take this into account, we change our original model in the following way:
Moreover is a vector of i.i.d. Bernoulli random variables with parameter , i.e. Finally the and are i.i.d. random variables in with positive, i.i.d. components, such that , for some . Again we will take dictionaries that only contain one segment and of each class and we want to classify a word that is distributed as a word from correctly. Assume that and . Again we obtain
Theorem 3.12
In the model described above assume that , , , but , then for and we have while for , we have
The proof is only a slight modification of the proof of Theorem 3.7.
4 Experiments
In this section we derive experiments on realworld data. In our experiments, we use three distinct DNNs. Two of them are related to vision tasks and perform feature extraction from raw input images, namely Inception V3 [11] and SqueezeNet [5]. Both these networks have been trained using 1’000 classes from the ImageNet dataset. As far as Inception V3 is concerned, we use the features obtained before the first fully connected layer. It consists of a vector with 2’048 dimensions. The inputs are images scaled to 299x299 pixels. For the SqueezeNet network, we use the penultimate layer (containing 1’000 dimensions) as our feature extractor. Input images contain 227x227 pixels. The last DNN we use has been trained on AudioSet [12], a dataset that consists of more than 2’000’000 distinct audio tracks extracted from videos on YouTube. The extracted features contain 1’280 dimensions which are the concatenation of ten 128 dimensions feature vectors, one per second of the corresponding audio track.
We perform tests on four datasets: CIFAR10, two subsets of ImageNet made of 10 classes sampled randomly from those that where not used to train the DNNs, and a subset of 10 classes used to train AudioSet. The first one is CIFAR10. CIFAR10 is a set of tiny images made of 32x32 pixels belonging to 10 different classes. This very low resolution results in signals that are quite different from those used to train the DNNs. We then introduce two datasets extracted from ImageNet, named ImageNet1 and ImageNet2. They contain both 10 classes that were not used to train the DNNs. These signals are thus much more similar to those used to train the DNN than in the case of CIFAR10. Training sets contain 5’000 items per class for CIFAR10 and 1’000 items per class for ImageNet1 and ImageNet2. Test is performed on 10’000 items for CIFAR10 and about 2’200 for the two ImageNet subsets. For AudioSet, we consider classes that have been used to train the DNN. We have chosen 10 classes with similar number of elements in the dataset (radio, cat, hihat, helicopter, fireworks, stream, bark, baby/infant cry, snoring, train horn). The cardinality of training sets ranges from 2’000 to 5’000 elements, and we test on about 600 other elements. We removed elements belonging to multiple of these categories (AudioSet contains multilabels elements).
We use NN as our nonparametric method to obtain a classification accuracy. Note that in the case of segments, the decision is taken using votes instead of , since each subspace performs a NN. We observe that in all scenarios, the optimal solution corresponds to an intermediate number of segments. The case of AudioSet is interesting as there is a local maximum in accuracy which corresponds to 10 segments, which occurs when considering the ten 128 dimension feature vectors independently, but the global maximum is for 40 segments. Note that the complexity of the method does not depend on , as both memory and number of operations boils down to the product of the number of training vectors and their dimension. Table 1 summarizes our results.
Inception V3, 1NN  

1  4  16  64  256  
CIFAR10  0.8519  0.8652  0.8781  0.8651  0.8347 
ImageNet1  0.9328  0.9354  0.9424  0.9439  0.9081 
ImageNet2  0.9438  0.9451  0.9524  0.9464  0.9171 
Inception V3, 5NN  
1  4  16  64  256  
CIFAR10  0.8689  0.8761  0.8759  0.8668  0.8461 
ImageNet1  0.9389  0.9450  0.9429  0.9394  0.9202 
ImageNet2  0.9467  0.9498  0.9511  0.9488  0.9303 
SqueezeNet, 1NN  

1  5  20  100  200  
CIFAR10  0.6839  0.7069  0.7472  0.6890  0.6225 
ImageNet1  0.8854  0.8900  0.9001  0.8784  0.8466 
ImageNet2  0.8737  0.8802  0.8926  0.8669  0.8267 
SqueezeNet, 5NN  
1  5  20  100  200  
CIFAR10  0.7284  0.7483  0.7566  0.6954  0.6371 
ImageNet1  0.8985  0.8965  0.8980  0.8698  0.8501 
ImageNet2  0.8862  0.8901  0.8893  0.8591  0.8280 
AudioSet  

1  2  10  40  160  
1NN  0.605  0.621  0.704  0.724  0.660 
5NN  0.564  0.649  0.704  0.727  0.668 
5 Conclusion
Transfer learning is a popular method to obtain cutting edge descriptors that can be exploited to classify new data. When combined with nonparametric methods such as nearest neighbor search, it provides a lightweight incremental solution that is suitable for devices with limited energy or computational capabilities. We have shown that segmenting vectors to perform nearest neighbor search in obtained subspaces can result in significant improvements in accuracy. Moreover, this change has no cost on memory usage neither on the number of operations required to fulfill the task. Interestingly, this method can be thought about as an alternative to increasing the number of neighbors to consider when taking a decision.
Future work include considering other downstream classification techniques such as support vector machines and logistic regression.
References
 [1] Sinno Jialin Pan and Qiang Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
 [2] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson, “How transferable are features in deep neural networks?,” in Advances in neural information processing systems, 2014, pp. 3320–3328.
 [3] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
 [4] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al., “Going deeper with embedded fpga platform for convolutional neural network,” in Proceedings of the ACM International Symposium on FieldProgrammable Gate Arrays, 2016, pp. 26–35.
 [5] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer, “Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
 [6] Robi Polikar, Lalita Udpa, Satish S Udpa, and Vasant Honavar, “Learn++: an incremental learning algorithm for multilayer perceptron networks,” in Proceedings of the IEEE conference on Acoustics, Speech, and Signal Processing, 2000, vol. 6, pp. 3414–3417.
 [7] Robi Polikar, Lalita Upda, Satish S Upda, and Vasant Honavar, “Learn++: An incremental learning algorithm for supervised neural networks,” IEEE transactions on systems, man, and cybernetics, part C (applications and reviews), vol. 31, no. 4, pp. 497–508, 2001.
 [8] Herve Jegou, Matthijs Douze, and Cordelia Schmid, “Product quantization for nearest neighbor search,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 1, pp. 117–128, 2011.
 [9] Amir Dembo and Ofer Zeitouni, Large deviations techniques and applications, vol. 38 of Stochastic Modelling and Applied Probability, SpringerVerlag, Berlin, 2010.
 [10] Ahmet Iscen, Teddy Furon, Vincent Gripon, Michael Rabbat, and Hervé Jégou, “Memory vectors for similarity search in highdimensional spaces,” IEEE Transactions on Big Data, 2017, In press.
 [11] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna, “Rethinking the inception architecture for computer vision,” arXiv preprint arXiv:1512.00567, 2015.
 [12] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and humanlabeled dataset for audio events,” in IEEE ICASSP, 2017.