# When Naïve Bayes Nearest Neighbours Meet Convolutional Neural Networks

###### Abstract

Since Convolutional Neural Networks (CNNs) have become the leading learning paradigm in visual recognition, Naive Bayes Nearest Neighbour (NBNN)-based classifiers have lost momentum in the community. This is because (1) such algorithms cannot use CNN activations as input features; (2) they cannot be used as final layer of CNN architectures for end-to-end training , and (3) they are generally not scalable and hence cannot handle big data. This paper proposes a framework that addresses all these issues, thus bringing back NBNNs on the map. We solve the first by extracting CNN activations from local patches at multiple scale levels, similarly to [1]. We address simultaneously the second and third by proposing a scalable version of Naive Bayes Non-linear Learning (NBNL, [2]). Results obtained using pre-trained CNNs on standard scene and domain adaptation databases show the strength of our approach, opening a new season for NBNNs.

- IC
- Improvement Condition
- RLS
- Regularized Least Squares
- HTL
- Hypothesis Transfer Learning
- ERM
- Empirical Risk Minimization
- RKHS
- Reproducing kernel Hilbert space
- DA
- Domain Adaptation
- LOO
- Leave-One-Out
- HP
- High Probability
- RSS
- Regularized Subset Selection
- FR
- Forward Regression
- NBNN
- Naïve Bayes Nearest Neighbour
- NBNL
- Naïve Bayes Non-linear Learning
- KDE
- Kernel Density Estimator
- ML3
- Multiclass Latent Locally-Linear
- SVM
- Support Vector Machine
- CCCP
- Concave-Convex Procedure
- CNN
- Convolutional Neural Network
- STOML3
- Stochastic Multiclass Latent Locally-Linear
- SMM
- Stochastic Majorization-Minimization
- SGD
- Stochastic Gradient Descent
- I2C
- Image-2-Class
- RELU
- Rectified Linear Unit
- LLSVM
- Locally-Linear Support Vector Machine
- LCC
- Local Coordinate Coding
- OCC
- Orthogonal Coordinate Coding

## 1 Introduction

The current easy access to terabytes of visual data, combined with the impressive ability of deep learning algorithms to exploit them, has led to a paradigm shift in visual recognition over the last few years. The so called shallow architectures, i.e. learning algorithms consisting of 1-3 levels, have survived only when (a) they have been able to scale over very large amount of data and classes ( i.e. and respectively); (b) they could be used as the final layer of Convolutional Neural Network (CNN)s, allowing for end-to-end learning, and/or (c) they could use effectively the activation layers of pre-computed CNNs [3, 4] as input features. All shallow architectures which do not comply with these requirements have started to fade away.

One of those fading algorithms is the Naïve Bayes Nearest Neighbour (NBNN) classifier [5]. Indeed, the key requisites of NBNN-based approaches do not fit well with CNNs. To begin with, they require local feature representations without any vector quantization, as opposed to the global feature representation derived from the CNN activation layers [3, 4]. Moreover, NBNN-based algorithms rely on the Image-2-Class (I2C) paradigm: for every image, each local descriptor is considered as independently sampled from a class-specific feature distribution. Hence, each descriptor votes for the most probable class, and the collection of votes is used to label each image. As opposed to that, CNNs operate on another classification principle. These two intrinsic features of NBNN-based approaches led to a strong generalization ability, showcased by remarkable results in place classification [2] and domain adaptation [6]. Still, as of today no solution has been found for bridging somehow these two approaches.

This paper fills this gap. We propose a simple way to compute local features from whole images, using pre-trained CNNs. Our starting point is the paper of Gong et al. [1], on which to a large extent we build. We extract CNN activations for local patches at multiple scale levels. As opposed to [1], we do not perform any pooling or concatenation. The resulting features can be used directly as input to any NBNN-based classifier. However, the total number of examples can be very large, especially when doing a dense sampling for the patches and tackling large scale problems. To deal with this, while at the same time maximizing the predictive power of NBNN-based approaches, we propose a scalable version of Naive Bayes Non-linear Learning (NBNL, [2]). NBNL tries to circumvent limitations of NBNN through non-linear learning powered by Latent Locally-Linear SVM [7], that to our knowledge is the current state of the art among NBNN-based classifiers. Our stochastic algorithm retains the generality and robustness of the original method, yet it wins by having low memory footprint. At the same time, it considerably increases its scalability during training, making it applicable also on problems with hundreds of classes, where a dense sampling strategy might lead to features or more. Moreover, we show that our smoothed version of NBNL could in principle be used as final layer for an end-to-end training of a CNN. Figure 1 shows schematically the whole framework.

We assess our approach on scene recognition and domain adaptation datasets. These two research areas are those where NBNN-based algorithms showed more promise in the pre-CNN era. We show that on the Scene [9], UIUC Sports [10], and MIT Indoor [11] datasets we achieve the state of the art among single-features approaches. To the best of our knowledge, these are the first results reported where an NBNN-based method achieves the state of the art not only among other NBNN-based approaches, but also among traditional techniques. Regarding domain adaptation, experiments on the Office+Caltech256 [12] dataset show that by just using our approach to build a source classifier and then testing it on the target, we achieve remarkable results in the unsupervised setting, and the state of the art in the semi-supervised one. This further underlines the current power and remarkable future potential of our contribution.

## 2 Related Work

NBNN [5] is a learning-free non-parametric image classification scheme. It proved its robustness and generalization ability on many different tasks, from image recognition [5, 13, 14, 15] to domain adaptation [13, 6] to action recognition [16]. A number of works went on to improve the generalization performance of NBNN by adding layers of learning. For example, in [17] the authors included a metric learning procedure, thus altering the metric space of -nearest neigbour. A similar idea was also investigated by Tommasi and Caputo [6], demonstrating that a plain NBNN performs very well in the domain adaptation setting, and even better when tuned-up with metric learning. Another route was pursued by works focused on patch subset selection and weighting [14, 2, 18]. A somewhat orthogonal direction was explored by fusing NBNN with kernel methods, proposing NBNN kernels [13, 19], which could be used in conjunction with linear classifiers and ultimately combined with another kernels over traditional representations. All of these methods were proposed before the advent of modern features induced by CNN, and typically were evaluated on feature descriptors such as SIFT or SURF, extracted from very small image patches. Since the seminal paper of Donahue et al. [3], the state of the art has been provided by CNNs’activations. Building on this, Gong et al. [1] proposed a multiscale orderless pooling of CNN features extracted from densely sampled patches. Later, Liu et al. [20] proposed a similar pooling scheme, called cross-convolutional-layer pooling, which focuses on using different convolutional layers together.

In this work, we revisit NBNN considering its power in conjunction with CNN features, in both categorization and domain adaptation scenarios. Many proposed algorithms built on top of NBNNs were thoroughly empirically studied [15]. However, the amount of training data hardly ever exceeded images. This stems from the limitations of the nearest-neighbour search – the need to store all or most of training data, and the curse of dimensionality that is often suffered by non-parametric algorithms. Some variations have been proposed to improve the time and space complexity of NBNNs. McCann and Lowe [21] proposed to build one single search structure for all the classes and to consider only neighbouring descriptors, thus offering an increase in performance. In Naïve Bayes Non-linear Learning (NBNL) [2], the authors retained the idea of patch-based classification as in NBNN, but followed the way of non-linear parametric classification. This allowed them to achieve a compact representation of the classes by learning a set of prototypes, allowing fast testing and improved accuracy. Unfortunately, their method was confined to the batch setting without much improvement in scalability compared to NBNN. In this paper we further develop the idea of Naïve Bayes Non-linear Learning (NBNL) by proposing a scalable stochastic locally-linear formulation, drawing inspiration from [2] and [7].

Many works in machine learning, such as [22], reside on the assumption that, although natural data live in a high-dimensional space, they are embedded into a low-dimensional manifold. Such algorithms try to learn about the manifold under the assumption that looking close enough, or locally, it appears approximately linear, thus can be captured by an hyperplane. A well-known stream of works on Local Coordinate Coding (LCC) [23, 24, 25] aims to learn the set of hyperplanes and weights that combine them locally. Often, this is done in the unsupervised way by minimizing the reconstruction error [23, 24, 26]. In these works a special attention is given to local weights of hyperplanes, or codes, which in visual learning problems are used as features. This approach was taken further by Locally-Linear Support Vector Machine [27], where codes are first found through clustering together with nearest-neigbour search, and then hyperplanes are learned in a single optimization problem. As these methods use separate unsupervised learning stage, they are unaware of the underlying discriminative task and scalability depends on the efficiency of this pre-training. This limitation is countered in the literature on Latent SVM [28] and Multiclass Latent Locally-Linear (ML3) SVM, where both, hyperplanes and codes are learned simultaneously through discriminative learning problem. Despite non-convexity, smart relaxations and optimization methods like Concave-Convex Procedure (CCCP), enable them to work well in practice. Unfortunately, these are typically batch algorithms with heuristical initialization [29], sometimes guided by in-domain knowledge, such as mining hard-negatives [28]. Other works proposed to scale up learning in this setting [30, 31], however, none of them demonstrated real scalability empirically. In this work we address these limitations proposing a simple scalable Stochastic Multiclass Latent Locally-Linear SVM, which does not require initialization tricks and easily handles the order of training examples.

## 3 Computing Local CNN Activations

As mentioned before, a key requirement for any NBNN-based framework is to deal with features that capture local information about the image. This concretely means to extract from each whole image a set of local patches at multiple scales, and compute feature descriptors from them. Following [1], we decide here to create orderless image representations from pre-trained CNN by extracting deep activation features from patches obtained at increasingly finer scales. The effectiveness of such features will depend on several designer choices, from the pre-trained CNN chosen, to the sampling rate for the patches, the patch size, and the computed CNN activations. In the following we discuss these points and our own designer choices.

Pre-trained CNN The first hyper-parameter to chose is the CNN architecture to be used for computing the activations. The current off-the shelf state of the art choice for this task on whole images is the Caffe implementation [32], pre-trained on ILSVRC [33]. We decided to follow this route here with respect to the architecture type. As one of our benchmarks is the scene classification problem, we decided to use their network trained on a hybrid dataset composed from Places- [8] and ILSVRC [33]. Note that other architectures like VGG [4] or OverFeat[34] could be used in the same framework. Note also that, for any given CNN architecture within this framework, fine tuning on a validation set might further improve results.

Patch Extraction The second set of hyper parameters to tune are those specifically related to the patch extraction, i.e. the sampling rate for the patches, the patches size and the number of scales. Regarding the sampling rate, we considered two patch sampling settings: (a) dense, with around patches per image, and (b) sparse, with approximately patches per image. Since each image has different proportions, the sampling stride was dynamically computed in order to approximately achieve the desired number of patches. Regarding the patches size and number of scales, we did set the size of the smallest patch from , and further doubled the size with each level. For example, if the size of the smallest patch is px and we consider levels, we will extract patches of size px (level ), px (level 2) and px (level 3). As level , we considered the whole image, where before extracting the patches, each image is resized to reduce its longest side to 200 pixels.

CNN activations Finally, we have to choose the fully connected layer of CNN, whose outputs will be used as features. The most popular choice in the literature, adopted also in [8], is to take the output of the seventh fully connected layer, after the rectified linear unit (ReLU) transformation, so that all values are non-negative. We compared this setting with other possibilities, namely taking the output of the sixth layer, on some pilot experiments, which can be found in the appendix. We found that also in the NBNN framework the mainstream approach seems to be the most effective.

## 4 Scalable Naïve Bayes Non-linear Learning

In this section we describe our main technical contribution, a novel Stochastic Multiclass Latent Locally-Linear (STOML3) SVM, designed to resolve the scalability issues of NBNN. Applied to the NBNN learning framework, it results in a scalable Naïve Bayes Non-linear Learning technique (sNBNL). First we introduce the necessary background (sections 4.1, 4.2, 4.3), and present our algorithm in Section 4.4.

### 4.1 Definitions

We first introduce the notation and technical definitions used in the rest of the paper. Denote with small and capital bold letters respectively column vectors and matrices, e.g. and . We will use a non-negative truncation function and its vectorial element-wise counterpart . To denote the largest element of the vector, we will use notation . We denote enumeration sets by for .

Denote by and respectively the input and output space of the learning problem. Let the training instance , w.l.o.g., be composed from sub-instances, . Then we denote the training set of size by , drawn from the probability distribution over . We will focus on the -class classification problem so , and, w.l.o.g., . To measure the accuracy of a learning algorithm, we have a non-negative convex loss function , which measures the cost incurred predicting instead of . Finally we will denote a one nearest neighbor function w.r.t. the support set by . Alternatively, for neighbor matrices we will use the notation .

### 4.2 Naïve Bayes Nearest Neighbor Classification

The idea behind NBNNs [5] is to treat each image as a collection of uniformly or randomly sampled patches. Let be the set containing visual descriptors of patches in the test image, let be random variables taking values in the space of these descriptors, and let be taking values in the label set. Denoting by the unknown conditional probability density function, the NBNN predictor is,

(1) |

The key statistical assumption made in NBNN is that patches are conditionally independent given the class. In addition, assuming that is uniform and switching to log-likelihood of , we have that,

(2) |

Since is unknown, NBNN resorts to the non-parametric Kernel Density Estimator (KDE) [35] with Gaussian kernel function, and further lower-bounds the log-likelihood by Jensen’s inequality, to make the predictor computationally efficient. In this form prediction involves nearest neighbor search, which can be very efficient when the intrinsic dimension of the data is small [36]. Denoting the support of the class by , the approximated empirical NBNN predictor is then,

(3) |

### 4.3 Naïve Bayes Non-Linear Learning

As NBNN is a nearest-neighbor-based approach, it shares its well-known scalability limits. Few works have explored the potential of NBNN-like schemes surpassing the order of training examples. Here we review the recently proposed Naïve Bayes Non-linear Learning (NBNL) [2] that scales NBNN through parametric learning. It will be the starting point for our scalable algorithm.

Let be the collection of -sized supports of NBNN in matrix notation. Following [2], we will refer to the columns of any support as prototypes. We will also assume that all prototypes have bounded norm, that is . NBNL rests upon the observation that NBNN minimizes,

(4) |

The right hand side can be minimized over , similarly as in (3), which yields the NBNL predictor

(5) |

The key idea is that prototypes in such a predictor need not be fixed, but can be learned. Fornoni and Caputo [2] proposed to learn prototypes through the regularized empirical risk minimization. Considering , the problem would be to minimize the following over ,

(6) |

However, in [2], they ultimately proposed to solve a simpler relaxed problem,

(7) |

Problem (7) is generally addressed by the family of latent [28] and locally-linear SVMs [27, 7]. In particular, [2] employed a non-linear ML3 Support Vector Machine (SVM) [7], which we briefly review next.

##### Multiclass Latent Locally-Linear (Ml3) Svm.

In ML3 SVM one aims to solve a problem similar to (7). ML3 SVM is a locally-linear parametric classification algorithm, where we assume that in a given small locality the optimal decision boundary is approximately linear [23, 27, 37, 30]. Usually, in locally-linear versions of SVM, we consider score functions , where is a function specifying local combination of hyperplanes at a particular point of the input space. Typically one has to choose before solving the main optimization problem [28, 23, 27]. This amounts to the separate procedure dedicated just to learn and fix weights . ML3 SVM addresses this by the score function with automatic weighting,

(8) |

for any or . Given a point , this rule leads to the combination of hyperplanes, such that the margin of a combined linear classifier is maximized on .

The objective function of ML3 SVM is non-convex, however, by posing it as a difference of convex functions, we can find a reasonably good solution by Concave-Convex Procedure (CCCP) [38]. This essentially confines the algorithm to the batch setting, because we need to solve a separate convex optimization problem at every CCCP iteration. Besides its batch nature, ML3 heavily relies on heuristic weight initialization by first solving a linear SVM problem.

### 4.4 Stochastic Ml3 Svm

In this section we fix the limitations of ML3 by introducing a novel scalable stochastic formulation, conceptually similar to the one of ML3. Namely, we propose a Stochastic Multiclass Latent Locally-Linear (STOML3) SVM which can ran online, is free from any initialization tricks, and enjoys stationary point convergence guarantees. This stochastic formulation allows to use NBNL at scales out of reach for ML3 SVM and NBNN. We call this new version, the scalable NBNL (sNBNL).

Rather than solving a regularized empirical risk as in (7), in the following we will aim at minimizing a regularized risk directly, similarly as in the popular Stochastic Gradient Descent (SGD) approach to learning. More formally, our goal is to solve,

(9) |

where we chose a differentiable multiclass logistic loss function (softmax loss),

(10) |

In practice we cannot solve (9) directly, since is unknown, thus the gradient cannot be computed. However, we can still compute an unbiased estimate of the gradient given a point , and thus update the solution iteratively. Alike the batch formulation of ML3 SVM, the resulting objective function is non-convex. We approach (9) through the Stochastic Majorization-Minimization framework [39], which unlike SGD, provides a stationary point convergence guarantee for our problem, and it converges faster in practice [40, 41]. We summarize the Stochastic Multiclass Latent Locally-Linear (STOML3) SVM in pseudocode, and defer its technical derivation details to the following section. The computational complexity of STOML3 SVM at every stochastic update is in .

##### Connection to Neural Network Learning.

Latent locally-linear classification, ML3 SVM, and STOML3 SVM can be interpreted as a variant of a shallow artificial neural network, Figure 2.

The main difference between traditional models such as multilayer perceptrons, is that the hidden layer consists of linear units (), whereas the weights of the output layer, are adjusted automatically depending on the outputs of hidden layer , thus for learned , is a function of input . Specifically, these weights are adjusted to maximize the margin by combining outputs of hidden units. Clearly, for different regions of the input space, resulting combinations are different, yielding non-linear decision surface.

From the artificial neural network learning point of view, it would be interesting to consider deeper architectures of STOML3. Another possibility would be to combine it with convolutional layers to investigate end-to-end locally-linear classification. We leave these directions to the future work.

#### 4.4.1 Derivation

To derive STOML3 SVM we use the Stochastic Majorization-Minimization (SMM) framework proposed by Mairal [39]. Stochastic Majorization-Minimization (SMM) deals with minimization of a differentiable function that has a form of expectation, by minimizing its simpler approximate convex upper-bound. Specifically, after we sample a training example, we minimize an upper bound on the term inside of expectation with realization fixed. In our case, the objective is (9), and thus for a realization we need to specify a convex upper-bound of a regularized loss function,

(11) |

More formally, in SMM such a convex upper-bound is called the surrogate function of an objective, defined as:

###### Strongly Convex First-Order Surrogate Functions [39].

Fix , and let be a strongly convex function such that and . Let be differentiable and the gradient be -Lipschitz continuous. We will call the first order surrogate function of .

We can choose among many different surrogates, but we have to keep in mind that it should be easily minimized with every incoming training example. That said, we choose,

(12) |

where , is the regularizer. Notably, we can solve analytically. It is also not hard to see that is strongly convex first-order surrogate function. Given optimal , the rest of the derivation follows the optimization template of Mairal [39], that we summarize in our pseudocode.

## 5 Experiments

In this section we test experimentally our framework. We considered two tasks, scene recognition and domain adaptation, where in the past NBNN methods showed promise. Our experiments aim to verify two claims: first, that such methods coupled with local CNN activations at multiple scales are able to achieve results competitive with, or even better than, end-to-end, fine tuned CNN architectures. Second, that scalable NBNL outperforms NBNN, thus paving the way for the use of our approach on large scale scenarios that have been so far prohibitive for NBNN methods.

In the rest of the section we describe the datasets and experimental settings used, and the variants of our framework that were tested (Section 5.1). Section 5.2 describes the results obtained in scene recognition, exploring how the performance changes when varying the parameters relative to the patch extraction, and the scalability of the approach. Section 5.3 reports results obtained in the domain adaptation setting.

### 5.1 Experimental Settings

Datasets. For the scene recognition setting, we used the Scene [9], UIUC Sports [10], and MIT Indoor [11] databases. For Scene , we used images per class for training and for testing. For UIUC Sports, we used images per class for training and images for testing. For MIT Indoor, we used images per class for training and for testing. These choices are all consistent with the standard protocols reported in the literature. Each configuration is tested on 5 splits. For the large scale experiments, we used the SUN- database [42] that totals 1.6 million image patches. We strictly followed the experimental procedure described in [42]. For all scene experiments, we concatenated the CNN activations with the absolute position of every patch. For the domain adaptation scenario, we considered the Office + Caltech database [12], which contains a subset of ten classes shared between Office and Caltech256 [43]. Here we keep images per class for training ( if the target is either Webcam or DSLR) and use the rest as test set. Each configuration was tested on splits.

Baselines For every scenario, for every setting, we always used the following three variants of our framework: (1) CNN-NBNN: this consists of using the NBNN classifier as originally proposed [5] , combined with the local CNN activations. (2) CNN-NBNL: the same as (1), using NBNL as classifier [2]. (3) CNN-sNBNL: the same as (1), (2), but using our scalable version of NBNL.

### 5.2 Scene Classification Experiments

We performed extensive experiments over Scene , UIUC Sports and MIT Indoor for assessing how performance changes when varying the parameters relative to the extraction of the CNN activations. Specifically, we varied the sampling density, patch size and the number of levels. We also compared results when taking the activations before or after ReLU. As classifier, we always used NBNN (preliminary experiments using also NBNL and sNBNL did not show any significant variation in behaviors). Figure 3 reports a representative set of our findings, which can be found in the appendix. We see that larger patch sizes generally yield better performance, but combining patches taken at different scales further improves accuracy. For example, using only px patches gives a worse accuracy than using px and px patches. This indicates that distinct scales hold complementary information. Dense sampling does not appear to improve the accuracy significantly.

Overall, using together px, px, and px patches seems to be the best and most stable configuration. The stability of results breaks down when we supply smaller patches of px. We speculate that at this patch size there is not enough visual information for CNN to provide meaningful representation. Finally, we note that CNN features extracted before ReLU generally perform better. On the basis of these results, in the rest of the paper we always use simultaneously px, px, and px patches, no ReLU and sparse sampling.

We then proceeded to evaluate our framework when using NBNL or sNBNL. Experiments here had the goal to confirm the ability of sNBNL to obtain the same results than NBNL at a lower computational cost, as well as comparing results obtained by CNN-sNBNL with respect to the state of the art. Table 2 shows the results obtained using NBNL and sNBNL on the three databases, in terms of accuracy and training time. We see that the two algorithms achieve basically the same results, as confirmed by a sign-test (). With respect to the training time instead the differences are remarkable, with sNBNL achieving on average a speed up of times compared to NBNL. This is a first experimental confirmation of the scalability of our approach.

Table 1 compares our results with previous work. We see that we achieve consistently the best accuracies among the single cue methods. This is impressive for an approach that uses an off-the-shelf pre-trained CNN, without any fine tuning. Moreover, on the Scene 15 database, our performance surpasses also that of multi-cue approaches.

Method | Scene | Sports | MIT |
---|---|---|---|

NBNN (Surf)[2] | |||

NBNL (Surf)[2] | |||

CNN-NBNN | |||

Lin. SVM(CNN) | |||

CNN-NBNL | |||

CNN-sNBNL | |||

Hybrid CNN[8] | |||

LScSPM[44] | |||

MOP-CNN[1] | |||

DDSFL + CAFFE[45] | |||

ISPR + IFV[46] | |||

CNN Fusion[47] |

We conclude this section by probing the potential of our framework on a larger scale experiment. We run experiments on the SUN- [42] dataset. Note that this dataset is out of reach for NBNN, and prohibitive also for NBNL. We trained GPU-optimized implementation of STOML3 SVM in minibatches of examples on splits originally proposed in [42]. As in the previous scene recognition experiments, we concatenated the absolute patch positions with the feature vector. We perform data standardization and we set the regularization parameter to – note that even better results can be obtained by tuning it. CNN-sNBNL achieves a performance of , which surpasses recently reported results by Zhou et al. [8] of and . These last results were obtained by training a linear SVM on Hybrid and Places- CNN features respectively.

Overall, we conclude that the results reported in this section clearly showcase the power of our framework in the scene recognition setting.

### 5.3 Domain Adaptation Experiments

We report here experiments performed on the Office+Caltech database, both in the unsupervised and semi-supervised scenarios. Note that none of the three instantiations of our framework are a domain adaptation algorithm, hence we simply use each of them on the source data, and test the obtained classifier on the target. Concretely, this means that in the unsupervised setting we simply train NBNN/NBNL/sNBNL on the source; for the semi-supervised setting, we add three target images to the source and proceed as for the unsupervised case. A similar experiment was first presented in [6], showing that the generalisation properties of NBNNs were enough to partially address the DA problem. As features, we use the same configuration employed in the scene recognition experiments, i.e. patches of size px, px and px without ReLU. We performed experiments with sparse sampling.

Table 3 reports the results obtained in the unsupervised setting, while Table 4 reports those obtained in the semi-supervised setting. We see that, in the unsupervised setting, our approach is powerful enough to outperform several important learning-based baselines, in spite of its simplicity. Performances on the semi-supervised settings are even more spectacular, as we achieve in all settings the state of the art. We stress that this is accomplished by the methods that are not designed for domain adaptation scenario. Note that we could not run DA-NBNN, the only existing NBNN-based domain adaptation method, on our local CNN multi scale activations because of its severe computational limitations. These results further confirm the power of the proposed framework, and its great potential for future work.

## 6 Conclusions

This paper provides a recipe for using CNN activation features combined with NBNN-based classifiers. The two key ingredients are: (1) extraction of CNN activations from local patches at different scales, and (2) a scalable NBNN-based algorithm that exploits the learning power of locally linear SVMs. We present an instantiation of this framework using a pre-trained Caffe architecture, applied to the scene classification and domain adaptation problems. Results are very strong: on scene classification, we achieve the state of the art among single cue methods on three widely used benchmark databases. On domain adaptation, the simple use of the framework on the source only leads to extremely promising results on the target, competitive with a significant fraction of learning methods proposed so far. Future work will further explore the framework in an end-to-end learning setting, and within a domain adaptation algorithm.

## References

- [1] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional activation features. In European Conference on Computer Vision (ECCV), 2014.
- [2] M. Fornoni and B. Caputo. Scene recognition with naive bayes non-linear learning. In Pattern Recognition (ICPR), International Conference on, 2014.
- [3] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning (ICML), 2014.
- [4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference (BMVC), 2014.
- [5] O. Boiman, E. Shechtman, and M. Irani. In defense of nearest-neighbor based image classification. In Computer Vision and Pattern Recognition (CVPR). IEEE Conference on, 2008.
- [6] T. Tommasi and B. Caputo. Frustratingly easy nbnn domain adaptation. In Computer Vision (ICCV), IEEE International Conference on, 2013.
- [7] M. Fornoni, B. Caputo, and F. Orabona. Multiclass latent locally linear support vector machines. In Asian Conference on Machine Learning (ACML), 2013.
- [8] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems, NIPS, 2014.
- [9] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition, IEEE Conference on, 2006.
- [10] L.-J. Li and L. Fei-Fei. What, where and who? classifying events by scene and object recognition. In Computer Vision (ICCV), IEEE International Conference on, 2007.
- [11] A. Quattoni and A. Torralba. Recognizing indoor scenes. In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, 2009.
- [12] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, 2012.
- [13] T. Tuytelaars, M. Fritz, K. Saenko, and T. Darrell. The nbnn kernel. In Computer Vision (ICCV), IEEE International Conference on, 2011.
- [14] R. Timofte and L. Van Gool. Iterative nearest neighbors. Pattern Recognition, 48(1):60–72, 2015.
- [15] R. Timofte, T. Tuytelaars, and L. Van Gool. Naive bayes image classification: beyond nearest neighbors. In Asian Conference on Computer Vision (ACCV), 2013.
- [16] X. Yang and YL Tian. Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE Conference on, 2012.
- [17] Z. Wang, Y. Hu, and L.-T. Chia. Image-to-class distance metric learning for image classification. In European Conference on Computer Vision (ECCV), 2010.
- [18] P. Wohlhart, M. Kostinger, M. Donoser, P. M. Roth, and H. Bischof. Optimizing 1-nearest prototype classifiers. In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, 2013.
- [19] K. Rematas, M. Fritz, and T. Tuytelaars. The pooled nbnn kernel: Beyond image-to-class and image-to-image. In Asian Conference on Computer Vision (ACCV), 2013.
- [20] L. Liu, C. Shen, and A. van den Hengel. The treasure beneath convolutional layers: cross convolutional layer pooling for image classification. In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, 2015.
- [21] S. McCann and D. G. Lowe. Local naive bayes nearest neighbor for image classification. In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, 2012.
- [22] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
- [23] K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. In Advances in neural information processing systems (NIPS), 2009.
- [24] K. Yu and T. Zhang. Improved local coordinate coding using local tangents. In International Conference on Machine Learning (ICML), 2010.
- [25] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, 2010.
- [26] Z. Zhang, L. Ladicky, P. Torr, and A. Saffari. Learning anchor planes for classification. In Advances in Neural Information Processing Systems (NIPS), 2011.
- [27] L. Ladicky and P. Torr. Locally linear support vector machines. In International Conference on Machine Learning (ICML), 2011.
- [28] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In Computer Vision and Pattern Recognition (CVPR), IEEE Conference On, 2008.
- [29] R. Girshick and J. Malik. Training deformable part models with decorrelated features. In Computer Vision (ICCV), IEEE International Conference on, 2013.
- [30] A. Kantchelian, M. C. Tschantz, L. Huang, P. L. Bartlett, A. D. Joseph, and J.D. Tygar. Large-margin convex polytope machine. In Advances in Neural Information Processing Systems (NIPS), 2014.
- [31] H. Oiwa and R. Fujimaki. Partition-wise linear models. In Advances in Neural Information Processing Systems (NIPS), 2014.
- [32] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia, 2014.
- [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, pages 1–42, 2015.
- [34] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In International Conference on Learning Representations (ICLR), 2014.
- [35] T. Hastie, R. Tibshirani, and J. Friedman. The Elements Of Statistical Learning. Springer, 2009.
- [36] K. L. Clarkson. Nearest-neighbor searching and metric space dimensions. In G. Shakhnarovich, T. Darrell, and P. Indyk, editors, Nearest-neighbor methods for learning and vision: theory and practice, pages 15–59. MIT Press, 2006.
- [37] C. Jose, P. Goyal, P. Aggrwal, and M. Varma. Local deep kernel learning for efficient non-linear SVM prediction. In International Conference on Machine Learning (ICML), 2013.
- [38] A. L. Yuille and A. Rangarajan. The concave-convex procedure. Neural computation, 15(4):915–936, 2003.
- [39] J. Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems (NIPS), 2013.
- [40] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11:19–60, 2010.
- [41] N. L. Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems (NIPS), 2012.
- [42] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision and pattern recognition (CVPR), IEEE Conference on, 2010.
- [43] G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical report, Caltech, 2007.
- [44] S. Gao, Ivor Wai-Hung Tsang, and L. Chia. Laplacian sparse coding, hypergraph laplacian sparse coding, and applications. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(1):92–104, 2013.
- [45] Z. Zuo, G. Wang, B. Shuai, L. Zhao, and Q. Yang. Exemplar based deep discriminative and shareable feature learning for scene image classification. Pattern Recognition, 48(10):3004–3015, 2015.
- [46] D. Lin, C. Lu, R. Liao, and J. Jia. Learning important spatial pooling regions for scene classification. In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, 2014.
- [47] M. Koskela and J. Laaksonen. Convolutional network features for scene recognition. In ACM International Conference on Multimedia, 2014.
- [48] B. Gong, K. Grauman, and F. Sha. Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In ICML, 2013.
- [49] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning (ICML), 2015.
- [50] N. Patricia and B. Caputo. Learning to learn, from transfer learning to domain adaptation: A unifying perspective. In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, 2014.
- [51] H. V. Nguyen. Non-Linear and Sparse Representations for Multi-Modal Recognition. PhD thesis, University of Maryland, 2013.
- [52] S. Shekhar, V. M. Patel, H. Nguyen, and R. Chellappa. Generalized domain-adaptive dictionaries. In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, 2013.
- [53] L. Bo, X. Ren, and D. Fox. Hierarchical matching pursuit for image classification: Architecture and fast algorithms. In Advances in neural information processing systems (NIPS), 2011.

## Appendix A Supplementary Experiments

Here, additional experimental results are provided both for Scene Recognition and Domain Adaptation.

### a.1 Scene Recognition Experiments

Figure 4 contains, from top-left proceeding clockwise, results for NBNN on:

### a.2 Domain Adaptation Experiments

Tables contain our full NBNN results on the Office + Caltech setting [12], both unsupervised and semi-supervised.