# Towards Effective Codebookless Model for Image Classification

## Abstract

The bag-of-features (BoF) model for image classification has been thoroughly studied over the last decade. Different from the widely used BoF methods which modeled images with a pre-trained codebook, the alternative codebook-free image modeling method, which we call Codebookless Model (CLM), attracted little attention. In this paper, we present an effective CLM that represents an image with a single Gaussian for classification. By embedding Gaussian manifold into a vector space, we show that the simple incorporation of our CLM into a linear classifier achieves very competitive accuracy compared with state-of-the-art BoF methods (e.g., Fisher Vector). Since our CLM lies in a high-dimensional Riemannian manifold, we further propose a joint learning method of low-rank transformation with support vector machine (SVM) classifier on the Gaussian manifold, in order to reduce computational and storage cost. To study and alleviate the side effect of background clutter on our CLM, we also present a simple yet effective partial background removal method based on saliency detection. Experiments are extensively conducted on eight widely used databases to demonstrate the effectiveness and efficiency of our CLM method.

## 1 Introduction

Image classification has been attracting massive attentions in computer vision and pattern recognition communities in recent years. It is one of the most fundamental but challenging vision problems because images, as illustrated in Fig. 1, often suffer from significant scale, view or illumination variations (e.g., in texture classification [8] and material recognition [22]), and pose changes, background clutter, partial occlusion (e.g., in scene categorization [30, 31] and object recognition [17, 18, 21, 47]).

For a long time the bag-of-features (BoF) model [40] has been almost given priority to image classification. As shown in Fig. 2 (a), the BoF-based methods generally consist of five components: local features extraction, learning codebook with training data, coding local features with pre-trained codebook, pooling or aggregating codes over images, and finally, learning classifier (e.g., SVM) for classification. With this processing pipeline, the BoF-based methods can be seen as a hand-crafted five-layer hierarchical feed-forward network [43] with a pre-trained feature coding template (codebook) [7]. The learned codebook depicts the distribution of feature space, and makes coding of high dimensional features possible. This architecture has achieved very promising performance in a variety of image classification tasks.

The codebook as a reference for feature coding serves as a bridge between local features and global image representation. However, it is well known that segmentation of feature space involved in building of codebook brings on quantization error [6], and leads to continuous striving for this side effect (e.g., soft coding methods [39, 45] alleviate but cannot completely eliminate it). Though offline, training of codebook, particularly large size ones, is time consuming. In addition, in general the pre-trained codebook on one database cannot naturally adapt to other databases [52].

An alternative approach is to estimate the statistics directly on sets of local features from input images [10, 35, 44], as illustrated in Fig. 2 (b), which is called codebookless model (CLM) in this paper. It is clear from Fig. 2 that the major difference is that the BoF model learns a codebook to explore the statistical distribution of local features and then performs coding of descriptors, while the CLM represents images with descriptors directly, requiring no pre-trained codebook and the subsequent coding. Conceptually, the codebookless model has the potential to circumvent the aforementioned limitations of the BoF model, however, which has received little attention in image classification community. The main reasons may be that such methods have not yet shown competitive classification performance, and that they often need to utilize inefficient and unscalable kernel-based classifiers.

In this paper, we propose an effective CLM scheme, and argue that the CLM can be a competitive alternative to the BoF methods for image classification. The comparison between state-of-the-art BoF method, Fisher Vector (FV) [39], and our CLM on various image databases is shown in Fig. 1. First and foremost, we extract a set of local features (e.g., SIFT [34]) on a dense grid of image, and simply model them with a single Gaussian model to represent the input image. Then, we employ a two-step metric for matching Gaussian models. By using this metric, Gaussian models can be fed to a linear classifier for ensuring efficient and scalable classification while respecting the Riemannian geometry structure of Gaussian models. Moreover, we introduce two well-motivated parameters into the used metric. One is to balance the effect between mean and covariance of Gaussian, and another is for eigenvalue power normalization on covariance.

Our codebookless model usually is of high dimension, by incorporating low-rank learning with SVM, we propose a joint learning method to effectively compress Gaussian models while respecting their Riemannian geometry structure. It is mentionable that, to the best of our knowledge, we make the first attempt to perform joint learning of low-rank transformation and SVM on Gaussian manifold. Finally, to alleviate the side effect of background clutter, a saliency-based partial background removal method is proposed to enhance our CLM. The experimental results show that partial background removal is helpful to CLM when images are heavily cluttered (e.g., CUB200-2011 and Pascal VOC2007).

## 2 Related work

The codebookless model for directly modeling the statistics of local features has been studied in past decades. Rubner \etal[38] introduced signatures for image representation, and proposed the Earth Mover’s Distance for image matching which is robust but has high computational cost. Tuzel \etal[44] for the first time used covariance matrices for representing regular image regions, and employed Affine-Riemannian metric which suffers from high computational cost [36]. Gaussian model as image descriptor has been used for visual tracking [19], in which Gaussian models are matched based on the Riemannian metric, involving expensive operations to solve generalized eigenvalue problem. Going beyond Gaussian, Gaussian mixture model (GMM) is more informative and is used in image retrieval [3]. However, GMM suffers from some limitations, such as high computational cost of matching methods and lacking of general criteria for model selection.

Our work is motivated by [9, 10] and [35]. Carreira \etal[9, 10] modeled the free-form regions obtained by image segmentation with estimating the second-order moments. By using Log-Euclidean metric [2], the method in [9, 10] can be combined with a linear classifier, which has shown competing recognition performance on images with less background clutter (e.g., Caltech101 [18]). Different from [9, 10], we employ a Gaussian model to represent the whole image. It is well-known that a covariance matrix can be seen as a Gaussian model with fixed mean vector. Compared to [9, 10], our CLM contains both the first-order (mean) and second-order (covariance) information. Note that the first-order statistics has proven important in image classification [25, 39]. Moreover, the manifold of Gaussian models and that of covariance matrices are quite different, and the embedding method in our CLM makes Gaussian models can be handled flexibly and conveniently.

Nakayama \etal[35] also represented an image with a global Gaussian for scene categorization. However, they matched two Gaussian models by using the Kullback-Leibler (KL) divergence, and hence kernel-based classifiers have to be used. This method is not scalable and has high computational cost. In contrast to [35], our metric is decoupled which allows a linear classifier to be combined, which makes our method more efficient and scalable than the KL kernel based one in [35]. Moreover, compared with the ad-hoc linear kernel (Euclidean baseline) in [35], our method takes advantage of the geometry structure of Gaussian models and brings large performance improvement.

There is another line of research on codebookless model methods. Grauman \etal[20] proposed a pyramid match kernel to map feature sets to multi-resolution histograms, and employed histogram intersection kernel for classification. Bo \etal[5] presented efficient match kernels to map local features into a low dimensional space, and adopted a linear classifier. Boiman \etal[6] developed an image-to-class distance between the sets of local features, and employed a nearest neighbor classifier. Yao \etal[50] proposed a codebook-free approach by using a large number of randomly generated image templates for image representation, and developed a bagging-based classifier.

## 3 Proposed method

We first introduce the image representation by a single Gaussian model. Then, we employ an effective and efficient two-step metric for matching Gaussian models, and propose two well-motivated parameters to improve the used distance metric. Finally, we present a joint learning method of low-rank transformation and SVM on Gaussian manifold.

### 3.1 Gaussian model for image representation

Given an input image, we extract a set of local features at a dense grid. By the maximum likelihood method, the image can be represented by the following Gaussian model:

where and are mean vector and covariance matrix, and denotes matrix determinant. Compared with histogram and covariance, Gaussian model is more informative. Meanwhile, unlike matching of signatures [38] or GMMs [3], matching of Gaussian models does not bring high computational cost.

### 3.2 Two-step metric between Gaussian models

To match Gaussian models, we exploit a two-step metric which has been proposed to compute the ground distance between Gaussian components of GMMs [32]. The first step is to embed Gaussian manifold into the space of SPD matrices [33], and then map the Lie group of SPD matrices into its corresponding Lie algebra, a linear space, by using the Log-Euclidean metric [2].

The space of -dimensional Gaussian models is a Riemannian manifold. Let be a Gaussian model with mean vector and covariance matrix . Through a continuous function , is mapped to an affine matrix, an element in the affine group ; that is,

(1) |

where is the Cholesky factorization of . Further, through the function , is mapped to an SPD matrix . So far, by the successive functions and , is uniquely designated as an SPD matrix

(2) |

Please refer to [33] for details on the embedding process.

The space of SPD matrices is a Lie group that forms a Riemannian manifold. Two operations, namely the logarithmic multiplication and the scalar logarithmic multiplication, are defined in the Log-Euclidean metric [2], which equip with structures of not only the Lie group but also vector space. Through the matrix logarithm, is mapped into its Lie algebra , the vector space of symmetric matrices. The matrix logarithm is a deffemorphism and an isomorphism so that operations over SPD matrices can be replaced by the Euclidean operations of their counterparts in the vector space. So, through the matrix logarithm, an SPD matrix is one-to-one mapped to a symmetric matrices which lies in a linear space, and the geodesic distance between SPD matrices and is defined by , where is the Frobenius norm.

### 3.3 Two well-motivated parameters

In practice, we found that it is important to balance mean vector and covariance matrix in the embedding matrix (2), because their dimensions and order of magnitude of each dimension may vary considerably. Meanwhile, the effect of mean vector and covariance matrix may vary for different tasks. With these considerations, we introduce a parameter in the function (1):

(3) |

Accordingly, the embedding matrix has the following form:

(4) |

The embedding matrix (4) reduces to the covariance matrix when , and is equal to the original one when . Hence, the role of mean vector and covariance matrix can be adjusted by .

The maximum likelihood estimator of the empirical covariance matrix is susceptible to interference of noise, especially for high dimension space [15]. Based on observation that the maximum likelihood estimator of covariance ought to be improvable by eigenvalue shrinkage [42], we exploit power normalization on the eigenvalues of covariance matrix (EPN). Let be a Gaussian model estimated from a set of descriptors extracted from some image. The covariance matrix has eigenvalue decomposition , where is an orthornormal matrix whose column is the eigenvector of and is the corresponding eigenvalue, and denotes diagonal matrix. Then by introducing a parameter , our normalization is defined as

(5) |

With EPN, our final embedding matrix is:

(6) |

It is easy to prove that the embedding matrix (6) is still positive definite as being an SPD matrix. The eigenvalues power normalization has been proposed to measure distances between covariance matrices [16, 24] or tensor [29], namely, Power-Euclidean metric. Different from previous work, we use eigenvalues power normalization for robust estimation of covariance matrices in Gaussian setting for the case of high dimensional features, and compare Gaussians by using Gaussian embedding and the Log-Euclidean metric.

According to the Log-Euclidean framework, the matrix can be further embedded into a linear space by matrix logarithm:

(7) |

Let and be two Gaussian models and their corresponding symmetric matrices are and . The distance between two Gaussian models is

(8) |

It is easy to know that distance (8) is decoupled so that and can be computed separately and adopted in a linear classifier. For notational simplicity, we omit the parameters and in the distance measure (8).

### 3.4 Joint low-rank learning and SVM classifier

Our CLM usually is of high dimension (). In order to suppress redundant and noisy information while reducing computational and storage cost, we propose a low-rank learning method to compact our CLM. The matrix in geodesic distance (8) is a symmetric matrix which lies in the Euclidean space. Due to its symmetry, we can unfold the upper triangular part of to a vector of size . We can modify geodesic distance (8) by introducing a low-rank transformation matrix :

(9) |

where and are the unfolding vectors of two Gaussian models and , respectively.

Recent researches [26, 49] have shown that joint optimization of dimensionality reduction with classifier performs better than separate optimization of the two modules. Thus, given training samples , we optimize the low-rank learning jointly with a linear SVM (LRSVM):

(10) | |||

where are parameters of SVM, and is the label of . The dimensionality reduction for SPD matrices [23] has been studied with dimensionality reduction and classification separately performed, while our method is quite different in that we focus on Gaussian models and perform joint learning of low-rank transformation and SVM.

In practice, we extend the objective function (10) to multi-class problem under the spatial pyramid matching (SPM) framework [30]. Given an image , we can obtain its SPM representation , where is the number of blocks in SPM, which is fed to a one vs. all SVM for solving the classes problem. As suggested in [26], we optimize the dual problem of the objective function (10) under the SPM framework:

(11) | |||

where indicates all training features, and is the diagonal label matrix of the th class with diagonal element .

The problem (3.4) is non-convex and can be optimized by a two-step alternating method: *Step One*, fixing , we can optimize the Lagrange parameters with off-the-shelf SVM; *Step Two*, for fixed , we solve the following trace maximization problem:

(12) | ||||

We optimize the problem (12) by independently solving each with a close-form solution [26]. Due to the problem (3.4) being non-convex, initialization is nontrivial to reach a good local optimal solution and for fast convergence. In this paper, we use the basis of principal component analysis (PCA) as initialization, and we find that it can always achieve good performance and fast convergence.

## 4 Partial background removal (PBR)

We then present a simple yet effective method for analyzing and handling the side effect of background clutter based on unsupervised, bottom-to-up saliency detection. Our purpose here is to remove the interference of background, which is distinguished from the purpose of precise foreground localization in saliency detection community. Our method consists of two steps: coarse foreground detection and partial background removal. In the first step we localize in image the foreground based on saliency detection method [27] and then determine the bounding-box surrounding the foreground. Next, we adaptively expand bounding-box to accommodate some background regions based on size and intensity variance of the area inside the bounding-box. Then, the area outside bounding-box is removed for recognition. Our method is based on the considerations that accurate foreground detection is currently very difficult and neighboring regions of object can serve as the context and may be helpful for recognition. In our experiments, we adopt PBR to the two datasets with heavy background clutter: CUB200-2011 and VOC2007. Since PBR is designed for foreground objects with separable background clutter, we do not perform PBR on images with less background clutter and scene images where both foreground and background are valuable for scene understanding.

## 5 Implementation details

We extract multi-scale SIFT descriptors [34] (standard pipeline in the BoF model) with cell size , , and single scale pixel-wise covariance descriptor [44] via the dense sampling strategy with step-length 2. The dense covariance descriptors are computed with 17 dimensional raw features including intensity and four kinds of first-order and second-order gradients from [37]. We perform matrix logarithm on the covariance descriptors (LogCov), which are then vectorized. The SIFT features are calculated via the VLFeat library [46]. Moreover, following [9, 10], we also extract additional image cues, including color, location, scale, gradient and entropy to concatenate SIFT and LogCov. In order to ensure that there is sufficient data to estimate Gaussian models and covariance matrices are positive definite, we limit the minimum size of width or height of images to be larger than 64, and add to the diagonal entries of covariance matrices, respectively. We employ the spatial pyramid strategy [30] which divides an image into some regular regions (e.g., , , , ). For each region we compute a Gaussian model, and then concatenate them to represent the whole image. Each Gaussian is weighted by , where and are the number of pyramid levels and regions in the layer, respectively. We implement a one-vs-all SVM with LibSVM [11] and set parameter to on VOC2007 and on all the other databases. All algorithms are written in Matlab, and run on a PC equipped with i7-4770k CPU and 32G RAM.

## 6 Experimental evaluation

In this section, we evaluate the classification performance of our CLM on eight benchmark databases. First of all, we make an analysis of local features, the parameters of our method, the proposed low-rank learning method and the partial background removal method on the challenging CUB200-2011 [47]. Then, we compare with state-of-the-art methods on Caltech101 [18], Caltech256 [21], KTH-TIPS2b [8], Flickr Material Database (FMD) [22], Pascal VOC2007 [17], Scene15 [30] and Sports8 [31]. Finally, we analyze the computational complexity of our CLM.

### 6.1 Parameters analysis

Local descriptors | Parameters | BR | |||||||

ST | eST | LC | eLC | Beta | EPN | PBR | GT | Acc. | |

Cov. | |||||||||

Gau. | |||||||||

Local descriptors Four kinds of local descriptors, SIFT (ST) and its enrichment (eST), and LogCov (LC) and its enrichment (eLC), are evaluated in this section. The results of our CLM with various local descriptors on CUB200-2011 are shown in Table 1. We can see that the Gaussian model used in our method outperforms covariance matrix by or higher with either SIFT or eSIFT, which, we believe, can indicate that the first-order (mean) information is non-trivial. We use eST to evaluate other parameters as follows.

Two well-motivated parameters The proposed EPN (5) is a generic method for robust estimation of covariance in high dimension space. We set parameter in EPN (5) as in all databases. From Table 1, we can see that EPN can bring performance gain over the relevant method without EPN. The embedding parameter (6) balances the effect of mean vector and covariance matrix. To test its effect, we determine the optimal value of via cross validation. The performances of our CLM with various are illustrated in Fig. 3 (left). Compared to (covariance matrix only [9, 10]) and (the embedding in [33]), appropriate balancing at achieves and gains, respectively.

LRSVM To evaluate the proposed LRSVM method, we compare LRSVM with unsupervised principal component analysis (PCA) and supervised partial least square (PLS) [1] under different compression ratios. The LRSVM is initialized by PCA, and the results on CUB200-2011 are illustrated in Fig. 3 (right). From it we can see that LRSVM always performs better than PLS, and is superior to PCA by a large margin. Different from PLS which exploits the least squares loss, LRSVM uses the hinge loss. We argue that the improvement owes to the joint learning of dimensionality reduction and classifier. Note that, with larger compression ratio, LRSVM achieves larger improvement over PCA and PLS. Meanwhile, the proposed LRSVM has insignificant performance loss (less than ) with large compression ratio (). We also can see that LRSVM can slightly improve the performance of our CLM when compression rations are smaller (), which we owe to that LRSVM can suppress some noisy information. In general, we set compression ratio as to balance the efficiency and effectiveness.

Impact of PBR We apply PBR to CUB200-2011 and the results are presented in Table 1. We can see that the method using PBR achieves great gains (more than ) over the one without PBR. Note that we achieve about gain in VOC2007 by using PBR. It shows that our PBR is a general method to handle background for CLM. The gains achieved by using ground truth (GT) bounding box indicate more advanced background removal methods have further ability to improve the recognition performance of our CLM. Compared with the improvement in CUB200-2011, the gains in VOC2007 are relative small. The reasons are mainly that the saliency-based methods fail to locate precisely the foregrounds in the challenging databases, and CUB200-2011 only contains one object per image while one image may contain multiple objects in VOC2007. PBR can not segment image into multiple objects so that multi-object images will heavily influence the performance of CLM.

Database | Classes | Images in total | Training/Test | Measurement | Scale | View | Illumination | Pose | Bg Clutter | Occlusion |
---|---|---|---|---|---|---|---|---|---|---|

CUB200-2011 [47] | 200 | 11,788 | Split in [47] | Acc. of split | ||||||

Caltech101 [18] | 102 | 9,144 | 30/remaining per class | Acc. of 5 runs | ||||||

Caltech256 [21] | 256 | 30,607 | 30/remaining per class | Acc. of 5 runs | ||||||

Sports8 [31] | 8 | 1,792 | 70/60 per class | Acc. of 5 runs | ||||||

KTH-TIPS2b [8] | 11 | 4,752 | [13] | Acc. of splits | ||||||

FMD [22] | 10 | 1,000 | 50/50 per class | Acc. of 5 runs | ||||||

VOC2007 [17] | 20 | 9,963 | Split in [17] | mAP of split | ||||||

Scene15 [30] | 15 | 4,485 | 100/remaining per class | Acc. of 5 runs |

### 6.2 Comparison with state-of-the-art methods

We compare our CLM with more than ten state-of-the-art methods on eight widely used benchmarks. The descriptions and experimental setup on these benchmarks are listed in Table 2. We report the results in Table 3, and discuss the experimental results as follows.

Comparison of various local descriptors We combine our CLM with four kinds of local descriptors, and assess them on all databases. From Table 3 we can see that SIFT and LogCov achieve comparable results. For object recognition, LogCov is superior to SIFT on CUB200-2011 and VOC2007 while SIFT outperforms LogCov on Caltech101 and Caltech256. On scene categorization, SIFT and LogCov obtain similar performances on both Sports8 and Sence15. For texture and material classification, SIFT achieves gains over LogCov on KTH-TIPS2b while LogCov is superior to SIFT by a large margin on FMD. The eSIFT and eLogCov perform with the similar rule as SIFT and LogCov, respectively. The enrichment on SIFT and LogCov can considerably boost the performance of our CLM, which encourages us to utilize more informative descriptors for further improvement.

Comparison with counterparts Here, we compare our CLM with its counterparts, O2P [10], Global Gaussian (GG) [35] and NBNN [6]. As shown in Tables 1 & 3, our CLM significantly outperforms O2P [10] on CUB200-2011 and Caltech101, and is also superior to its variant with sparse quantization (SQ-O2P) [7] on Caltech101 and VOC2007 by a large margin, which are mainly due to the appropriate use of mean information and EPN. Moreover, our CLM performs much better than GG methods [35] with ad-hoc linear kernel (ad-linear), center tangent linear kernel (ct-linear) and KL divergence on Sports8 and Sence15. The ad-linear can be seen as a baseline in Euclidean space. It is mentionable that the methods in [35] exploit probabilistic discriminant analysis (PDA) as a classifier. If SVM is used, their results will drop to , and on Sports8, and , and on Scene15, respectively. We attribute the gains of our CLM over [35] to the use of two-step metric with the proposed well-motivated parameters. We also compare our CLM with NBNN [6]. It is easy to see that our CLM performs much better than NBNN on Caltech101 and Caltech256. The main differences between our CLM and NBNN are that our CLM employs an effective model-to-model distance and SVM classifier.

Comparison with FV We make a comprehensive comparison with one state-of-the-art BoF method, FV [39], throughout all databases, and also adopt enrichment SIFT (eSIFT) to FV. On all databases except for FMD, our CLM achieves better than or comparable performances with FV when SIFT or eSIFT is used. On FMD, with SIFT or eSIFT, our CLM is inferior to FV, but with LogCov or eLogCov, our CLM is much better than FV. In our experiments, we find that LogCov or eLogCov is not very suitable for FV, so the relevant results are not reported. It is found that our CLM is more sensitive to local descriptors than FV, as eSIFT brings less or no gains on FV while our CLM greatly benefits from the enrichment on SIFT or LogCov.

Comparison with other state-of-the-art methods Some recent results are also presented for comparison. On Caltech101, DeCAF [14] with 6 layers CNN and dropout strategy [41] slightly outperforms our CLM. Without dropout, the result of DeCAF drops to . On Caltech256, our CLM outperforms the deep architecture Multipath Hierarchical Matching Pursuit (M-HMP) [4] by . Cimpoi \etal[13] achieved state-of-the-art results on KTH-TIPS2b and FMD with semantic attributes which are trained on the additional database by combining FV [39] and DeCAF [14]. Our CLM is superior to the method with attributes, FV and DeCAF. By combining attribute features, FV and DeCAF, Cimpoi \etal[13] obtained and accuracy on KTH-TIPS2b and FMD. Kobayashi [28] proposed a histogram transformation method, and it achieves state-of-the-art results on Sports8 and VOC2007.

Summary In this paper, we assess our CLM on eight image benchmarks, as shown in Table 2, which contains various transformations or noisy factors. We claim that (1) the results on Caltech101 and Caltech256 show that our CLM can well deal with location and pose variations of objects; (2) the results on FMD and KTH-TIPS2b show that our CLM is robust to scale, viewpoint, illumination and appearance variation; (3) the results on Sports8 and Sence15 indicate our CLM can well classify scene images with certain background clutters; and (4) the results on CUB200-2011 and VOC2007 demonstrate our CLM also can handle images with complex surroundings, such as heavy background clutters and occlusion.

### 6.3 Computational complexity analysis

Our CLM for classification mainly consists of three components: extracting local descriptors, computing Gaussian models using Eq.(4) followed by EPN (5) and matrix logarithm in Eq.(8), and learning LRSVM for classification. Most of the computational costs of CLM lie in the eigenvalue decomposition produced by EPN and matrix logarithm. Their computational complexity are and , respectively, where is the dimension of local descriptors. During joint training of low-rank matrix and SVM classifier, optimizing the objective function (3.4) consists of alternating SVM minimization problem and trace minimization problem, whose complexity is , where is the number of training samples of dimension , and is the number of iterations which is less than in our experiments.

Here, we give empirical running time by taking KTH-TIPS2b and Caltech101 as examples. The time of computing image representation, which includes extraction of SIFT at multiple scales, and the time of computation of Gaussian models and embedding matrices, are 30 minutes on KTH-TIPS2b and 1.5 hours on Caltech101. The average time of modeling one image takes about 0.4 second and 0.6 second on relevant databases. For each trial, training (resp. test) of LRSVM takes 20s (resp. 2s) and 7min (resp. 40s) on KTH-TIPS2b and Caltech101, respectively.

## 7 Discussion and conclusion

The bag-of-features (BoF) is a popular method in classification and recognition fields, demonstrating convincing performance in many computer vision tasks in the past years. It might seem that training codebook & descriptor coding are indispensable ingredients. However, the codebookless model (CLM) proposed in this work has proven to be an effective alternative method to the BoF methods for image classification. Below we give some discussions about why CLM shows such competitive performance.

Different from the BoF methods, our CLM leverages continuous functions for statistical modeling of local descriptors, which does not need codebook and thus has no quantization brought in. Recent research [12] showed that high dimensionality can bring impressive performance. The state-of-the-art BoF methods such as SV/VLAD or FV have inherently high dimensionality, which, in our opinion, is the key for characterizing distinctness and discriminativess of individual images as well as image categories. Our CLM directly employs the first- and second-order statistics of high dimensional local descriptors, giving rise to informative image-level models of high dimensionality as well. In this respect, it is worthwhile to study more informative or high dimensional CLM. Moreover, as shown in [9, 10], the CLM is more efficient than the BoF methods for modeling images because learning codebook & coding are not necessary. In addition, the CLM may be more suitable for the tasks where the datasets will be regularly updated or increased, and thus the codebook in the BoF model has to be regularly adjusted to fit the changing datasets.

The contributions of this paper are concluded as follows. (1) Our work has clearly shown that the CLM is a very competitive alternative to the mainstream BoF model. We hope our work can raise potential interests in the classification (or retrieval) community and pave a way to future research. (2) Our method enables Gaussian models to be successfully combined with linear SVM classifier, which makes our method scalable and efficient. The key is that we embed Gaussian models into a vector space which also allows us to perform joint low-rank learning and SVM on Gaussian manifold. Meanwhile, the proposed two well-motivated parameters further improve our CLM. (3) We performed extensive experiments, evaluating various aspects of our CLM and comparing with its counterparts as well as state-of-the-art methods. The comprehensive experiments demonstrated the promising performance of our CLM.

### References

- J. Arenas-Garc¨ªa, K. B. Petersen, and L. K. Hansen. Sparse kernel orthonormalized PLS for feature extraction in large data sets. In NIPS, 2006.
- V. Arsigny, P. Fillard, X. Pennec, and N. Ayache. Fast and simple calculus on tensors in the Log-Euclidean framework. In MICCAI, 2005.
- C. Beecks, A. M. Zimmer, S. Kirchhoff, and T. Seidl. Modeling image similarity by gaussian mixture models and the signature quadratic form distance. In ICCV, 2011.
- L. Bo, X. Ren, and D. Fox. Multipath sparse coding using hierarchical matching pursuit. In CVPR, 2013.
- L. Bo and C. Sminchisescu. Efficient match kernel between sets of features for visual recognition. In NIPS, 2009.
- O. Boiman, E. Shechtman, and M. Irani. In defense of nearest-neighbor based image classification. In CVPR, 2008.
- X. Boix, G. Roig, S. Diether, and L. V. Gool. Self-adaptable templates for feature coding. In NIPS, 2014.
- B. Caputo, E. Hayman, and P. Mallikarjuna. Class-specific material categorisation. In ICCV, 2005.
- J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic Segmentation with Second-Order Pooling. In ECCV, 2012.
- J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Free-Form Region Description with Second-Order Pooling. TPAMI, PP:1, 2014.
- C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM TIST, 2(3):27, 2011.
- D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification. In CVPR, 2013.
- M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In CVPR, 2014.
- J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
- D. L. Donoho, M. Gavish, and I. M. Johnstone. Optimal shrinkage of eigenvalues in the spiked covariance model. arXiv, 1311.0851, 2014.
- L. Dryden, A. Koloydenko, and D. Zhou. Non-euclidean statistics for covariance matrices, with applications to diffusion tensor imaging. Annals of Applied Statistics, 2009.
- M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, 88(2):303–338, 2010.
- L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. TPAMI, 28(4):594–611, 2006.
- L. Gong, T. Wang, and F. Liu. Shape of gaussians as feature descriptors. In CVPR, 2009.
- K. Grauman and T. Darrell. The pyramid match kernel: Discriminative classification with sets of image features. In ICCV, 2005.
- G. Griffin, A. Holub, and P. Perona. The Caltech-256. Technical report, California Institute of Technology, 2007.
- L. haran, R. Rosenholtz, and E. H. Adelson. Material perception: What can you see in a brief glance? Jour. of Vis., 9(8):784, 2009.
- M. T. Harandi, M. Salzmann, and R. Hartley. From manifold to manifold: Geometry-aware dimensionality reduction for spd matrices. In ECCV, 2014.
- S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. Kernel methods on the riemannian manifold of symmetric positive definite matrices. In CVPR, 2013.
- H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In CVPR, 2010.
- S. Ji and J. Ye. Linear dimensionality reduction for multi-label classification. In IJCAI, 2009.
- B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang. Saliency detection via absorbing markov chain. In ICCV, 2013.
- T. Kobayashi. Dirichlet-based histogram feature transform for image classification. In CVPR, 2014.
- P. Koniusz, F. Yan, P.-H. Gosselin, and K. Mikolajczyk. Higher-order Occurrence Pooling on Mid- and Low-level Features: Visual Concept Detection. Technical report, 2013.
- S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006.
- L.-J. Li and F.-F. Li. What, where and who? classifying events by scene and object recognition. In ICCV, 2007.
- P. Li, Q. Wang, and L. Zhang. A novel earth mover’s distance methodology for image matching with gaussian mixture models. In ICCV, 2013.
- M. Lovric, M. Min-Oo, and E. A. Ruh. Multivariate normal distributions parametrized as a riemannian symmetric space. JMVA, 74(1):36–48, 2000.
- D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.
- H. Nakayama, T. Harada, and Y. Kuniyoshi. Global gaussian approach for scene categorization using information geometry. In CVPR, 2010.
- X. Pennec, P. Fillard, and N. Ayache. A riemannian framework for tensor computing. IJCV, pages 41–66, 2006.
- W. K. Pratt. Digital Image Processing, 4th Edition. John Wiley & Sons, Inc., New York, NY, USA, 2007.
- Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth Mover’s Distance as a metric for image retrieval. IJCV, 40(2):99–121, 2000.
- J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the Fisher vector: Theory and practice. IJCV, 105(3):222–245, 2013.
- J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, 2003.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.
- C. Stein. Lectures on the theory of estimation of many parameters. Jour. of Math. Sci., 34(1):1373–1403, 1986.
- V. Sydorov, M. Sakurada, and C. H. Lampert. Deep fisher kernels - end to end learning of the Fisher kernel GMM parameters. In CVPR, 2014.
- O. Tuzel, F. Porikli, and P. Meer. Region covariance: A fast descriptor for detection and classification. In ECCV, 2006.
- J. van Gemert, C. J. Veenman, A. W. M. Smeulders, and J.-M. Geusebroek. Visual word ambiguity. TPAMI, 32(7):1271–1283, 2010.
- A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org/, 2008.
- C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical report, 2011.
- J. Wang, J. Yang, K. Yu, F. Lv, T. S. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In CVPR, 2010.
- J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, 2011.
- B. Yao, G. Bradski, and L. Fei-Fei. A codebook-free and annotation-free approach for fine-grained image categorization. In CVPR, 2012.
- N. Zhang, R. Farrell, and T. Darrell. Pose pooling kernels for sub-category recognition. In CVPR, 2012.
- W. Zhou, M. Yang, H. Li, X. Wang, Y. Lin, and Q. Tian. Towards codebook-free: Scalable cascaded hashing for mobile image search. TMM, 16(3):601–611, 2014.
- X. Zhou, K. Yu, T. Zhang, and T. S. Huang. Image classification using super-vector coding of local image descriptors. In ECCV, 2010.