Learning a Robust Representation via a Deep Network on Symmetric Positive Definite Manifolds
Abstract
Recent studies have shown that aggregating convolutional features of a pretrained Convolutional Neural Network (CNN) can obtain impressive performance for a variety of visual tasks. The symmetric Positive Definite (SPD) matrix becomes a powerful tool due to its remarkable ability to learn an appropriate statistic representation to characterize the underlying structure of visual features. In this paper, we propose to aggregate deep convolutional features into an SPD matrix representation through the SPD generation and the SPD transformation under an endtoend deep network. To this end, several new layers are introduced in our network, including a nonlinear kernel aggregation layer, an SPD matrix transformation layer, and a vectorization layer. The nonlinear kernel aggregation layer is employed to aggregate the convolutional features into a real SPD matrix directly. The SPD matrix transformation layer is designed to construct a more compact and discriminative SPD representation. The vectorization and normalization operations are performed in the vectorization layer for reducing the redundancy and accelerating the convergence. The SPD matrix in our network can be considered as a midlevel representation bridging convolutional features and highlevel semantic features. To demonstrate the effectiveness of our method, we conduct extensive experiments involving two domains: visual search and visual classification. Experiment results show that our method notably outperforms stateoftheart methods.
I Introduction
Deep Convolutional Neural Networks (CNNs) have shown great success in many vision tasks. There are several successful networks, e.g., AlexNet [Krizhevsky2012ImageNet], VGG [Simonyan2014Very], GoogleNet [Szegedy2014Going], Network In Network [Lin2013Network] and ResNet [He2015Deep]. Driven by the emergence of largescale data sets and fast development of computation power, features based on CNNs have proven to perform remarkably well on a wide range of visual recognition tasks [Zeiler2013Visualizing, Donahue2013DeCAF]. Two contemporaneous works introduced by Liu et al. [Lin2016Bilinear] and Babenko and Lempitsky [Yandex2016Aggregating] demonstrate that convolutional features could be seen as a set of local features which can capture the visual representation related to objects. To make better use of deep convolutional features, many efforts have been devoted to aggregating them, such as max pooling [Tolias2015Particular], crossdimensional pooling [Kalantidis2015Cross], sum pooling [Yandex2016Aggregating], and bilinear pooling [Lin2016Bilinear, Gao2015Compact]. However, modeling these convolutional features to boost the feature learning ability of a CNN is still a challenging task. This work investigates a more effective scheme to aggregate convolutional features to produce a robust representation using an endtoend deep network for visual tasks.
Recently, the Symmetric Positive Definite (SPD) matrix has been demonstrated the powerful representation ability and widely used in computer vision community, such as face recognition [Huang2016Geometry, Harandi2017Dimensionality], image set classification [Huang2015Log], transfer learning [Herath2017Learning], and action recognition [Zhang2016Exploiting, Zhou2017Revisiting]. Through the theory of nonEuclidean Riemannain geometry, the SPD matrix often turns out to be better suited in capturing desirable data distribution properties. Accordingly, we attempt to aggregate the deep convolutional features into an powerful SPD matrix as a robust representation.
The secondorder statistic information of convolutional features, e.g., the covariance matrix and Gaussian distribution, are the widely used SPD matrix representation endowed with CNNs [Li2017Is, Ionescu2015Matrix, Yu2017Second]. The dimensionality of convolutional features extracted from CNNs may be much larger than that of handcraft features. As a result, modeling convolutional features from CNNs by using the covariance matrix or Gaussian distribution is insufficient to precisely model the real feature distribution. When the dimension of features is larger than the number of features, the covariance matrix and Gaussian distribution is a symmetric Positive SemiDefinite (PSD) matrix, i.e., the singular matrix. Singular matrix makes the data have an unreasonable manifold structure. In this case, the Riemannain metrics, e.g., the affineinvariant distance and LogEuclidean distance, are unsuitable to measure the manifold structure of SPD matrices. Moreover, most SPD matrix embedding on deep networks only contains the linear correlation of features. Owning the ability of capturing nonlinear relationship among features is indispensable for a generic representation.
It is thus desirable that a more discriminative and suitable SPD representation aggregated from deep features should be established in an endtoend framework for visual analysis. To this end, we design a series of new layers to overcome existing issues aforementioned based on the following two observations.

Kernel functions possess an ability of modeling nonlinear relationships of data, and they are fiexible and easy to be computed. Beyond covariance [Wang2015Beyond] have witnessed significant advances of positive definite kernel functions whose kernel matrices are real SPD matrices, no matter what the number of the feature dimension and the number of features are. Since many kernel functions are differentiable, such as Radial Basis Function (RBF) kernel function, Polynomial kernel function and Laplacian kernel function [Bo2010Kernel], they can be readily embeded into a network to implement an endtoend training, which is well aligned with the design requirements of a deep network.

Several deep SPD networks [Dong2017Deep, Huang2016A, Zhang2017Deep] transform the SPD matrix to a new compact and discriminative matrix. The network input is an SPD matrix. The transformed matrix is still an SPD matrix which can capture desirable data properties. We find that the transformed SPD matrix after the learnable layers leads to better performance than the original SPD matrix. The output SPD matrix not only have characteristics of a general SPD matrix that captures the desirable properties of visual features but also is more suitable and discriminative to the specific visual task.
Motivated by empirical observations mentioned above, we introduce a convolutional feature aggregation operation which consists of the SPD generation and the SPD transformation. Three new layers including a kernel aggregation layer, an SPD matrix transformation layer and a vectorization layer, are designed to replace the traditional global pooling layers and fully connected (FC) layers. Concretely, we deem each feature map as a sample and present a kernel aggregation layer using a nonlinear kernel function to generate an SPD matrix. The proposed kernel matrix models a nonlinear relationship among feature maps and ensures that the SPD matrix is nonsingular. More importantly, our kernel matrix is differentiable, which entirely meets requirements of a deep network. The SPD matrix transformation layer is employed to map the SPD matrix to a more discriminative and compact one. Thanks to the symmetry property, the vectorization layer carries out the upper triangle vectorization and normalization operations to the SPD matrix followed by the classifier. The architecure of our network is illustrated in Fig. 1. The proposed method first generates an SPD matrix based on convolutional features and then transforms the initial SPD matrix to a more discriminative one. It can not only capture the real spatial information but also encode highlevel variation information among convolutional features. Actually, the obtained descriptor acts as a midlevel representation bridging convolutional features and highlevel semantics features. The resulting vector can contribute to visual search and classification tasks, as validated in experiments.
In summary, our contributions are threefold.
(1) We apply the SPD matrix nonlinear aggregation to the convolutional feature aggregation field by the generation and the transformation two processes. In this way, it can learn an compactness and robustness SPD matrix representation to characterize the underlying structure of convolutional features.
(2) We carry out the nonlinear aggregation of convolutional features under a Riemannain deep network architecture, where three novel layers are introduced, i.e., a kernel aggregation layer, an SPD matrix transformation layer and a vectorization layer. The stateoftheart performance of our SPD aggregation network is consistently achieved over the visual search and classification tasks.
(3) We exploit the faster matrix operations to avoid the cyclic calculation in forward and backward backpropagations of the kernel aggregation layer. In addition, we present the component decomposition and retraction of the Orthogonal Stiefle manifold to carry out the backpropagation on the SPD matrix transformation layer.
The remaining sections are organized as follows. We review the recent works about feature aggregation methods in both Euclidean Space and Riemannain Space in Section II. Section III presents the details of our SPD aggregation method. We report and discuss the experimental results in Section IV, and conclude the paper in Section V.
Ii Related Work
Feature aggregation is an important problem in computer vision tasks. Recent works have witnessed significant advances of CNNs. It is still a challenging work to find a suitable way to aggregate convolutional features. In this section, we review typical techniques of feature aggregation in both the Euclidean space and Riemannain space.
Iia Convolutional Feature Aggregation in the Euclidean Space
An effective image representation is an essential element for visual recognition due to the object appearance variations caused by pose, view angle, and illumination changes. Traditional methods typically obtain the image representation by aggregating handcrafted local features (e.g., SIFT) into a global image descriptor. Popular aggregation schemes include Bagofwords (BOW) [Sivic2003Video], Fisher Vector (FV) [Liu2014Encoding], and Vector of Locally Aggregated Descriptor (VLAD) [Ng2015Exploiting]. Gong et al. [Gong2014Multi] introduced a multiscale orderless pooling scheme to aggregates FC6 features of local patches into a global feature using VLAD. The VLAD ignores different effects of each cluster center. Cimpoi et al. [Cimpoi2015Deep] treated the convolutional layer of CNNs as a filter bank and built an orderless representation using FV. In addition, Liu et al. [Liu2015Cross] proposed a cross convolutional layer pooling scheme which regards feature maps as a weighting filter to the local features. Tolias et al. [Tolias2015Particular] max pooled convolutional features of the last convolutional layer to represent each patch and achieved compelling performance for object retrieval. Babenko et al. [Yandex2016Aggregating] compared different kinds of aggregation methods (i.e., max pooling, sum pooling and fisher vector) for last convolutional layer features and demonstrated the sumpooled convolutional descriptor is really competitive with other aggregation schemes.
Works mentioned above only treat the CNN as a blackbox feature extractor rather than studying on properties of CNN features in an endtoend framework. Several researchers [Lin2016Bilinear, Zhang2016Deep, Arandjelovic2016NetVLAD] suggested that the endtoend network can achieve better performance because it is sufficient by itself to discover good features for visual tasks. Arandjelovic et al. [Arandjelovic2016NetVLAD] proposed a NetVLAD which adopts an the endtoend framework for weakly supervised place recognition. Based on the ResNet, Zhang et al. [Zhang2016Deep] introduced an extended version of the VLAD, i.e., DeepTEN, for texture classification. Lin et al. [Lin2016Bilinear] presented a general orderless pooling model named Bilinear to compute the outer product of local features. He et al. [He2015Spatial] introduced a spatial pyramid pooling method eliminating the constrain of the fixedsize input image.
Recent research shows that exploiting the manifold structure representation is more effective than the hypothetical Euclidean distribution in some visual tasks. The difference between our method and the traditional aggregation methods in the Euclidean space is that we use the powerful SPD manifold structure to aggregate the desirable data distributions of features. We design an SPD aggregation scheme to generate the SPD matrix as the resulting representation, and transform the SPD representation to more discriminative one by learnable layers.
IiB Convolutional Feature Aggregation in the Riemannain Space
The aggregation methods in the nonEuclidean space have been successful applied. It can capture more appropriate feature distributions information. The secondorder statistic information has better performance than the firstorder statistic [Li2017Is], such as average pooling. Some works directly regard the secondorder statistic information as the SPD matrix. Ionescu et al. [Ionescu2015Matrix] proposed a DeepO2P network that uses a covariance matrix as the image representation. They mapped points on the manifold to the logarithm tangent space and derived a new chain rule for derivatives. Li et al. [Li2017Is] presented a matrix normalized covariance method exploring the secondorder statistic. This work can tackle the singular issue of the covariance matrix by the normalization operation. Yu and Salzmann [Yu2017Second] introduced a covariance descriptor unit to integrates secondorder statistic information. The covariance matrix of convolutional features is generated and then transformed to a vector for the softmax classifier. Compared with our network, these three works are confined to the drawbacks of covariance matrices. Engin et al. [Engin2017DeepKSPD] designed a deep kernel matrix based SPD representation, but didn’t contains the transformation process.
Other SPD Riemannain networks mainly project an SPD matrix to a more discriminative one. Dong et al. [Dong2017Deep] and Huang and Gool [Huang2016A] proposed Riemannain networks contemporaneously, in which the inputs of their networks are SPD matrices. The networks projects high dimensional SPD matrices to a low dimensional discriminative SPD manifold by a nonlinear mapping. Zhang et al. [Zhang2017Deep] introduced new layers to transform and vectorize the SPD matrix for action recognition, where the input is a nonlinear kernel matrix modeling correlation of frames in a video. However, these three works only focused on how to transform the SPD matrix without utilizing the powerful convolutional features. The generation of the input SPD matrix can not be guided by the loss function. In contrast, our method focuses on not only the SPD matrix transformation but also the generation from convolutional features.
Our work is closely related with [Li2017Is, Yu2017Second, Engin2017DeepKSPD]. We make it clear that the proposed convolutional feature aggregation method is composed of generation and transformation processes. Compared with [Li2017Is], our method utilizes the kernel matrix as the representation instead of the secondorder statistic covariance matrix, characterizing complex nonlinear variation information of features. In addition, our aggregation method contains a learnable transformation process than [Li2017Is], making SPD representation more compact and robust. The generated SPD matrix in our method is more powerful than the covariance matrix in [Yu2017Second], avoiding some drawbacks of PSD matrix. In addition, instead of a transformation from a matrix to a vector, the vectorization operation in our work is taking the upper triangle of a matrix since there are already transformation operations between the SPD matrices. Compared to [Engin2017DeepKSPD], our SPD representation can be more compact and robust through the transformation process.
Iii SPD Aggregation Method
Our model aims to aggregate convolutional features into a powerful SPD representation in an endtoend fashion. To this end, we design three novel layers including a kernel aggregation layer, an SPD matrix transformation layer and a vectorization layer. Our SPD aggregation can be applied to two different kinds of tasks, i.e., visual search and visual classification. Specifically, the convolutional features continue to pass through the kernel aggregation layer and the vectorization layer to produce a global representation for the visual search task. For the visual classification task, convolutional features pass through the proposed three layers followed by an FC layer and a loss function. The intermediate generated SPD matrix can be treated as a midlevel representation which is a connection between convolutional features and highlevel features. The architecture of our network is illustrated in Fig. 1(c).
Iiia Preprocessing of Convolutional Features
A CNN model trained on a large dataset such as ImageNet can have a better general representation ability. We would like to fuse the convolutional features of the last convolutional layer and adjust the dimension of convolutional features for different tasks. We exploit the Principal Component Analysis (PCA) between the last convolutional layer and the kernel aggregation operation to reduce the dimension of each local convolutional feature in the non endtoend visual search task, so that it is benifical to alleviate information redundancy. Each dimension has the largest variance and is independent to each other.
For the visual classification tasks, we introduce a convolutional layer whose filter’s size is between the last convolutional layer of the offtheshelf model and the kernel aggregation layer to make the processed convolutional features more adaptive to the SPD matrix representation. A Relu layer follows the convolutional layer to increase the nonlinear ability.
IiiB Kernel Aggregation Layer
We present the kernel aggregation layer to aggregate convolutional features into an initial SPD matrix. Let be dimensional convolutional features. is the number of channels, i.e., the number of feature maps, and are the height and width of each feature map, respectively. Let denote the th local feature, and there are local features in total, where . is the th feature map.
Although several approaches have applied a covariance matrix to be a generic feature representation and obtained promising results, two issues remain to be addressed. First, the rank of covariance matrix should hold , otherwise covariance matrix is prone to be singular when the dimension of local features is larger than the number of local features extracted from an image region. Second, for a generic representation, the capability of modeling nonlinear feature relationship is essential. However, covariance matrix only evaluates the linear correlation of features.
To address these issues, we adopt the nonlinear kernel matrix as a generic feature representation to aggregate deep convolutional features. In particular, we take advantage of the Riemannain structure of SPD matrices to describe the secondorder statistic and nonlinear correlations among deep convolutional features. The nonlinear kernel matrix is capable of modeling nonlinear feature relationship and is guaranteed to be nonsingular. Different from the traditional kernelbased methods whose entries evaluates the similarity between a pair of samples, we apply the kernel mapping to each feature rather than each sample . Mercer kernels are usually employed to carry out the mapping implicitly. The Mercer kernel is a function which can generate a kernel matrix using pairwise inner products between mapped convolutional features for all the input data points. The in our nonlinear kernel matrix can be defined as
(1) 
where is an implicit mapping. In this paper, we exploit the Radial Basis Function (RBF) kernel function expressed as
(2) 
where is a positive constant and set to the mean Euclidean distances of all feature maps. What Eq. (2) reveals is the nonlinear relationship between convolutional features.
We show an important theorem for kernel aggregation operation. Based on the Theorem 1, the kernel matrix of the RBF kernel function is guaranteed to be positive definite no matter what and are.
Theorem 1.
Let denotes a set of different points and . Then the kernel matrix of the RBF kernel function on is guaranteed to be a positive definite matrix, whose th element is and .
Proof.
The Fourier transform convention of the RBF kernel function is
(3) 
Then we calculate the quadratic form of the kernel matrix . Let denote an arbitrary nonzero vector. The quadratic form is
(4)  
where is the transpose operation. Because is a positive and continuous function, the quadratic form on the condition that
(5) 
However, the complex exponentials is linear independence. Accordingly, and kernel matrix is a positive definite matrix. ∎
In this work, is the generated SPD matrix as the midlevel image representation. Any SPD manifold optimization can be applied directly, without structure being destroyed. The toy example of the kernel aggregation is illustrated in Fig. 2. As we all known, the kernel aggregation layer should be differentiable to meet the requirement of an endtoend deep learning framework. Clearly, Eq. (2) is differentiable with respect to the input . Denoting by the loss function, the gradient with respect to the kernel matrix is . is an element in . We compute the partial derivatives of with respect to and , which are
(6)  
In this process, the gradient of the SPD matrix can flow back to convolutional features.
During forward propagation Eq. (2) and backward propagation Eq. (6), we have to do cycles to compute the kernel matrix and cycles to gain the gradient with respect to convolutional features , where is the number of channels. Obviously, both the forward and backward propagations are computationally demanding. It is well known that the computation using matrix operations is preferable due to the parallel computing in computers. Accordingly, our kernel aggregation layer is able to be calculated in a faster way via matrix operations. Let’s reshape the convolutional features to a matrix . Each row of is a reshaped feature map obtained from and each column of is the convolutional local feature . Note that, in Eq. (2) can be expanded to . For each of inner products , and , it needs to be calculated times in cycles of Eq. (2). Now, we can convert times inner products operation to a matrix multiplication operation which only needs to be computed once,
(7)  
where is the Hadamard product and is a matrix whose elements are all “1”s. , and are all real matirces. The element is the 2norm of th row vector of , and is equal to the calculation output of . The element is the 2norm of th column vector of , and is equal to the calculation output of . The element is equal to . , and can be calculated in advance.
Therefore, we compute in Eq. (2) by the matrix addition and multiplication, and implement the to the matrix in a parallel computing way instead of calculating each element in the cycle. Then the kernel matrix can be calculated by matrix operations as follows.
(8) 
where means the exponential operation to each element in the matrix . Although calculating directly the function is timeconsuming, it can be computed efficiently in a matrix form through Eq. (8), which is faster than through Eq. (2). Similarly, back propagation process in Eq. (6) can also be carried out in the matrix operation which is given by
(9) 
Remark: The covariance matrix descriptor, as a special case of SPD matrices, captures feature correlations compactly in an object region, and therefore has been proven to be effective for many applications. Given the local features , the covariance descriptor is defined as
(10) 
where is the mean vector. The covariance matrix can also be seen as a kernel matrix where the th element of the covariance matrix can be expressed as
(11) 
where denotes the inner product, and is the mean value of . Therefore, the covariance matrix corresponds to a special case of the nonlinear kernel matrix defined in Eq. (1), where . Through this way, we can find that covariance matrices contain the simple linear correlation features. Whether the covariance matrix is a positive definite matrix depends on the and , i.e., .
IiiC SPD Matrix Transformation Layer
As discussed in [Dong2017Deep, Huang2016A, Zhang2017Deep], SPD matrix transformation networks are capable of achieving the better performance than the original SPD matrix. Inspired by [Yu2016Weakly] and [Yu2017Second], we add a learnable layer to make the network more flexible and more adaptive to the specific task. Based on the SPD matrix generated by the kernel aggregation layer, we expect to transform the existing SPD representation to be a more discriminative, suitable and desirable matrix. To preserve the powerful ability of the SPD matrix, the transformed matrix should also be an SPD matrix. Moreover, we attempt to adjust the dimension to make the SPD matrix more flexible and compact. Here, we design the SPD matrix transformation layer in our network.
Let’s define the Riemannain manifold of SPD matrices as . The output SPD matrix of the kernel aggregation layer lies on the manifold . We use a matrix mapping to complete the transformation operation. As depicted in Fig. 3, we map the input SPD matrix which lies on the original manifold to a new discriminative and compact SPD matrix in another manifold , where is the dimension of the SPD matrix transformation layer. In this way, the desired transformed matrix can be obtained by a learnable mapping. Given a SPD matrix as an input, the output SPD matrix can be calculated as
(12) 
where is the output of the transformation layer, and are learnable parameters which are randomly initialized during training. controls the size of . Based on the Theorem 2, the learnable parameters should be a column full rank matrix to make be an SPD matrix as well.
Theorem 2.
Let denote an SPD matrix, and , where . is an SPD matrix if and only if is a column full rank matrix, i.e., .
Proof.
If is an SPD matrix, is a column full rank matrix and . For homogeneous equations and , only has a zero solution, where is the zero vector. For arbitrary nonzero vector , . We calculate the quadric form ,
(13) 
Because and is an SPD matrix, . This proves that is an SPD matrix.
On the other hand, if is an SPD matrix, for arbitrary nonzero vector , . Beccause is an SPD matrix, . Only if can lead to . Accordingly, and is a column full rank matrix. ∎
Since there are learnable parameters in the SPD matrix transformation layer, we should not only compute the gradient of loss function with respect to the input , but also calculate the gradient with respect to parameters . The gradient with respect to the input is
(14) 
where is the gradient with respect to the output .
Since is a column full rank matrix, it is on a noncompact Stiefel manifold [Absil2009Optimization]. However, directly optimizing on the noncompact Stiefel manifold is infeasible. To overcome this issue, we relax to be semiorthogonal, i.e., . In this case, is on the orthogonal Stiefel manifold . The optimization space of parameters is changed from the noncompact Stiefel manifold to the orthogonal Stiefel manifold . Considering the manifold structure of , the optimization process is quite different from the gradient descent method in the Euclidean space. We first compute the partial derivative with respect to . Then we convert the partial derivative to the manifold gradient that lies on the tangent space. Along the tangent gradient, we find a new point on the tangent space. Finally, the retraction operation is applied to map the new point on the tangent space back to the orthogonal Stiefel manifold. Thus, an iteration of the optimization process on the manifold is completed. This process is illustrate in Fig. 4. Next we will elaborate each step.
First the partial derivative with respect to is computed by
(15) 
The partial derivative doesn’t contain any manifold constraints. Considering is a point on the orthogonal Stiefel manifold, the partial derivative needs to be converted to the manifold gradient, which is on the tangent space. As shown in Fig. 5, on the orthogonal Stiefel manifold, the partial derivative is a Euclidean gradient at the point , not tangent to the manifold. The tangential component of is what we need for optimization, which lies on the tangent space. The normal component is perpendicular to the tangent space. We decompose into two vectors that are perpendicular to each other, i.e., one is tangent to the manifold and the other is the normal component based on the Theorem 3.
Theorem 3.
Let denote an orthogonal Stiefel manifold and is a point on . denotes a function defined on the orthogonal Stiefel manifold. If the partial derivatives of with respect to is , the manifold gradient at which is tangent to is .
Proof.
Because is a point on the orthogonal Stiefel manifold, , where is an identity matrix. Differentiating yields , where is a tangent vector. Thus, is a skewsymmetric matrix. Note that, the canonical metric for the orthogonal Stiefel manifold at the point is . For all tangent vectors at , we can get that
(16) 
Because is a skewsymmetric matrix, Eq. (16) can be solved, i.e., . ∎
Then the tangential component at can be expressed by the partial derivative ,
(17) 
is the manifold gradient of the orthogonal Stiefel manifold. Searching along the giadient gets a new point on the tangent space. Finally, we use the retracting operation to map the point on the tangent space back to the Stiefel manifold space,
(18) 
where is the retraction operation mapping the data back to the manifold. Specificly, denotes the matrix of QR decomposition to . , , where is a semiorthogonal matrix and is a upper triangular matrix. is the learning rate.
Note that, we can make a Relu activation function layer follow the SPD matrix transformation layer. The output of the Relu layer is still an SPD matrix based on the Theorem 4.
Theorem 4.
The relu activation function on a matrix is . Let ,
If is an SPD matrix, is an SPD matrix.
Proof.
The detailed proof of this theorem is shown in the appendix section of [Dong2017Deep]. ∎
IiiD Vectorization layer
Since inputs of the common classifier is all vectors, we should vectorize the SPD matrix to a vector. Because of the symmetry of the robust SPD matrix achieved by the transformation layer, is determined by elements, i.e., the upper triangular matrix or the lower triangular matrix of . Here, we take the upper triangular matrix of and reshape it into a vector as the input of the loss function. Let’s denote the vector by ,
(19) 
Due to the symmetry of the matrix , the gradient is also a symmetric matrix. For the diagonal elements of , its gradient of the loss function is equal to the gradient of its corresponding element in the vector , while the gradient of nondiagonal elements of is times of the element in the vector . The gradient with respect to is given by
(20) 
The normalization operation is important as well. We use the power normalization and normalization operation following the vector . The gradient formulation Eq. (9), Eq. (14) and Eq. (18) calculate the gradient with respect to the input of the corresponding layer, respectively. Once these gradients are obtained, the standard SGD backpropagation can be easily employed to update the parameters directly with the learning rate. Fig. 6 shows the data flow in our network with the proposed three layers including forward and backward propagations. denotes the output of the last fullyconnected layer. In Algorithm 1, we summarize the training process of our model. We can use more than one SPD transformation layers in the network, where each one can be followed by a Relu layer as the activation layer.
Iv Experiment
To demonstrate the benefits of our method, we conduct extensive experiments on both visual search and visual classification tasks. Althought many visual search tasks are under an endtoend framework, we aggregate the convolutional features into an SPD matrix only through the generation process to demonstrate the powerful ability of our nonlinear SPD matrix representation without the transformation process or back propagation. We conduct experiments on visual classification tasks to show the performace of the SPD aggregation framework including the generation and transformation processes. We first present the visual search tasks on six datasets. We then turn to visual classification tasks on five datasets.
Iva Visual Search
We validate the effectiveness of the kernel aggregation layer through both image retrieval tasks and object instance search tasks. Given a specific object as query, the object instance search aims to not only retrieve the images or frames that contain the query but also locate all its occurrences [Meng2015Object].
IvA1 Datasets and Evaluation Protocols
We conduct comprehensive experiments on six public datasets which are popularly used in visual search, i.e., Oxford5K, Paris6K, Sculptures6K, Holidays, Holidays+100K and UKbench. Oxford5K [Philbin2007Object] includes landmark reference images downloaded from Flicker. It defines queries which are cropped from some of the reference images. Paris6K [Philbin2008Lost] includes landmark reference images downloaded from Flicker. It also defines queries which are cropped from some of the reference images. Sculptures6K [Arandjelovic2011Smooth] includes sculptures reference images downloaded from Flicker. It defines queries which are cropped from some of the reference images. The Holidays dataset [Jegou2008Hamming] includes images of personal holiday photos from categories, where the first image in each category is used as the query. Holidays + 100K contains the Holidays dataset and additionally 100K [Philbin2007Object] distractor images from Flickr. Experiments conducted on the dataset is to demonstrate the search ability of our SPD representation in the large scale data. University of Kentucky Benchmark dataset (UKbench) [Nister2006Scalable] contains indoor photographs of objects (four photos per object). Each image is used to query the rest of the dataset. Some example reference images on these datasets are shown in Fig. 7. The Holidays, Holidays + 100K and UKbench datasets are used for image retrieval, Oxford5K, Paris6K and Sculptures6K are used for the object instance search task.
For the Oxford5K, Paris6K, Sculptures6K and Holidays datasets, the performance is evaluated by mean average precision (mAP) [Kong2012Manhattan], which is a popularly used to evaluate the performance of image retrieval. For the UKbench dataset, the performance is reported as the average number of sameobject images within the top four results, i.e., . We only compare our method with the existing methods that using CNN features on image retrieval and object instance search tasks, because CNN features are more discriminative than the handcrafted features.
IvA2 Implementation Details
For all dataset images, retrieval images and query objects images, we resize them to the size of and extract convolutional features from the conv53 layer of VGG16 model which is pretrained on the ImageNet dataset. Then the PCA is applied to reduce the dimension of each convolutional local feature from to . The PCA also makes each faeture map more independent. We aggregate these feature maps into an SPD matrix representation by the nonlinear kernel function. Since we exploit the distance to measure the similarity between representations in this paper, the SPD matrices are mapped to the tangent space by the matrix logarithm function. Then the mapped symmetric matrix is vectorized to a dimensional representations. Consider the dimension is still large for the visual search tasks, we add another PCA to the vector representation, in which the dimension is reduced to . Image retrieval is done by calculating the distance between the obtained representations of the query image and database images.
For object instance search, a set of object proposals of each reference image are extracted by Edge Boxes [Zitnick2014Edge] and serve as potential regions for the query object. Given a query and a reference image , we first find the bestmatched proposal of based on
(21) 
where represents the similarity between and each proposal . is defined by the distance between representation of and representation of given by
(22) 
where and are the dimensional vectors. After that, the relevance between and is determined by the similarity between and , i.e., . The bounding box of corresponds to the detecting location of the object in the reference image. We then sort the reference images according to and get the search results. Besides, we follow the work in [Tolias2015Particular], also utilize the average query expansion (AQE) to further enhance the object instance search performance. Given a query object image , ( in our implementation) most similar proposals are selected from all reference images based on the distance. We then compute the mean of the representation of and proposals, which is defined as
(23) 
the average representation is treated as new representation of to get the final search result in our experiment. We use the to replace in Eq.(22). The object instance search processes are shown in Fig. 8.
IvA3 Comparisons with Other Image Retrieval Methods
For image retrieval, the performance of different aggregation methods is illustrated in Table I. As can be seen, our schemes provide considerable improvement over previous stateoftheart methods on compact representations [Gong2014Multi, Babenko2014Neural, Razavian2014Visual, Yandex2016Aggregating, Kalantidis2015Cross, Lin2016Bilinear, Radenovi2016CNN, Gordo2016Deep]. It has been shown that the proposed method performs especially well on Holidays dataset. Even CroW [Kalantidis2015Cross] using very tricky pooling methods achieved mAP on Holidays dataset with a dimension representation, a improvement in mAP is achived by our dimension representation. The dimension of VLAD [Ng2015Exploiting] representation is , much larger than our method, but the performance of VLAD is still lower than ours. Compared with our method, NetVLAD [Arandjelovic2016NetVLAD], RMAC [Radenovi2016CNN] and Gordo et al. [Gordo2016Deep] are endtoend frameworks. However, our performance is better than theirs. On the UKbench dataset, ours method achieved performance comparing with , the stateoftheart performance achieved in SPoC [Yandex2016Aggregating]. Moreover, the competitive performance we achieved in Holiday+100K dataset demonstrates the effectiveness of our representation in largescale visual search.
IvA4 Comparisons with Other Object Instance Search Methods
For object instance search tasks, the Table II shows the performance compared with other methods [Babenko2014Neural, Yandex2016Aggregating, Ng2015Exploiting, Razavian2014Visual, Razavian2014CNN, Kalantidis2015Cross, Tolias2015Particular, Arandjelovic2016NetVLAD, Radenovi2016CNN, Gordo2016Deep]. Since methods in [Babenko2014Neural, Yandex2016Aggregating, Ng2015Exploiting] ignore the object localization process which is advantageous to the visual search performance, they are not competitive comparing with other methods. Without Query Expansion, the proposed model obtains mAP on Oxford5Kd dataset, on Paris6K dataset and on Sculptures6K dataset. The mAP of Razavian et al. [Razavian2014Visual] is on Oxford5Kd dataset, higher than ours. But on Paris6K dataset the mAP of [Razavian2014Visual] is , while our method achieves . Besides, the method in [Razavian2014Visual] enlarges the query object bounding box which is not compliant with the standard evaluation protocol. With Query Expansion, the dimension of MAC + R + QE [Radenovi2016CNN] and Gordo et al. + QE [Gordo2016Deep] is larger than our method. RMatch + QE [Iscen_2017_CVPR] implements the method based on layers RestNet, whose convolutional features are more discriminative than VGG16. Methods in [Radenovi2016CNN] and [Gordo2016Deep] are endtoend frameworks. In this case, we have achieved the comparable results.
Method  Network  Dimension  Holidays  Holidays+100K  UKbench 

MOPCNN [Gong2014Multi]  CaffeNet      
Neural Codes[Babenko2014Neural]      
Razavian [Razavian2014Visual]  AlexNet    
SPoC[Yandex2016Aggregating]  VGG19    
VLAD [Ng2015Exploiting]  GoogLeNet      
CroW [Kalantidis2015Cross]  VGG16      
CroW [Kalantidis2015Cross]  VGG16      
Bilinear [Lin2016Bilinear]  VGG16  
NetVLAD [Arandjelovic2016NetVLAD]  VGG16      
RMAC [Radenovi2016CNN]  VGG16    
Gordo et al. [Gordo2016Deep]  VGG16      
Ours  VGG16  
Method  Network  Dimension  Oxford5K  Paris6K  Sculptures6K 

Neural Codes[Babenko2014Neural]        
SPoC[Yandex2016Aggregating]  VGG19    
VLAD [Ng2015Exploiting]  VGG16    
CNNaugss[Razavian2014CNN]  AlexNet  
Razavian et al. [Razavian2014Visual]  AlexNet  
CroW + QE [Kalantidis2015Cross]  VGG16    
Tolias et al. [Tolias2015Particular]  VGG16    
NetVLAD [Arandjelovic2016NetVLAD]  VGG16    
MAC + R + QE [Radenovi2016CNN]  VGG16    
Gordo et al. + QE [Gordo2016Deep]  VGG16    
RMatch + QE [Iscen_2017_CVPR]  ResNet101    
Ours  VGG16  256  
Ours + QE  VGG16  256 
Method  DTD  FMD  KTHT2b  CUB2002011  FGVCaircraft 

FVCNN [Cimpoi2015Deep]    
FVFCCNN [Cimpoi2015Deep]    
BCNN [Lin2016Bilinear]  6  
DeepTEN [Zhang2016Deep]        
VGG16 [Simonyan2014Very]  
Ours 
Method  DTD  FMD  KTHT2b  CUB2002011  FGVCaircraft 

BCNN [Lin2016Bilinear]  
conv + BCNN [Lin2016Bilinear]  
VGG16 [Simonyan2014Very]  
512 conv + VGG16 [Simonyan2014Very]  
+ Kernel Aggregation Layer  
+ Kernel Aggregation Layer  
+ Kernel Aggregation Layer  
Ours 
IvB Visual Classification
In order to demonstrate benefits of the proposed endtoend framework in visual classification tasks, we choose the challenging texture and finegrained classification tasks. The texture classification tasks need a powerful global representation, because of the features of texture should be invariant to translation, scaling and rotation. Differences among finegrained images are very small. It is challenging to represent these differences in the aggregation process.
IvB1 Datasets and Evaluation Protocols
We choose three texture datasets in the experiments. They are Describable Textures Dataset (DTD) [Cimpoi2014Describing], Flickr Material Database (FMD) [Sharan2013Recognizing] and KTHTIPS2b (KTH2b) [Caputo2005Class]. DTD and FMD are both collected in the wild conditions while KTH2b is under the laboratory condition. DTD has classes, and each class contains images. There are totally images in DTD. FMD contains images of classes, each class has images. KTH2b contains images of classes. Fig. 9 illustrates the texture datasets for our experiments. For these texture datasets, we follow the standard traintest protocol. We divide DTD and FMD into three subsets randomly, and use two subsets for training and the rest one subset for testing. Images of KTH2b are splited into four samples. we train the framework using one sample and test on the rest three samples. Inspired by [Krizhevsky2012ImageNet], the texture images are augmented. We do times augmentation to the training data, including randomly cropping times and picking from center and four corners. The test images are only picked from the center and four corners. The size of cropped images are resized to .
We report results on birds and aircrafts finegrained recognition datasets. The birds dataset [Wah2011The] is CUB2002011 which contains classes and images totally. The FGVCaircraft dataset [Maji2013Fine] contains aircraft images of classes. Fig. 10 illustrates some finegrained images. We train and test the birds and aircrafts finegrained datasets through the inherent training document. The data augmentation is not applied to the finegrained images. We resize them to the size of . All the texture and finegrained images are normalized by subtracting means for RGB channels.
IvB2 Implementation Details
The basic convolutional layers and pooling layers before our SPD aggregation are from the VGG16 model which is pretrained on the ImageNet dataset. We remove layers after the conv53 layer of VGG16 model. Then we insert our SPD aggregation method into the network following the conv53 layer. Finally, a FC layer and a softmax layer follow the vectorization layer where the output dimension of the FC layer is equal to the number of classes. All our networks run under the caffe framework. We use SGD with a minibatch size of . The training process is divided into two stages. At the first training stage, we fix the parameters before the SPD aggregation method and train the rest new layers. The learning rate is started from and reduced by when error plateaus. At the second training stage, we train the whole network. The learning rate is started from and divided by when error plateaus.
IvB3 Experiments for the SPD Aggregation Framework
In this section, we compare the SPD aggregation framework with some stateoftheart convolutional feature aggregation methods. First a convolutional layer whose number of channels is follows the conv53 convolutional layers. Then our SPD aggregation method including a kernel aggregation layer, an SPD matrix transformation layer and a vectorization layer is inserted after the convolutional layer. The output size of the SPD matrix transformation layer is . Considering these datasets are not big enough, we only use one SPD matrix transformation layer to avoid the overfitting. Table III shows the comparison on texture datasets and finegrained datasets respectively.
The following methods are evaluated in our experiments, FVCNN [Cimpoi2015Deep], FVFCCNN [Cimpoi2015Deep], BCNN [Lin2016Bilinear], DeepTEN [Zhang2016Deep] and the pure VGG16 model [Simonyan2014Very] is used as the baseline. FVCNN aggregates convolutional features from VGG16 conv53 layer. The dimension of it is and is compressed to by PCA for classification. FVFCCNN incorporates the FC features and FV vector. BCNN uses the Bilinear pooling method on the conv53 layer of VGG16 model. The DeepTEN uses layers ResNet and larger number of training samples and image size, while the other methods use VGG16 model. It is not scientific to compare DeepTEN with the other feature aggregation methods.
We can see that, our method gets a better performance, especially on KTH2b and FGVCaircraft datasets. The average precision of our method on KTH2b and FGVCaircraft datasets are and . In contrast, BCNN achieves and . On DTD and CUB2002011 datasets, our method is slightly worse than the BCNN. The reason may be that the linear relationships among features are dominant on some datasests and the nonlinear relationships are important on the others.
IvB4 Experiments for the Components of the Proposed Aggregation Method
convolutional layer. As mentioned above, we employ a convolutional layer to accomplish the preprocessing of convolutional features. In this section, we provide experiments for the necessity of the preprocessing of convolutional features in our method. We design experiments in Table IV. We combine different numbers of channels of layer with the kernel aggregation layer. We also add the layer to the BCNN and pure VGG network. , and in the Table indicate that whether there is a layer before the kernel aggregation layer and the number of channels of the layer. The kernel aggregation layer in the Table means that there is only the kernel aggregation layer without the SPD matrix transformation layer and the vectorization layer in the network. Table IV shows that, the layer is beneficial to our nonlinear kernel aggregation method. However, it is useless or even harmful to the BCNN and pure VGG16 model. Our benefits are brought about by the powerful SPD matrix instead of the preprocessing of convolutional features. But the preprocessing of convolutional features can actually lead to better performance to the SPD aggregation. The reason may be that the convolutional features are totally different from the kernel matrices but have some similarities to the Bilinear matrices or FC features. We can also observe that the number of channels of layer has small influence for the texture datasets. But when it is reduced to , the performance on finegrained datasets is declined. So we argue that the convolutional features are redundant for the texture datasets but not redundant for the finegrained datasets.
SPD Matrix Generation Process. To evaluate the effectiveness of the nonlinear SPD matrix generation process, we establish a network that only contains the kernel aggregation layer without the SPD matrix transformation layer and vectorization layer. The outcome is shown in Table IV. Without the layer, i.e., + kernel aggregation layer in the Table, our network is comparable with the BCNN and pure VGG16 model. When the layer is added to the network, i.e.,, + Kernel Aggregation Layer in the Table, it has a obvious better performance than the other methods on FMD, KTH2b and FGVCaircraft datasets.
SPD Matrix Transformation Process. We design experiments to evaluate the effectiveness of the proposed transformation process in this subsection. Compared with ours, kernel aggregation layer in Table IV only lacks the SPD matrix transformation layer and the vectorization layer, the rest is the same. Through Table IV, we find that the SPD matrix transformation process transforms the SPD matrix to a more suitable and discriminative representation. Especially on DTD dataset and FGVCaircraft datasets, the performance is improved by and respectively.
V Conclusion
In this paper, we have proposed a new powerful SPD aggregation method which models the convolutional feature aggregation as an SPD matrix nonlinear learning problem on the Riemannain manifold. To achieve this goal, we have designed three new layers to aggregate the convolutional features into an SPD matrix and transform the SPD matrix to be more discriminative and suitable. The three layers include a kernel aggregation layer, an SPD matrix transformation layer and a vectorization layer under an endtoend framework. We investigated the component decomposition and retraction of the Orthogonal Stiefle manifold to carry out the backpropagation of our model. Meanwhile, the faster matrix operation was adopted to speed up forward and backward backpropagations. Compared with alternative aggregation strategies such as FV, VLAD and bilinear pooling, our SPD aggregation achieves appealing performance on both visual search and visual classification tasks. Extensive experiments on challenging datasets have demonstrated that our approach outperforms the stateoftheart methods.