# LaTeX Author Guidelines for ICCV Proceedings

# More About Covariance Descriptors for Image Set Coding: Log-Euclidean Framework based Kernel Matrix Representation

###### Abstract

We consider a family of structural descriptors for visual data, namely covariance descriptors (CovDs) that lie on a non-linear symmetric positive definite (SPD) manifold, a special type of Riemannian manifolds. We propose an improved version of CovDs for image set coding by extending the traditional CovDs from Euclidean space to the SPD manifold. Specifically, the manifold of SPD matrices is a complete inner product space with the operations of logarithmic multiplication and scalar logarithmic multiplication defined in the Log-Euclidean framework. In this framework, we characterise covariance structure in terms of the arc-cosine kernel which satisfies Mercer’s condition and propose the operation of mean centralization on SPD matrices. Furthermore, we combine arc-cosine kernels of different orders using mixing parameters learnt by kernel alignment in a supervised manner. Our proposed framework provides a lower-dimensional and more discriminative data representation for the task of image set classification. The experimental results demonstrate its superior performance, measured in terms of recognition accuracy, as compared with the state-of-the-art methods.

## 1 Introduction

The representation of visual data plays a vital role in identifying the content of images [23, 14], image sets [28, 4] and videos [21, 8]. There are many descriptors for visual data. As is well known, the most popular representations for recognition tasks are bag-of-visual-words (BoVW) models [25], fisher vectors (FV) [14] and vector of locally aggregated descriptors (VLAD) [1]. These representations are ultimately in the form of vectors and typically involve the following four steps: extracting features, generating codebook, encoding and pooling, and normalization.

Structured representations, such as linear subspaces [9], and covariance descriptors (CovDs) [29, 33], have recently been shown to offer efficient and powerful representations for high dimensional tasks in computer vision. In particular, CovDs, defined by the second-order statistics of sample features, have been widely used as the visual representations for both single image [29] and image sets [33]. Prior to CovDs for describing image sets studied in [33], covariance matrices have been used as region covariance descriptors to characterise local regions within an image. They have been applied to the task of object tracking, object detection and texture classification. In contrast, when CovDs are used to describe image sets [33], samples are the images in the set and features are the raw intensities of the image pixels. The resulting covariance matrices are often singular because the feature dimensionality (the number of the pixels in the image) is usually larger than the number of samples (the number of images). Among the aforementioned methods, CovDs for describing image sets has the following characteristics:

1) In contrast to the vector representations of BoVW, FV and VLAD, which generate linear descriptors via codebooks, CovDs directly generate structured representations.

2) A feature matrix of an image set with images: = [,,,], where is a -dimensional feature vector characterising the -th image. Here, is the number of the pixels in each image.

3) The resulting covariance matrix tends to be singular and high dimensionality, and may contain a certain amount of redundant information.

Characteristic 3 summarises several deficiencies of traditional CovDs as image set descriptors. In this paper, we propose a improved framework, which involves using a kernel matrix defined in terms of representations associated with sub-image sets, instead of pixels. Our proposed framework enables to generate covariance descriptors on non-linear SPD manifold (CovDs-S^{1}^{1}1Source code: https://github.com/Kai-Xuan/iCovDs). The experimental results show the advantages of our proposed CovDs-S.

The rest of this paper is organized as follows: In Section 2, we introduce the theory of Riemannian geometry of SPD manifold and review the Log-Euclidean framework which is the baseline for our proposed approach. In Section 3, we give a brief overview of the traditional CovDs as image set descriptors and present the proposed framework. We present and discuss the experimental results in Section 4. Section 5 draws conclusions and outlines future work.

## 2 Background Theory

This section first provides a brief introduction to Riemannian geometry of SPD manifold, and proposes the process of SPD mean centralization. We then present a Log-Euclidean framework based arc-cosine (LogE.Arc) kernel.

Notation: In this paper, is an identity matrix. denotes the SPD manifold spanned by real SPD matrices and denotes the space spanned by real symmetric matrices. is the tangent space at the point , which is a flat surface spanned by real symmetric matrices. Diag() is a diagonal matrix with the diagonal elements . The matrix logarithm, log(): is defined as:

(1) |

with = Diag(). If is an SPD matrix, log() will be a point in the tangent space at the identity matrix . Similarly, the matrix exponential exp(): is defined as:

(2) |

with = Diag().

### 2.1 The General Metrices on SPD Manifold

A real SPD matrix satisfies 0 for all non-zero . The Affine Invariant Riemannian Metric (AIRM) is the frequently studied Riemannian metric on the SPD manifold [22]. Beside AIRM, Log-Euclidean Metric (LEM) [2] and two types of Bregman divergence [16], namely Stein [26] and Jeffrey [11] divergence, are also widely used to analyze SPD matrices.

Definition 1 (Affine Invariant Riemannian Metric, AIRM) * The most common Riemannian metric on is Affine Invariant Riemannian Metric (AIRM) [22], in which the geodesic distance : [0,) between two SPD matrices and can be obtained by:*

(3) |

where denotes is the Frobenius norm. log() is the matrix principal logarithm.

Definition 2 (Stein divergence) *The Stein, or S, divergence [26] : [0,) is a special type of Bregman divergence:*

(4) |

Definition 3 (Jeffrey divergence) *The Jeffrey, or, J, divergence [11] : [0,) is another type of Bregman divergence:*

(5) |

The Stein and Jeffrey divergence are similar to geodesic distance induced by AIRM [22].

Definition 4 (Log-Euclidean Metric, LEM) *For two SPD matrices , the Log-Euclidean Distance (LED) [33, 2], is defined by Frobenius norm in the tangent space at identity matrix :*

(6) |

Accordingly, the SPD manifold will be reduced to a flat Riemannian space [2] while endowed with LEM.

### 2.2 Log-Euclidean Framework

In the Log-Euclidean framework, the matrix logarithm log(): is smooth and bijective, and its inverse map, denoted by exp(), is smooth as well. Figure 1 illustrates these two operations on SPD manifold. The Log-Euclidean Kernel [33] is derived by computing the inner product in the domain of logarithm matrix:

(7) |

where denotes the matrix trace. The Log-Euclidean kernel is a positive definite kernel and has been shown to meet Mercer’s conditions in [33]. The operations logarithmic multiplication and scalar logarithmic multiplication are the corresponding Euclidean operations in the domain of logarithm matrix, followed by an inverse mapping back to the SPD manifold via the operation of matrix exponential (Interested readers can refer to [2, 19] for details). We thus can propose the operation of mean centralization on SPD matrices.

Proposition 1 *In line with the brief overview of the Log-Euclidean framework [2, 19], we define the operation of mean centralization on SPD matrices in three steps. Firstly, we map the SPD matrices into the domain of logarithm matrix. Then, we centralize the resulting symmetric matrix by an operation that is similar to centering the kernel matrix in [7]. Finally, we map the centralized matrices back to SPD manifold via exponential mapping. For an arbitrary real SPD matrix , the operation of mean centralization can be written as:*

(8) | |||

Here, is the result of our proposed mean centralization operation applied to the SPD matrix .

Inspired by the broad applications of arc-cosine kernel [6] in the Euclidean space and a family of Log-Euclidean kernels proposed in [19], we propose Log-Euclidean framework based arc-cosine kernel(LogE.Arc kernel), which extends the well-known arc-cosine kernel onto the nonlinear Riemannian manifold of SPD matrices.

Definition 5 (arc-cosine kernel)* Let be two vectors. The arc-cosine kernel can be expressed as the angle between the samples [6] as:*

(9) |

*In [6], the arc-cosine kernel has a simple formulation, which depends on the magnitude of the vectors and the angle between them. It can be defined as:*

(10) |

*where the angular dependence function () for different orders is defined as:*

(11) |

The arc-cosine kernel function has different properties that are shared respectively by radial basis function (RBF), linear, and polynomial kernels (Interested reader can refer to [6] for the details of the arc-cosine kernel). Motivated by the work in [19] and the Log-Euclidean framework, the inputs, , and , of arc-cosine kernel can not only be vectors in the Euclidean space, but also SPD matrices on the curved Riemannian manifold. Thus, the Log-Euclidean framework based arc-cosine kernels can be defined as:

(12) |

Here and has the same formulation as Eq.11, is the angle between the inputs that are mapped into the domain of logarithm matrix:

(13) |

The arc-cosine kernel given in (Eq.12) sets up a Log-Euclidean framework for constructing kernels on SPD manifold, which measures the similarity of SPD matrices and referred to as Log-Euclidean framework based arc-cosine kernel (LogE.Arc kernel). The LogE.Arc kernels of different orders are the corresponding arc-cosine kernels in the domain of logarithm matrix which inherit the corresponding property (RBF, etc.) in the vector space.

## 3 Proposed Framework

In this section, we first give a brief overview of traditional CovDs for describing image sets and then present our proposed CovDs-S. Finally, we compare our CovDs-S with other improved versions of CovDs.

### 3.1 Traditional Covariance Descriptors

Consider a feature matrix (Fig.2 step (b)) of an image set with images: = , where is the -dimensional feature vector representing the -th image. Color images need to be processed as grayscale images and the -dimensional feature vectors are obtained by vectorizing the grayscale images. Using the traditional CovDs, the representation [33] of this image set can be obtained by:

(14) | |||

and = is a -dimensional mean vector of the feature matrix . is the mean centralized matrix (Fig.2 step (c)), and is a column vector of ones. The covariance matrix, , can also be viewed as the kernel matrix between mean centralized feature vectors of the corresponding pixels (the rows of mean centralized matrix ) via linear kernel (Fig.2 step (d)):

(15) | |||

where denotes the -th row of , which is also the feature vector that represents -th pixel of images. is the result of a linear kernel operation between and and denotes the similarity between -th pixel and -th pixel.

### 3.2 Proposed Framework for Image Set Coding

Our proposed framework offers lower-dimensional and more discriminative representation for describing image sets than the traditional CovDs. For the sub-volumes of an image set (Fig.2 step ()), namely sub-image sets, we use SPD matrices to describe them via a Gaussian embedding. We then centralize these SPD matrices and use the LogE.Arc kernel to operate on the resulting mean centralized SPD matrices. Finally, the image set representation in our proposed framework is the kernel matrix defined by the mean centralized SPD matrices associated with the corresponding sub-image sets. We first give a brief overview of the Gaussian embedding and elaborate the bottom row of Figure 2 to describe our framework.

The feature matrix , as introduced in the sub-section of traditional CovDs, can also be described by a Gaussian model. The space spanned by a Gaussian model is a Riemannian manifold, and Gaussian embedding can embed this special manifold into SPD manifold [31]:

(16) |

where is a -dimensional mean vector of and is a real covariance matrix. is a parameter balancing the covariance matrix and the mean vector, The resulting matrix () is an SPD matrix and used as descriptors for sub-image sets(Fig.2 step (f)) in our proposed framework.

The bottom row of Figure 2 shows the flow chart of our proposed framework. We first obtain 4 sub-image sets via a sliding window(Fig.2 step (),* where the particular 4 sub-image were selected just for demonstration. Their choice is not fixed in our framework*). Then we use four SPD matrices (Fig.2 step (f)) to represent sub-image sets via Gaussian embedding (Eq.8) and obtain four mean centralized SPD matrices: , , , (Fig.2 step (g)) via the operation of mean centralization. Finally, the resulting representation (Fig.2 step (h)) is the sum kernel matrix of the four mean centralized SPD matrices via the LogE.Arc kernels of different orders(Eq.12). To this end, the resulting representation can generally be defined as:

(17) | |||

where is the number of the sub-image sets, which is the key parameter determining the dimensionality, is the number of orders selected for LogE.Arc kernel. is the mean centralized SPD matrix, which is the representation of the -th sub-image set. is the local kernel matrix using the -th order LogE.Arc kernel function between mean centralized SPD matrices: ,,…,. The resulting representation: is the sum kernel matrix (The sum of kernels is also a kernel [27]) of the local kernel matrices multiplied by the corresponding weight . denotes the similarity of the -th sub-image set and -th sub-image set.

### 3.3 Learning weights via Kernel Alignment

In this sub-section, we learn, via the kernel target alignment, the weight coefficients associated with the -th order LogE.Arc kernel for our proposed framework via the kernel target alignment.

Definition 6 (Kernel Alignment [7, 27]) *The kernel alignment aims to align an input kernel matrix to a target kernel matrix . It is defined as:*

(18) |

the result of Eq.18 can be viewed as the cosine of the angle between and . The weight coefficients should be estimated by maximizing the . is the global kernel matrix obtained as the sum of local kernel matrices of different orders. The kernel alignment has the following optimization formulation [7]:

(19) |

We now introduce the kernel matrices: and . Given a set of training samples , where is our proposed representation for the -th image set and is the -th order LogE.Arc kernel matrix between sub-image sets constructed from the -th image set by image division, and the corresponding label matrix for the samples, where contains the class label information of -th sample and the -th element of is 1 if is from the -th class. In this paper, the global kernel matrix is the sum of multiplied by the weight : and is the local kernel matrix between : . The target kernel matrix is defined via label matrix: . As introduced in [7], the objective function in Eq.19 can be rewritten as:

(20) |

where = 1 is a regularization term. is defined by = , where is the centralized matrix of the kernel matrix [7, 27], and the matrix is defined by .

According to Proposition 2 in [7], the solution of Eq.20 is given by:

(21) |

### 3.4 Comparison with other Improved Versions of traditional CovDs for Image Set Coding

As far as we know, the works in [30, 20, 5] are the only three improved versions of traditional CovDs for describing image sets. It is desirable to manifest their connections and differences.

Wang et al [30] proposed an open framework^{2}^{2}2https://www.uow.edu.au/ leiw/ to use the kernel matrix over feature dimensions as a generic representation. This work uses a non-linear kernel matrix as the representation, but the kernel functions are defined in the Euclidean space and the resulting representation describes similarities between pixels at different locations, as the traditional CovDs [30]. Our work proposes to capture the similarities between sub-image sets that contain more useful information. It extends the acr-cosine kernel onto the SPD manifold for this purpose.

Li et al [20] extended the descriptive granularity of covariance matrix from traditional pixel-level to more general patch-level. Though this work concentrates on the patch-level covariance computation, it is actually a sum-pooling form of the pixel-level covariance. There is an essential difference in the way of descriptor computation. For describing image sets, we use the kernel matrix computed on SPD manifold as the resulting representation instead of the covariance matrix computed in the Euclidean space [20]. Moreover, we use the SPD matrices to represent sub-image sets instead of the feature matrices consisting of intensity values [20].

Chen et al [5] proposed a framework^{3}^{3}3https://github.com/Kai-Xuan/ComponentSPD/ to generate low-dimensional discriminative data representation for describing image sets and concentrated on characterising the similarities between sub-image sets instead of pixels. The main difference his approach and our method is that we combine arc-cosine kernels of different orders in the domain of logarithm matrix instead of using a linear kernel [5]. Moreover, we obtain our sub-image sets via the sliding window technique instead of dividing images into non-overlapping blocks [5] and use the Gaussian embedding to model them. The work in [5] is a special case of our proposed framework.

Methods | Descriptors | CG[15] | ETH-80[18] | Virus[17] | MDSD[24] |

NN-AIRM | CovDs [33] | 51.822.55 | 70.125.24 | 27.574.34 | 13.084.05 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-B [30] | 71.872.47 | 89.173.48 | 36.575.00 | 23.675.55 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-P [20] | 87.491.41 | 87.834.61 | 67.406.19 | 22.514.56 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-C [5] | 89.071.15 | 94.532.55 | 40.176.07 | 21.135.68 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-S (Ours) | 90.211.27 | 94.183.69 | 67.604.76 | 32.385.60 |

NN-Stein | CovDs [33] | 40.662.62 | 57.085.62 | 27.434.90 | 12.674.25 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-B [30] | 75.062.32 | 88.323.71 | 35.174.48 | 22.926.17 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-P [20] | 79.972.30 | 88.754.69 | 67.036.14 | 21.774.16 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-C [5] | 89.761.23 | 94.102.90 | 39.836.18 | 19.515.42 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-S (Ours) | 90.301.28 | 94.183.69 | 67.704.80 | 31.515.89 |

NN-Jeffrey | CovDs [33] | 82.451.38 | 86.124.81 | 30.805.49 | 18.564.77 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-B [30] | 59.382.42 | 90.133.58 | 40.035.81 | 23.675.37 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-P [20] | 83.071.44 | 87.634.54 | 65.376.55 | 21.674.29 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-C [5] | 89.521.14 | 94.602.79 | 40.976.54 | 24.005.30 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-S (Ours) | 90.021.22 | 94.683.64 | 67.504.79 | 33.035.72 |

NN-LEM | CovDs [33] | 67.471.93 | 78.175.59 | 25.974.62 | 13.744.52 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-B [30] | 73.182.39 | 90.653.79 | 36.074.22 | 24.875.51 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-P [20] | 89.441.17 | 88.274.24 | 67.576.61 | 24.904.96 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-C [5] | 90.241.09 | 93.483.16 | 40.575.53 | 21.545.31 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-S (Ours) | 90.491.31 | 94.073.77 | 68.104.88 | 32.236.08 |

Ker-SVM | CovDs [33] | 91.541.16 | 92.605.15 | 65.835.63 | 35.746.11 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-B [30] | 92.311.13 | 94.183.69 | 73.775.82 | 37.795.76 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-P [20] | 94.341.02 | 94.523.64 | 75.406.01 | 35.546.47 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-C [5] | 93.811.01 | 95.452.85 | 53.636.80 | 38.086.05 |

\cdashline2-7[0.5pt/1.5pt] | CovDs-S (Ours) | 94.360.98 | 97.072.66 | 77.935.03 | 43.926.09 |

## 4 Experiments and Analysis

This section presents comparative experimental results of our proposed framework with state-of-the-art (SOTA) methods for the task of image set classification.

### 4.1 Datasets and settings

In our first experiment involving the task of image set classification, we consider the Cambridge hand-gesture (CG) dataset [15] that contains nine categories of samples and nine hundred image sets. Each class has twenty image sets chosen for training at random, and the remaining eighty image sets are reserved for testing. In the ETH-80 dataset [18], there are eight categories of samples and eighty image sets. For each class, five image sets are randomly chosen as training samples and the remaining five image sets are used for testing. In the Virus cell dataset [17], there are fifteen categories of samples and 100 images in each category. We divided the images of each category equally into five different image sets and obtained seventy-five image sets. For each class, three image sets are randomly chosen as training samples and the remaining two image sets for testing. The MDSD dataset [24] has been used for the task of dynamic scene classification. Following the settings in [27], we test the method based on the protocol of seventy-thirty-ratio (STR) which chooses seven videos for training and three videos for testing in each class.

Methods | CG[15] | ETH-80[18] | Virus[17] | MDSD[24] |

COV-LDA [33] | 90.251.64 | 93.954.30 | 46.405.76 | 34.105.90 |

COV-PLS [33] | 88.951.26 | 94.234.63 | 62.845.99 | 36.745.62 |

LogEKSR.Pol [19] | 92.321.19 | 95.003.28 | 58.536.54 | 36.236.81 |

LogEKSR.Exp [19] | 92.231.18 | 95.103.20 | 59.036.38 | 36.596.88 |

LogEKSR.Gau [19] | 92.331.18 | 95.183.30 | 61.806.35 | 37.956.83 |

LEML [13] | 88.181.29 | 93.053.31 | 33.005.70 | 25.976.79 |

LEML+COV-LDA [13] | 89.091.63 | 95.353.50 | 58.035.84 | 31.926.44 |

LEML+COV-PLS [13] | 86.361.35 | 95.833.04 | 59.406.22 | 35.906.98 |

SPDML-LEM [10] | 84.031.04 | 90.634.19 | 49.377.46 | 24.234.47 |

SPDNet [12] | 92.031.46 | 95.503.69 | 59.704.58 | 33.765.04 |

MMML [32] | 92.871.39 | 95.283.80 | 51.137.60 | 31.956.26 |

KS-CS-LEK | 93.631.08 | 95.382.92 | 74.935.92 | 39.337.00 |

KS-CS-LogE.Pol | 93.950.94 | 97.302.55 | 75.174.86 | 42.726.20 |

KS-CS-LogE.Exp | 93.680.90 | 95.403.03 | 70.905.60 | 42.677.04 |

KS-CS-LogE.Gau | 93.900.91 | 95.652.86 | 71.875.18 | 41.156.33 |

KS-CS-LogE.Arc | 94.360.98 | 97.072.66 | 77.935.03 | 43.926.09 |

### 4.2 A Comparison with Existing Descriptors

For the comparative experiments with existing descriptors [30, 20, 5], we first resize all images to and then use the intensity values to generate their corresponding representations. For our proposed framework, the sub-image sets are obtained by sliding window with spatial step of 2 pixels for the CG, ETH-80 and MDSD datasets, and spatial step of 3 pixels for the Virus dataset. In total, we obtain 100 sub-image sets for theCG, ETH-80 and MDSD datasets and 49 sub-image sets for the Virus dataset. In our framework, we set parameter in LogE.Arc kernel to be = [0, 1, 2, 3], and the value of in Eq.16 depends on the datasets, which is 0.05, 0.9, 14, 2 for the four datasets respectively. For the learned , we set first two largest absolute values to be 1 and another two values to be zero on ETH-80, Virus and MDSD datasets. We set the largest absolute value to be 1 and another three values to be zero on theCG datasets. We regularize the traditional CovDs: to avoid the matrix singularity as introduced in [33], and set to . To generate the descriptor in [30], we obtain the final kernel matrix representation via RBF kernel that has been shown to produce better accuracies [30]. For the fairness of the comparative experiments, the patch size in [20] and block size in [20] are all and the step size is the same as the setting used by our proposed CovDs-S on the corresponding dataset. The different descriptors evaluated in our experiments are referred to as:

CovDs: Image set repesented by traditional CovDs[33].
CovDs-B: Image set repesented by the method in [30].
CovDs-P: Image set repesented by the method in [20].
CovDs-C: Image set repesented by the method in [5].
CovDs-S: Image set repesented by our framework.

In our experiments, five classification algorithms are used to verify the validity of our proposed CovDs-S, which include four nearest neighbor (NN) algorithms based on AIRM, Stein divergence, Jeffrey divergence, LEM and the well-known SVM classifier [3]. The different methods tested in our experiments are referred to as:

NN-AIRM: AIRM-based NN classifier.

NN-Stein: Stein divergence-based NN classifier.

NN-Jeffrey: Jeffrey divergence-based NN classifier.

NN-LEM: LEM-based NN classifier.

Ker-SVM: LEK-based SVM classifier.

Here, Ker-SVM is a one-vs-all SVM classifier^{4}^{4}4http://www.peihuali.org/publications/RAID-G/RIAD-G_V1.zip implemented by Wang et al. Table 1 shows the average recognition rates and standard deviations of different descriptors with the same classifiers. In addition to the results with NN-AIRM on the ETH-80 dataset, the recognition rates of CovDs-S are higher than other four descriptors while using the same classification algorithm. This confirms that our CovDs-S captures more discriminative information than the other methods. In particular, the accuracy of CovDs-C is not as high as shown in Table 1 if the setting follows the recommendations made in [5].

### 4.3 Comparison with Existing Classifiers

Here, we compare our method with the SOTA algorithms including Covariance Discriminative Learning (COV-LDA, COV-PLS) [33], Log-Euclidean Kernels for Sparse Representation (LogEKSR.Pol, LogEKSR.Exp and LogEKSR.Gau) [19], SPD Manifold Learning based on LEM (SPDML-LEM) [10], Log-Euclidean Metric Learning (LEML, LEML+COV-LDA, LEML+COV-PLS) [13], Riemannian Network for SPD Matrix Learning (SPDNet) [12] and Multiple Manifolds Metric Learning (MMML) [32]. For these methods, we first resize all images to and use the intensity values as their features.

We use the Ker-SVM tested on the representations obtained by our framework as our proposed classification algorithm. In addition to the LogE.Arc kernel (introduced above) used in our framework, we also consider other types of kernel functions to enrich our framework, such as Log-Euclidean Kernel (LEK) [33], Log-Euclidean based polynomial (LogE.Pol), exponential (LogE.Exp) and Gaussian kernels (LogE.Gau). Our framework based classifiers are reffered as:

KS-CS-LogE.Arc: Ker-SVM tested on CovDs-S (introduced above).

KS-CS-LEK: Ker-SVM tested on the representations via our framework where the LogE.Arc kernel replaced by LEK

KS-CS-LogE.Pol: Ker-SVM tested on the representations via our framework where the LogE.Arc kernelis replaced by LogE.Pol kernel

KS-CS-LogE.Exp: Ker-SVM tested on the representations obtained by our framework, where the LogE.Arc kernel is replaced by LogE.Exp kernel

KS-CS-LogE.Gau: Ker-SVM tested on the representations produced by our framework, where the LogE.Arc kernel is replaced by LogE.Gau kernel

As shown in Table 2, the classifiers based on our framwork produce better performance than other SOTA methods. The advantage of our methods is very obvious on the Virus and MDSD datasets, where image samples contain a large amount of noise. Our method KS-CS-LogE.Arc achieves the best recognition rate of , and on the CG, Virus and MDSD datasets. Our framework based KS-CS-LogE.Pol achieves the best recognition rate of on the ETH-80 dataset and KS-CS-LogE.Arc achieves the second best recognition rate of .

### 4.4 Ablation Experiments

In this subsection, we validate the contributions of each component in CovDs-S and analyze the effect of image rotation and sub-image sizes. To this end, CovDs-S is used in the following variants: CovDs-S without Gaussian embedding (CovDs-S-GE), CovDs-S without kernel alignment (Coves-S-KA), CovDs-S without mean centralization (CovDs-S-MC), CovDs-S with image rotation (CovDs-S-IM90), CovDs-S with image rotation (CovDs-S-IM180) and CovDs-S with image rotation (CovDs-S-IM270).

Table 3 shows the accuracies and the standard deviations of the different kernel variants in conjunction with the Ker-SVM classifier on the four datasets. From the results in this table, we can conclude that the contribution of Gaussian embedding and kernel alignment are more significant than the effect of mean centralization of our CovDs-S. In theory, the mean centralization operation corresponds to the standard normalization operation used in traditional CovDs. It appears that in the case of our framework, this operation does not impact on accuracy. According to the last three rows of the table, CovDs-S is also not sensitive to image rotation.

Figure 3 shows the effect of sub-image size and step size on the recognition rates of KS-CS-LogE.Arc. In the horizontal ordinate in the form of ’a/b’, ’a’ represents the sliding window size, and ’b’ denotes step size. The results show clearly that the proposed framework performs bettter when the sub-image size is . In that case, our CovDs-S achieves the best accuracies on the CG, ETH-80 and MDSD datasets when the step size is 2 pixels. When the step size is set as 3 pixels, our CovDs-S achieves the best accuracies on the Viurs dataset.

Descriptors | CG[15] | ETH-80[18] | Virus[17] | MDSD[24] |

CovDs-S-GE | 94.180.88 | 96.732.90 | 60.006.21 | 40.106.68 |

CovDs-S-KA | 91.761.26 | 96.623.02 | 74.635.14 | 43.796.16 |

CovDs-S-MC | 94.340.97 | 97.072.66 | 77.674.68 | 43.746.10 |

CovDs-S-IM90 | 94.321.00 | 97.072.66 | 77.404.70 | 43.906.16 |

CovDs-S-IM180 | 94.340.97 | 97.072.66 | 77.834.95 | 43.926.06 |

CovDs-S-IM270 | 94.330.99 | 97.072.66 | 77.534.56 | 43.956.26 |

### 4.5 Advantages of Our Framework

From the superior results in Table 1 and Table 2, we can conclude that our proposed framework captures more discriminative information for the task of image set classification than other methods. The superior performance is particularly notable for noisy samples. Moreover, the dimensionality of the representation obtained via our framework is related to the number of sub-image sets(100 or 49), which is far lower than that of the traditional CovDs. Table 4 shows the run time for representative methods (NN-AIRM and Ker-SVM), as well as the generating time (GT) required for different descriptors on the ETH-80 dataset, where the unit of time is one second. The representations extracted by our framework require far less time than the traditional CovDs. With the recommended settings in the experiments, our proposed representations tend to be nonsingular. The dimensionality of the feature representation for the sub-image set is . The resulting representation can also be viewed as corresponding to the kernel matrix (arc-cosine kernel, etc. in the Euclidean space) of or feature matrix as traditional CovDs. Here, we cannot be sure about the nonsingularity for the resulting kernel matrix, but it is more likely to hold for an SPD matrix than a traditonal covariance matrix.

CovDs | CovDs-S | |||||

LEK | LogE.Pol | LogE.Exp | LogE.Gau | LogE.Arc | ||

GT | 20.15 | 48.67 | 48.74 | 49.32 | 49.20 | 51.83 |

NN-AIRM | 28.77 | 1.248 | 1.252 | 1.248 | 1.250 | 1.249 |

Ker-SVM | 2.07 | 0.056 | 0.056 | 0.056 | 0.056 | 0.055 |

## 5 Conclusion And Future Work

We proposed a novel framework extending CovDs from the Euclidean to an SPD manifold. It generates a kernel matrix defined by the representations of sub-image sets instead of pixels. Our method provides a lower-dimensional data representation, which is beneficial for improving the efficiency of classifiers. The experimental results show that the representation obtained by our proposed framework is more discriminative than other methods when performing the task of image set classification. In future, we will consider how to extend our proposed framework to Reproducing Kernel Hilbert Space (RKHS).

## 6 Acknowledgments

THE PAPER IS SUPPORTED BY THE NATIONAL NATURAL SCIENCE FOUNDATION OF CHINA (GRANT NO 61672265,U1836218), THE 111 PROJECT OF MINISTRY OF EDUCATION OF CHINA (GRANT NO. B12018), UK EPSRC GRANT EP/N007743/1, AND MURI/EPSRC/DSTL GRANT EP/R018456/1.

## References

- [1] (2013) All about vlad. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1578–1585. Cited by: §1.
- [2] (2007) Geometric means in a novel vector space structure on symmetric positive-definite matrices. SIAM journal on matrix analysis and applications 29 (1), pp. 328–347. Cited by: §2.1, §2.2.
- [3] (2011) LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 27. Cited by: §4.2.
- [4] (2018) Riemannian kernel based nyström method for approximate infinite-dimensional covariance descriptors with application to image set classification. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 651–656. Cited by: §1.
- [5] (2018) Component spd matrices: a low-dimensional discriminative data descriptor for image set classification. Computational Visual Media 4 (3), pp. 245–252. Cited by: §3.4, Table 1, §4.2.
- [6] (2009) Kernel methods for deep learning. In Advances in neural information processing systems, pp. 342–350. Cited by: §2.2.
- [7] (2010) Two-stage learning kernel algorithms.. In ICML, pp. 239–246. Cited by: §2.2, §3.3.
- [8] (2017) Object tracking with kernel correlation filters based on mean shift. In Smart Cities Conference (ISC2), 2017 International, pp. 1–7. Cited by: §1.
- [9] (2009) Extended grassmann kernels for subspace-based learning. In Advances in neural information processing systems, pp. 601–608. Cited by: §1.
- [10] (2018) Dimensionality reduction on spd manifolds: the emergence of geometry-aware methods. IEEE transactions on pattern analysis and machine intelligence 40 (1), pp. 48–62. Cited by: §4.3, Table 2.
- [11] (2014) Bregman divergences for infinite dimensional covariance matrices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1003–1010. Cited by: §2.1.
- [12] (2017) A riemannian network for spd matrix learning. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §4.3, Table 2.
- [13] (2015) Log-euclidean metric learning on symmetric positive definite manifold with application to image set classification. In International conference on machine learning, pp. 720–729. Cited by: §4.3, Table 2.
- [14] (2012) Aggregating local image descriptors into compact codes. IEEE transactions on pattern analysis and machine intelligence 34 (9), pp. 1704–1716. Cited by: §1.
- [15] (2009) Canonical correlation analysis of video volume tensors for action categorization and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (8), pp. 1415–1428. Cited by: Table 1, §4.1, Table 2, Table 3.
- [16] (2009) Low-rank kernel learning with bregman matrix divergences. Journal of Machine Learning Research 10 (Feb), pp. 341–376. Cited by: §2.1.
- [17] (2011) Virus texture analysis using local binary patterns and radial density profiles. In Iberoamerican Congress on Pattern Recognition, pp. 573–580. Cited by: Table 1, §4.1, Table 2, Table 3.
- [18] (2003) Analyzing appearance and contour based methods for object categorization. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, Vol. 2, pp. II–409. Cited by: Table 1, §4.1, Table 2, Table 3.
- [19] (2013) Log-euclidean kernels for sparse representation and dictionary learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1601–1608. Cited by: §2.2, §4.3, Table 2.
- [20] (2016) Spatial pyramid covariance-based compact video code for robust face retrieval in tv-series. IEEE Transactions on Image Processing 25 (12), pp. 5905–5919. Cited by: §3.4, Table 1, §4.2.
- [21] (2014) Action recognition with stacked fisher vectors. In European Conference on Computer Vision, pp. 581–595. Cited by: §1.
- [22] (2006) A riemannian framework for tensor computing. International Journal of computer vision 66 (1), pp. 41–66. Cited by: §2.1.
- [23] (2010) Improving the fisher kernel for large-scale image classification. In European conference on computer vision, pp. 143–156. Cited by: §1.
- [24] (2010) Moving vistas: exploiting motion for describing scenes. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 1911–1918. Cited by: Table 1, §4.1, Table 2, Table 3.
- [25] (2003) Video google: a text retrieval approach to object matching in videos. In null, pp. 1470. Cited by: §1.
- [26] (2012) A new metric on the manifold of kernel matrices with application to matrix geometric means. In Advances in neural information processing systems, pp. 144–152. Cited by: §2.1.
- [27] (2017) Learning deep match kernels for image-set classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3307–3316. Cited by: §3.2, §3.3, §4.1.
- [28] (2015) Grassmann manifold for nearest points image set classification. Pattern Recognition Letters 68, pp. 190–196. Cited by: §1.
- [29] (2008) Pedestrian detection via classification on riemannian manifolds. IEEE Transactions on Pattern Analysis & Machine Intelligence (10), pp. 1713–1727. Cited by: §1.
- [30] (2015) Beyond covariance: feature representation with nonlinear kernel matrices. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4570–4578. Cited by: §3.4, Table 1, §4.2.
- [31] (2016) RAID-g: robust estimation of approximate infinite dimensional gaussian with application to material recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4433–4441. Cited by: §3.2.
- [32] (2018) Multiple manifolds metric learning with application to image set classification. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 627–632. Cited by: §4.3, Table 2.
- [33] (2012) Covariance discriminative learning: a natural and efficient approach to image set classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2496–2503. Cited by: §1, §2.1, §2.2, §3.1, Table 1, §4.2, §4.3, Table 2.