Parameterizing Region Covariance: An Efficient Way To Apply Sparse Codes On Second Order Statistics
Abstract
Sparse representations have been successfully applied to signal processing, computer vision and machine learning. Currently there is a trend to learn sparse models directly on structure data, such as region covariance. However, such methods when combined with region covariance often require complex computation. We present an approach to transform a structured sparse model learning problem to a traditional vectorized sparse modeling problem by constructing a Euclidean space representation for region covariance matrices. Our new representation has multiple advantages. Experiments on several vision tasks demonstrate competitive performance with the stateoftheart methods.
1 Introduction
Sparse representations have been successfully applied to many tasks in signal processing, computer vision and machine learning. Many algorithms[1, 12] have been proposed to learn an overcomplete and reconstructive dictionary based on such representations. These algorithms involve vectorizing the input data which can destroy inherent ordering information in the data[9, 31]. Instead sparse codes can be constructed directly based on the original structure of the input data. Such structures include diffusion tensors, region covariance, etc. The region covariance structure, introduced by Tuzel et al.[33] provides a natural way to fuse different features for a given region. Additionally, the averaging filter in covariance computation reduces noise that corrupts individual samples. Furthermore, Porikli et al.[28] showed that it can be constructed for arbitrarysized windows in constant time using integral images. Hence, it has become a popular descriptor for face recognition[24, 11, 39], human detection[34], tracking[34], object detection [8, 32], action recognition [38, 7] and pedestrian detection [35].
However, region covariance matrices are positive definite matrices, forming a connected Riemannian manifold. Current manifoldbased methods for region covariance often require complex computation. Many applications remain restricted to knearestneighbors or kernel SVMs, using geodesic distance measurement[26, 34, 35]. Pennec et al. [26] first introduced the general framework to calculate the statistics based on an affineinvariant metric. Recently, there have been several attempts to develop sparse coding for region covariance matrices[7, 8, 31, 14, 32, 39]. However, such approaches all involve complex computations, including calculating eigenvalues, matrix logarithms and matrix determinants.
We present an approach for sparse coding parameterized representations of region covariance matrices inspired by finance applications. This representation preserves the same second order statistics as region covariance matrices. More importantly, the representation is Euclidean and hence can be vectorized and computed effectively in the traditional sparse coding framework. We further learn discriminative dictionaries over this representation by integrating label consistency regularization and class information into the objective function. The framework of our approach is shown in Figure 1. The main contributions of this paper are:

Introduction of covariance parameterization used in finance to the computer vision community.

Design of a new Euclidean representation for region covariance that has multiple advantages, including lower time complexity for measuring similarity and preserving both first order and second order statistics of a given region.

Performing discriminative dictionary learning on our new representation of region covariance to show its effectiveness.

Experiments show stateoftheart performance on multiple tasks.
2 Background
We provide a brief review of the region covariance descriptor and its corresponding similarity measurement methods.
2.1 Region Covariance Descriptors
Given an image , let be a function that extracts a dimension feature vector at each pixel , i.e. , where and is the location of pixel . can be any feature mapping function such as intensity, gradient, different color channels, filter responses, etc. is a dimensional feature matrix extracted from . A given image region is represented by the covariance matrix of the set of feature vectors of all points inside the region . The region covariance descriptor is defined as:
(1) 
where, is the mean vector,
(2) 
2.2 Positive Definite Similarity Computation
In general, covariance matrices are positive definite, except for some special cases. They are usually regularized to make them strictly positive definite. Hence, the region covariance descriptors belong to the positive definite space , which lies on a Riemannian manifold, not in Euclidean space. This fact makes the similarity measurement between two covariance matrices nontrivial. One wellknown method for computing similarity is the Affine Invariant Riemannian Metric (AIRM)[26] which uses the corresponding geodesic distance on the manifold as a similarity measurement:
(3) 
where is the matrix logarithm and is the Frobenius norm. This method is widely used in classification tasks that involve region covariance. However, the requirement of eigenvalue computation makes it very expensive when used in iterative optimization frameworks.
Many methods have been proposed to improve AIRM. One is the LogEuclidean Riemannian Metric (LERM)[2]:
(4) 
This method maps the positive definite matrices into a flat Rieminnian space by taking the logarithm of the matrices so that the Euclidean distance measurement can be used. While the logarithm for each of these matrices can be evaluated offline, computing the matrix logarithm is still expensive.
More recently, LogDet divergence[14] has been investigated:
(5) 
where is the logarithm of a matrix determinant and is the matrix trace. This method was used in several tensor based sparse coding methods[31, 32, 39]. The LogDet divergence reduces computational complexity by replacing the calculation of eigenvalues with determinants. Also, it avoids the explicit manifold embedding and results in a convex MAXDET problem. However, since the computation of matrix determinants each iteration is still roughly , where is the column size of the region covariance matrix, the whole optimization process is still costly.
3 A Euclidean Space Representation for Region Covariance
In this section, we introduce our methods to construct a small set of points that lie in Eucliddean space and preserve the second order statistics.
3.1 Understanding the Region Covariance
Covariance matrices used in finance usually represent the variance of stock price and the correlations between different stocks. Region covariance in computer vision applications shares similar concepts. Given a set of features , for all points in a region, the region covariance can be written as:
(6) 
where is the th feature value for point and is the mean of the th feature vector. The diagonal entries of the covariance matrix represent the variances of each feature, while the entries outside the diagonal represent the correlations of different features. To design a covariance representation, we want to include both of these terms.
3.2 Cholesky Decomposition
A meaningful region covariance matrix should be symmetric and positive semidefinite, and hence can be decomposed as the product:
(7) 
A obvious way to calculate is using Cholesky decomposition, which enjoys low computation cost and preserves some properties of the covariance matrix[10]. Let be the lower triangular matrices calculated from using Cholesky decomposition, the distance between can be approximated by
(8) 
where is a standard Euclidean basis. The Cholesky decomposition guarantees that the new representation is unique for any covariance matrix .
Although the representation based on Cholesky decomposition works in practice, it is difficult to interpret the entries in the lower triangular matrix. In particular, it is difficult to obtain the correlation coefficients which are available in the original covariance matrix.
3.3 Spherical Decomposition
Alternatively, we seek a lower triangular representation that not only obeys the decomposition rule, but also possesses better statistical interpretations. Inspired by the spherical parametrization method used in finance application[27] for covariance estimation, a new representation can be constructed using spherical coordinates, which involves a series of rotational mappings from the standard basis to the lower triangular matrix[30]. We start with the lower triangular matrix generated from Cholesky decomposition, and then represent it as:
(9) 
where denotes the new representation, is an element of . A special case of 9 is . To ensure the uniqueness of converting from a covariance matrix to spherical coordinates, we must have:
(10) 
This new representation has the following statistical advantages:

The diagonal entries of the covariance matrix are captured directly by the entries of this new representation: .

Some of the correlation coefficients for the covariance matrix can be uniquely mapped to the new representation: .

Elements of the new representation are independent of each other.
The new representation lies in Cartesian space [30], hence the distance can be measured using the Frobenius norm:
(11) 
where and are the new representations calculated by 9.
3.4 Combine with the Mean
The mean of the original features can be concatenated to to make it more informative and robust:
(12) 
is a parameter that balances the scale difference between the mean and our representation.
Our representation lies in Euclidean space and the similarity between representations can be simply measured by the Frobenius norm. Compared to the traditional covariance matrix, our new representation enjoys several advantages:

Informative and robust. Our new representation preserves both the first and the second order statistics. Since the region covariance only captures the differences between features, it may lose some useful statistics within separate feature channels. Hence, fusing feature means into our representation enhances robustness.

Flexibility. The similarity measurement of our new representation can be calculated in Euclidean space, which enables applying many traditional learning methods to second order statistic.
4 Discriminative Sparse Coding and Dictionary Learning
We next describe a method to learn a reconstructive and discriminative dictionary from multiclass data. We construct a subdictionary for each class. We explicitly encourage independence between dictionary atoms from different subdictionaries and leverage class information in the optimization problem. We adopt the LCKSVD[12] method to learn the discriminative dictionary.
4.1 Dictionary Learning via Label Consistent Regularization
Let S be a set of N ddimensional Euclidean space region covariance representations for training dictionary, i.e. . Learning a reconstructive dictionary with K atoms for sparse representation of can be formulated as:
(13) 
where is the learned overcomplete dictionary. is the sparse codes for given inputs, are the ”discriminative” sparse codes, is a linear transformation matrix defined to transform the original sparse codes to be most discriminative in sparse feature space, denotes the parameters of a linear classifier , are the class labels and is a sparsity constraint factor.
Minimizing the objective function not only encourages independence between dictionary atoms from different subdictionaries, but also trains a linear classifier simultaneously. We use the efficient KSVD algorithm to find the optimal solution for all parameters simultaneously.
4.2 Classification
After obtaining the dictionary and the linear classifier parameter , the sparse representation for the test inputs can be calculated as:
(14) 
We simply use the linear classifier to estimate the label of a test sample :
(15) 
4.3 Sparse Codes as Features
We can also fuse the generated sparse codes with other features. One drawback of 4.2 is that the sparsity constraint factor is a hard threshold that forces the sparse codes to have fewer than T nonzero items. This is good for classification task, but when using sparse codes as features we are not concerned with the number of nonzero items. Instead we want to make the sparse codes as informative as possible. Hence, we consider a ”soft” version of equation 4.2:
(16) 
where and are the new sparsity constraint factors. These two parameter control the generation of more continuous sparse codes.
5 Experiments
We evaluate our approach on several different tasks: texture classification, object classification, face recognition, material classification and person reidentification. Sample images for different tasks are shown in Figure 2. For fair comparison, we experiment on the same features as reported by other methods.
5.1 Texture Classification
Scenario  logESR[38]  TSC[32]  RSR[8]  SDL[39]  Ourschol  Ourssphere 

5c  0.88  1.00  0.99  0.99  0.99  0.98 
5m  0.54  0.73  0.85  0.95  0.96  0.97 
5v  0.73  0.86  0.89  0.90  0.92  0.91 
5v2  0.70  0.85  0.89  0.93  0.94  0.95 
5v3  0.65  0.83  0.87  0.84  0.97  0.92 
10  0.60  0.81  0.85  0.89  0.82  0.81 
10v  0.64  0.68  0.86  0.91  0.84  0.79 
16c  0.68  0.75  0.83  0.86  0.89  0.89 
16v  0.56  0.66  0.77  0.89  0.79  0.79 
Evaluation Protocol. We follow the protocol in [29] to create mosaics under nine test scenarios. Each scenario has various numbers of classes, including 5textures, 10textures and 16textures. Each image in the dataset is resized to and cut into nonoverlapped blocks, yielding 64 data samples per image. For each scenario, we randomly select 5 data samples as training and use the rest for testing. The evaluation is repeated 10 times.
Implementation Details. We extract features based on intensity and gradient from each sample. They form a region covariance matrix and result in a 20dimension vector in our representation. We use the same parameter configuration () in all test scenarios.
Results. Table 1 shows the classification results under the nine scenarios. We compare our method with logESR[38], TSC[32], RSR[8], SDL[39]. The mean accuracy of our method achieves the best result in over half of the scenarios (5m, 5v, 5v2, 5v3, 16c). Overall, our maximum classification results over 10 runs are comparable to the best scores.
5.2 Object Classification
Evaluation Protocol. The ETH80 dataset[17] contains eight objects with ten instances each collected form 41 different views. There are 3280 images total. Images for each object have large view point changes which make this dataset very challenging for object recognition task.
Implementation Details. For each image, we generate a covariance matrix with feature , where is the responses from Laplacian of Gaussian filter, is the responses form the bank of Laws texture filters [15];
5.3 Face Recognition
Evaluation Protocol. The AR face dataset[22] contains over 4000 face images captured from 126 individuals. For each individual, there are 26 images separated in two sessions. We follow the protocol used in [39], randomly select 10 subjects to evaluate in our experiment. We repeat the evaluation 20 times.
Implementation Details. Each image is cropped to and converted into gray scale. We extract the intensity and the spatial information along with the responses of Gabor filters with 8 orientations and 5 scales : where is the response of a 2D Gabor wavelet[16] defined by:
(17) 
where .
5.4 Material Classification
Evaluation Protocol. The UIUC material dataset[20] contains eighteen categories with twelve images each (mainly belong to bark, fabric, construction materials, outer coat of animals and so on). Images for each category have different scales and are collected in the wild, which make this dataset very difficult. This dataset is considered as one of thestateofart benchmarks for material classification task. The standard evaluation protocol is to randomly split half of dataset for training and use the rest for testing. We report the average accuracy over 10 repeats.
Implementation Details. For each image, we generate a covariance matrix usng 128 dimensional SIFT feature and 27 color feature ( raw RGB pixels around the center of SIFT descriptor). We calculate above region covariance matrices over a window size with a 4 step size.
5.5 Person Reidentification
Evaluation Protocol. The VIPeR dataset[5] contains 632 pedestrian pairs captured from different camera views. Each image in the pair is resized to . They exhibit large viewpoint variations among pedestrian pairs, which makes it one of the most challenging datasets in person reidentification. We follow the protocol widely used in [6], splitting the 632 pedestrian pairs into half for training and half for testing. Twofold validation is applied during evaluation. We repeat the evaluation 10 times and report the average result.
Implementation Details. We extract blocks with a stride of 4 from each image. For each block, we extract gradient and color features in different channels (including RGB, HSV and color name[36]) to form region covariance matrices. This generates a region covariance matrix and result in a 65dimensional vector in our representation. We then learn sparse codes and use them as features. Additionally, we also extract color histograms from different channels (Lab, HSV and color name[36]) using stripes with a stripe of 3 for consistency with our region covariance sparse code feature. The color histograms are further reduced to 300 dimensions by PCA. We concatenate these two features together with normalizing the maximum value to 1 for each sample and use information theoretic metric learning method[4] to learn the final ranks.
Method  Rank 1 Accuracy 

eBiCov[21]  20.66 
eSDC[41]  26.31 
PCCA[23]  19.27 
KISSME[13]  19.60 
LF[25]  24.18 
LADF[19]  29.34 
SalMatch[40]  30.16 
MidFilter[42]  29.11 
Ourschol  32.99 
Ourssphere  32.84 
Results. We compare our method with stateoftheart methods that don’t require foreground priors such as PCCA[23], KISSME[13], eBiCov[21], eSDC[41], LF[25], SalMatch[40], LADF[19] and MidFilter[42]. Table 5 shows the rank 1 accuracy on the VIPeR dataset. The rank 1 results of our method outperform all the competing methods. Figure 3 contains the cmc ranking curve from rank 1 to rank 25. Our curve is competitive to most of the stateofthearts methods. By visualizing the matching pairs (shown in Figure 4), we find our approach is good at finding discriminative textures thanks to our region covariance representation.
6 Conclusion
We introduced a new representation for region covariance which lies in Euclidean space. This new representation not only shares the same second order statistics with covariance matrices, but also includes the first order statistics. Analysis shows its space and computation advantages over region covariance matrices. Additionally, the discriminative dictionary learning problem on this representation can be solved efficiently in the traditional KSVD framework. Experiments on different tasks demonstrate the proposed approach is effective and robust.
References
 [1] M. Aharon, M. Elad, and A. Bruckstein. Ksvd: An algorithm for designing overcomplete dictionaries for sparse representation. TSP, 54(11):4311–4322, 2006.
 [2] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache. LogEuclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine, 56(2):411–421, 2006.
 [3] A. Cherian and S. Sra. Riemannian sparse coding for positive definite matrices. In ECCV, 2014.
 [4] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Informationtheoretic metric learning. In ICML, 2007.
 [5] G. Doug, B. Shane, and T. Hai. Evaluating appearance models for recognition, reacquisition, and tracking. In PETS, 2007.
 [6] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani. Person reidentification by symmetrydriven accumulation of local features. In CVPR, 2010.
 [7] K. Guo, P. Ishwar, and J. Konrad. Action recognition using sparse representation on covariance manifolds of optical flow. In AVSS, 2010.
 [8] M. Harandi, C. Sanderson, R. Hartley, and B. Lovell. Sparse coding and dictionary learning for symmetric positive definite matrices: A kernel approach. In ECCV, 2012.
 [9] T. Hazan, S. Polak, and A. Shashua. Sparse image coding using a 3d nonnegative tensor factorization. In ICCV, 2005.
 [10] X. Hong, H. Chang, S. Shan, X. Chen, and W. Gao. Sigma set: A small second order statistical region descriptor. In CVPR, pages 1802–1809, June 2009.
 [11] H. Huo and J. Feng. Face recognition via aam and multifeatures fusion on riemannian manifolds. In ACCV, 2009.
 [12] Z. Jiang, Z. Lin, and L. Davis. Learning a discriminative dictionary for sparse coding via label consistent ksvd. In CVPR, 2011.
 [13] M. Kostinger, M. Hirzer, P. Wohlhart, P. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In CVPR, 2012.
 [14] B. Kulis, M. Sustik, and I. Dhillon. Learning lowrank kernel matrices. In ICML, 2006.
 [15] K. Laws. Rapid texture identification. Proc. SPIE, 0238:376–381, 1980.
 [16] T. Lee. Image representation using 2d gabor wavelets. TPAMI, 18(10):959–971, 1996.
 [17] B. Leibe and B. Schiele. Analyzing appearance and contour based methods for object categorization. In CVPR, 2003.
 [18] P. Li, Q. Wang, W. Zuo, and L. Zhang. Logeuclidean kernels for sparse representation and dictionary learning. In ICCV, 2013.
 [19] Z. Li, S. Chang, F. Liang, T. Huang, L. Cao, and J. Smith. Learning locallyadaptive decision functions for person verification. In CVPR, 2013.
 [20] Z. Liao, J. Rock, Y. Wang, and D. Forsyth. Nonparametric filtering for geometric detail extraction and material representation. In CVPR, 2013.
 [21] B. Ma, Y. Su, and F. Jurie. Bicov: a novel image representation for person reidentification and face verification. In BMVC, 2012.
 [22] A. Martinez and R. Benavente. The ar face database. In CVC Technical Report 24, 1998.
 [23] A. Mignon and F. Jurie. Pcca: A new approach for distance learning from sparse pairwise constraints. In CVPR, 2012.
 [24] Y. Pang, Y. Yuan, and X. Li. Gaborbased region covariance matrices for face recognition. Circuits and Systems for Video Technology, IEEE Transactions on, 18(7):989–993, 2008.
 [25] S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian. Local fisher discriminant analysis for pedestrian reidentification. In CVPR, 2013.
 [26] X. Pennec, P. Fillard, and N. Ayache. A riemannian framework for tensor computing. ICJV, 66(1):41–66, 2006.
 [27] J. Pinheiro and D. Bates. Unconstrained parameterizations for variancecovariance matrices. Statistics and Computing, 6:289–296, 1996.
 [28] F. Porikli and O. Tuzel. Fast construction of covariance matrices for arbitrary size image windows. In ICIP, 2006.
 [29] T. Randen and J. Husoy. Filtering for texture classification: a comparative study. TPAMI, 21(4):291–310, 1999.
 [30] F. Rapisarda, D. Brigo, and F. Mercurio. Parameterizing correlations: a geometric interpretation. IMA J Management Math, 18(1):55–73, 2007.
 [31] R. Sivalingam, D. Boley, V. Morellas, and N. Papanikolopoulos. Tensor sparse coding for region covariances. In ECCV, 2010.
 [32] R. Sivalingam, D. Boley, V. Morellas, and N. Papanikolopoulos. Positive definite dictionary learning for region covariances. In ICCV, 2011.
 [33] O. Tuzel, F. Porikli, and P. Meer. Region covariance: A fast descriptor for detection and classification. In ECCV, 2006.
 [34] O. Tuzel, F. Porikli, and P. Meer. Human detection via classification on riemannian manifolds. In CVPR, 2007.
 [35] O. Tuzel, F. Porikli, and P. Meer. Pedestrian detection via classification on riemannian manifolds. TPAMI, 30(10):1713–1727, 2008.
 [36] J. van de Weijer, C. Schmid, J. Verbeek, and D. Larlus. Learning color names for realworld applications. TIP, 18(7):1512–1523, 2009.
 [37] R. Wang, H. Guo, L. Davis, and Q. Dai. Covariance discriminative learning: A natural and efficient approach to image set classification. In CVPR, 2012.
 [38] C. Yuan, W. Hu, X. Li, S. Maybank, and G. Luo. Human action recognition under logeuclidean riemannian metric. In ACCV, 2009.
 [39] Y. Zhang, Z. Jiang, and L. Davis. Discriminative tensor sparse coding for image classification. In BMVC, 2013.
 [40] R. Zhao, W. Ouyang, and X. Wang. Person reidentification by salience matching. In ICCV, 2013.
 [41] R. Zhao, W. Ouyang, and X. Wang. Unsupervised salience learning for person reidentification. In CVPR, 2013.
 [42] R. Zhao, W. Ouyang, and X. Wang. Learning midlevel filters for person reidentification. In CVPR, 2014.