Secondorder Democratic Aggregation
Abstract
Aggregated secondorder features extracted from deep convolutional networks have been shown to be effective for texture generation, finegrained recognition, material classification, and scene understanding. In this paper, we study a class of orderless aggregation functions designed to minimize interference or equalize contributions in the context of secondorder features and we show that they can be computed just as efficiently as their firstorder counterparts and they have favorable properties over aggregation by summation. Another line of work has shown that matrix power normalization after aggregation can significantly improve the generalization of secondorder representations. We show that matrix power normalization implicitly equalizes contributions during aggregation thus establishing a connection between matrix normalization techniques and prior work on minimizing interference. Based on the analysis we present democratic aggregators that interpolate between sum (=1) and democratic pooling (=0) outperforming both on several classification tasks. Moreover, unlike power normalization, the democratic aggregations can be computed in a low dimensional space by sketching that allows the use of very highdimensional secondorder features. This results in a stateoftheart performance on several datasets.
Keywords:
Secondorder features, democratic pooling, matrix power normalization, tensor sketching1 Introduction
Secondorder statistics have been demonstrated to improve performance of classification on images of objects, scenes and textures as well as finegrained problems, action classification and tracking [52, 44, 30, 15, 6, 25, 34, 33]. In the simplest form, such statistics are obtained by taking the outer product of some feature vectors and aggregating them over some region of interest which results in an autocorrelation [6, 24] or covariance matrix [52]. Such a secondorder image descriptor is then passed as a feature to train a SVM, etc. Several recent works obtained an increase in accuracy after switching from the first to secondorder statistics [25, 24, 34, 32, 33, 59, 26, 28]. Further improvements were obtained by considering the impact of spectrum of such statistics on aggregation into the final representations [25, 24, 37, 31, 33, 26, 28]. For instance, analysis conducted in [25, 24] concluded that decorrelating feature vectors from an image via the matrix power normalization has a positive impact on classification due to the signal whitening properties which prevent socalled bursts of features [19]. However, evaluating the power of matrix is a costly procedure with complexity , where concerns the complexity of SVD. In recent CNN approaches [31, 33, 28] which perform endtoend learning, the complexity becomes a prohibitive factor for typical due to a costly backpropagation step which involves SVD or solving a Lyapunov equation [33] in every iteration of the CNN finetuning process; thus adding several hours of computations to training. However, another line of aggregation mechanisms aim to reweight the firstorder feature vectors prior to their aggregation [37] in order to balance their contributions to the final image descriptor. Such a reweighting scheme, called a democratic aggregation [21, 37], is solved very efficiently by a modified Sinkhorn algorithm [23].
In this paper, we study democratic aggregation in the context of secondorder feature descriptors and show that this feature descriptor has favorable properties when combined with the democratic aggregator which was applied originally to the firstorder descriptors. We take a closer look at the relation between the reweighted representations and the matrix power normalization in terms of the variance of feature contributions. In addition, we propose a democratic aggregation scheme which generalizes democratic aggregation and allows to interpolate between sum pooling and democratic pooling. We show that our formulation can be solved via the Sinkhorn algorithm as efficiently as approach [37] while resulting in a performance comparable to the matrix power normalization. Computationally, our approach involves Sinkhorn iterations, which requires matrixvector multiplications, and is faster by an order of magnitude even when compared to approximate matrix power normalization via the Newton’s method, which involves matrixmatrix operations [33]. Unlike the power matrix normalization, our democratic aggregation can be performed via sketching [42, 12] enabling the use of highdimensional feature vectors.
To summarize, our contributions are: (i) we propose a new secondorder democratic aggregation, (ii) we obtain reweighting factors via the Sinkhorn algorithm which enjoys an order of magnitude speedup over the fast matrix power normalization via Newton’s iterations while it achieves comparable results, (iii) we provide theoretical bounds on feature contributions in relation to the matrix power normalization, (iv) we present stateoftheart results on several datasets by applying democratic aggregation of secondorder representations with sketching.
2 Related work
Mechanisms of aggregating first and secondorder features have been extensively studied in the context of image retrieval, texture and object recognition [40, 41, 47, 20, 38, 44, 53, 6, 25]. In what follows, we first describe shallow approaches and nonEuclidean aggregation schemes followed by the CNNbased approaches.
Shallow Approaches.
Early approaches to aggregating secondorder statistics include Region Covariance Descriptors [44, 53], Fisher Vector Encoding [40, 41, 47] and Vector of Locally Aggregated Tensors [38], to name but a few of approaches.
Region Covariance Descriptors capture cooccurrences of luminance, first and secondorder partial derivatives of images [44, 53] and, in some cases, even binary patterns [46]. The main principle of these approaches is to aggregate the cooccurrences of some feature vectors into a matrix which represents an image.
Fisher Vector Encoding [40] precomputes a visual vocabulary by clustering over a set of feature vectors and captures the elementwise squared difference between each feature vector and its nearest cluster center. Subsequently, the renormalization of the captured statistics with respect to the cluster variance and the sum aggregation are performed. Furthermore, extension [41] proposes to apply the elementwise square root to the aggregated statistics which improves the classification results. Vector of Locally Aggregated Tensors extends Fisher Vector Encoding to secondorder offdiagonal feature interactions.
NonEuclidean Distances.
To take the full advantage of statistics captured by the scatter matrices, several works employ nonEuclidean distances. For positive definite matrices, geodesic distances (or their approximations) known from the Riemannian geometry are used [39, 3, 2]. PowerEuclidean distance [11] extends to semidefinite positive matrices. Distances such as AffineInvariant Riemannian Metric [39, 3], KLDivergence Metric [55], JensenBregman LogDet Divergence [7] and LogEuclidean distance [2] are frequently used for comparing scatter matrices resulting from aggregation of secondorder statistics. However, the above distances are notoriously difficult to backpropagate through for endtoend learning and often computationally prohibitive [27].
Pooling Normalizations.
Both first and secondorder aggregation methods often employ normalizations of pooled feature vectors. The early works on image retrieval apply the square root [19] to aggregated feature vectors to limit the impact of frequently occurring features and boost the impact of infrequent and highly informative ones (socalled notion of feature bursts). The roots of this approach in computer vision can be traced back to socalled generalized histogram of intersection kernel [5]. For secondorder approaches, similar strategy is used by Fisher Vector Encoding [41]. The notion of bursts is further studied in the context of BagsofWords approach as well scatter matrices and tensors for which their spectra are power normalized [24, 25, 26] (socalled Eigenvalue Power Normalization or EPN for short). However, the square complexity of scatter matrices w.r.t. length of feature vectors deems them somewhat impractical in classification. A recent study [21, 37] shows how to exploit secondorder imagewise statistics and reweight sets of feature vectors per image at the aggregation time to obtain an informative firstorder representation. Socalled Democratic Aggregation (DA) and Generalized MaxPooling (GMP) strategies are proposed whose goal is to reweight feature vectors per image prior to the sum aggregation so that interference between frequent and infrequent feature vectors is minimized. Strategies such as EPN (Matrix Power Normalization, MPN for short, is a special case of EPN), DA and GMP can be seen as ways of equalizing contributions of feature vectors into the final image descriptor and they are closely related to Zerophase Component Analysis (ZCA) whose role is to whiten the signal representation.
Pooling and Aggregation in CNNs.
The early image retrieval and recognition CNNbased approaches aggregate firstorder statistics extracted from the CNN maps e.g., [14, 57, 1]. In [14], multiple feture vectors are aggregated over multiple image regions. In [57], feature vectors are aggregated for retrieval. In [1], socalled VLAD descriptor is extended to allow endtoend training.
More recent approaches form cooccurrence patterns from CNN feature vectors similar in spirit to Region Covariance Descriptors. In [34], the authors combine two CNN streams of feature vectors via outer product and demonstrate that such a setup is robust for the task of the finegrained image recognition. A recent approach [49] extracts feature vectors at two separate locations in feature maps and performs an outer product to form a CNN cooccurrence layer.
Furthermore, a number of recent approaches are dedicated to performing backpropagation on the spectrumnormalized scatter matrices [18, 17, 31, 33, 28]. In [18], the authors employ the backpropagation via the SVD of matrix to implement the LogEuclidean distance in endtoend fashion. In [31], the authors extend Eigenvalue Power Normalization [25] to an endtoend learning scenario which also requires to backpropagate via the SVD of matrix. Concurrently, approach [33] suggests to perform Matrix Power Normalization via the Newton’s method and backpropagate w.r.t. the square root of matrix by solving a Lyapunov equation for greater numerical stability. An approach [58] phrases the matrix normalization as the problem of robust covariance estimation. Lastly, compact bilinear pooling [12] uses socalled tensor sketching [42]. Where indicated, we also make use of tensor sketching in our work.
There has been no connection made between reweighting feature vectors and its impact on the spectrum of the corresponding scatter matrix. Our work closely related to the approaches [21, 37], however, introduce a mechanisms of limiting the interference in the context of secondorder features. We demonstrate their superiority over the firstorder inference approaches [21, 37] and show that we can obtain results comparable to the matrix square root aggregation [33] with much lower computational complexity at the training and testing stages.
3 Method
Given a sequence of features , where , we are interested in a class of functions that compute an orderless aggregation of the sequence to obtain a global descriptor . If the descriptor is orderless, it implies that any permutation of features does not effect the global descriptor. A common approach is to encode each feature using a nonlinear function before aggregation via a simple symmetric function such as sum or max. For example, the global descriptor using sum pooling can be written as:
(1) 
In this work, we investigate outerproduct encoders, i.e. , where denotes the transpose and is the vectorization operator. Thus, if is dimensional then is dimensional.
3.1 Democratic aggregation
The democratic aggregation approach was proposed in [37] to minimize interference or equalize contributions of each element in the sequence. The contribution of a feature is measured as the similarity of the feature to the overall descriptor. In the case of sum pooling, the contribution of a feature is given by:
(2) 
For sum pooling, the contributions may not be equal for all features . In particular, the contribution is affected by both the norm and frequency of the feature. Democratic aggregation is a scheme that weights each feature by a scalar that depends on both and the overall set of features in such that the weighted aggregation satisfies:
(3) 
under the constraint that , . The above equation only depends on the dot product between the elements since:
(4) 
where denotes the dot product between the two vectors and . Following the notation in [37], if we denote to be the kernel matrix of the set , the above constraint is equivalent to finding a vector of weights such that:
(5) 
where diag is the diagonalization operator and is an dimensional vector of ones. In practice, the aggregated features are normalized hence the constant does not matter and can be set to 1.
The authors [37] noted that the above equation can be efficiently solved by a dampened Sinkhorn algorithm [23]. The algorithm returns a unique solution as long as certain conditions are met, namely the entries in are nonnegative and the matrix is not fully decomposable. In practice, these conditions are not satisfied since the dot product between two features can be negative. A solution proposed in [37] is to compute by setting the negative entries in to zero.
For completeness, the dampened Sinkhorn algorithm is included in Algorithm 1. Given features of dimensions, computing the kernel matrix takes , whereas each Sinkhorn iteration takes time. In practice, 10 iterations are sufficient to find a good solution. The damping factor is typically used. This slows the convergence rate but avoids oscillations and other numerical issues associated with the undampened version ().
3.1.1 democratic aggregation.
We propose a parametrized family of democratic aggregation functions that interpolate between sum pooling and fully democratic pooling. Given a parameter , the democratic aggregation is obtained by solving for a vector of weights such that:
(6) 
When , this corresponds to the democratic aggregation, and when , this corresponds to sum aggregation since satisfies the above equation. The above equation can be solved by modifying the update rule for computing in the Sinkhorn iterations to:
(7) 
in Algorithm 1, where denotes elementwise division. Thus, the solution can be equally efficient for any value of . Intermediate values of allow the contributions of each feature within the set to vary and, in our experiments, we find this can lead to better results than the extremes (i.e., ).
3.1.2 Secondorder democratic aggregation.
In practice, features extracted using deep ConvNets can be highdimensional. For example, an input image is passed through layers of a ConvNet to obtain a feature map of size . Here corresponds to the number of filters in the convolutional layer and corresponds to the spatial resolution of the feature. For stateoftheart ConvNets from which features are typically extracted, the values of and are comparable and in the range of a few hundred to a thousand. Thus, explicitly realizing the outer products can be expensive. Below we show several properties of democratic aggregation with outerproduct encoders. Some of these properties allow aggregation in a computationally and memory efficient manner.
Proposition 1
For outerproduct encoders, the solution to the democratic kernels exists for all values of as long as .
Proof
For the outerproduct encoder we have:
Thus, all the entries of the kernel matrix are nonnegative and the kernel matrix is strictly positive definite when . This is a sufficient condition for the solution to exist [23]. Note that the kernel matrix of the outer product encoders is positive even when .
Proposition 2
For outerproduct encoders, the solution to the democratic kernels can be computed in time and space.
Proof
The running time of the Sinkhorn algorithm is dominated by the time to compute the kernel matrix . Naively computing the kernel matrix for dimensional features would take time and space. However, since the kernel entries of the outer products are just the square of the kernel entries of the features before the encoding step, one can compute the kernel by simply squaring the kernel of the raw features, which can be computed in time and space. Thus the weights for the secondorder features can also be computed in time and space.
Proposition 3
For outerproduct encoders, democratic aggregation can be computed with lowmemory overhead using Tensor Sketching.
Proof
Let be a lowdimensional embedding that approximates the inner product between two outerproducts, i.e.,
(8) 
and with . Since the democratic aggregation of is a linear combination of the outerproducts, the overall feature can be written as:
(9) 
Thus, instead of realizing the overall feature of size , one can use the embedding to obtain a feature of size as a democratic aggregation of the approximate outerproducts. One example of an approximate outerproduct embedding is the Tensor Sketching (TS) approach of Pham and Pagh [42]. Tensor sketching has been used to approximate secondorder sum pooling [12] resulting in an orderofmagnitude savings in space at a marginal loss in performance on classification tasks. Our experiments show that sketching also performs well in the context of democratic aggregation.
3.2 Spectral normalization of secondorder representations
A different line of work [6, 33, 31, 58] has investigated matrix functions to normalize the secondorder representations obtained by sum pooling. For example, the improved bilinear pooling [33] and secondorder approaches [24, 25, 28] construct a global representation by sum pooling of outerproducts:
(10) 
The matrix is subsequently normalized using matrix power function with . When , this corresponds to the matrix squareroot which is defined as matrix such that . Matrix function can be computed using the Singular Value Decomposition (SVD). Given matrix with a SVD given by , where the matrix , with , the matrix function can be written as , where is applied to the elements in the diagonal of . Thus, the matrix power can be computed as . Such spectral normalization techniques scale the spectrum of the matrix . The following establishes a connection between the spectral normalization techniques and democratic pooling.
Let be the normalized version of and and be the maximum and minimum squared radii of the data defined as:
(11) 
As earlier, let be the contribution of the vector to the the aggregated representation defined as:
(12) 
Proposition 4
The following properties hold true:

The norm of is .

.

The maximum value .

The minimum value .
Proof
The proof is left in the supplementary material.
Proposition 5
The variance of the contributions satisfies
(13) 
where and are the maximum and minimum values defined above and is the mean of given by where is the cardinality of . All of the above quantities can be computed from the spectrum of the matrix .
Proof
The above shows that smaller values reduce an upperbound on the variance of the contributions thereby equalizing their contributions. The upper bound is a monotonic function of the exponent and is minimized when reducing all the spectrum to an identity matrix. This corresponds to whitening of the matrix . However, complete whitening often leads to poor results while intermediate values such as can be significantly better than [24, 25, 33, 31]. In the experiments section we evaluate these bounds on deep features from real data.
Proposition 6
For exponents , the matrix power may not lie in the linear span of the outerproducts of the features .
The proof of Proposition 6 is left in the supplementary material. A consequence of this is that the matrix power cannot be easily computed in the lowdimensional embedding space of outerproducts encoding such as Tensor Sketch. It does however lie in the linear span of the outerproducts of the eigenvectors. However, computing eigenvectors can be significantly slower than computing weighted aggregates. We describe the computation and memory tradeoffs between computing the matrix powers and democratic pooling in Section 4.5.
4 Experiments
We analyze the behavior of matrix power normalization and democratic pooling empirically on several finegrained and texture recognition datasets. The general experiment setting and the datasets are described in Section 4.1. We validate the theoretical bounds on the feature contributions with real data in Section 4.2. We compare our models against sumpooling baseline, matrix power normalization, and other stateoftheart methods in Sections 4.3 and 4.4. Finally, we include a discussion on runtime and memory consumption for various approaches and a technique to perform endtoend finetuning in Section 4.5.
4.1 Experimental setup
Datasets. We experiment on CaltechUCSD Birds [56], Stanford Cars [29] and FGVC Aircrafts [35] datasets. Birds dataset contains 11,788 images which contain over 200 bird species. Stanford Cars dataset consists of 16.185 images across 196 categories and FGVC Aircrafts provides 10,000 images of 100 categories. For each dataset, we use the train and test splits provided by the benchmarks and only the corresponding category labels are used during training phase. In addition to the above finegrained classification tasks, we also analyze the performance of various approaches on the following datasets: Describable Texture Dataset (DTD) [8], Flickr Material Dataset (FMD) [48] and MIT indoor scene dataset [45]. DTD consists of 5,640 images across 47 texture attributes. We report results averaged over the 10 splits provided by the dataset. FMD provides 1000 images from 10 different material categories. We randomly split half of images for training and the rest for testing for each category and report results across multiple splits. The MIT indoor scene dataset contains 67 indoor scene categories, each of which includes 80 images for training and 20 for testing.
Features.
We aggregate the secondorder features with democratic pooling and matrix power normalization using VGG16 [50] and ResNet101 [16] networks. We follow the work [34] and resize input images to and aggregate the last convolutional layer features after ReLU activations. For the VGG16 network architecture, this results in feature maps of size (before aggregation), while for the ResNet101 architecture this results in maps of size . For democratic pooling, we run the modified Sinkhorn algorithm for 10 iterations with the power exponent . Fully democratic pooling [37] and sum pooling can be implemented by setting and , respectively. The aggregated features are followed by elementwise signed squareroot and normalization. For finegrained recognition datasets, we aggregate the VGG16 features finetuned with vanilla BCNN models, while the ImageNet pretrained networks without finetuning are used for texture and scene datasets.
4.2 The distribution of the spectrum and feature contributions
In this section, we analyze how democratic pooling and matrix normalization effect the spectrum (set of eigenvalues) of the aggregated representation, as well as how the contributions of individual features are distributed as a function of for the democratic pooling and of the matrix power normalization.
We randomly sampled 50 images from CUB and MIT indoor datasets each and plotted the spectrum (normalized to unit length) and the feature vector contributions (Eq. (12)) in Figure 1. In this experiment, we use the matrix power and . Figure 1(a) shows that the square root yields a flatter spectrum in comparison to the sum aggregation. Democratic aggregation distributes the energy away from the top eigenvalues but has considerably sharper spectrum in comparison to the square root. The democratic pooling interpolates between sum and fully democratic pooling.
Figure 1(b) shows the contributions of each feature to the aggregate for different pooling techniques (Eq. (12)). The contributions are more evenly distributed for the matrix square root in comparison to sum pooling. Democratic pooling flattens the individual contributions the most – we note that it is explicitly designed to have this effect. These two plots show that democratic aggregation and power normalization both achieve equalization of feature contributions.
(a) spectrum (eigenvalues)  (b) contributions 
(a) bounds on contribution  (b) bounds on variance 
Figure 2 shows the variances of the contributions to the aggregation using the VGG16 features for different values of the exponent . Figure 2(a) shows the true minimum, maximum, mean as well as the bounds of these quantities expressed in Proposition 4. The upper bound on the maximum contribution, i.e., , is tight on both datasets, as can be seen in the overlapping red lines, while the lower bound is significantly less tight.
Figure 2(b) shows the true deviation and two different upper bounds on the variance of the contributions as expressed in Proposition 5 and Eq. (13). The tighter bound shown by the dashed red line corresponds to the version with the mean in Eq. (13). The plot shows that the matrix power normalization implicitly reduces the variance in feature contributions similar to equalizing the feature vector contributions in democratic aggregation. These plots are averaged over 50 examples from the CUB200 and MIT indoor datasets.
4.3 Effect of on democratic pooling
Table 1 shows the performance as a function of for the democratic pooling and for the matrix normalization on the VGG16 network. For DTD dataset, we report results on the first split. For FMD dataset, we randomly sample half of the data in each category for training and use the rest for testing. We use the standard training and testing splits on remaining datasets. We augment the training set by flipping its images and train k onevsall linear SVM classifiers with hyperparameter . At the test time, we average predictions from an image and its flipped copy. Optimal and the matrix power are also reported.
The results on sum pooling correspond to the symmetric BCNN models [33]. Fully democratic pooling (=0) improves the performance over sum pooling by 0.71%. However, equalizing feature contributions hurts performance on Stanford Cars and FMD dataset. Table 1 shows that reducing the contributions by adjusting helps outperform sum pooling and fully democratic pooling.
Matrix power normalization outperforms democratic pooling by 0.21%. However, computing the matrix powers on covariance matrices is computationally expensive compared to our democratic aggregation. We discuss these tradeoffs in the Section 4.5.
Dataset  democratic  

Democratic  Optimal  Sum  
=0  
Caltech UCSD Birds  84.7  84.9 (0.5)  84.0  85.9 (0.3) 
Stanford Cars  89.7  90.8 (0.5)  90.6  91.7 (0.5) 
FGVC Aircrafts  86.7  86.7 (0.0)  85.7  87.6 (0.3) 
DTD  72.2  72.3 (0.3)  71.2  72.9 (0.6) 
FMD  82.8  84.8 (0.8)  84.6  85.0 (0.7) 
MIT indoor  79.6  80.4 (0.3)  79.5  80.9 (0.6) 
4.4 Democratic pooling with Tensor Sketching
One of the main advantages of the democratic pooling approaches over matrix power normalization techniques is that the embeddings can be computed in a lowdimensional space using tensor sketching. To demonstrate this advantage, we compute the secondorder democratic pooling combined with tensor sketching on 2048 dimensional ResNet101 features. Direct construction of secondorder features yields 4M dimensional features which are impractical to manipulate on GPU/CPU. Therefore, we apply the Tensor Sketch [42] to approximate the outer product using 8192 dimensional features, which is far lower than 2048 of the full outer product. The features are aggregated using democratic approach with . We compare our method to the state of the art on MIT indoor, FMD and DTD datasets. We report the mean accuracy. For DTD and FMD, we also indicate the standard deviation over 10 splits.
Results on MIT indoor.
Table 2 reports the accuracy on MIT indoor. The baseline model approximating secondorder features with tensor sketch followed by sum pooling achieves 82.8% accuracy. With democratic pooling, our model achieves stateoftheart accuracy of 84.3% which is 1.5% more than the baseline. Moreover, Table 1 shows that we outperform the matrix power normalization using VGG16 network by 3.4%. Note that (i) matrix power normalization is impractical for ResNet101 features, (ii) it cannot be computed by sketching due to Proposition 6. We also outperform FASON [10] by 2.6%. FASON fuses the first and secondorder features from and layers of the VGG19 networks given 448448 image size and scores 81.7% accuracy. Recent work on Spectral Features [22] achieves the same accuracy as our best model with democratic pooling. However, approach [22] uses more data augmentations (rotation, shifts, etc.) during training and pretrains the VGG19 network on the largescale Places205 dataset. In contrast, our networks are pretrained on ImageNet which arguably has a larger domain shift from the MIT indoor dataset than Places205.
Results on FMD.
Table 3 compares the accuracy on FMD dataset. Recent work on Deep filter banks [9], denoted as FV+FC+CNN, which combines fullyconnected CNN features and Fisher Vector approach, scores 82.1% accuracy. In contrast to several methods, FASON uses singlescale input images (224224) and also scores 82.1% accuracy. Our secondorder democratic pooling outperforms FASON by 0.7% given the same image size. For 448448 image size, our model scores 84.3% and outperforms other stateoftheart approaches.
Method  input size  accuracy  

IFV+DeCAF  [8]  ms  65.5 1.3 
FV+FC+CNN  [9]  ms  82.2 1.4 
LFV  [51]  ms  82.1 1.9 
SMO Task  [60]    82.3 1.7 
FASON  [10]  224  82.1 1.9 
ResNet101 + TS + sum pooling  (baseline)  448  83.7 1.3 
ResNet101 + TS + democratic  (ours)  448  84.3 1.5 
ResNet101 + TS + democratic  (ours)  224  82.8 2.5 
Results on DTD.
Table 4 presents our results and comparisons on DTD dataset. Deep filter banks [9], denoted as FV+FC+CNN, reports 75.5% accuracy. Combined secondorder features and tensor sketching outperforms Deep filter banks by 0.3%. With secondorder democratic pooling and 448448 size images, our model achieves 76.2% accuracy and outperforms FV+FC+CNN 0.7%. Note that FV+FC+CNN exploits several scales of image sizes.
Method  input size  accuracy  

LFV  [51]  ms  73.8 1.0 
FV+FC+CNN  [9]  ms  75.5 0.8 
FASON  [10]  224  72.9 0.7 
ResNet101 + TS + sum pooling  (baseline)  448  75.8 0.7 
ResNet101 + TS + democratic  (ours)  448  76.2 0.7 
ResNet101 + TS + democratic  (ours)  224  73.0 0.6 
4.5 Discussion
While matrix power normalization achieves marginally better performance, it requires SVD which is computationally expensive and not GPU friendly e.g., the CUDA BLAS cannot perform SVD for large matrices. Even in the case of matrix square root which can be approximated via Newton’s iterations [33], the iterations involve matrixmatrix multiplication of complexity. In contrast, solving democratic pooling via the Sinkhorn algorithm (Algorithm 1) involves only matrixvector multiplication which is . Empirically, we find that solving Sinkhorn iterations is an order of magnitude faster than solving the matrix square root on a NVIDIA Titan X GPU. Moreover, the complexity of Sinkhorn iteration depends only on the kernel matrix – it is independent of the feature vector size. In contrast, the memory required by a covariance matrix grows with which becomes prohibitive for feature vectors greater than 512 dimensions. Secondorder democratic pooling with tensor sketching yields comparable results and reduces the memory usage by two orders of magnitude over the matrix power normalization.
Although we did not report results using endtoend training, one can easily obtain the gradients of the Sinkhorn algorithm using automatic differentiation by implementing Algorithm 1 in a library such as PyTorch or Tensorflow. Training using gradients from iterative solvers has been performed in a number of applications (e.g., [13] and [36]) which suggests that it is a promising direction.
5 Conclusions
We proposed a secondorder aggregation method referred to as democratic pooling that interpolates between sum (=1) and democratic pooling (=0) and outperforms other aggregation approaches on several classification tasks. We demonstrated that our approach enjoys low computational complexity compared to the matrix square root approximations via Newton’s iterations. With the use of sketching, our approach is not limited to aggregating small feature vectors which is typically the case for the matrix power normalization. The source code for the project is available at http://viswww.cs.umass.edu/o2dp.
Acknowlegements.
We acknowledge support from NSF (#1617917, #1749833) and the MassTech Collaborative grant for funding the UMass GPU cluster.
Supplementary
In supplementary we provide the proofs for Proposition 4 and 6 described in Section 3 of the paper. In addition, we provide the comparison between aggregating first and secondorder features using democratic pooling in the last section.
Proofs of Proposition 4

The norm of is .
Proof
We have:
Thus the norm:

.
Proof
We have:

The maximum value .
Proof
We have

The minimum value .
Proof
We have:
Proof of Proposition 6
Proof
Here is an example where the matrix power does not lie in the linear span of the outerproducts of the features . Consider two vectors and . The covariance matrix formed by the two is
The square root of the matrix is:
It is easy to see that cannot be written as a linear combination of and since any linear combination will have all equal values for all the entries except possibly the top left value.
A sufficient condition for to be in the linear span of outer products is that the vectors which are used in constructing be orthogonal to each other. This however, is not true in general for features extracted from convolutional layers.
Aggregating first versus secondorder features
Dataset  DTD  FMD  MIT indoor 

First order  72.1 0.7  84.2 1.2  80.4 
Second order  76.2 0.7  84.3 1.5  84.3 
In this section, we show that aggregating secondorder features is more effective than aggregating firstorder features using democratic pooling. Table 5 compares the results obtained on the firstorder representations with the results on secondorder features for on DTD, FMD and MIT indoor datasets. We follow the same protocol as detailed in Section 4.4 in the main paper. Specifically, we aggregate the 2048 dimensional ResNet101 features from the last convolutional layers using image size. The secondorder features are approximated by Tensor Sketching to obtain 8192 dimensional representation. On the smallscale FMD dataset which contains only 10 categories, the firstorder aggregation works comparable to aggregating secondorder features. In contrast, aggregating secondorder features on the larger and more challenging DTD and MIT indoor datasets improves results over the firstorder features by a significant margin (4%). Such improvements demonstrate robustness of our secondorder aggregation scheme.
References
 [1] Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: CVPR (2016)
 [2] Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Logeuclidean metrics for fast and simple calculus on diffusion tensors. Magnetic resonance in medicine 56(2), 411–421 (2006)
 [3] Bhatia, R.: Positive definite matrices. Princeton Univ Press (2007)
 [4] Bhatia, R., Davis, C.: A better bound on the variance. The American Mathematical Monthly 107(4), 353–357 (2000)
 [5] Boughorbel, S., Tarel, J.P., Boujemaa, N.: Generalized Histogram Intersection Kernel for Image Recognition. In: ICIP (2005)
 [6] Carreira, J., Caseiro, R., Batista, J., Sminchisescu, C.: Semantic Segmentation with SecondOrder Pooling. In: ECCV (2012)
 [7] Cherian, A., Sra, S., Banerjee, A., Papanikolopoulos, N.: JensenBregman LogDet Divergence with Application to Efficient Similarity Search for Covariance Matrices. TPAMI 35(9), 2161–2174 (2013)
 [8] Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing Textures in the Wild. In: CVPR (2014)
 [9] Cimpoi, M., Maji, S., Vedaldi, A.: Deep Filter Banks for Texture Recognition and Segmentation. In: CVPR (2015)
 [10] Dai, X., YueHei Ng, J., Davis, L.S.: FASON: First and Second Order Information Fusion Network for Texture Recognition. In: CVPR (2017)
 [11] Dryden, I.L., Koloydenko, A., Zhou, D.: Noneuclidean statistics for covariance matrices, with applications to diffusion tensor imaging. The Annals of Applied Statistics 3(3), 1102–1123 (2009)
 [12] Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: CVPR (2016)
 [13] Genevay, A., Peyré, G., Cuturi, M.: Learning generative models with sinkhorn divergences. arXiv preprint arXiv:1706.00292 (2017)
 [14] Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multiscale Orderless Pooling of Deep Convolutional Activation Features. In: ECCV (2014)
 [15] Guo, K., Ishwar, P., Konrad, J.: Action recognition from video using feature covariance matrices. Trans. Img. Proc. 22(6), 2479–2494 (2013)
 [16] He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR (2016)
 [17] Huang, Z., Gool, L.V.: A Riemannian Network for SPD Matrix Learning. In: AAAI (2017)
 [18] Ionescu, C., Vantzos, O., Sminchisescu, C.: Matrix Backpropagation for Deep Networks with Structured Layers. In: ICCV (2015)
 [19] Jégou, H., Douze, M., Schmid, C.: On the Burstiness of Visual Elements. In: CVPR (2009)
 [20] Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating Local Descriptors into a Compact Image Representation. In: CVPR (2010)
 [21] Jégou, H., Zisserman, A.: Triangulation embedding and democratic aggregation for image search. In: CVPR (2014)
 [22] Khan, S.H., Hayat, M., Porikli, F.: Scene Categorization with Spectral Features. In: ICCV (2017)
 [23] Knight, P.A.: The sinkhornknopp algorithm: Convergence and applications. SIAM J. Matrix Anal. Appl. 30(1), 261–275 (Mar 2008)
 [24] Koniusz, P., Yan, F., Gosselin, P., Mikolajczyk, K.: Higherorder Occurrence Pooling on Mid and Lowlevel Features: Visual Concept Detection. Technical Report, HAL Id: hal00922524 (2013)
 [25] Koniusz, P., Yan, F., Gosselin, P., Mikolajczyk, K.: Higherorder occurrence pooling for bagsofwords: Visual concept detection. PAMI 39(2), 313–326 (2017)
 [26] Koniusz, P., Cherian, A., Porikli, F.: Tensor representations via kernel linearization for action recognition from 3d skeletons. In: ECCV. pp. 37–53. Springer (2016)
 [27] Koniusz, P., Tas, Y., Zhang, H., Harandi, M., Porikli, F., Zhang, R.: Museum exhibit identification challenge for the supervised domain adaptation. In: ECCV (2018)
 [28] Koniusz, P., Zhang, H., Porikli, F.: A deeper look at power normalizations. In: CVPR. pp. 5774–5783 (2018)
 [29] Krause, J., Stark, M., Deng, J., FeiFei, L.: 3D Object Representations for FineGrained Categorization. In: Workshop on 3D Representation and Recognition (3dRR) (2013)
 [30] Li, P., Wang, Q.: Local LogEuclidean Covariance Matrix (L ECM) for Image Representation and Its Applications. In: ECCV (2012)
 [31] Li, P., Xie, J., Wang, Q., Zuo, W.: Is Secondorder Information Helpful for Largescale Visual Recognition? In: ICCV (2017)
 [32] Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear Convolutional Neural Networks for FineGrained Visual Recognition. IEEE TPAMI 40(6) (2018)
 [33] Lin, T.Y., Maji, S.: Improved Bilinear Pooling with CNNs. In: BMVC (2017)
 [34] Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear CNN Models for Finegrained Visual Recognition. In: ICCV (2015)
 [35] Maji, S., Kannala, J., Rahtu, E., Blaschko, M., Vedaldi, A.: Finegrained visual classification of aircraft (2013)
 [36] Mena, G., Belanger, D., Linderman, S., Snoek, J.: Learning latent permutations with gumbelsinkhorn networks. arXiv preprint arXiv:1802.08665 (2018)
 [37] Murray, N., Jégou, H., Perronnin, F., Zisserman, A.: Interferences in match kernels. IEEE TPAMI 39(9), 1797–1810 (2017)
 [38] Negrel, R., Picard, D., Gosselin, P.H.: Compact Tensor Based Image Representation for Similarity Search. ICIP (2012)
 [39] Pennec, X., Fillard, P., Ayache, N.: A Riemannian Framework for Tensor Computing. IJCV 66(1), 41–66 (2006)
 [40] Perronnin, F., Dance, C.: Fisher Kernels on Visual Vocabularies for Image Categorization. In: CVPR (2007)
 [41] Perronnin, F., Sánchez, J., Mensink, T.: Improving the Fisher Kernel for LargeScale Image Classification. In: ECCV (2010)
 [42] Pham, N., Pagh, R.: Fast and scalable polynomial kernels via explicit feature maps. In: KDD (2013)
 [43] Popoviciu, T.: Sur les équations algébriques ayant toutes leurs racines réelles. Mathematica 9, 129–145 (1935)
 [44] Porikli, F., Tuzel, O.: Covariance Tracker. In: CVPR (2006)
 [45] Quattoni, A., Torralba, A.: Recognizing Indoor Scenes. In: CVPR (2009)
 [46] Romero, A., Terán, M.Y., Gouiffès, M., Lacassagne, L.: Enhanced local binary covariance matrices for texture analysis and object tracking. MIRAGE (2013)
 [47] Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: Theory and practice. IJCV 105(3), 222–245 (2013)
 [48] Sharan, L., Rosenholtz, R., H., A.E.: Material perceprion: What can you see in a brief glance? Journal of Vision 9:784(8) (2009)
 [49] Shih, Y.F., Yeh, Y.M., Lin, Y.Y., Weng, M.F., Lu, Y.C., Chuang, Y.Y.: Deep Cooccurrence Feature Learning for Visual Object Recognition. In: CVPR (2017)
 [50] Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. In: ICLR (2015)
 [51] Song, Y., Zhang, F., Li, Q., Huang, H., O’Donnell, L.J., Cai, W.: Locallytransferred fisher vectors for texture classification. In: ICCV (Oct 2017)
 [52] Tuzel, O., Porikli, F., Meer, P.: Region Covariance: A Fast Descriptor for Detection and Classification. In: ECCV (2006)
 [53] Tuzel, O., Porikli, F., Meer, P.: Pedestrian Detection via Classification on Riemannian Manifolds. IEEE TPAMI 30(10), 1713–1727 (2008)
 [54] Wang, L., Guo, S., Huang, W., Qiao, Y.: Places205VGGnet models for scene recognition. CoRR abs/1508.01667 (2015)
 [55] Wang, Z., Vemuri, B.C.: An affine invariant tensor dissimilarity measure and its applications to tensorvalued image segmentation. In: CVPR (2004)
 [56] Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: CaltechUCSD Birds 200. Tech. Rep. CNSTR2010001, California Institute of Technology (2010)
 [57] Yandex, A.B., Lempitsky, V.: Aggregating Local Deep Features for Image Retrieval. In: ICCV (2015)
 [58] Yu, K., Salzmann, M.: Secondorder Convolutional Neural Networks. abs/1703.06817 (2017)
 [59] Yu, K., Salzmann, M.: Statisticallymotivated secondorder pooling. In: ECCV (2018)
 [60] Zhang, Y., Ozay, M., Liu, X., Okatani, T.: Integrating deep features for material recognition. In: ICPR (2016)