# DEFRAG: Deep Euclidean Feature Representations through Adaptation on the Grassmann Manifold

## Abstract

We propose a novel technique for training deep networks with the objective of obtaining feature representations that exist in a Euclidean space and exhibit strong clustering behavior. Our desired features representations have three traits: they can be compared using a standard Euclidian distance metric, samples from the same class are tightly clustered, and samples from different classes are well separated. However, most deep networks do not enforce such feature representations. The DEFRAG training technique consists of two steps: first good feature clustering behavior is encouraged though an auxiliary loss function based on the Silhouette clustering metric. Then the feature space is retracted onto a Grassmann manifold to ensure that the Norm forms a similarity metric. The DEFRAG technique achieves state of the art results on standard classification datasets using a relatively small network architecture with significantly fewer parameters than many standard networks.

Breton Minnehan and Andreas Savakis
\addressRochester Institute of Technology

Rochester, New York 14623, USA
{keywords}
Deep learning, Clustering, Grassmann Optimization

## 1 Introduction

Deep Convolutional Neural Networks (CNNs) and their variants have emerged as the architecture of choice for computer vision. Deep networks have achieved state-of-the-art results in object class recognition [1], [2], [3], face recognition [4], semantic segmentation [5], pose estimation [6], and visual tracking [7] among other applications. While the initial focus has been on making CNNs deeper in order to achieve higher performance, recent work has been exploring leaner networks, such as DenseNet [8] as an alternative to ResNet [3], that are more efficient yet perform just as well if not better than their larger counterparts.

In its simplest form a CNN consists of a feature extraction convolutional network followed a linear classifier, the head of the network. One benefit of CNNs is that they are trained in an end to end manner, thus the maximum benefit can be extracted from each stage. However, the features learned by CNNs can be further improved for robustness. A robust feature representation is one that minimizes differences between samples of the same class and maximizes differences between samples from different classes. In this work we present a method referred to as DEFRAG, inspired by Linear Discriminate Analysis (LDA) [9]. Our approach shapes the feature representation through a novel auxiliary loss function, (Section 3.1), and a orthogonalization step that involves retraction on a Grassmann manifold (Section 3.2) illustrated in Fig. 1.

An auxiliary loss, , is secondary metric that is added to the loss from the main training objective, , for the optimizer to minimize, as

(1) |

Where is a mixture parameter used to balance the impact of the auxiliary loss.

In this work the main loss objective is the traditional categorical cross-entropy loss which learns to classify the samples, and a proposed new auxiliary loss function, the Silhouette Loss. Figure 3 illustrates the clustering characteristics of DEFRAG compared to other methods.

The contributions of this paper are the following: (a) We introduce a new auxiliary loss functions based on the Silhouette clustering metric which encourages tight intra-class clustering and inter-class separation. (b) We propose an orthogonalization step which retracts the optimized feature projection matrix back on the Grassmann manifold. (c) We demonstrate how the DEFRAG method improves performance so that a small network matches or exceeds the performance of much bigger networks and achieves state of the art results on standard datasets.

In the remainder of this paper, Section 2 provides background on auxiliary loss functions and other regularization methods used to train deep neural networks. Section 3 presents our proposed DEFRAG method and its theoretical justifications. Section 4 outlines our results and Section 5 offers concluding remarks.

## 2 Background

There are many variants of auxiliary loss functions used when training deep networks to encourage different behaviors. One of the first auxiliary losses proposed was feature regularization. The goal of regularizing the feature activations is to keep the values in the feature representation small or sparse by using or norms respectively. The underlying assumption is that small-valued or sparse feature representations generally reduce over-fitting. regularization encourages activations with small magnitudes

(2) |

Where is an activation function, is the layer’s input vector, is the weight matrix and is the bias vector. regularization encourages sparsity in the activations

(3) |

The use of these regularization techniques has waned due to reliance on robust training methods such as Batch Normalization [10] and Dropout [11]. Batch Normalization reduces the covariance shift and potential for vanishing or exploding gradients by normalizing the activations from each layer to be zero-mean and unit-variance. Dropout on the other hand reduces the potential for co-adaptation of neurons by randomly setting a fraction of neurons to zero on each training iteration, thus forcing neurons to be more self-reliant. Though these techniques have improved training time and network accuracy, they overlook generalizable feature representations.

Recent work introduced an auxiliary function called the center loss [12], that increases the robustness of the feature representation by encouraging tightly grouped clusters. The center loss represents the average distance of each point , in feature space, to the mean of the corresponding class .

(4) |

where m is the number of samples in the mini-batch, is the feature space representation for the sample and is the center for the class of the sample.

The tight Euclidean clustering in feature space encouraged by the center loss is useful in situations where the feature representations are compared to estimate similarity between samples, such as is done with k-Nearest Neighbors (k-NN). The work in [12] focused on person re-identification, a problem that requires robust feature representations that can be compared using Euclidean distance metrics.

More recently [13] introduced contrastive center loss, a technique that encourages tight clustering and increases class separation. The contrastive center loss is given as:

(5) |

where is a small value that insures the denominator is non-zero. Our work builds upon the center loss and contrastive center loss for better feature clustering and more robust performance.

Like many other methods in the field, both the [12] and [13] use the Euclidean, , norm as a similarity metric between deep features vectors. This use of the norm can be problematic when applied to arbitrary feature spaces. The Euclidean norm is intended to operate in , represented by an orthogonal basis. However, the norm is often applied to vector representations without an orthogonal basis. In this work we aim produce feature representations with a proper similarity metric. This behavior is critical for many applications that use feature similarity, such as k-NN and other graphical methods. We insure orthogonality by retracting the weight matrix of our feature representation layer to the Grassmann manifold, the set of orthogonal spaces in .

## 3 Proposed Methodology

The DEFRAG method consists of two components: an auxiliary loss component, and a retraction of the feature projection on the Grassmann manifold. The auxiliary loss is designed to encourage better feature clustering of samples based on their class labels, while the Grassmann manifold retraction ensures the features are in a space suitable for the norm similarity metric. A side benefit of our DEFFRAG training approach is that due to the robustness of the features generated, smaller networks can be used. The two components of DEFRAF are discussed next.

### 3.1 Clustering Auxiliary Loss

Robust feature representations are important for classification, as they increase the classifier’s ability to generalize across different datasets and operating conditions. We formulate the formation of robust feature representations as seeking a feature space that encourages tight clusters for samples in the same class and large separations between clusters from different classes, for some similarity metric. Training deep networks with only the classification loss does not inherently encourage feature clustering.

Our auxiliary loss function is the Silhouette loss shown below:

(6) |

The Silhouette loss is inspired by the Silhouette score [14], which is used to assess clustering performance.The Silhouette loss is different from the center loss and contrastive center loss in that is focuses on separating classes that are close to each other, instead of maximizing the overall class separation. This alteration is important because it forces the network to focus on classes that are hard to separate and results in better classification performance. To reduce computation we use a running average method to update the class centers as suggested in [12].

### 3.2 Grassmann Manifold Retraction

The set of orthogonal spaces in form a Grassmann manifold. We therefore formulate our feature learning process as an optimization problem on the Grassmann manifold, our optimization process depicted in Fig. 1. This is done by using a linear activation function for the last layer of the network, reducing it to a linear projection with projection matrix . The update step of the Grassmann optimization is done using traditional stochastic gradient descent. The projection matrix is updated each iteration of training based on the gradient of the loss function . However, because any non-zero update is likely to take the projection matrix out of the manifold, the updated matrix is retracted back on the Grassmann manifold as, . This retraction is done through singular value decomposition [15] as,

(7) |

This process enforces features that satisfy the Silhouette auxiliary loss criterion and reside in an orthonormal space.

## 4 Experimental Results

We performed a series of experiments by training a deep neural network using the proposed DEFRAG method and compared the results with the state of the art deep networks. The network architecture of our choice is small network with only two convolutional layers and three fully connected layers. The network parameters are given in Table 1. This network choice is intentionally simplistic so that the benefit of the DEFRAG method is made apparent.

Stage | Layer Type | Size |
---|---|---|

Convolution Stage 1 | Conv | 32 (5x5) |

Pooling | 2x2 | |

Conv | 256 (5x5) | |

Pooling | 2x2 | |

ReLU | 256 | |

Feature Representation | ReLU/Linear | 8 |

Output | Softmax | 10 |

We used standard classification datasets in these experiments including: MNIST [16] and Fashion MNIST [17]. Example images of the datasets used are shown in Fig. 2. In Section 4.1 we show quantitative comparisons of the proposed DEFRAG method to standard deep networks on the Fashion MNIST dataset. Section 4.2 presents qualitative results of feature clustering with the MNIST [16] dataset.

### 4.1 Fashion MNIST Experiments

The Fashion MNIST dataset [17] was developed as a significantly more challenging alternative to the original MNIST dataset [16]. Like the original MNIST dataset, Fashion MNIST dataset consists of 60,000 training samples and 10,000 test samples from objects in 10 different classes. The objects in the Fashion MNIST set are ten different articles of clothing: T-Shirt/Top, Trouser, Pullover, Dress, Coat, Sandals, Shirt, Sneaker, Bag, Ankle boots. This dataset is significantly harder than the original MNIST dataset. In crowed sourced experiments on a subset of 1000 examples humans were only able to achieve 83.5% accuracy. The current state of the art performance on the dataset without augmentation is only 93.7%. Examples form the Fashion MNIST dataset are shown on the bottom row of Figure 2 and the experimental results are summarized in Table 2.

Our experiments with the Fashion MNIST dataset illustrate the impact on classification accuracy of both DEFRAG components, the Silhouette loss and orthogonalization step. We first trained the network using only the classification loss traditional ReLU activation function to get a baseline, Sparse (ReLU) in the table. We then considered a linear activation function for the feature representation layer, as well as Silhouette, Center and DEFRAG. The results in Table 2 demonstrate that the DEFRAG method outperforms the other methods with a relative reduction in error of 7% compared the the original network. The results also demonstrate a significant improvement using just the Silhouette loss, however, DEFRAG shows improvement as a result of the orthonormal feature space.

Our experiments demonstrate that state of the art results can be achieved on the Fashion-MNIST dataset with a simple network that benefits from a feature space clustering regularization technique. The accuracy achieved by our network is better than the results achieved with the much larger GoogLeNet [18] and VGG16 [2] architectures. In comparison with the other two architectures, our network has 24 and 1.4 times fewer parameters, respectively. These comparisons are based on the updated results reported in [19].

Network | Accuracy | Parameters | |

GoogleNet [17] | 0.9370 | 5M | |

VGG16 [18] | 0.9350 | 26M | |

HOG+SVM [19] | 0.9260 | N.A. | |

Other Works | Human [20] | 0.8350 | N.A. |

Sparse (ReLU) | 0.9347 | ||

Linear | 0.9369 | ||

Silhouette | 0.9375 | ||

Center | 0.9371 | ||

Contr. Center | 0.9368 | ||

Our Network | DEFRAG | 0.9393 | 1.4M |

(a) | (b) |

(c) | (d) |

(e) |

### 4.2 Qualitative MNIST Experiments

The MNIST dataset serves as a useful tool in understanding the behavior of the network. In this section, we qualitatively investigate the clustering behavior of the proposed method. To visualize the clustering behavior the dimensions feature representation was reduced to 2 and the resulting features were plotted in Fig. 3. The X-axis corresponds to the response of the first unit and the Y-axis corresponds to the second unit.

A few observations can be made on these results, although is hard to draw any firm conclusions from these plots. Firstly, it is clear that the SoftPlus activation function is not ideal for classification; this was confirmed by the results where the SoftPlus implementation achieved only 88.6% accuracy on the test set while the other methods achieved accuracies of over 99.0%.

Secondly, the plots of Figure 3 show that auxiliary cluster losses have a significant impact on the clustering behavior of the network. The linear implementation forms linearly separable clusters for each class, however, the intra-class variance is much higher than the inter-class separation. We measured the separability of each method by both the Silhouette score or the ratio of average inner to intra class distances. In both of these metrics the DEFRAG method out performed all other methods. The DEFRAG method outperformed the center loss, the second best method, with a 16% reduction in the Silhouette score and 13% in the distance ratio.

## 5 Conclusions

In this work, we present a new method for training deep neural networks that focuses on training better feature representations with ideal clustering behavior and can be more effectively compared using the standard Norm. The proposed DEFRAG method consists of a new auxiliary loss functions, Silhouette loss, and a retraction step which ensures linear independence of each dimension in the feature representation. We demonstrate performance improvements in terms of increased accuracy on multiple datasets using a smaller network, achieving state of the art on the fashion MNIST dataset.

Our work demonstrates how by combining an auxiliary loss functions that encourages ideal clustering behavior and a orthogonalization step that ensures the feature space is . Clustered learned features render small networks more effective for classification applications.

### References

- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105, 2012.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
- K. He, X. Zhang, S. Ren, and J.: Sun, “Deep residual learning for image recognition,” in Computer vision and pattern recognition (CVPR), 2016.
- F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Computer vision and pattern recognition (CVPR), 2015.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” 2015, Computer Vision and Pattern Recognition (CVPR).
- S. E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Computer vision and pattern recognition (CVPR), 2016.
- H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” 2015.
- G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolutional networks,” 2016.
- Ronald A Fisher, “The use of multiple measurements in taxonomic problems,” Annals of human genetics, vol. 7, no. 2, pp. 179–188.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” 2015, p. 448â456.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” J. Machine Learning Research, vol. 15, pp. 1929â1958, 2014.
- Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European Conference on Computer Vision (ECCV), 2016.
- Ce Qi and Fei Su, “Contrastive-center loss for deep neural networks,” CoRR, vol. abs/1707.07391, 2017.
- P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,” Computational and Applied Mathematics, vol. 20, pp. 53â65, 1987.
- K. Fan and A. J. Hoffman, “Some metric inequalities in the space of matrices,” 1955, vol. 6, pp. 111–116.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in IEEE, no. 11, 1998, p. 2278â2323, vol. 86.
- Han Xiao, Kashif Rasul, and Roland Vollgraf, “Fashion-mnist a novel image dataset for benchmarking machine learning algorithms,” 2017.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” preprint, 2014.
- “https://github.com/zalandoresearch/fashion-mnist,” .