Angular Learning: Toward Discriminative Embedded Features
The margin-based softmax loss functions greatly enhance intra-class compactness and perform well on the tasks of face recognition and object classification. Outperformance, however, depends on the careful hyperparameter selection. Moreover, the hard angle restriction also increases the risk of overfitting. In this paper, angular loss suggested by maximizing the angular gradient to promote intra-class compactness avoids overfitting. Besides, our method has only one adjustable constant for intra-class compactness control. We define three metrics to measure inter-class separability and intra-class compactness. In experiments, we test our method, as well as other methods, on many well-known datasets. Experimental results reveal that our method has the superiority of accuracy improvement, discriminative information, and time-consumption.
Traditional representation learning methods, like Locally Linear Embedding (LLE) , Laplacian Eigenmap  and Hessian LLE , can not produce a meaningful feature embedding or distance metric. More important, these traditional methods can not deal with unseen samples. A neural network with a single layer is linear separability. Hence, the outputs of the second last fully connected layer are excellent features for distinguishing classes. representation learning by Deep Convolutional Neural Network (DCNN) embedding can generate embedded features for unseen samples and achieves excellent performance on face recognition tasks [25, 21, 26]. Figure 1 illustrates the general framework of representation learning by DCNN. The softmax function following cross-entropy loss is the most common component in DCNN on classification tasks. However, the learned features are separable but not discriminative enough. This shortage raises the risk of learned features being misclassified near the decision boundary. Many researchers have noticed that by designing specific loss functions, discriminative features will improve performance.
Contrastive loss  and triplet loss  use data pairs to enhance intra-class compactness and enlarge inter-class separability. They emphasize the relationship between features but increase the number of training pairs, which are theoretically up to for the contrastive loss and to for the triplet loss (where is the size of the training set). Additionally, the complex form of these loss functions makes it hard to train. To encourage the intra-class compactness with low computational overhead, center loss  converges the learned features to their corresponding class centers. However, experiments (see in Table [2,3]) show that this method declines the inter-class separability.
A study line of margin-based methods [27, 29, 30], has several good properties, including clear geometric interpretation, strong intra-class compactness and easy to implement. They also demonstrate the power of margin in representation learning. Margin is a hard constraint which enforces features belonging to different classes far away. ArcFace, the outstanding one on margin-based methods, defines an addictive angular margin parameter to enforce the intra-class angle smaller than the inter-class angle significantly, . Nevertheless, the high performance needs a careful selection of hyperparameters, and the tiny intra-class angles may cause overfitting.
In this paper, an angular loss is proposed to generate discriminative features. Our objective is that a loss function with a high angular gradient leads to a low intra-class angle. We also argue that margin-based methods face the risk of overfitting—for some cases, the feature of a mislabeled or high variance sample, forced into the center of this class, declines generalization ability. Our method has not this kind of hard constraint on features, allowing outliers separable. Without losing flexibility, our method also provides a hyperparameter to adjust the intra-class compactness. The summary of our contributions in this work is the following:
We cast a new viewpoint on the weakness of the softmax loss. i.e., the angular gradient of the softmax loss is closing to zero while the intra-class features are converging to the corresponding center.
We propose a novel loss function, angular loss, which has an intra-class compactness regulation that is elastic. The approach promotes the angle small by the angular gradient in a soft way, rather than providing a restriction on the angles in a hard way like margin-base methods.
Experimental results on datasets of Fashion-MNIST, CIFAR10, and CIFAR100 reveal the effectiveness of our method.
In Section 2, we describe the preliminary knowledge and the methods we are going to compare. Section 3 introduces our method and the angular gradient. Empirically, our method consistently outperforms other methods on the accuracy, intra-class compactness, and computational efficiency, as shown in Section 4. For clarity, our work demonstrates a novel, convenient, and adjustable way to achieve intra-class compactness.
2 Related Works
2.1 Softmax loss
In supervised representation learning by DCNN, the softmax loss obtains a fully connected layer, a softmax function, and a cross-entropy loss function, defined by , shown in Figure 1. The representation of data learned by the deep convolution units usually transforms high dimension feature to low dimension embedded feature. And the class probabilities of a sample , symbolized by , is calculated by the softmax function:
where is the number of classes and represents the DCNN. Then we get the loss by the cross-entropy loss,
where is the ground truth label for the sample .
To better describe the following methods, in this paper, we define that is the output of the fully connected layer, is the output of the softmax function, and L is the final output of the loss function. For convenience, we denote the output of the convolution units or embedded feature just by . And the superscript of those symbols distinguishes between different loss functions.
In the softmax loss, the last fully connected layer calculates the similarity between the feature and class by dot product, , where are the weight matrix of this layer. Finally, the original softmax loss can be written as:
where is the embedded feature of -th sample, is the number of a mini-batch size and is the lable of -th sample. SphereFace  studies the effects of bias . They find that omitting the bias does no harm to the performance but makes it easy to analyze. So we follow this modified softmax loss:
where is the angle between two vectors; denotes the angle between this feature and the weight vector of class or .
The traditional softmax loss is the standard method on classification task. However, recent studies show that this method cannot generate discriminative features. The features near the decision boundary are more likely to be misclassified. To be discriminative features helps the ability of generalization on unseen samples.
Weight normalization  can not only accelerate training but also solve the imbalanced data problem by rebalancing the weight of each class . Following [17, 29, 28], we fix the weight by normalisation and multiplied a rescale hyperparameter , so the formulation becomes
L-softmax  provides a new way, incorporating margin, to enhance intra-class compactness and inter-class separability simultaneously. ArcFace, L-softmax and soft-margin  use different margin penalty and achieve good results, however, ArcFace was proved outstanding on them . ArcFace applies an angular margin to penalise the angle:. Therefore the final formulation of the ArcFace loss can be written as:
The margin-based methods decrease the probability of the ground truth class. It constrain the distribution of features strictly, that may cause overfitting.
2.3 Center loss
The center loss’s intuition is to minimize intra-class variations. To achieve that, it decreases the euclidean distance between the feature and the class center directly, as formulated :
where denotes the -th class center of the feature . will converge to the center of the features of every class as the entire training set is taken into account. Besides, it needs the softmax loss to keep different classes separated. The final formulation is balanced the softmax loss (modified) and the center loss by a scale parameter :
SphereFace  provides a new view on the weights of the last fully connected layer, representing the centers of each class in angular space. Enlightened by this, we draw a theory—minimizing the angle can achieve discriminative features and the fast way to decrease the angle is maximizing the gradient of .
In this paper, we propose a novel loss function named angular loss, which adds arccos after the last fully connected layer, to give a constant gradient.
For convenience, we give a shorthand for each method, soft for the softmax loss, center for the center loss, ArcFace for the ArcFace loss, and Arc for the angular loss.
3.1 Gradient analysis
We investigate the change of the intra-class angle while training. To simplify this problem, we assume that the features remain unchanged on our analysis because the features are not updated directly by backpropagation. Figure 2 illustrates how is updated during training. The angular gradient indicates the change of the intra-class angle. Hence, our intuitive objective is to design a loss function which has a large angular gradient.
In softmax, the partial angular gradient is
In ArcFace, the rescale parameter has no effect on gradient, and . The derivation of is
The angular loss has a constant gradient, enhancing the intra-class compactness consistently.
Figure 3 compares the gradient of four loss functions. In softmax, the gradient goes down to zero when is closing to zero. That means it is hard to optimize the intra-class compactness when is small. That ArcFace solves this problem by offset avoids the range of zero gradients. The sharp curve in L-Softmax leads to a smaller range of small angular gradient. Nevertheless, L-Softmax still has the zero angular gradient that makes it hard to train.
3.2 Adjustable constant
Our method defines an extra constant , which squeezes/stretches the curvature. The angle between the feature and the mismatched class is close to , similar to the situation in . Therefore, we assume that . becomes
First, we should ensure that is far bigger than , that is, is very large. Empirically, should be greater than 3.Then, let’s consider the angular gradient. According to the chain rule:
combining EQ10, the angular gradient of angular loss is
Mathematically, the intra-class angle will keep decreasing when the angular gradient is bigger than zero. Therefore, the larger produces the smaller intra-class angle. We conduct the experiment on Fashion-MNIST comparing and detect the WC-Intra (Defined in EQ14) for every 200 iterations. The results in Figure 5 reveals that the constant adjusts the intra-class compactness. And increases the confidence of this class, as shown in Figure 4.
4.1 General experimental settings
Fashion-MNIST : The Fashion-MNIST is a dataset of Zalando’s article images designed for a replacement of the original MNIST. Both are simple and light datasets, however, the most simple classification can reach 90% test accuracy on MNIST. Many researchers have replaced MNIST to Fashion-MNIST in order to verify their ideas.
CIFAR10/CIFAR10+ : The CIFAR10 dataset, containing ten classes, is widely used for image classification tasks. It is separated into a training set with 50000 samples and a test set with 10000 samples. CIFAR10+ denotes the CIFAR10 with data augment. For the data augmentation, we follow the transformations in : a random cropping with 4-pixel padding on each side, a random flipping with the probability of 0.5, and a z-score normalization.
CIFAR100/CIFAR100+ : We also testify our method on CIFAR100 dataset, which has the same image size but is more complex. Due to the similarity of CIFAR10 and CIFAR100, we remain our experiment setting almost unchanged. CIFAR100+ denotes CIFAR100 dataset with data augmentation.
Architecture: Deep residual networks  have been widely used in image classification tasks [11, 6], improving the performance of the deep convolutional neural networks. Though many other modern architectures, such as Inception-ResNet, WideResNet, ResNext , are proposed. Our purpose is not achieving the best result of cifar10 but varifies the efficiency of our method. So we use the original ResNet and some modern training techniques introduced by Leslie Smith . He also provides a practical approach to select hyperparameters (such as learning rate, weight decay, batch_size). We follow this one cycle policy to change the learning rates. We summarize our experiments settings on Table 1.
4.2 Angle descent verification
In this section, we demonstrate the effectiveness of our method to descent the angle from different aspects. First, we use a toy example to plot the actual feature distribution; Second, we monitor the angles while training; Last, we give the confusion matrix to grasp the similarity of different features on real datasets (CIFAR10 and CIFAR100).
We conduct this experiment on the Fashion-MNIST dataset. For visualizing the embedded features, we set the dimension of features as two and follow the training settings listed in Table 1. Figure 6 plots the distribution of features comparing the angular loss with the other loss functions, showing that the softmax loss generates roughly separable features, and other methods can produce discriminative features. ArcFace and Arc generate explicitly discriminative features; however, Arc is more tolerate on the outlier.
Our method can directly optimize the angle between embedded feature and its target vector. To proof that, we define the following metrics :
where is the number of class, is the mean of the features belonging to the same class , and denotes the angle between a and b. “W-Inter” refers to the mean of angles between different target vectors . “C-Inter” refers to the mean of angles between different classes’ feature center. “WC-Intra” refers to the mean of the angles between target vectors and feature center of the class . Table 2 and 3 give the details of angle statistics on CIFAR10 and CIFAR100+ dataset. To be attention, the intra-class angle of ArcFace is extraordinarily little, and that we argue that it probably causes overfitting.
We give the angle histograms of different classes to capture the dynamic change of the intra-class angles in Figure 7. At the end of every epoch, we randomly pick 200 samples belonging to class 0 and measure the angle for each sample. In comparison with the softmax method, our approach will reduce the target angle faster and be at a lower level.
Confusion matrix visualization
Visualization is difficult for high-dimensional features. Hence the comparison of the confusion matrix is given in Figure 8 to show the cosine similarity of features. In particular, on CIFAR10 dataset, we randomly select 10 features for each class, consisting of a total of 100 features; on CIFAR100 dataset, we randomly select one feature for each class. The learned features will then be applied to L2 standardization and calculated by
One can see that ArcFace and Arc greatly enhance intra-class compactness and enlarge the inter-class separability.
We have proved our method can enhance the intra-class compactness. All these experiments show that our method can significantly decrease the intra-class angle as our theory infers. In this part, we compare four loss functions: the softmax loss, the center loss, the ArcFace loss, and the angular loss (ours).
4.4 Computational efficiency
|Genetic DCNN ||5.4|
In this paper, we proposed an angular loss function, which has a constant to adjust the compactness of learned features. Our work also provides a potential direction to encourage intra-class compactness by the angular gradient analysis. Comparing to the hard constraint on margin-based methods, our method avoids overfitting by a soft way. The experiments also have demonstrated that our method outperforms the state-of-the-art. Moreover, our method can accelerate training.
This work was supported by National Natural Science Foundation of China under Grant No. 61872419, No. 61573166, No. 61572230, No. 61873324, No. 81671785, No. 61672262. Shandong Provincial Natural Science Foundation No. ZR2019MF040, No. ZR2018LF005. Shandong Provincial Key R&D Program under Grant No. 2019GGX101041, No. 2018GGX101048, No. 2016ZDJS01A12, No. 2016GGX101001, No. 2017CXZC1206. Taishan Scholar Project of Shandong Province, China.
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
- Mikhail Belkin and Partha Niyogi, ‘Laplacian eigenmaps and spectral techniques for embedding and clustering’, in Advances in neural information processing systems, pp. 585–591, (2002).
- Shobhit Bhatnagar, Deepanway Ghosal, and Maheshkumar H Kolekar, ‘Classification of fashion article images using convolutional neural networks’, in 2017 Fourth International Conference on Image Information Processing (ICIIP), pp. 1–6. IEEE, (2017).
- Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou, ‘Arcface: Additive angular margin loss for deep face recognition’, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699, (2019).
- David L Donoho and Carrie Grimes, ‘Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data’, Proceedings of the National Academy of Sciences, 100(10), 5591–5596, (2003).
- Thibaut Durand, Taylor Mordan, Nicolas Thome, and Matthieu Cord, ‘Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 642–651, (2017).
- Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio, ‘Maxout networks’, arXiv preprint arXiv:1302.4389, (2013).
- Yandong Guo and Lei Zhang. One-shot face recognition by promoting underrepresented classes, 2017.
- Raia Hadsell, Sumit Chopra, and Yann LeCun, ‘Dimensionality reduction by learning an invariant mapping’, in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp. 1735–1742. IEEE, (2006).
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, ‘Deep residual learning for image recognition’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, (2016).
- Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al., ‘Cnn architectures for large-scale audio classification’, in 2017 ieee international conference on acoustics, speech and signal processing (icassp), pp. 131–135. IEEE, (2017).
- Asifullah Khan, Anabia Sohail, Umme Zahoora, and Aqsa Saeed Qureshi. A survey of the recent architectures of deep convolutional neural networks, 2019.
- Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014.
- Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton, ‘Cifar-10 and cifar-100 datasets’, URl: https://www. cs. toronto. edu/kriz/cifar. html, 6, (2009).
- Yann LeCun, ‘The mnist database of handwritten digits’, http://yann. lecun. com/exdb/mnist/, (1998).
- Chenyu Lee, Saining Xie, Patrick W Gallagher, Zhengyou Zhang, and Zhuowen Tu, ‘Deeply-supervised nets’, 562–570, (2015).
- Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song, ‘Sphereface: Deep hypersphere embedding for face recognition’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220, (2017).
- Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang, ‘Large-margin softmax loss for convolutional neural networks.’, in ICML, volume 2, p. 7, (2016).
- Benteng Ma and Yong Xia, ‘Autonomous deep learning: A genetic dcnn designer for image classification’.
- Sam T Roweis and Lawrence K Saul, ‘Nonlinear dimensionality reduction by locally linear embedding’, science, 290(5500), 2323–2326, (2000).
- Swami Sankaranarayanan, Azadeh Alavi, and Rama Chellappa, ‘Triplet similarity embedding for face verification’, arXiv preprint arXiv:1602.03418, (2016).
- Florian Schroff, Dmitry Kalenichenko, and James Philbin, ‘Facenet: A unified embedding for face recognition and clustering’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823, (2015).
- Leslie N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay, 2018.
- Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller, ‘Striving for simplicity: The all convolutional net’, arXiv preprint arXiv:1412.6806, (2014).
- Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang, ‘Deep learning face representation by joint identification-verification’, in Advances in neural information processing systems, pp. 1988–1996, (2014).
- Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf, ‘Deepface: Closing the gap to human-level performance in face verification’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708, (2014).
- Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu, ‘Additive margin softmax for face verification’, IEEE Signal Processing Letters, 25(7), 926–930, (2018).
- Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille, ‘Normface’, Proceedings of the 2017 ACM on Multimedia Conference - MM â17, (2017).
- Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu, ‘Cosface: Large margin cosine loss for deep face recognition’, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274, (2018).
- Xiaobo Wang, Shifeng Zhang, Zhen Lei, Si Liu, Xiaojie Guo, and Stan Z Li, ‘Ensemble soft-margin softmax loss for image classification’, arXiv preprint arXiv:1805.03922, (2018).
- Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao, ‘A discriminative feature learning approach for deep face recognition’, in European conference on computer vision, pp. 499–515. Springer, (2016).
- Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
- Xiao Zhang, Rui Zhao, Yu Qiao, Xiaogang Wang, and Hongsheng Li. Adacos: Adaptively scaling cosine logits for effectively learning deep face representations, 2019.