Angular Learning: Toward Discriminative Embedded Features
Abstract
The marginbased softmax loss functions greatly enhance intraclass compactness and perform well on the tasks of face recognition and object classification. Outperformance, however, depends on the careful hyperparameter selection. Moreover, the hard angle restriction also increases the risk of overfitting. In this paper, angular loss suggested by maximizing the angular gradient to promote intraclass compactness avoids overfitting. Besides, our method has only one adjustable constant for intraclass compactness control. We define three metrics to measure interclass separability and intraclass compactness. In experiments, we test our method, as well as other methods, on many wellknown datasets. Experimental results reveal that our method has the superiority of accuracy improvement, discriminative information, and timeconsumption.
1 Introduction
Traditional representation learning methods, like Locally Linear Embedding (LLE) [20], Laplacian Eigenmap [2] and Hessian LLE [5], can not produce a meaningful feature embedding or distance metric. More important, these traditional methods can not deal with unseen samples. A neural network with a single layer is linear separability. Hence, the outputs of the second last fully connected layer are excellent features for distinguishing classes. representation learning by Deep Convolutional Neural Network (DCNN) embedding can generate embedded features for unseen samples and achieves excellent performance on face recognition tasks [25, 21, 26]. Figure 1 illustrates the general framework of representation learning by DCNN. The softmax function following crossentropy loss is the most common component in DCNN on classification tasks. However, the learned features are separable but not discriminative enough. This shortage raises the risk of learned features being misclassified near the decision boundary. Many researchers have noticed that by designing specific loss functions, discriminative features will improve performance.
Contrastive loss [9] and triplet loss [22] use data pairs to enhance intraclass compactness and enlarge interclass separability. They emphasize the relationship between features but increase the number of training pairs, which are theoretically up to for the contrastive loss and to for the triplet loss (where is the size of the training set). Additionally, the complex form of these loss functions makes it hard to train. To encourage the intraclass compactness with low computational overhead, center loss [31] converges the learned features to their corresponding class centers. However, experiments (see in Table [2,3]) show that this method declines the interclass separability.
A study line of marginbased methods [27, 29, 30], has several good properties, including clear geometric interpretation, strong intraclass compactness and easy to implement. They also demonstrate the power of margin in representation learning. Margin is a hard constraint which enforces features belonging to different classes far away. ArcFace, the outstanding one on marginbased methods, defines an addictive angular margin parameter to enforce the intraclass angle smaller than the interclass angle significantly, . Nevertheless, the high performance needs a careful selection of hyperparameters, and the tiny intraclass angles may cause overfitting.
In this paper, an angular loss is proposed to generate discriminative features. Our objective is that a loss function with a high angular gradient leads to a low intraclass angle. We also argue that marginbased methods face the risk of overfitting—for some cases, the feature of a mislabeled or high variance sample, forced into the center of this class, declines generalization ability. Our method has not this kind of hard constraint on features, allowing outliers separable. Without losing flexibility, our method also provides a hyperparameter to adjust the intraclass compactness. The summary of our contributions in this work is the following:

We cast a new viewpoint on the weakness of the softmax loss. i.e., the angular gradient of the softmax loss is closing to zero while the intraclass features are converging to the corresponding center.

We propose a novel loss function, angular loss, which has an intraclass compactness regulation that is elastic. The approach promotes the angle small by the angular gradient in a soft way, rather than providing a restriction on the angles in a hard way like marginbase methods.

Experimental results on datasets of FashionMNIST, CIFAR10, and CIFAR100 reveal the effectiveness of our method.
In Section 2, we describe the preliminary knowledge and the methods we are going to compare. Section 3 introduces our method and the angular gradient. Empirically, our method consistently outperforms other methods on the accuracy, intraclass compactness, and computational efficiency, as shown in Section 4. For clarity, our work demonstrates a novel, convenient, and adjustable way to achieve intraclass compactness.
2 Related Works
2.1 Softmax loss
In supervised representation learning by DCNN, the softmax loss obtains a fully connected layer, a softmax function, and a crossentropy loss function, defined by [18], shown in Figure 1. The representation of data learned by the deep convolution units usually transforms high dimension feature to low dimension embedded feature. And the class probabilities of a sample , symbolized by , is calculated by the softmax function:
where is the number of classes and represents the DCNN. Then we get the loss by the crossentropy loss,
where is the ground truth label for the sample .
To better describe the following methods, in this paper, we define that is the output of the fully connected layer, is the output of the softmax function, and L is the final output of the loss function. For convenience, we denote the output of the convolution units or embedded feature just by . And the superscript of those symbols distinguishes between different loss functions.
In the softmax loss, the last fully connected layer calculates the similarity between the feature and class by dot product, , where are the weight matrix of this layer. Finally, the original softmax loss can be written as:
(1) 
where is the embedded feature of th sample, is the number of a minibatch size and is the lable of th sample. SphereFace [17] studies the effects of bias . They find that omitting the bias does no harm to the performance but makes it easy to analyze. So we follow this modified softmax loss:
(2) 
where is the angle between two vectors; denotes the angle between this feature and the weight vector of class or .
The traditional softmax loss is the standard method on classification task. However, recent studies show that this method cannot generate discriminative features. The features near the decision boundary are more likely to be misclassified. To be discriminative features helps the ability of generalization on unseen samples.
2.2 ArcFace
Weight normalization [31] can not only accelerate training but also solve the imbalanced data problem by rebalancing the weight of each class [8]. Following [17, 29, 28], we fix the weight by normalisation and multiplied a rescale hyperparameter , so the formulation becomes
(3) 
Lsoftmax [18] provides a new way, incorporating margin, to enhance intraclass compactness and interclass separability simultaneously. ArcFace, Lsoftmax and softmargin [30] use different margin penalty and achieve good results, however, ArcFace was proved outstanding on them [4]. ArcFace applies an angular margin to penalise the angle:. Therefore the final formulation of the ArcFace loss can be written as:
(4) 
The marginbased methods decrease the probability of the ground truth class. It constrain the distribution of features strictly, that may cause overfitting.
2.3 Center loss
The center loss’s intuition is to minimize intraclass variations. To achieve that, it decreases the euclidean distance between the feature and the class center directly, as formulated :
(5) 
where denotes the th class center of the feature . will converge to the center of the features of every class as the entire training set is taken into account. Besides, it needs the softmax loss to keep different classes separated. The final formulation is balanced the softmax loss (modified) and the center loss by a scale parameter :
(6) 
3 Method
SphereFace [17] provides a new view on the weights of the last fully connected layer, representing the centers of each class in angular space. Enlightened by this, we draw a theory—minimizing the angle can achieve discriminative features and the fast way to decrease the angle is maximizing the gradient of .
In this paper, we propose a novel loss function named angular loss, which adds arccos after the last fully connected layer, to give a constant gradient.
(7) 
For convenience, we give a shorthand for each method, soft for the softmax loss, center for the center loss, ArcFace for the ArcFace loss, and Arc for the angular loss.
3.1 Gradient analysis
We investigate the change of the intraclass angle while training. To simplify this problem, we assume that the features remain unchanged on our analysis because the features are not updated directly by backpropagation. Figure 2 illustrates how is updated during training. The angular gradient indicates the change of the intraclass angle. Hence, our intuitive objective is to design a loss function which has a large angular gradient.
In softmax, the partial angular gradient is
(8) 
In ArcFace, the rescale parameter has no effect on gradient, and . The derivation of is
(9) 
The angular loss has a constant gradient, enhancing the intraclass compactness consistently.
(10) 
Figure 3 compares the gradient of four loss functions. In softmax, the gradient goes down to zero when is closing to zero. That means it is hard to optimize the intraclass compactness when is small. That ArcFace solves this problem by offset avoids the range of zero gradients. The sharp curve in LSoftmax leads to a smaller range of small angular gradient. Nevertheless, LSoftmax still has the zero angular gradient that makes it hard to train.
3.2 Adjustable constant
Our method defines an extra constant , which squeezes/stretches the curvature. The angle between the feature and the mismatched class is close to , similar to the situation in [33]. Therefore, we assume that . becomes
First, we should ensure that is far bigger than , that is, is very large. Empirically, should be greater than 3.Then, let’s consider the angular gradient. According to the chain rule:
combining EQ10, the angular gradient of angular loss is
(11) 
Mathematically, the intraclass angle will keep decreasing when the angular gradient is bigger than zero. Therefore, the larger produces the smaller intraclass angle. We conduct the experiment on FashionMNIST comparing and detect the WCIntra (Defined in EQ14) for every 200 iterations. The results in Figure 5 reveals that the constant adjusts the intraclass compactness. And increases the confidence of this class, as shown in Figure 4.
4 Experiments
4.1 General experimental settings
FashionMNIST [32]: The FashionMNIST is a dataset of Zalando’s article images designed for a replacement of the original MNIST[15]. Both are simple and light datasets, however, the most simple classification can reach 90% test accuracy on MNIST. Many researchers have replaced MNIST to FashionMNIST in order to verify their ideas.
CIFAR10/CIFAR10+ [14]: The CIFAR10 dataset, containing ten classes, is widely used for image classification tasks. It is separated into a training set with 50000 samples and a test set with 10000 samples. CIFAR10+ denotes the CIFAR10 with data augment. For the data augmentation, we follow the transformations in [16]: a random cropping with 4pixel padding on each side, a random flipping with the probability of 0.5, and a zscore normalization.
CIFAR100/CIFAR100+ [14]: We also testify our method on CIFAR100 dataset, which has the same image size but is more complex. Due to the similarity of CIFAR10 and CIFAR100, we remain our experiment setting almost unchanged. CIFAR100+ denotes CIFAR100 dataset with data augmentation.
Architecture: Deep residual networks [10] have been widely used in image classification tasks [11, 6], improving the performance of the deep convolutional neural networks. Though many other modern architectures, such as InceptionResNet, WideResNet, ResNext [12], are proposed. Our purpose is not achieving the best result of cifar10 but varifies the efficiency of our method. So we use the original ResNet and some modern training techniques introduced by Leslie Smith [23]. He also provides a practical approach to select hyperparameters (such as learning rate, weight decay, batch_size). We follow this one cycle policy to change the learning rates. We summarize our experiments settings on Table 1.
Dataset  FashionMNIST  CIFAR10  CIFAR100 
CNU  ResNet20  ResNet18  ResNet34 
epochs  60  160  160 
batch_size  256  128  128 
lr(Soft)  0.1  0.01  0.01 
lr(Arc)  0.1  0.1  0.1 
lr(ArcFace)  0.1  0.1  0.1 
lr(Center)  0.01  0.01  0.01 
4.2 Angle descent verification
In this section, we demonstrate the effectiveness of our method to descent the angle from different aspects. First, we use a toy example to plot the actual feature distribution; Second, we monitor the angles while training; Last, we give the confusion matrix to grasp the similarity of different features on real datasets (CIFAR10 and CIFAR100).
Intuitive interpretation
We conduct this experiment on the FashionMNIST dataset. For visualizing the embedded features, we set the dimension of features as two and follow the training settings listed in Table 1. Figure 6 plots the distribution of features comparing the angular loss with the other loss functions, showing that the softmax loss generates roughly separable features, and other methods can produce discriminative features. ArcFace and Arc generate explicitly discriminative features; however, Arc is more tolerate on the outlier.
Angle histogram
Our method can directly optimize the angle between embedded feature and its target vector. To proof that, we define the following metrics [4]:
WCIntra  (12)  
WInter  (13)  
CInter  (14) 
where is the number of class, is the mean of the features belonging to the same class , and denotes the angle between a and b. “WInter” refers to the mean of angles between different target vectors . “CInter” refers to the mean of angles between different classes’ feature center. “WCIntra” refers to the mean of the angles between target vectors and feature center of the class . Table 2 and 3 give the details of angle statistics on CIFAR10 and CIFAR100+ dataset. To be attention, the intraclass angle of ArcFace is extraordinarily little, and that we argue that it probably causes overfitting.
Method  WInter  CInter  WCIntra 

Soft  1.2679  1.6366  0.9089 
Arc  1.6821  1.6822  0.0039 
ArcFace  1.6821  1.6821  0.0021 
Center  1.4749  1.6430  0.5604 
Method  WInter  CInter  WCIntra 

Soft  1.5680  1.5648  0.4056 
Arc  1.5404  1.5395  0.0301 
ArcFace  1.6256  1.6244  0.0103 
Center  1.5712  1.5601  0.3467 
We give the angle histograms of different classes to capture the dynamic change of the intraclass angles in Figure 7. At the end of every epoch, we randomly pick 200 samples belonging to class 0 and measure the angle for each sample. In comparison with the softmax method, our approach will reduce the target angle faster and be at a lower level.
Confusion matrix visualization
Visualization is difficult for highdimensional features. Hence the comparison of the confusion matrix is given in Figure 8 to show the cosine similarity of features. In particular, on CIFAR10 dataset, we randomly select 10 features for each class, consisting of a total of 100 features; on CIFAR100 dataset, we randomly select one feature for each class. The learned features will then be applied to L2 standardization and calculated by
(15) 
One can see that ArcFace and Arc greatly enhance intraclass compactness and enlarge the interclass separability.
4.3 Performance
We have proved our method can enhance the intraclass compactness. All these experiments show that our method can significantly decrease the intraclass angle as our theory infers. In this part, we compare four loss functions: the softmax loss, the center loss, the ArcFace loss, and the angular loss (ours).
4.4 Computational efficiency
In this part, we use different epochs to train our model while keeping the other settings the same in Table 1. Results in Table 4 reveal our method accelerates training.
Epochs  Soft  Arc  ArcFace  Center 

10  15.72  12.09  14.08  17.67 
20  12.55  8.65  9.23  10.73 
80  9.43  6.07  6.22  6.64 
200  6.21  5.68  6.22  6.60 
Method  error_rate 

Genetic DCNN [19]  5.4 
CNN [3]  7.46 
Soft  6.30 
Arc  6.04 
ArcFace  6.27 
Center  6.04 
Method  Params(M)  CIFAR10  CIFAR10+ 
EMSoftmax [30]  15.2    6.69 
Maxout [7]    9.38   
AllCNN [24]  1.3  9.08  7.25 
Softmax  11.22  13.160.31  5.730.22 
Center  11.22  12.220.45  5.400.21 
Arc  11.22  11.770.31  5.230.19 
ArcFace  11.22  12.110.45  5.340.12 
Method  Params(M)  Top1  Top5 
EMSoftmax [30]  31.1  27.26   
Maxout [7]    38.57   
AllCNN [24]  1.4  33.71   
Softmax  21.54  25.640.12  10.580.26 
Center  21.54  25.330.96  8.190.58 
Arc  21.54  24.290.08  10.070.23 
ArcFace  21.54  25.560.16  9.740.31 
5 Conclusion
In this paper, we proposed an angular loss function, which has a constant to adjust the compactness of learned features. Our work also provides a potential direction to encourage intraclass compactness by the angular gradient analysis. Comparing to the hard constraint on marginbased methods, our method avoids overfitting by a soft way. The experiments also have demonstrated that our method outperforms the stateoftheart. Moreover, our method can accelerate training.
This work was supported by National Natural Science Foundation of China under Grant No. 61872419, No. 61573166, No. 61572230, No. 61873324, No. 81671785, No. 61672262. Shandong Provincial Natural Science Foundation No. ZR2019MF040, No. ZR2018LF005. Shandong Provincial Key R&D Program under Grant No. 2019GGX101041, No. 2018GGX101048, No. 2016ZDJS01A12, No. 2016GGX101001, No. 2017CXZC1206. Taishan Scholar Project of Shandong Province, China.
References
 Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 Mikhail Belkin and Partha Niyogi, ‘Laplacian eigenmaps and spectral techniques for embedding and clustering’, in Advances in neural information processing systems, pp. 585–591, (2002).
 Shobhit Bhatnagar, Deepanway Ghosal, and Maheshkumar H Kolekar, ‘Classification of fashion article images using convolutional neural networks’, in 2017 Fourth International Conference on Image Information Processing (ICIIP), pp. 1–6. IEEE, (2017).
 Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou, ‘Arcface: Additive angular margin loss for deep face recognition’, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699, (2019).
 David L Donoho and Carrie Grimes, ‘Hessian eigenmaps: Locally linear embedding techniques for highdimensional data’, Proceedings of the National Academy of Sciences, 100(10), 5591–5596, (2003).
 Thibaut Durand, Taylor Mordan, Nicolas Thome, and Matthieu Cord, ‘Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 642–651, (2017).
 Ian J Goodfellow, David WardeFarley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio, ‘Maxout networks’, arXiv preprint arXiv:1302.4389, (2013).
 Yandong Guo and Lei Zhang. Oneshot face recognition by promoting underrepresented classes, 2017.
 Raia Hadsell, Sumit Chopra, and Yann LeCun, ‘Dimensionality reduction by learning an invariant mapping’, in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp. 1735–1742. IEEE, (2006).
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, ‘Deep residual learning for image recognition’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, (2016).
 Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al., ‘Cnn architectures for largescale audio classification’, in 2017 ieee international conference on acoustics, speech and signal processing (icassp), pp. 131–135. IEEE, (2017).
 Asifullah Khan, Anabia Sohail, Umme Zahoora, and Aqsa Saeed Qureshi. A survey of the recent architectures of deep convolutional neural networks, 2019.
 Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014.
 Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton, ‘Cifar10 and cifar100 datasets’, URl: https://www. cs. toronto. edu/kriz/cifar. html, 6, (2009).
 Yann LeCun, ‘The mnist database of handwritten digits’, http://yann. lecun. com/exdb/mnist/, (1998).
 Chenyu Lee, Saining Xie, Patrick W Gallagher, Zhengyou Zhang, and Zhuowen Tu, ‘Deeplysupervised nets’, 562–570, (2015).
 Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song, ‘Sphereface: Deep hypersphere embedding for face recognition’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220, (2017).
 Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang, ‘Largemargin softmax loss for convolutional neural networks.’, in ICML, volume 2, p. 7, (2016).
 Benteng Ma and Yong Xia, ‘Autonomous deep learning: A genetic dcnn designer for image classification’.
 Sam T Roweis and Lawrence K Saul, ‘Nonlinear dimensionality reduction by locally linear embedding’, science, 290(5500), 2323–2326, (2000).
 Swami Sankaranarayanan, Azadeh Alavi, and Rama Chellappa, ‘Triplet similarity embedding for face verification’, arXiv preprint arXiv:1602.03418, (2016).
 Florian Schroff, Dmitry Kalenichenko, and James Philbin, ‘Facenet: A unified embedding for face recognition and clustering’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823, (2015).
 Leslie N. Smith. A disciplined approach to neural network hyperparameters: Part 1 – learning rate, batch size, momentum, and weight decay, 2018.
 Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller, ‘Striving for simplicity: The all convolutional net’, arXiv preprint arXiv:1412.6806, (2014).
 Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang, ‘Deep learning face representation by joint identificationverification’, in Advances in neural information processing systems, pp. 1988–1996, (2014).
 Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf, ‘Deepface: Closing the gap to humanlevel performance in face verification’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708, (2014).
 Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu, ‘Additive margin softmax for face verification’, IEEE Signal Processing Letters, 25(7), 926–930, (2018).
 Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille, ‘Normface’, Proceedings of the 2017 ACM on Multimedia Conference  MM â17, (2017).
 Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu, ‘Cosface: Large margin cosine loss for deep face recognition’, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274, (2018).
 Xiaobo Wang, Shifeng Zhang, Zhen Lei, Si Liu, Xiaojie Guo, and Stan Z Li, ‘Ensemble softmargin softmax loss for image classification’, arXiv preprint arXiv:1805.03922, (2018).
 Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao, ‘A discriminative feature learning approach for deep face recognition’, in European conference on computer vision, pp. 499–515. Springer, (2016).
 Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
 Xiao Zhang, Rui Zhao, Yu Qiao, Xiaogang Wang, and Hongsheng Li. Adacos: Adaptively scaling cosine logits for effectively learning deep face representations, 2019.