Angular Learning: Toward Discriminative Embedded Features

Angular Learning: Toward Discriminative Embedded Features


The margin-based softmax loss functions greatly enhance intra-class compactness and perform well on the tasks of face recognition and object classification. Outperformance, however, depends on the careful hyperparameter selection. Moreover, the hard angle restriction also increases the risk of overfitting. In this paper, angular loss suggested by maximizing the angular gradient to promote intra-class compactness avoids overfitting. Besides, our method has only one adjustable constant for intra-class compactness control. We define three metrics to measure inter-class separability and intra-class compactness. In experiments, we test our method, as well as other methods, on many well-known datasets. Experimental results reveal that our method has the superiority of accuracy improvement, discriminative information, and time-consumption.


1 Introduction

Figure 1: The framework of convolution representation learning. In representation learning, the weights of the last fully connected layer, also seemed as a part of loss function, help the convolution units learn good feature embedding.

Traditional representation learning methods, like Locally Linear Embedding (LLE) [20], Laplacian Eigenmap [2] and Hessian LLE [5], can not produce a meaningful feature embedding or distance metric. More important, these traditional methods can not deal with unseen samples. A neural network with a single layer is linear separability. Hence, the outputs of the second last fully connected layer are excellent features for distinguishing classes. representation learning by Deep Convolutional Neural Network (DCNN) embedding can generate embedded features for unseen samples and achieves excellent performance on face recognition tasks [25, 21, 26]. Figure 1 illustrates the general framework of representation learning by DCNN. The softmax function following cross-entropy loss is the most common component in DCNN on classification tasks. However, the learned features are separable but not discriminative enough. This shortage raises the risk of learned features being misclassified near the decision boundary. Many researchers have noticed that by designing specific loss functions, discriminative features will improve performance.

Contrastive loss [9] and triplet loss [22] use data pairs to enhance intra-class compactness and enlarge inter-class separability. They emphasize the relationship between features but increase the number of training pairs, which are theoretically up to for the contrastive loss and to for the triplet loss (where is the size of the training set). Additionally, the complex form of these loss functions makes it hard to train. To encourage the intra-class compactness with low computational overhead, center loss [31] converges the learned features to their corresponding class centers. However, experiments (see in Table [2,3]) show that this method declines the inter-class separability.

A study line of margin-based methods [27, 29, 30], has several good properties, including clear geometric interpretation, strong intra-class compactness and easy to implement. They also demonstrate the power of margin in representation learning. Margin is a hard constraint which enforces features belonging to different classes far away. ArcFace, the outstanding one on margin-based methods, defines an addictive angular margin parameter to enforce the intra-class angle smaller than the inter-class angle significantly, . Nevertheless, the high performance needs a careful selection of hyperparameters, and the tiny intra-class angles may cause overfitting.

In this paper, an angular loss is proposed to generate discriminative features. Our objective is that a loss function with a high angular gradient leads to a low intra-class angle. We also argue that margin-based methods face the risk of overfitting—for some cases, the feature of a mislabeled or high variance sample, forced into the center of this class, declines generalization ability. Our method has not this kind of hard constraint on features, allowing outliers separable. Without losing flexibility, our method also provides a hyperparameter to adjust the intra-class compactness. The summary of our contributions in this work is the following:

  • We cast a new viewpoint on the weakness of the softmax loss. i.e., the angular gradient of the softmax loss is closing to zero while the intra-class features are converging to the corresponding center.

  • We propose a novel loss function, angular loss, which has an intra-class compactness regulation that is elastic. The approach promotes the angle small by the angular gradient in a soft way, rather than providing a restriction on the angles in a hard way like margin-base methods.

  • Experimental results on datasets of Fashion-MNIST, CIFAR10, and CIFAR100 reveal the effectiveness of our method.

In Section 2, we describe the preliminary knowledge and the methods we are going to compare. Section 3 introduces our method and the angular gradient. Empirically, our method consistently outperforms other methods on the accuracy, intra-class compactness, and computational efficiency, as shown in Section 4. For clarity, our work demonstrates a novel, convenient, and adjustable way to achieve intra-class compactness.

2 Related Works

2.1 Softmax loss

In supervised representation learning by DCNN, the softmax loss obtains a fully connected layer, a softmax function, and a cross-entropy loss function, defined by [18], shown in Figure 1. The representation of data learned by the deep convolution units usually transforms high dimension feature to low dimension embedded feature. And the class probabilities of a sample , symbolized by , is calculated by the softmax function:

where is the number of classes and represents the DCNN. Then we get the loss by the cross-entropy loss,

where is the ground truth label for the sample . To better describe the following methods, in this paper, we define that is the output of the fully connected layer, is the output of the softmax function, and L is the final output of the loss function. For convenience, we denote the output of the convolution units or embedded feature just by . And the superscript of those symbols distinguishes between different loss functions.

In the softmax loss, the last fully connected layer calculates the similarity between the feature and class by dot product, , where are the weight matrix of this layer. Finally, the original softmax loss can be written as:


where is the embedded feature of -th sample, is the number of a mini-batch size and is the lable of -th sample. SphereFace [17] studies the effects of bias . They find that omitting the bias does no harm to the performance but makes it easy to analyze. So we follow this modified softmax loss:


where is the angle between two vectors; denotes the angle between this feature and the weight vector of class or .

The traditional softmax loss is the standard method on classification task. However, recent studies show that this method cannot generate discriminative features. The features near the decision boundary are more likely to be misclassified. To be discriminative features helps the ability of generalization on unseen samples.

2.2 ArcFace

Weight normalization [31] can not only accelerate training but also solve the imbalanced data problem by rebalancing the weight of each class [8]. Following [17, 29, 28], we fix the weight by normalisation and multiplied a rescale hyperparameter , so the formulation becomes


L-softmax [18] provides a new way, incorporating margin, to enhance intra-class compactness and inter-class separability simultaneously. ArcFace, L-softmax and soft-margin [30] use different margin penalty and achieve good results, however, ArcFace was proved outstanding on them [4]. ArcFace applies an angular margin to penalise the angle:. Therefore the final formulation of the ArcFace loss can be written as:


The margin-based methods decrease the probability of the ground truth class. It constrain the distribution of features strictly, that may cause overfitting.

2.3 Center loss

The center loss’s intuition is to minimize intra-class variations. To achieve that, it decreases the euclidean distance between the feature and the class center directly, as formulated :


where denotes the -th class center of the feature . will converge to the center of the features of every class as the entire training set is taken into account. Besides, it needs the softmax loss to keep different classes separated. The final formulation is balanced the softmax loss (modified) and the center loss by a scale parameter :


3 Method

Figure 2: A toy example of the process of updating during training, where denotes the normalized feature, is normalized too and is the gradient of last layer. will become closer to if represents the ground truth label, .

SphereFace [17] provides a new view on the weights of the last fully connected layer, representing the centers of each class in angular space. Enlightened by this, we draw a theory—minimizing the angle can achieve discriminative features and the fast way to decrease the angle is maximizing the gradient of .

In this paper, we propose a novel loss function named angular loss, which adds arccos after the last fully connected layer, to give a constant gradient.


For convenience, we give a shorthand for each method, soft for the softmax loss, center for the center loss, ArcFace for the ArcFace loss, and Arc for the angular loss.

Figure 3: The partial gradient of four loss functions, including the softmax, ArcFace , L-Softmax [18] and angular (our method) loss. for all these losses.

3.1 Gradient analysis

We investigate the change of the intra-class angle while training. To simplify this problem, we assume that the features remain unchanged on our analysis because the features are not updated directly by backpropagation. Figure 2 illustrates how is updated during training. The angular gradient indicates the change of the intra-class angle. Hence, our intuitive objective is to design a loss function which has a large angular gradient.

In softmax, the partial angular gradient is


In ArcFace, the rescale parameter has no effect on gradient, and . The derivation of is


The angular loss has a constant gradient, enhancing the intra-class compactness consistently.


Figure 3 compares the gradient of four loss functions. In softmax, the gradient goes down to zero when is closing to zero. That means it is hard to optimize the intra-class compactness when is small. That ArcFace solves this problem by offset avoids the range of zero gradients. The sharp curve in L-Softmax leads to a smaller range of small angular gradient. Nevertheless, L-Softmax still has the zero angular gradient that makes it hard to train.

(a) The curve of w.r.t. . The curve of is similar.
(b) The curve of w.r.t. .
Figure 4: The trade-off between a high angular gradient and high confidence.

3.2 Adjustable constant

Our method defines an extra constant , which squeezes/stretches the curvature. The angle between the feature and the mismatched class is close to , similar to the situation in [33]. Therefore, we assume that . becomes

First, we should ensure that is far bigger than , that is, is very large. Empirically, should be greater than 3.Then, let’s consider the angular gradient. According to the chain rule:

combining EQ10, the angular gradient of angular loss is


Mathematically, the intra-class angle will keep decreasing when the angular gradient is bigger than zero. Therefore, the larger produces the smaller intra-class angle. We conduct the experiment on Fashion-MNIST comparing and detect the WC-Intra (Defined in EQ14) for every 200 iterations. The results in Figure 5 reveals that the constant adjusts the intra-class compactness. And increases the confidence of this class, as shown in Figure 4.

Figure 5: WC-Intra angle vs. iteration.

4 Experiments

4.1 General experimental settings

Fashion-MNIST [32]: The Fashion-MNIST is a dataset of Zalando’s article images designed for a replacement of the original MNIST[15]. Both are simple and light datasets, however, the most simple classification can reach 90% test accuracy on MNIST. Many researchers have replaced MNIST to Fashion-MNIST in order to verify their ideas.
CIFAR10/CIFAR10+ [14]: The CIFAR10 dataset, containing ten classes, is widely used for image classification tasks. It is separated into a training set with 50000 samples and a test set with 10000 samples. CIFAR10+ denotes the CIFAR10 with data augment. For the data augmentation, we follow the transformations in [16]: a random cropping with 4-pixel padding on each side, a random flipping with the probability of 0.5, and a z-score normalization.
CIFAR100/CIFAR100+ [14]: We also testify our method on CIFAR100 dataset, which has the same image size but is more complex. Due to the similarity of CIFAR10 and CIFAR100, we remain our experiment setting almost unchanged. CIFAR100+ denotes CIFAR100 dataset with data augmentation.
Architecture: Deep residual networks [10] have been widely used in image classification tasks [11, 6], improving the performance of the deep convolutional neural networks. Though many other modern architectures, such as Inception-ResNet, WideResNet, ResNext [12], are proposed. Our purpose is not achieving the best result of cifar10 but varifies the efficiency of our method. So we use the original ResNet and some modern training techniques introduced by Leslie Smith [23]. He also provides a practical approach to select hyperparameters (such as learning rate, weight decay, batch_size). We follow this one cycle policy to change the learning rates. We summarize our experiments settings on Table 1.

Dataset Fashion-MNIST CIFAR10 CIFAR100
CNU ResNet20 ResNet18 ResNet34
epochs 60 160 160
batch_size 256 128 128
lr(Soft) 0.1 0.01 0.01
lr(Arc) 0.1 0.1 0.1
lr(ArcFace) 0.1 0.1 0.1
lr(Center) 0.01 0.01 0.01
Table 1: The training settings of our experiments. “CNU” denotes the Convolution Units, and “lr” denotes the learning rate. Though the learning rate of these loss functions are different, all of them are well trained, reaching high accuracy (above 99.9%) on training set. In all experiments, weight decay is 0.0005 and optimizer is Adam [13].

4.2 Angle descent verification

In this section, we demonstrate the effectiveness of our method to descent the angle from different aspects. First, we use a toy example to plot the actual feature distribution; Second, we monitor the angles while training; Last, we give the confusion matrix to grasp the similarity of different features on real datasets (CIFAR10 and CIFAR100).

Intuitive interpretation

(a) Softmax
(b) Center
(c) ArcFace
(d) Arc
Figure 6: Embedded features visualization on Fashion-MNIST dataset. Specially, we set the dimension of the features as 2. Though the features in Arc and ArcFace, nearing the origin of coordinates, seemed are close, they are far in angular space.

We conduct this experiment on the Fashion-MNIST dataset. For visualizing the embedded features, we set the dimension of features as two and follow the training settings listed in Table 1. Figure 6 plots the distribution of features comparing the angular loss with the other loss functions, showing that the softmax loss generates roughly separable features, and other methods can produce discriminative features. ArcFace and Arc generate explicitly discriminative features; however, Arc is more tolerate on the outlier.

Angle histogram

Our method can directly optimize the angle between embedded feature and its target vector. To proof that, we define the following metrics [4]:

WC-Intra (12)
W-Inter (13)
C-Inter (14)

where is the number of class, is the mean of the features belonging to the same class , and denotes the angle between a and b. “W-Inter” refers to the mean of angles between different target vectors . “C-Inter” refers to the mean of angles between different classes’ feature center. “WC-Intra” refers to the mean of the angles between target vectors and feature center of the class . Table 2 and 3 give the details of angle statistics on CIFAR10 and CIFAR100+ dataset. To be attention, the intra-class angle of ArcFace is extraordinarily little, and that we argue that it probably causes overfitting.

Method W-Inter C-Inter WC-Intra
Soft 1.2679 1.6366 0.9089
Arc 1.6821 1.6822 0.0039
ArcFace 1.6821 1.6821 0.0021
Center 1.4749 1.6430 0.5604
Table 2: The angle statistics under different losses on CIFAR10+. ArcFace and Arc enlarge both inter-class separability and intra-class compactness.
Method W-Inter C-Inter WC-Intra
Soft 1.5680 1.5648 0.4056
Arc 1.5404 1.5395 0.0301
ArcFace 1.6256 1.6244 0.0103
Center 1.5712 1.5601 0.3467
Table 3: The angle statistics (rad) under different methods on CIFAR100. ArcFace and Arc enlarge both inter-class separability and intra-class compactness.

We give the angle histograms of different classes to capture the dynamic change of the intra-class angles in Figure 7. At the end of every epoch, we randomly pick 200 samples belonging to class 0 and measure the angle for each sample. In comparison with the softmax method, our approach will reduce the target angle faster and be at a lower level.

(a) Softmax
(b) Angluar Loss
Figure 7: Visualization of the angle histogram in tensorboard [1]. Each slice in the figure displays the angle histogram of intra-class. The slices depicts the change of angle during the training process. The top slice referring to the angle histogram at epoch 0 is darker and the lower slices referring to higher epoch are lighter.
(a) Soft on CIFAR10
(b) Center on CIFAR10
(c) ArcFace on CIFAR10
(d) Arc on CIFAR10
(e) Soft on CIFAR100
(f) Center on CIFAR100
(g) ArcFace on CIFAR100
(h) Arc on CIFAR100
Figure 8: The confusion matrix on CIFAR10 and CIFAR100 dataset.

Confusion matrix visualization

Visualization is difficult for high-dimensional features. Hence the comparison of the confusion matrix is given in Figure 8 to show the cosine similarity of features. In particular, on CIFAR10 dataset, we randomly select 10 features for each class, consisting of a total of 100 features; on CIFAR100 dataset, we randomly select one feature for each class. The learned features will then be applied to L2 standardization and calculated by


One can see that ArcFace and Arc greatly enhance intra-class compactness and enlarge the inter-class separability.

4.3 Performance

We have proved our method can enhance the intra-class compactness. All these experiments show that our method can significantly decrease the intra-class angle as our theory infers. In this part, we compare four loss functions: the softmax loss, the center loss, the ArcFace loss, and the angular loss (ours).

The experiment on Fashion-MNIST, shown in Table 5, reveals our method and center loss are best. Table 6 and 7 reveal Arc is the best and ArcFace is better. That means discriminative information improves the accuracy, however ArcFace faces the problem of overfitting.

4.4 Computational efficiency

In this part, we use different epochs to train our model while keeping the other settings the same in Table 1. Results in Table 4 reveal our method accelerates training.

Epochs Soft Arc ArcFace Center
10 15.72 12.09 14.08 17.67
20 12.55 8.65 9.23 10.73
80 9.43 6.07 6.22 6.64
200 6.21 5.68 6.22 6.60
Table 4: The error rate(%) vs. epochs. Every row indicates one experiment. Our method consistently outperforms others within the same training epochs.
Method error_rate
Genetic DCNN [19] 5.4
CNN [3] 7.46
Soft 6.30
Arc 6.04
ArcFace 6.27
Center 6.04
Table 5: The error rate (%) on Fashion-MNIST dataset.
Method Params(M) CIFAR10 CIFAR10+
EM-Softmax [30] 15.2 - 6.69
Maxout [7] - 9.38 -
All-CNN [24] 1.3 9.08 7.25
Softmax 11.22 13.160.31 5.730.22
Center 11.22 12.220.45 5.400.21
Arc 11.22 11.770.31 5.230.19
ArcFace 11.22 12.110.45 5.340.12
Table 6: Recognition error rate (meanstd%) on CIFAR10 without data augmentation and CIFAR10+ with data augmentation. Every result is evaluated five times.
Method Params(M) Top1 Top5
EM-Softmax [30] 31.1 27.26 -
Maxout [7] - 38.57 -
All-CNN [24] 1.4 33.71 -
Softmax 21.54 25.640.12 10.580.26
Center 21.54 25.330.96 8.190.58
Arc 21.54 24.290.08 10.070.23
ArcFace 21.54 25.560.16  9.740.31
Table 7: Recognition error rate (meanstd%) on CIFAR100+ dataset with data augmentation. Every result is evaluated five times.

5 Conclusion

In this paper, we proposed an angular loss function, which has a constant to adjust the compactness of learned features. Our work also provides a potential direction to encourage intra-class compactness by the angular gradient analysis. Comparing to the hard constraint on margin-based methods, our method avoids overfitting by a soft way. The experiments also have demonstrated that our method outperforms the state-of-the-art. Moreover, our method can accelerate training.


This work was supported by National Natural Science Foundation of China under Grant No. 61872419, No. 61573166, No. 61572230, No. 61873324, No. 81671785, No. 61672262. Shandong Provincial Natural Science Foundation No. ZR2019MF040, No. ZR2018LF005. Shandong Provincial Key R&D Program under Grant No. 2019GGX101041, No. 2018GGX101048, No. 2016ZDJS01A12, No. 2016GGX101001, No. 2017CXZC1206. Taishan Scholar Project of Shandong Province, China.


  1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from
  2. Mikhail Belkin and Partha Niyogi, ‘Laplacian eigenmaps and spectral techniques for embedding and clustering’, in Advances in neural information processing systems, pp. 585–591, (2002).
  3. Shobhit Bhatnagar, Deepanway Ghosal, and Maheshkumar H Kolekar, ‘Classification of fashion article images using convolutional neural networks’, in 2017 Fourth International Conference on Image Information Processing (ICIIP), pp. 1–6. IEEE, (2017).
  4. Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou, ‘Arcface: Additive angular margin loss for deep face recognition’, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699, (2019).
  5. David L Donoho and Carrie Grimes, ‘Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data’, Proceedings of the National Academy of Sciences, 100(10), 5591–5596, (2003).
  6. Thibaut Durand, Taylor Mordan, Nicolas Thome, and Matthieu Cord, ‘Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 642–651, (2017).
  7. Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio, ‘Maxout networks’, arXiv preprint arXiv:1302.4389, (2013).
  8. Yandong Guo and Lei Zhang. One-shot face recognition by promoting underrepresented classes, 2017.
  9. Raia Hadsell, Sumit Chopra, and Yann LeCun, ‘Dimensionality reduction by learning an invariant mapping’, in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp. 1735–1742. IEEE, (2006).
  10. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, ‘Deep residual learning for image recognition’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, (2016).
  11. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al., ‘Cnn architectures for large-scale audio classification’, in 2017 ieee international conference on acoustics, speech and signal processing (icassp), pp. 131–135. IEEE, (2017).
  12. Asifullah Khan, Anabia Sohail, Umme Zahoora, and Aqsa Saeed Qureshi. A survey of the recent architectures of deep convolutional neural networks, 2019.
  13. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014.
  14. Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton, ‘Cifar-10 and cifar-100 datasets’, URl: https://www. cs. toronto. edu/kriz/cifar. html, 6, (2009).
  15. Yann LeCun, ‘The mnist database of handwritten digits’, http://yann. lecun. com/exdb/mnist/, (1998).
  16. Chenyu Lee, Saining Xie, Patrick W Gallagher, Zhengyou Zhang, and Zhuowen Tu, ‘Deeply-supervised nets’, 562–570, (2015).
  17. Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song, ‘Sphereface: Deep hypersphere embedding for face recognition’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220, (2017).
  18. Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang, ‘Large-margin softmax loss for convolutional neural networks.’, in ICML, volume 2, p. 7, (2016).
  19. Benteng Ma and Yong Xia, ‘Autonomous deep learning: A genetic dcnn designer for image classification’.
  20. Sam T Roweis and Lawrence K Saul, ‘Nonlinear dimensionality reduction by locally linear embedding’, science, 290(5500), 2323–2326, (2000).
  21. Swami Sankaranarayanan, Azadeh Alavi, and Rama Chellappa, ‘Triplet similarity embedding for face verification’, arXiv preprint arXiv:1602.03418, (2016).
  22. Florian Schroff, Dmitry Kalenichenko, and James Philbin, ‘Facenet: A unified embedding for face recognition and clustering’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823, (2015).
  23. Leslie N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay, 2018.
  24. Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller, ‘Striving for simplicity: The all convolutional net’, arXiv preprint arXiv:1412.6806, (2014).
  25. Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang, ‘Deep learning face representation by joint identification-verification’, in Advances in neural information processing systems, pp. 1988–1996, (2014).
  26. Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf, ‘Deepface: Closing the gap to human-level performance in face verification’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708, (2014).
  27. Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu, ‘Additive margin softmax for face verification’, IEEE Signal Processing Letters, 25(7), 926–930, (2018).
  28. Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille, ‘Normface’, Proceedings of the 2017 ACM on Multimedia Conference - MM ’17, (2017).
  29. Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu, ‘Cosface: Large margin cosine loss for deep face recognition’, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274, (2018).
  30. Xiaobo Wang, Shifeng Zhang, Zhen Lei, Si Liu, Xiaojie Guo, and Stan Z Li, ‘Ensemble soft-margin softmax loss for image classification’, arXiv preprint arXiv:1805.03922, (2018).
  31. Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao, ‘A discriminative feature learning approach for deep face recognition’, in European conference on computer vision, pp. 499–515. Springer, (2016).
  32. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
  33. Xiao Zhang, Rui Zhao, Yu Qiao, Xiaogang Wang, and Hongsheng Li. Adacos: Adaptively scaling cosine logits for effectively learning deep face representations, 2019.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description