Asymmetric Loss For Multi-Label Classification

# Asymmetric Loss For Multi-Label Classification

## Abstract

Pictures of everyday life are inherently multi-label in nature. Hence, multi-label classification is commonly used to analyze their content. In typical multi-label datasets, each picture contains only a few positive labels, and many negative ones. This positive-negative imbalance can result in under-emphasizing gradients from positive labels during training, leading to poor accuracy.

In this paper, we introduce a novel asymmetric loss (”ASL”), that operates differently on positive and negative samples. The loss dynamically down-weights the importance of easy negative samples, causing the optimization process to focus more on the positive samples, and also enables to discard mislabeled negative samples.

We demonstrate how ASL leads to a more ”balanced” network, with increased average probabilities for positive samples, and show how this balanced network is translated to better mAP scores, compared to commonly used losses. Furthermore, we offer a method that can dynamically adjust the level of asymmetry throughout the training.

With ASL, we reach new state-of-the-art results on three common multi-label datasets, including achieving on MS-COCO. We also demonstrate ASL applicability for other tasks such as fine-grain single-label classification and object detection.

ASL is effective, easy to implement, and does not increase the training time or complexity. Implementation is available at: https://github.com/Alibaba-MIIL/ASL.

\wacvfinalcopy

## 1 Introduction

Typical images contain multiple objects and concepts [33], highlighting the importance of multi-label classification for real-world tasks. Recently, impressive progress has taken place in multi-label benchmarks such as MS-COCO [20], NUS-WIDE [6], Pascal-VOC [11] and Open Images [17]. Notable success was reported by exploiting label correlation via graph neural networks that represent the label relationships [5, 4, 10] or word embeddings based on knowledge priors [5, 30]. Other approaches are based on modeling image parts and attentional regions [35, 12, 31, 34], and using recurrent neural networks [23, 28]. In addition to these efforts, we believe that some of the primary core blocks in the learning process should be revised and adapted for multi-label classification.

A key characteristic of multi-label classification is the inherent class imbalance when the overall number of labels is large. By nature, most images will contain only a small fraction of the labels, implying that the number of positive samples per category will predominantly be much lower than the number of negative samples. To address this, [32] suggested a loss function for handling the imbalance in multi-label problems. However, it was aimed specifically to long-tail distribution scenarios. High negative-positive imbalance is apparent also in dense object detection, where the imbalance stems from the ratio of foreground vs. background pixels. A common solution is to adopt the focal loss [19], which decays the loss as the labelâs probability increases. This puts focus on hard samples, while down-weighting easy samples, which are mostly related to easy background locations.

Surprisingly, focal loss is seldom used for multi-label classification, and cross-entropy is often the default choice (see [5, 1, 3, 21, 12], for example). Since high negative-positive imbalance is also encountered in multi-label classification, focal loss might provide better results, as it encourages focusing on relevant hard-negative samples: for a given positive class, âeasyâ negative examples are mostly related to images that do not contain the positive class, but other categories located far away in the feature space. For example, when learning the class âDogâ it will be more worthwhile to focus on negative samples which contain classes that are closely located in the feature space, such as âCatâ and âCowâ, while decreasing the weights of easy negative samples which contain classes such as âAeroplaneâ, âCarâ or âBicycleâ.

Nevertheless, for the case of multi-label classification, treating the positive and negative samples equally, as proposed by focal loss, is sub-optimal since it results in accumulating more loss gradients from negative samples, and down-weighting important contributions from positive samples. In other words, the network might focus on learning features of negative samples while under-emphasizing learning features of positive samples. Our experiments will corroborate this analysis by showing that with focal loss, the network’s average probabilities of positive samples are much lower compared to the average probabilities of negative samples.

In this paper, we introduce an asymmetric loss (ASL) for multi-label classification, which addresses the negative-positive imbalance. We show that a careful design of the loss can significantly benefit the training and classification results. ASL is based on two key properties. First, to focus on hard samples while maintaining the contribution of positive samples, we decouple the modulations of the positive and negative samples and assign them different exponential decay factors. This allows us to set a lower decay factor for the positive examples, thus putting more emphasis on them. Second, we propose to shift the probabilities of negative samples to completely discard very easy samples (hard thresholding). By formulating the loss derivatives, we show that probability shifting also enables to discard very hard negative samples, suspected as mislabeled, which are common in multi-label problems [10].

We compare ASL to the common symmetrical loss functions, cross-entropy and focal loss, and show significant mAP improvement using our asymmetrical formulation. By analyzing the model’s probabilities, we demonstrate the effectiveness of ASL in balancing between negative and positive samples. We also introduce a method that dynamically adjusts the asymmetry level throughout the training process, by demanding a fixed gap between positive and negative average probabilities, allowing to simplify the hyper-parameter selection process.

Using ASL, we obtain state-of-the-art results on three common multi-label benchmarks, as can be seen in Figure 1. For example, we obtained mAP on MS-COCO dataset, surpassing the state-of-the-art by . We also demonstrate that ASL is applicable for other computer vision tasks, such as fine-grained single-label classification and object detection.

## 2 Asymmetric Loss

In this section, we will first review cross-entropy and focal loss. Then we will introduce the components of the proposed asymmetric loss (ASL), designed to address the inherent imbalance nature of multi-label datasets. We will also analyze ASL gradients, provide probability analysis, and present a method to set the loss’ asymmetry levels during training dynamically.

### 2.1 Binary Cross-Entropy and Focal Loss

A general form of a binary loss applied on each output of the network is given by:

 L=−yL\scalebox0.55[0.75]$+$−(1−y)L\scalebox0.6[0.8]$−$ (1)

Where is the ground-truth label and and are the positive and negative loss parts, respectively. Following [19], focal loss is obtained by setting and as:

 ⎧⎪⎨⎪⎩L\scalebox0.55[0.75]$+$=(1−p)γlog(p)L\scalebox0.6[0.8]$−$=pγlog(1−p) (2)

where is the network’s output probability and is the focusing parameter. yields binary cross-entropy.

By setting in Eq. 2, the contribution of easy samples (having low network’s probability, ) can be down-weighted in the loss function, enabling to focus more on harder samples during the training.

### 2.2 Asymmetric Focusing

When using focal loss for imbalanced multi-label datasets, there is an inner trade-off: setting high , to sufficiently down-weight the contribution from easy negatives, may eliminate the gradients from the rare positive samples.

We propose to decouple the focusing levels of the positive and negative samples. Let and be the positive and negative focusing parameters, respectively. Thus, the loss parts are written by:

 ⎧⎪⎨⎪⎩L\scalebox0.55[0.75]$+$=(1−p)γ\scalebox0.55[0.75]$+$log(p)L\scalebox0.6[0.8]$−$=pγ\scalebox0.6[0.8]$−$log(1−p) (3)

As we are interested in emphasizing the contribution of positive samples, we usually set .

Differently from focal loss, which applies the same decay factor for both the positive and negative parts, with asymmetric focusing we decouple the decay rates. Hence we can better control the contribution of positive and negative samples to the loss function, and help the network to learn meaningful features from positive samples, despite their rarity.

Note that methods which address class imbalance via static weighting factors were proposed in previous works [15, 8]. However, [19] found that those weighting factors interact with the focusing parameter, making it necessary to select the two together. In practice, [19] even suggested a weighting factor that favors background samples (). Hence we chose to avoid adding static weighting factors, and control the asymmetry level via two separate focusing factors, which can dynamically compensate for the negative-positive imbalance.

### 2.3 Asymmetric Probability Shifting

Asymmetric focusing enables to soft threshold negative samples, reducing their contribution to the loss as their probability decreases. We propose an additional asymmetric mechanism, probability shifting, that can perform hard thresholding of very easy samples, meaning fully discard negative samples when their probability is low enough. This can be beneficial in case of extreme positive-negative imbalancing, where the soft thresholding mechanism cannot sufficiently attenuate the loss gradients from all the negative samples. With hard thresholding, very easy negative samples can be fully discarded, not just attenuated.

Let’s define the shifted probability, , as:

 pm=max(p−m,0) (4)

Where the probability margin is a tunable hyper-parameter. Integrating into the loss function of negative samples in Eq. 2, we get an asymmetric probability-shifted focal loss:

 L\scalebox0.6[0.8]$−$=(pm)γlog(1−pm) (5)

In Figure 2 we draw the probability-shifted focal loss, for negative samples, and compare it to regular focal loss and cross-entropy.

We can see from Figure 2 that from a geometrical point-of-view, probability shifting is equivalent to moving the loss function to the right, by a factor of , and output for probabilities lower than . Very easy negative samples, with a probability lower than , will incur zero loss - hard thresholding. We will later show, via gradient analyses, another important property of the probability shifting mechanism - it can also reject mislabeled negative samples.

Notice that the concept of probability shifting is not limited to cross-entropy or focal loss, and can be used on many loss functions. Linear hinge loss [1], for example, can also be seen as (symmetric) probability shifting of linear loss. Also notice that logits shifting, as suggested in [19] and [32], is different than probability shifting due to the non-linear sigmoid operation.

### 2.4 ASL Definition

We can integrate asymmetric focusing and probability shifting into a unified formula, to obtain the proposed asymmetric loss (ASL):

 ASL=⎧⎨⎩L+=(1−p)γ\scalebox0.55[0.75]$+$log(p)L−=(pm)γ\scalebox0.6[0.8]$−$log(1−pm) (6)

Where is defined in Eq. 4. ASL allows us to apply two types of asymmetry for reducing the contribution of easy negative samples to the loss function - soft thresholding via the focusing parameters , and hard thresholding via the probability margin .

It can be convenient to set , so that positive samples will incur simple cross-entropy loss, and control the level of asymmetric focusing via a single hyper-parameter, . For experimentation and generalizability, we still keep the degree of freedom.

In practice, the network weights are updated according to the gradient of the loss, with respect to the input logit . Explicitly, the loss gradients for negative samples in ASL are:

 dL\scalebox0.6[0.8]$−$dz=∂L\scalebox0.6[0.8]$−$∂p∂p∂z=(pm)γ\scalebox0.6[0.8]$−$[11−pm−γ\scalebox0.6[0.8]$−$log(1−pm)pm]p(1−p) (7)

where , and is defined in Eq. 4.

In Figure 3 we compare the normalized gradients of different variants of ASL.

Figure 3 enables us to fully understand the properties of the different loss regimes:

• Plain cross-entropy (, blue line) provides simple linear dependency between the loss gradient to the probability, with no dedicated attenuation of easy samples.

• Asymmetric focusing (, orange line), which is equivalent to (decoupled) focal loss, provides non-linear attenuation , that targets specifically easy samples - soft thresholding.

• Cross-entropy with asymmetric probability margin (, red line) provides hard thresholding of very easy samples (). In addition, For very hard negative samples (with , where is defined as the point where ), the loss gradient has a negative slope. This can be interpreted as a mechanism for discarding mislabeled negative samples - if the network gives a negative sample very large probability, it is possible that the sample was mislabeled, and its correct label should be positive. When dealing with highly imbalanced dataset, even small mislabeling rate of negative samples can have a large impact on the training statistics of positive samples. Hence, dedicated rejection of mislabelled negative samples can be beneficial, especially since multi-label datasets are prone to negative samples mislabeling [10]. However, there is a trade-off - using a probability margin too large can cause the network not to propagate gradients from actual misclassified negative examples.

Notice that negative slope for hard negative samples also appears at asymmetric focusing (and regular focal loss), but with significantly less emphasis. Only when applying probability shifting we get: .

Cross-entropy with asymmetric probability margin also has significant disadvantages: the loss gradient is not continuous (when ). In addition, it has less attenuation of easy negative samples compared to plain cross-entropy.

• When we combine asymmetric focusing and asymmetric probability margin (, green line), we can enjoy all the advantages: hard thresholding of very easy samples, non-linear attenuation of easy samples, continuous loss gradients and the ability to reject very hard negative samples, suspected as mislabeling errors.

In Table 1 we summarize the properties of the different loss mechanisms, according to the gradient analysis.

### 2.6 Probability Analysis

We want to support our claim that using symmetric loss in multi-label datasets might lead the network to sub-optimal learning of positive samples’ features. By monitoring the average probabilities (outputted by the network) of different samples during the training, we can track the network’s level of confidence for positive and negative samples. Low confidence suggests that features were not learned optimally.
Let’s define as:

 pt={¯pif y=11−¯potherwise (8)

where denotes the averaged probability for all the samples in a batch at each iteration. Also, let and be the averaged probabilities of the positive and negative samples, and define as the probability gap:

 Δp=p\scalebox0.55[0.75]$+$t−p\scalebox0.6[0.8]$−$t. (9)

In Figure 4 we present the averaged probabilities and computed throughout the training, for three different loss functions: cross-entropy, focal loss and ASL.

Figure 4 demonstrates the problem of using symmetric losses for imbalanced datasets. For cross-entropy, a large negative probability gap occurs. is much higher than (at the end of the training, ), implying that the optimization process gives too much weight to the negative samples. While focal loss narrows the probability gap, is still large ( at the end of the training), showing that the optimization process still puts too much emphasis on negative samples. When using ASL, the gap can be completely eliminated, meaning the network has the ability to emphasis correctly positive samples.

Indeed, by lowering the decision threshold at inference time, we can control the precision vs. recall trade-off, and favor high true-positive rate over low false-negative rate. However, large negative probability gap, as obtained by the symmetric losses, might suggest that the network has converged to a local minima with sub-optimal performances. We will validate this claim in the ”Experimental Study” section.

The hyper-parameters of a loss function are usually adjusted via a manual tuning process. This process is often cumbersome, and requires a level of expertise - it is not straightforward to understand and predict each hyper-parameter impact on the final score. Based on our probability analysis, we wish to offer a simple intuitive way of dynamically adjusting ASL’s asymmetry levels, with a single interpretable control parameter.

In the previous section we observed that when using symmetric loss, negative samples have significant larger than positive samples (). When introducing asymmetric focusing (), positive samples will have a higher , while negative samples will have a lower , hence increases.

Instead of using a fixed , we propose to adjust dynamically throughout the training, to match a desired probability gap, denoted by . We can achieve this by a simple adaptation of after each batch, as described in Eq. 10.

 γ\scalebox0.6[0.8]$−$←γ\scalebox0.6[0.8]$−$+λ(Δp−Δptarget) (10)

where is a dedicated step size.

As we increase , via Eq. 10 we can dynamically increase the asymmetry level throughout the training, forcing the optimization process to focus more on the positive samples’ gradients. Notice that using Eq. 10 we can also dynamically adjust the probability margin, or simultaneously adjust both asymmetry mechanisms. For simplicity, we chose to explore the case of adjusting only throughout the training, with and a small fixed probability margin, that enables hard thresholding and discarding of mislabeled negative samples.

Figure 5 presents the values of and throughout the training, for . After of the training, the network converges successfully to the target probability gap, and to a stable value of . In the next section we will analyze the mAP score and possible use-cases for this dynamic scheme.

## 3 Experimental Study

In this section, we will provide experimentations and comparisons to better understand the different losses, and demonstrate the improvement we gain from ASL, compared to symmetric losses. We will also test our adaptive asymmetry mechanism, and compare it to a fixed hyper-parameters scheme.

For testing, we will use the well-known MS-COCO [20] dataset, which, like most multi-label datasets, is highly imbalanced toward negative samples, with positive-negative average ratio of (see ”Dataset Results” section for full dataset and training details).

Focal Loss Vs Cross-Entropy: In Figure 6 we present the mAP scores obtained for different values of focal loss ( is cross-entropy).

We can see from Figure 6 that with cross-entropy loss, the mAP score is significantly lower than the one obtained with focal loss ( vs ). Optimal scores for focal loss are obtained for . With below that range, the loss does not provide enough down-weighting for easy negative samples. With above that range, there is too much down-weighting of the rare positive samples.

Now we want to examine the impact of our two asymmetry mechanisms on the mAP score.

Asymmetric Focusing: In Figure 7 we test the asymmetric focusing mechanism: for two fixed values of , and , we present the mAP score along the axis.

Figure 7 demonstrates the effectiveness of asymmetrical focusing - as we decrease (hence increasing the level of asymmetry), the mAP score significantly improves.

Interestingly, we found that simply setting leads to best results in our experiments. That may further support the importance of keeping the gradient magnitues high for positive samples. Indeed, allowing may be useful for cases where positive samples are also frequent, and focusing on hard samples is also required to better balancing the contributions of the loss terms.

Note that we also tried training with , to extend the asymmetry further. However, these training did not converge, so we do not present them in Figure 7.

Asymmetric Probability Margin: In Figure 8 we apply our second asymmetry mechanism, asymmetric probability margin, on top of cross-entropy loss () and two levels of (symmetric) focal loss, and .

We can see from Figure 8 that both for cross-entropy and focal loss, introducing asymmetric probability margin improves the mAP score significantly, by 1-2%. For cross-entropy, the optimal probability margin is low, , in agreement with our gradient analysis - cross-entropy with probability margin produces a non-smooth loss gradient, with less attenuation of easy samples. Hence, small probability margin, that still enables hard threshold for very easy samples and rejection of mislabeled samples, is sufficient.

For focal loss, the optimal probability margin is significantly higher, . This again can be explained by analyzing the loss gradients: since focal loss already has non-linear attenuation of easy samples, we need larger probability margin to introduce meaningful asymmetry. We can also see that when introducing asymmetric probability margin, better scores are obtained for compared to , meaning that asymmetric probability margin works better on top of a modest amount of focal loss.

Combining Asymmetries: Until now we tested each ASL asymmetry separately. In Table 2 we compare the top mAP scores, achieved when combining the asymmetries together, to the top mAP scores obtained when applying each asymmetry alone.

We can see from Table 2 that the best results are obtained when combining the two components of asymmetry. This correspondence to our analysis of the loss gradients from Figure 3, where we show that combining asymmetries enables us to completely ignore very easy samples, do non-linear attenuation of easy samples and reject possibly mislabeled very hard negative samples, which is not possible when applying only one type of asymmetry.

Adaptive Asymmetry: We would like to examine the effectiveness of adjusting the ASL asymmetry levels dynamically via the procedure proposed in Eq.10. In Table 3 we present the mAP score, and the final value of , obtained for various values of .

We can see from Table 3 that even without any tuning of the probability gap parameter (demanding the trivial ”balanced” case ), a significant improvement is achieved compared to focal loss ( vs. ). By using a higher probability gap, , we obtain a mAP score of , improvement of compared to focal loss. However, it is still lower by compared to the best ASL run with a fixed . A possible reason for this degradation is that a training process is highly impacted by the first epochs [13]. Tuning hyper-parameter dynamically may be sub-optimal at the beginning of the training, which may decrease the overall performance. To compensate for the initial recovery iterations, dynamically-tuned tends to converge to higher values, but the overall score is still degraded somewhat. Due to this decline, in the ”Dataset Results” section we will use a fixed scheme.

Still, despite the small score drop, the dynamic scheme is appealing as it allows to control the asymmetry level via one simple interpretable hyper-parameter. In the future, we will explore ways to expand this adaptive scheme for other useful applications, such as tuning adaptively per class, which can be impractical with a regular exhaustive search.

## 4 Dataset Results

In this section, we will evaluate ASL on four known multi-label classification datasets, and compere its results to state-of-the-art models. We will also test ASL on other computer vision tasks, such as single-label classification and object detection.

### 4.1 Multi-Label Datasets

#### Ms-Coco

MS-COCO [20] is a widely used dataset to evaluate computer vision tasks such as object detection, semantic segmentation and image captioning, and has been adopted recently to evaluate multi-label image classification. For multi-label classification, it contains images with different categories, where every image contains on average labels, hence giving an average positive-negative ratio of: . The dataset is divided to a training set of images and a validation set of images. For training our model, we used ASL with , and . Full training details appear in the appendix.

Following conventional settings for MS-COCO [30, 21], we report the following statistics: mean average precision (mAP), average per-class precision (CP), recall (CR), F1 (CF1) and the average overall precision (OP), recall (OR) and F1 (OF1), for the overall statistics and top-3 highest scores. Among these metrics, mAP, OF1, and CF1 are the main metrics, since they take into account both false-negative and false-positive rates. In Table 4 we bring results results for the main metrics. In Table 10 in the appendix we bring results for all the metrices.

We can see from Table 4 that ASL significantly outperforms previous state-of-the-art methods on MS-COCO for all major metrics. For example, ASL mAP score, , is higher than the previous top method.

Notice that the mAP scores for MS-COCO are highly influenced by the input resolution. For a fair comparison, we report in Table 4 results for standard input resolution, 448. In Table 11 in the appendix we show that with higher input resolutions, our mAP score can be increased to 88.4%.

#### Pascal-VOC

Pascal Visual Object Classes Challenge (VOC 2007) [11] is another popular dataset for multi-label recognition. It contains images from 20 object categories, with an average of 2.5 categories per image. Pascal-VOC is divided to a trainval set of 5,011 images and a test set of 4,952 images. Our training settings were identical to the ones used for MS-COCO.

Notice that most previous works on Pascal-VOC used simple ImageNet pre-training, but some used additional data, like pre-training on MS-COCO or using NLP models like BERT. For a fair comparison, we present our results once with ImageNet pre-training, and once with additional pre-train data (MS-COCO pre-training) and compare them to the relevant works. Results appear in Table 5.

We can see from Table 5 that ASL achieves new state-of-the-art results on Pascal-VOC, with and without additional pre-training. We also see that additional (MS-COCO) pertaining can significantly improve the mAP score, compared to the standard ImageNet pertaining.

#### Nus-Wide

NUS-WIDE [6] dataset originally contained 269,648 images from Flicker, that have been manually annotated with 81 visual concepts. Since some urls have been deleted, we were able to download only 220,000 images, similar to [10, 29] (see appendix for more details about obtaining the dataset). Since the dataset was not originally divided to a train and test set, we did the standard 70-30 train-test split [10, 29, 21]. Our training settings were identical to the ones used for MS-COCO. In Table 6 we compare ASL results with current state-of-the art models on NUS-WIDE.

We can see from Table 6 that ASL improves the known state-of-the-art result on NUS-WIDE by a large margin - 2.4% mAP. Other metrics also show improvement.

#### Open Images

Open Images (v6) [17] is a large scale dataset, which consists of 9 million training images, validation images and test images. It is partially annotated with human labels and machine-generated labels. The scale of Open Images is much larger than previous multi-label datasets such as NUS-WIDE, Pascal-VOC and MS-COCO, allowing us to test ASL on extreme classification multi-label scenario [37]. Full dataset and training details appear in the appendix.

To the best of our knowledge, no results were published yet for v6 variant of Open Images. Hence, we chose to compare our ASL accuracies to regular focal loss training. Yet we hope that our result can serve as a benchmark for future comparisons to other methods. In addition to the standard (micro) mAP metric, we also choose to state the macro mAP score, since we believe it better represents the actual visual quality of the network. Results appear it Table 7.

We can see from Table 7 that ASL outperforms regular focal loss on Open Images on both metrics, demonstrating that ASL is also suitable for large datasets and extreme classification cases.

While our main focus was on multi-label classification, we wanted to further test ASL on other computer vision tasks. Since fine-grain single-label classification and object detection tasks usually contain a large portion of background or long-tail cases [1, 16], and are known to benefit from using focal loss, we chose to test ASL on these additional tasks.

#### Fine-Grain Single-Label Classification

For the fine-grain single-label classification case, we chose to work on the competitive Herbarium 2020 FGVC7 Challenge [16]. Full dataset and training details appear in the appendix. The metric chosen for the competition is macro F1 score. In Table 8 we bring results of ASL on Herbarium dataset, and compare it to regular focal loss.

We can see from Table 8 that ASL outperforms focal loss on this fine-grain single-label classification dataset by a large margin. Note that Herbarium 2020 was a CVPR-Kaggle classification competition. ASL test-set score would achieve the 3rd place in the competition, among teams.

#### Object Detection

For testing ASL on object detection, we used the MS-COCO [20] dataset (object detection task), which contains a training set of 118k images, and an evaluation set of 5k images. Full training details appear in the appendix. Our object detection method, FCOS [27], uses different types of losses: classification (focal loss), bounding box (IoU loss) and centerness (plain cross-entropy). The only component which is effected by the large presence of background samples is the classification loss. Hence, for testing we replaced only the classification focal loss with ASL.

In Table 9 we compare the mAP score obtained from ASL training to the score obtained with standard focal loss.

We can see from Table 9 that ASL outscores regular focal loss, yielding an 0.4% improvement to the mAP score.

## 5 Conclusion

In this paper, we presented an asymmetric loss (ASL) for multi-label classification. ASL contains two complementary asymmetric mechanisms, that operate differently positive and negative samples. By examining ASL derivatives, we gained a deeper understanding of the loss properties. Through network probability analysis, we demonstrate the effectiveness of ASL in balancing between negative and positive samples, and proposed an adaptive scheme that can dynamically adjusts the asymmetry levels throughout the training. Extensive experimental analysis shows that ASL outperforms all previous approaches on common multi-label classification benchmarks, including MS-COCO, Pascal-VOC, NUS-WIDE and Open Images. We also tested ASL on object detection and fine-grain single-label classification datasets, demonstrating its applicability to other computer vision tasks that also exhibit dataset imbalancing.

## Appendix A Multi-Label General Training Details

Unless stated explicitly otherwise, for all multi-label datasets we used the following training procedure: As a training architecture, we used L-TResNet, which is equivalent in run-time to ResNet-101 [24]. We trained the model for epochs using Adam optimizer and 1-cycle policy [25], with maximal learning rate of . For regularization, we used Cutout factor of [9], True-weight-decay [22] of and GPU augmentations. We found that the common ImageNet statistics normalization [14, 7, 26] does not improve results, and instead used a simpler normalization - scaling all the RGB channels to be between and .

## Appendix B Comparing MS-COCO On All Common Metrics

In Table 10 we compare ASL results, to known state-of-the-art methods, on all common metrics for MS-COCO dataset.

## Appendix C MS-COCO mAP Scores for High-Resolution Inputs

The mAP score we report in Table 10 is for standard input resolution, , with TResNet-L as architecture. However, using Higher input resolutions with larger architectures is highly beneficial for the mAP score. Notice that the input resolution is not always mentioned in articles [21], or sometimes a larger-than-standard input resolution is used [34]. For completeness, in Table 11 we compare results with input resolution larger than . Notice that ResNet-101 architecture has similar runtime to TResNet-L [24].

## Appendix D Obtaining NUS-WIDE

Some of the original flicker links to NUS-WDIE are not longer available. We can find in previous works [10, 29, 21] many variants of NUS-WIDE dataset, and its hard to do a one-to-one comparison. We obtained our variant of NUS-WIDE from: https://drive.google.com/file/d/0B7IzDz-4yH_HMFdiSE44R1lselE/view.

This variant contains images. We recommend using it for standardization and a completely fair comparison in future works.

## Appendix E Open Images Training Details

Due to missing links on flicker, we were able to download only test images from Open Images dataset, which contain about unique tagged classes. For dealing with the partial labeling methodology of Open Images dataset, we set all untagged labels as negative, with reduced weights. Due to the large the number of images, we trained our network for epochs on input resolution of , and finetuned it for epochs on input resolution of . Other training details are similar to the ones used for MS-COCO.

## Appendix F Herbarium Dataset and Training Details

The goal of Herbarium 2020 is to identify vascular plant species from a large, long-tailed collection Herbarium specimens provided by the New York Botanical Garden (NYBG). The dataset contains over 1M images representing over 32,000 plant species. This is a dataset with a long tail; there are a minimum of 3 specimens per species, however, some species are represented by more than a hundred specimens. The metric chosen for the competition is macro F1 score. For Focal loss, we trained with . For ASL, we trained with .

## Appendix G MS-COCO Detection Training Details

For training on MS-COCO detection we used the popular mm-detection [2] package, with the enhancements discussed in ATSS [36] and FCOS [27] as the object detection method. We trained a TResNet-M [24] model with SGD optimizer for epochs, with momentum of , weight decay of and batch size of 48. We used learning rate warm up, initial learning rate of 0.01 and 10x reduction at epochs 40, 60. For ASL we used . For focal loss we used the common value, [19]. Note that unlike multi-label and fine-grain single-label classification datasets, for object detection was not the optimal solution. The reason for this might be the need to balance the contribution from the losses used in object detection (classification, bounding box and centerness). We should further investigate this issue in the future.

### References

1. P. L. Bartlett and M. H. Wegkamp (2008) Classification with a reject option using a hinge loss. Journal of Machine Learning Research 9 (Aug), pp. 1823–1840. Cited by: §1, §2.3, §4.2.
2. K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy and D. Lin (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: Appendix G.
3. T. Chen, M. Xu, X. Hui, H. Wu and L. Lin (2019) Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 522–531. Cited by: §1, Table 5.
4. Z. Chen, X. Wei, X. Jin and Y. Guo (2019) Multi-label image recognition with joint class-aware map disentangling and label correlation embedding. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 622–627. Cited by: Table 10, Figure 1, §1, Table 4, Table 6.
5. Z. Chen, X. Wei, P. Wang and Y. Guo (2019) Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5177–5186. Cited by: Table 10, §1, §1, Table 4, Table 5.
6. T. Chua, J. Tang, R. Hong, H. Li, Z. Luo and Y. Zheng (2009) NUS-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval, pp. 1–9. Cited by: §1, §4.1.3.
7. E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 113–123. Cited by: Appendix A.
8. Y. Cui, M. Jia, T. Lin, Y. Song and S. Belongie (2019) Class-balanced loss based on effective number of samples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9268–9277. Cited by: §2.2.
9. T. DeVries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: Appendix A.
10. T. Durand, N. Mehrasa and G. Mori (2019) Learning a deep convnet for multi-label classification with partial labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 647–657. Cited by: Appendix D, §1, §1, 3rd item, §4.1.3.
11. M. Everingham, L. Van Gool, C. K. Williams, J. Winn and A. Zisserman (2007) The pascal visual object classes challenge 2007 (voc2007) results. Cited by: §1, §4.1.2.
12. B. Gao and H. Zhou (2020) Multi-label image recognition with multi-class attentional regions. arXiv preprint arXiv:2007.01755. Cited by: Table 10, Table 11, Figure 1, §1, §1, Table 4.
13. A. Golatkar, A. Achille and S. Soatto (2019) Time matters in regularizing deep networks: weight decay and data augmentation affect early learning dynamics, matter little near convergence. CoRR abs/1905.13277. Cited by: §3.
14. A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang and V. Vasudevan (2019) Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1314–1324. Cited by: Appendix A.
15. C. Huang, Y. Li, C. C. Loy and X. Tang (2016) Learning deep representation for imbalanced classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5375–5384. Cited by: §2.2.
16. kiat Chuan Tan (2020) Herbarium-2020-fgvc7. Note: https://www.kaggle.com/c/herbarium-2020-fgvc7 Cited by: §4.2.1, §4.2.
17. A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci and T. Duerig (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §1, §4.1.4.
18. P. Li, P. Chen, Y. Xie and D. Zhang (2020) Bi-modal learning with channel-wise attention for multi-label image classification. IEEE Access 8, pp. 9965–9977. Cited by: Figure 1, Table 5.
19. T. Lin, P. Goyal, R. B. Girshick, K. He and P. Dollár (2017) Focal loss for dense object detection. CoRR abs/1708.02002. External Links: Link Cited by: Appendix G, §1, §2.1, §2.2, §2.3.
20. T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick and P. DollÃ¡r (2014) Microsoft coco: common objects in context. External Links: 1405.0312 Cited by: §1, §3, §4.1.1, §4.2.2.
21. Y. Liu, L. Sheng, J. Shao, J. Yan, S. Xiang and C. Pan (2018) Multi-label image classification via knowledge distillation from weakly-supervised detection. In Proceedings of the 26th ACM international conference on Multimedia, pp. 700–708. Cited by: Table 10, Appendix C, Appendix D, §1, §4.1.1, §4.1.3, Table 4, Table 6.
22. I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: Appendix A.
23. J. Nam, E. L. Mencía, H. J. Kim and J. Fürnkranz (2017) Maximizing subset accuracy with recurrent neural networks in multi-label classification. In Advances in neural information processing systems, pp. 5413–5423. Cited by: §1.
24. T. Ridnik, H. Lawen, A. Noy, E. B. Baruch, G. Sharir and I. Friedman (2020) TResNet: high performance gpu-dedicated architecture. arXiv preprint arXiv:2003.13630. Cited by: Appendix A, Appendix C, Appendix G.
25. L. N. Smith (2018) A disciplined approach to neural network hyper-parameters: part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820. Cited by: Appendix A.
26. M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: Appendix A.
27. Z. Tian, C. Shen, H. Chen and T. He (2019) FCOS: fully convolutional one-stage object detection. External Links: 1904.01355 Cited by: Appendix G, §4.2.2.
28. J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang and W. Xu (2016) Cnn-rnn: a unified framework for multi-label image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2285–2294. Cited by: §1.
29. Q. Wang, N. Jia and T. P. Breckon (2019) A baseline for multi-label image classification using an ensemble of deep convolutional neural networks. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 644–648. Cited by: Appendix D, §4.1.3.
30. Y. Wang, D. He, F. Li, X. Long, Z. Zhou, J. Ma and S. Wen (2019) Multi-label classification with label graph superimposing. ArXiv abs/1911.09243. Cited by: §1, §4.1.1.
31. Z. Wang, T. Chen, G. Li, R. Xu and L. Lin (2017) Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE international conference on computer vision, pp. 464–472. Cited by: §1, Table 5.
32. T. Wu, Q. Huang, Z. Liu, Y. Wang and D. Lin (2020) Distribution-balanced loss for multi-label classification in long-tailed datasets. Cited by: §1, §2.3.
33. H. Yang, J. Tianyi Zhou, Y. Zhang, B. Gao, J. Wu and J. Cai (2016) Exploit bounding box annotations for multi-label object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 280–288. Cited by: §1, Table 5.
34. J. Ye, J. He, X. Peng, W. Wu and Y. Qiao (2019) Attention-driven dynamic graph convolutional network for multi-label image recognition. Cited by: Appendix C, §1.
35. R. You, Z. Guo, L. Cui, X. Long, Y. Bao and S. Wen (2020) Cross-modality attention with semantic graph embedding for multi-label classification.. In AAAI, pp. 12709–12716. Cited by: Table 10, §1, Table 4, Table 6.
36. S. Zhang, C. Chi, Y. Yao, Z. Lei and S. Z. Li (2019) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. External Links: 1912.02424 Cited by: Appendix G.
37. W. Zhang, J. Yan, X. Wang and H. Zha (2018) Deep extreme multi-label learning. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp. 100–107. Cited by: §4.1.4.
38. F. Zhu, H. Li, W. Ouyang, N. Yu and X. Wang (2017) Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5513–5522. Cited by: Table 6.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters