A Comparative Study of Deep Learning Loss Functions for Multi-Label Remote Sensing Image Classification
This paper analyzes and compares different deep learning loss functions in the framework of multi-label remote sensing (RS) image scene classification problems. We consider seven loss functions: 1) cross-entropy loss; 2) focal loss; 3) weighted cross-entropy loss; 4) Hamming loss; 5) Huber loss; 6) ranking loss; and 7) sparseMax loss. All the considered loss functions are analyzed for the first time in RS. After a theoretical analysis, an experimental analysis is carried out to compare the considered loss functions in terms of their: 1) overall accuracy; 2) class imbalance awareness (for which the number of samples associated to each class significantly varies); 3) convexibility and differentiability; and 4) learning efficiency (i.e., convergence speed). On the basis of our analysis, some guidelines are derived for a proper selection of a loss function in multi-label RS scene classification problems.
plus 0.01ex \nameHichame Yessou, Gencer Sumbul, Begüm Demir \addressFaculty of Electrical Engineering and Computer Science, Technische Universität Berlin, Germany
Multi-label image classification, deep learning, loss functions, remote sensing
Recent advances on remote sensing (RS) instruments have led to a significant growth of remote sensing (RS) image archives. Accordingly, multi-label image scene classification (MLC) that aims at automatically assigning multiple class labels (i.e., multi-labels) to each RS image scene in an archive has attracted great attention in RS. In recent years, deep learning (DL) based methods have been introduced for the MLC problems due to high generalization capabilities of DL models (e.g., convolutional neural networks (CNNs) and recurrent neural networks (RNNs)). As an example, in  conventional use of CNNs developed for single-label image classification is adapted for MLC. In this method, the sigmoid function is suggested for MLC adaptation instead of the softmax function as the activation of the last CNN layer. In , a data augmentation strategy is proposed to employ a shallow CNN in the framework of MLC. This method aims to apply an end-to-end training of the shallow CNN, while avoiding to use a pre-trained network. In , a multi-attention driven approach is introduced for high-dimensional high-spatial resolution RS images. In this approach, a branch-wise CNN is jointly exploited with an RNN to characterize a global image descriptor based on the extraction and exploitation of importance scores of image local areas. All the existing approaches utilize the conventional combination of sigmoid activation and cross-entropy loss functions to simultaneously learn multi-labels for each image in the framework of DL. Sigmoid activation function provides Bernoulli distributions and thus allows multiple class predictions. The cross-entropy loss function has strong foundations from information theory and its effectiveness has been widely proven. However, it is not fully suitable to use when: i) imbalanced training sets are present; and ii) there is a time constraint on the training phase of a DL based method. Since a loss function guides the whole learning procedure throughout the training, its proper selection is important for DL based MLC. Thus, in this paper, we present a study to analyze and compare different loss functions in the content of MLC and propose a scheme to guide the choice of loss functions based on a set of properties. All the considered loss functions are analyzed for the first time in RS in terms of their: 1) overall accuracy; 2) class imbalance awareness; 3) convexibility and differentiability; and 4) learning efficiency. BigEarthNet , which is a large scale multi-label benchmark archive, is employed to validate our theoretical findings within experiments.
2 Deep Learning loss Functions for Multi-Label Image Classification
Let be an archive that consists of images, where is the th image in the archive. Each image in the archive is associated with one or more classes from a label set . Let be a binary variable that indicates the presence or absence of a label for the image . Thus, the multi-labels of the image are given by the binary vector . A MLC task can be formulated as a function that maps the image to multiple classes based on the function (which provides a classification score for each class in the label set) and the function (which defines the multi-labels of the image based on the probabilities). The learning process is performed by minimizing the empirical loss , which compares multi-label predictions with the ground reference samples. For a comparative analysis, we consider seven DL loss functions: cross-entropy loss (CEL) ; focal loss (FL) ; weighted cross-entropy loss (W-CEL) ; Hamming loss (HAL) ; Huber loss (HL) ; ranking loss (RL) ; and sparseMax loss (SML) . For the image we define its class probabilities as follows:
where is resulting output from the Sigmoid activation function defined as . The CEL is formulated as:
For the CEL, easily classified images may significantly affect the value of the loss function and thus control the gradient that limits the learning from hard images. The FL adds a modulating factor to the CEL, shifting the objective from easy negatives to hard negatives by down-weighting the easily classified images as follows:
where is a focusing parameter, which increases the importance of correcting wrongly classified examples. Another way to guide the learning procedure is to consider class weighting that allows exploiting the importance for each class. The W-CEL is defined by setting a weighting vector inversely proportional to the class distribution. The HAL aims at reducing the fraction of the wrongly predicted labels compared to the total number of labels as follows:
where denotes the XOR logical operation. The HL consists of: i) a quadratic function for values in the target proximity; and ii) a linear function for larger values as follows:
where is the class score (i.e., logit) of the label without applying any activation function. It is worth noting that to utilize the HL, the value of is replaced by . The SML is coupled with the sparseMax activation function that provides sparse distributions, while holding a separation margin for classification. Its generalization for the multi-label classification is defined as follows:
where is a thresholding function to define which class scores will be further leveraged (denoted as ) and the remaining class scores will be truncated to zero (for a detailed explanation, see ). The RL aims to provide an accurate order of class probabilities, and thus assign higher probabilities to ground reference classes compared to others. This is achieved with pairwise comparisons as follows:
where is the ground reference class labels associated with the image and is the remaining labels from the label set of the archive.
3 A Comparative Analysis
We analyze and compare the above-mentioned loss functions in the framework of MLC based on their: 1) class imbalance awareness; 2) convexibility and differentiability; and 3) learning efficiency. Our analysis of DL loss functions under these criteria aims at providing a guideline to select the most appropriate loss function for MLC applications. Most of the operational RS applications include a degree of class imbalance, which is associated to the fact that classes are not equally represented in the archive. This is more evident in the case of MLC. When the number of images for a given class is not sufficient in the training set, characterization of this class can be more difficult compared to others. This may lead to misclassification of images. To overcome this limitation, the modulating factor defined in (3) significantly down-weights the effect of well-classified images on the value of the loss function (e.g., when , the modulating factor shrinks towards 0). Since the FL focuses more on hard samples, minority classes can be better characterized. In addition to FL, W-CEL considers images with minority classes more than the vastly represented classes in the training set. This is due to the fact that the weighting vector applied to the loss function is inversely proportional to the class distribution. The optimization problems of DL methods are generally non-convex, while convex properties exist in the trajectory of gradient minimizers . The convexity of a DL loss function is an important property for an effective training procedure and better generalization capability. In addition to the convexity, another factor that supports the optimization of a loss function is its differentiability. It is worth noting that the differentiability is not a sufficient condition for guaranteeing the convergence to a global minimum. However, it is a required condition for providing a non-zero gradient back to the DL model during backpropagation. There are several strategies that allow the training of non-differentiable loss functions. However, these strategies may undesirably change the aim of loss functions and introduce additional complexity.
Among the considered loss functions, only the HAL and RL do not embrace the convexity and differentiability. This is due to the fact that they are non-convex and discontinuous, and thus difficult to be directly optimized. The learning efficiency is another criterion, which is evaluated as a rate at which the approximation of an iterative procedure in training reaches a high performance in terms of MLC. By employing more efficient learning procedures, similar MLC accuracies can be obtained with fewer iterations. Thus, a fast convergence reduces the total training time, which is required to reach a high MLC performance. Accordingly, it is crucial for a DL loss function particularly when there is a time constraint on the training phase. In this work, we use the same optimization strategy for all loss functions, and thus do not assess the effect of optimizers on the learning efficiency.
4 Experimental Results
Experiments have been carried out on the BigEarthNet  large-scale benchmark archive. We used the BigEarthNet-19 class nomenclature proposed in  instead of the original BigEarthNet classes. For the detailed explanation about the archive and the class nomenclature, the reader is referred to  and , respectively. For the experiments, we considered a standard CNN architecture in order not to lose in generality. To this end, the CNN architecture given in the first step of the classification approach proposed in  is used with the difference in terms of the number of units (1024) in the last two fully connected layers. We applied the same training procedure and hyperparameters to all considered loss functions for 80 epochs. Initial learning rate was selected as for the RMSprop optimizer. The performance of each loss function is provided in terms of precision (), recall () and -Score. We did not apply early-stopping with the validation set not to change the actual characteristics of the loss functions. We applied the Layer-wise Relevance Propagation (LRP)  technique to RGB spectral bands of the images. This technique allows propagating the multi-label predictions backward in CNNs and providing heatmaps, which indicate the most informative areas in RS images for each class. The heatmaps provide an accurate way to explain the characteristics of different loss functions. Low and high heatmap values are highlighted in blue and red tones, respectively.
To analyze the overall accuracy of the considered loss functions, Table 1 shows the overall multi-label classification performances. As one can see from Table 1, the CNNs trained with HL and RL achieve the highest values of precision and recall, respectively. However, since the CNN trained with HL provides a low recall, it does not lead to a high -Score. Similar to the HL, the CNN trained with RL leads to a low -Score. Since the CNN trained with SML achieves high precision and recall, it leads to the highest -Score compared to the other loss functions. To analyze the class imbalance and convexity and differentiability criteria, Figure 1 shows two examples of the BigEarthNet images, their multi-labels and LRP heatmaps with multi-label predictions of the considered loss functions. From Fig. 1.a, one can see the behavior of different loss functions when an image is associated with the classes, which are not equally represented in the archive. In detail, on the heatmap of the CEL, the semantic content associated with one of the well represented classes (which is Urban fabric) overwhelms the heatmap values. However, using the FL and W-CEL shows a more regular distribution of heatmap values. On the other hand, using the HAL and RL provides a high values associated with most of the image regions on the heatmap of the Urban fabric class while showing the highest values for the semantic content associated with the Industrial or commercial units class. In Fig. 1.b, one can see that convex loss functions provide a more accurate distribution of heatmap values in terms of the correlation between the semantic content of the image and the heatmap values. Loss functions that hold convexity and differentiability have more reliable heatmap values. However, applying a weighting factor to a relatively smooth loss function such as the CEL introduces significant uncertainty in the heatmap values of W-CEL. In contrast to W-CEL, the modulating factor of the FL provides more regular values for the same regions. The RL and HAL show an irregular profile of predictions, while having high and low heatmap values associated with the same regions of the image. Although the LRP heatmaps are given for two examples, the similar behavior is also observed by varying the images in the BigEarthNet. To compare the learning efficiency of the considered loss functions, Figure 2 shows the overall scores on the validation set at different epochs of the training phase. As one can see from Figure 2, the CNNs trained with the SML and RL lead considerably better performances in -Score from the initial epochs compared to the other loss functions.
This paper analyzes and compares different loss functions in the framework of MLC problems in RS. In particular, we have presented advantages and limitations of different DL loss functions in terms of their: 1) overall accuracy; 2) class imbalance awareness; 3) convexity and differentiability; and 4) learning efficiency. In Table 2, a comparison of the considered loss functions is given on the basis of our experimental and theoretical analysis. In greater detail, experimental results show that the highest overall accuracy is achieved when the SML is utilized as a loss function. The FL and W-CEL can be more convenient to be utilized as loss functions when the imbalanced training sets are present. For the MLC applications that require a training phase with convex and differentiable loss functions, the HAL and the RL are less suitable to be used during the training phase. The SML and RL can be more convenient to be utilized as loss functions when a lower computational time is preferred for the training phase of a DL based MLC method. This study shows that for MLC problems in RS, DL loss functions should be chosen according to the need of the considered problem. As a future work, we plan to further analyze the differences of the MLC loss functions by visualizing their 3D trajectories under different network architectures.
This work is funded by the European Research Council (ERC) through the ERC-2017-STG BigEarth Project under Grant 759764.
- I. Shendryk, Y. Rist, R. Lucas, P. Thorburn, and C. Ticehurst, “Deep learning - a new approach for multi-label scene classification in planetscope and sentinel-2 imagery,” in IEEE Intl. Geosci. Remote Sens. Symp., 2018, pp. 1116–1119.
- R. Stivaktakis, G. Tsagkatakis, and P. Tsakalides, “Deep learning for multilabel land cover scene categorization using data augmentation,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 7, pp. 1031–1035, 2019.
- G. Sumbul and B. Demir, “A deep multi-attention driven approach for multi-label remote sensing image classification,” IEEE Access, vol. 8, pp. 95934–95946, 2020.
- G. Sumbul, M. Charfuelan, B. Demir, and V. Markl, “BigEarthNet: A large-scale benchmark archive for remote sensing image understanding,” IEEE Intl. Geosci. Remote Sens. Symp., pp. 5901–5904, 2019.
- G. Hinton, P. Dayan, B. Frey, and R. Neal, “The ”wake-sleep” algorithm for unsupervised neural networks,” Science, vol. 268, no. 5214, pp. 1158–1161, 1995.
- T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 318–327, 2020.
- E. Frank and M. Hall, “A simple approach to ordinal classification,” in European Conf. Machine Learning, 2001, pp. 145–156.
- P. J. Huber, “Robust estimation of a location parameter,” Ann. Math. Statist., vol. 35, no. 1, pp. 73–101, 1964.
- Y. Li, Y. Song, and J. Luo, “Improving pairwise ranking for multi-label image classification,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 3617–3625.
- A. F. T. Martins and R. F. Astudillo, “From softmax to sparsemax: A sparse model of attention and multi-label classification,” in Intl. Conf. Mach. Learn., 2016, pp. 1614–1623.
- I. J. Goodfellow, O. Vinyals, and A. M. Saxe, “Qualitatively characterizing neural network optimization problems,” in Intl. Conf. Learn. Represent., 2015.
- G. Sumbul, J. Kang, T. Kreuziger, F. Marcelino, H. Costa, P. Benevides, M. Caetano, and B. Demir, “BigEarthNet dataset with a new class-nomenclature for remote sensing image understanding,” 2020, [Online]. Available: arXiv:2001.06372.
- M. Alber, S. Lapuschkin, P. Seegerer, M. Hägele, K. T. Schütt, G. Montavon, W. Samek, K. R. Müller, S. Dähne, and P.-J. Kindermans, “iNNvestigate neural networks!,” J. Mach. Learn. Res., vol. 20, no. 93, pp. 1–8, 2019.