Overfitting of neural nets under class imbalance: Analysis and improvements for segmentation
Overfitting in deep learning has been the focus of a number of recent works, yet its exact impact on the behavior of neural networks is not well understood. This study analyzes overfitting by examining how the distribution of logits alters in relation to how much the model overfits. Specifically, we find that when training with few data samples, the distribution of logit activations when processing unseen test samples of an under-represented class tends to shift towards and even across the decision boundary, while the over-represented class seems unaffected. In image segmentation, foreground samples are often heavily under-represented. We observe that sensitivity of the model drops as a result of overfitting, while precision remains mostly stable. Based on our analysis, we derive asymmetric modifications of existing loss functions and regularizers including a large margin loss, focal loss, adversarial training and mixup, which specifically aim at reducing the shift observed when embedding unseen samples of the under-represented class. We study the case of binary segmentation of brain tumor core and show that our proposed simple modifications lead to significantly improved segmentation performance over the symmetric variants.
Convolutional neural networks (CNNs) work exceptionally well when trained with sufficiently large and representative data. When only small amounts of training data is available, overfitting can become a critical issue. CNNs may memorize specific patterns of the training data, leading to poor generalization at test time. Image segmentation is particularly prone to overfitting, as the generation of high-quality expert annotations is tedious and time-consuming. Contributing to the problem is the often severe class imbalance where the foreground class (say tumor) is heavily under-represented in the training samples. Class ratios of 1:10 and lower are typical. To alleviate class imbalance, one may use data augmentation, change the sampling weights per class, add information in the loss function , or adopt multi-stage approaches with candidate proposals and background suppression . We argue that the key connection between class-imbalance and overfitting of under-represented foreground class has not been investigated sufficiently. Many methods have been proposed for deep models to improve generalization and prevent overfitting including specific loss functions , data mixing  or learning based augmentation . However, most of these techniques were proposed for general image classification where class imbalance is not specifically addressed. It also remains unclear how these techniques exactly affect the model and to our knowledge this has not been explored in great detail. In this study, we shed new light on the problem of overfitting in the presence of class imbalance aiming to improve segmentation performance.
To explore the effects of overfitting on the model behavior, we investigate how the distribution of activations in the last network layer (logits) changes for a model trained with different amounts of training data. We notice that samples of the under-represented class at test time tend to be mapped towards and across the decision boundary, while the mapping of training and test samples of the over-represented class remains stable. This leads to a tendency for the under-represented class losing sensitivity at test time. We argue this is a consequence of class imbalance and overfitting to the few training samples of the under-represented class. Current solutions aiming to make different classes separate better do not address this imbalance and may even reduce the performance in such imbalanced settings, as we show empirically. Based on our analysis, we propose asymmetric modifications of those techniques to steer their effect to tackle the problem of class imbalance, showing promising results for image segmentation with small amounts of training data.
In order to investigate the influence of overfitting, we train convolutional neural networks using different amounts of data with strong class imbalance. We conduct experiments on brain tumor core  and small organ segmentation (data from ). For tumor segmentation, we test on 95 cases and train separate models using 190 (100%), 95 (50%), 38 (20%), 19 (10%) and 10 cases (5% of full training set). For organ segmentation, we test on 10 cases and train models using 20 (100%) and 5 cases (25% of training set). We use the DeepMedic  architecture for all our experiments.
Results are shown in Fig. 1. With less training data, we observe a clear decline of segmentation performance on test data but similar or increase of performance on training data, as expressed by the DSC metric (defined as ). Precision remains largely stable, while we observe that overfitting causes reduced sensitivity. We also observe this behavior in other tasks where foreground classes are under-represented. Our findings reveal that models that overfit to training data have a bias to under-segment the under-represented class on unseen test data.
Delving deeper into analysing this behavior, we monitor the logits when the network processes foreground and background samples of training and unseen test data (cf. plots in Fig. 2). For simplicity, we focus on the problem of binary segmentation of tumor core for the rest of this paper. We observe that the distributions of logit activations when processing background samples from the training and test sets tend to be similar. However, the distribution of logit activations when processing foreground samples shift significantly towards or even across the decision boundary which causes false negatives. This shift tends to increase for models that overfit more (trained with less training data), leading to our previous observation that sensitivity decreases drastically when models overfit causing the model to under-segment the structures of interest.
We argue that the above behavior is a combined effect of class imbalance and overfitting. The background covers large part of the image and is a relatively heterogeneous class in many tasks. For example, a CNN “sees” very different patterns when processing different parts of the brain through its receptive field. Thus to minimize the training cost for even little data, the network has to learn relatively generic filters. Subsequently these filters will also map unseen data appropriately and no shift between the embeddings of training and test samples is observed. In contrast, the appearance of small foreground structures may be easier to memorize within a network, as there are only limited ways the CNN can view them through its receptive field. Even if the structure is complex, a set of case-specific filters could enable memorization. These filters tailored for each training case are suboptimal for new unseen data, yielding poor generalization. As their evaluation on new data does not match well the underlying patterns it leads to activations of smaller magnitude, causing the observed distribution shift111The dot product between a filter and a signal is highest when these match perfectly.. As a result of class imbalance and overfitting, the CNN tends to underperform for under-represented classes. The shift of the foreground logit distribution is the cause of the drastic decrease of sensitivity and under-segmentation in case of overfitting. However, previous loss functions and regularization techniques that aim to prevent overfitting do not take this behavior into account and are unable to improve segmentation in this setting. Here, we introduce new asymmetric variants that aim at reducing the shift of the under-represented classes leading to significant improvements.
Based on our observations about the behavior of CNNs, we modify previous loss functions and training strategies to prevent distribution shift of logit activations. Specifically, we add a bias for the under-represented class to tackle overfitting under class imbalance. Although the original techniques were proposed for different purposes, our modifications have a common goal: keep the logit activations of the under-represented class away from the decision boundary. Even if the logit of a foreground sample shifts towards the decision boundary, its prediction is likely to remain correct (cf. Fig. 3).
3.1 Asymmetric large margin loss
We start with a basic CNN segmentation model. Typically, the softmax function and cross entropy are computed on the logit to make predictions, with a loss as:
where N is the number of training samples, z is the network’s logit output of the cth class, x is the input image and y is its corresponding one-hot label.
The large margin loss was proposed for increasing the Euclidean distances between logits for different classes to learn discriminative features . Symmetrically, it is implemented by adding bias on the logits of every class:
where m is the margin. Although the large margin loss encourages the model to map different classes away from each other, the decision boundary remains in the center. According to our findings, the bias of class imbalance causes shifts of unseen foreground samples towards the background class. To mitigate this, a regularizer should move the decision boundary closer to the background class. Therefore, our asymmetric modification pushes the foreground class away from the decision boundary by only setting the margin for the rare class r:
3.2 Asymmetric focal loss
The focal loss was proposed for small object detection by reducing the loss of well-classified samples and focusing on samples which are near the decision boundary . It adds attenuation inside the loss function based on the logit activations:
where is the hyper-parameter to control the focus. Symmetric focal loss prevents logits from being too large and makes every class stay near the decision boundary. However, it also makes it easy for the unseen foreground samples to shift across the decision boundary. Therefore, we remove the loss attenuation for the foreground class to keep it away from the decision boundary:
3.3 Asymmetric adversarial training
Adversarial training was proposed to learn a robust classifier by training with difficult samples which can break the correct predictions in a significant way :
Here, d is the direction of generated adversarial samples, l and are the magnitude and the range of the adversarial perturbations. Similar to the large margin loss, symmetric adversarial training preserves the decision boundary and causes difficulties for unseen foreground samples, which tends to shift towards background class. Our proposed asymmetric adversarial training aims to produce a larger space between foreground class and the decision boundary:
3.4 Asymmetric mixup
Mixup is a simple yet effective data augmentation algorithm to improve generalization by generating extra training samples by using the linear combinations of pairs of images and their labels :
where (, ) are the generated training sample:
Here, is randomly selected based on a beta distribution, (x, y) is another random training sample. Mixup regularizes the model by centering the decision boundary between classes which helps little in our setting. Different from the original mixup, which generates samples with soft labels, our modification generates hard labels by regarding augmented samples which are near to foreground samples just as foreground class. By doing this, asymmetric mixup can keep the decision boundary away from the foreground class and increase the area of foreground logit distribution. This prevents unseen under-presented samples from shifting across the decision boundary. Specifically, the mixed image which has a certain distance from background class, is taken as a foreground sample:
where m is the margin to guarantee that the augmented samples are not too close to background samples and still belong to positive samples.
We demonstrate the effect of our proposed modifications for the case of brain tumor core binary segmentation. The hyper-parameters are kept the same for the original baselines and our modified techniques. The quantitative segmentation results are summarized in Table 1. For baseline experiments, we show that simply changing the objective function to DSC (which is a mix of sensitivity and precision) does not improve the performance. Increasing the weight of tumor samples (from 50% to 80%) leads to even more overfitting and decreased performance. Our proposed modifications lead to improvements in all experiments. Specifically, the original large margin loss and mixup sometimes decrease performance, while our modifications boost the performance to a large extent. Focal loss and adversarial training can be effective when data is very little, where our modifications seem to further improve the sensitivity. We also demonstrate that our four methods can be integrated into a single model. The combination of the four modified techniques further improves results.
The effect of all four techniques on the logit distribution is shown in Fig. 4. The original large margin loss and adversarial training try to push samples from different classes far from each other, however, the logits of unseen data remain in the center around the decision boundary and thus the predictions are not improved. With our modifications, only the logits of foreground samples are pushed away and the unseen foreground logits tend to remain positive. The original focal loss encourages the network to prevent the logits of each class from staying too far from the decision boundary. However, it allows foreground logits to remain near the decision boundary which can yield false negative predictions. Our asymmetric focal loss removes the constraints of foreground samples. Original mixup encourages the symmetric distributions of different classes but does not consider class imbalance. Asymmetric mixup exploits the embedding space based on the relationship between samples to generate foreground samples and make the decision boundary stay near the background class. This leads overall the biggest improvement by increasing the region for the foreground logit distribution and reduce logit shift of unseen foreground samples.
In this paper, we analyze overfitting of neural networks under class imbalance. We find that when processing unseen under-represented samples, the logit activations tend to shift towards the decision boundary, thus the sensitivity drops. We derive asymmetric variants for existing loss functions and regularization techniques to prevent overfitting, showing promising results. We expect findings to extend naturally to multi-class problems, which is further investigated in future work. We further believe that our logit distribution plots can be a valuable tool for practitioners to study overfitting and other behavior of different models.
ZL is grateful for a China Scholarship Council (CSC) Imperial Scholarship. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant No 757173, project MIRA, ERC-2017-STG) and EPSRC (EP/R511547/1).
-  Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J.S., Freymann, J.B., Farahani, K., Davatzikos, C.: Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Sci. Data 4, 170117 (2017)
-  Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations (ICLR) (2015)
-  Kamnitsas, K., Ledig, C., Newcombe, V.F., Simpson, J.P., Kane, A.D., Menon, D.K., Rueckert, D., Glocker, B.: Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Med. Image Anal. 36, 61–78 (2017)
-  Landman, B.A., Xu, Z., Igelsias, J.E., Styner, M., Langerak, T.R., Klein, A.: 2015 miccai multi-atlas labeling beyond the cranial vault â workshop and challenge
-  Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)
-  Liu, W., Wen, Y., Yu, Z., Yang, M.: Large-margin softmax loss for convolutional neural networks. In: International Conference on Machine Leanring (ICML). pp. 507–516 (2016)
-  Valindria, V.V., Lavdas, I., Cerrolaza, J., Aboagye, E.O., Rockall, A.G., Rueckert, D., Glocker, B.: Small organ segmentation in whole-body mri using a two-stage fcn and weighting schemes. In: MICCAI-MLMI. pp. 346–354. Springer (2018)
-  Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (ICLR) (2018)