Skin Lesion Segmentation and Classification With Deep Learning System


Melanoma is one of ten most common cancers in the US. Early detection is crucial for survival, but often the cancer is diagnosed in the fatal stage. Deep learning has the potential to improve cancer detection rates, but its applicability to melanoma detection is compromised by the limitations of the available skin lesion data bases, which are small, heavily imbalanced, and contain images with occlusions. We propose a complete deep learning system for lesion segmentation and classification that utilizes networks specialized in data purification and augmentation. It contains the processing unit for removing image occlusions and the data generation unit for populating scarce lesion classes, or equivalently creating virtual patients with pre-defined types of lesions. We empirically verify our approach and show superior performance over common baselines.


Devansh Bisla  Anna Choromanska  Jennifer A. Stein  David Polsky  Russell Berman
Tandon School of Engineering, Dept of Electrical Engineering, New York University
Ronald O. Perelman Department of Dermatology, New York University School of Medicine
Division of Surgical Oncology, Department of Surgery, New York University School of Medicine

1 Introduction

According to the American Cancer Society, there were approximately cases of melanoma in the US alone in the year  [1]. The -year survival rate in the US is for patients diagnosed in the very early stage of the cancer and reduces down to only once it spreads to distant organs. Therefore, early detection is crucial for survival. The techniques that aim at automating the visual examination of skin lesions, traditionally done by dermatologists111Clinician detects signs such as asymmetric lesions, lesions with irregular borders, non-uniform pigmentation, or large size, as well as lesions that change over time. have the potential to become an invaluable weapon in the battle with melanoma. The fundamental obstacle in advancing automated methods is the lack of large and balanced data sets that can be used to train computational models, i.e. many publicly available skin lesion data sets are small, imbalanced (contain significant disproportions in the number of data points between different classes of lesions and are heavily dominated by the images of benign lesions), and furthermore contain occlusions such as hairs. Moreover, publicly available data-sets are obtained from multiple different imaging centers, hospitals, and research institutes, each with different data collection and management standards. Furthermore, some imaging centers mark lesions for example by placing the ruler next to the lesion to measure the diameter of the lesion. These visual marks introduce yet another occlusions in the image. Such practices skew the data to comply with the requirements of a particular organization but also introduce bias to the data.

We contribute to the computer-aided dermatological techniques with a new deep learning system for lesion segmentation and classification, where the performance of our classification model heavily hinges on the careful preparation of the training data. The segmentation component of the system is used to identify the lesion areas on the images as well as in the process of data purification, i.e. removal of occlusions such as hairs and rulers from the images, for the classification task. The classification component further relies on two-folded data augmentation, i.e. balancing the data through the generation of artificial dermoscopic images and additional data augmentation. We tested our approach222Open-sourced under the link: on popular ISIC data sets [2, 3]: ISIC 2017 containing three classes (melanoma, nevus, and seborrheic keratosis) and ISIC 2018 containing seven classes (acitinic keratosis, basal cell carcinoma, benign keratosis, dermatofibroma, melanoma, nevus, and vascular lesion).

2 Related Work

There is a substantial body of work on developing algorithms for computer-assisted dermatology. A number of approaches rely on hand-crafted features [4, 5, 6, 7] and are not scalable to massive data sets as opposed to deep learning techniques, and in particular convolutional neural networks (CNNs) [8], which were recently applied in computational dermatology [9, 10, 11, 12, 13, 14, 15, 16, 17] . Only a subset of these deep learning methods perform data augmentation to increase the training data size [9, 12, 17] or balance the data [17], but they use fairly standard tools, i.e. image cropping, scaling, and flipping, and none of them addresses the problem of the removal of image occlusions. In this paper we propose more sophisticated tools for data purification and augmentation utilizing the removal of occluded objects and synthetic data generation. We focus on dermoscopic images which allow higher diagnostic accuracy than traditional eye examination [18] as they show the micro-structures in the lesion and provide a view unobstructed from skin reflections.

3 Proposed Methods

3.1 Data sets

We trained our deep learning models on a combination of open source data sets: the ISIC Archives, the Edinburgh Dermofit Image Library [19], and the PH data set [20]. Diagnoses of these lesions were established clinically (ISIC, Dermofit) and/or via histopathology (ISIC, PH) (see Figure 4, which will be discussed more in details later in the paper, regarding the training data size). Additionally, each data set contains lesion segmentation masks. We tested our models on ISIC 2017 and ISIC 2018 test data sets obtained from the ISIC Archives.

3.2 Data purification problem

Fig. 1: Visualization results for the conventionally-trained model on the ISIC 2017 data set. (Top): Original image. (Bottom): Visualization mask overlaid on the original image. The model overfits to image occlusions such as hairs and rulers.

Dermoscopic images often contain occlusions such as hairs and/or rulers. Deep learning approaches in general can handle such image occlusions and learn to avoid these objects while making predictions, though their learning ability is conditioned upon the availability of large training data. In case of dermoscopic data bases, which typically have small or medium size, the deep learning models are easily prone to overfitting, i.e. they use visual cues such as hairs and rulers as indicators of the lesion category. To demonstrate this problem we employ the visualization technique, called VisualBackProp [21], that highlights the part of the image that the network focuses on when forming its prediction. Figure 1 shows the results obtained for the traditionally-trained deep model (without performing data purification or augmentation) on the ISIC 2017 data set.

3.3 Data Purification Network

We utilized a U-Net [22]-based encoder-decoder deep architecture to perform data purification. We modified this model by replacing convolution operations by partial convolutions [23] to improve the performance of the model. We refer to this network as data purification network. The input of the network is the original image and the output is the same image after occlusion removal. Since the training data set for this model requires coupled pairs of images before and after occlusion removal and those, to the best of our knoweldge, are not available in any public data sets, we created such training data. It was done using the hair-removal algorithm [24] that involves traditional data processing techniques. We used it to find hairs and rulers on the images. One of its steps involves thresholding the luminance channel of the image in the LUV color space, which may also remove dark regions belonging to the lesion itself. To correct that, we overlayed the processed image with the segmented lesion obtained from our segmentation network that will be described in the next section. The holes in the background were eliminated by performing the closing operation. The obtained training data-set consisted of images and was augmented through random masking.

The data purification network was trained using the loss proposed for training networks with partial convolutions [23] asnd that targets both per-pixel reconstruction accuracy ans well as composition i.w how smoothly the predicted hole values transition into their surrounding context. The data purification network was trained using an Adam optimizer [25] with beta values set to and constant learning rate set to in the beginning of training and SGD optimizer with momentum and learning rate in the later stages of learning. The model was trained over a week on GTX Gb GPU cards. Figure 2 show the results of data purification obtained by our model.

Fig. 2: (Top): Original images. (Bottom): Images after removing occlusions, i.e. hairs and rulers, using Data Purification Network.
Fig. 3: (A) Architecture of the Segmentation Network. Black lines represent skipped connections. Each convolution operation is followed by ReLU activation; 2 dil and 4 dil refers to 2 dilated and 4 dilated convolution operations. (B) Top: Original images from the test data set. Middle: Segmentation masks. Bottom: Corresponding segmentation masks obtained from our Segmentation Network.

3.4 Segmentation Network

For the lesion segmentation, we utilized a U-Net model, whose architecture was modified by adding dilated [26] convolutions to increase network’s effectiveness and performance. The schematics of the model is shown in Figure 3. The output of the segmentation model was passed through a binary hole filling procedure to fill empty holes in the segmentation mask. The input image size was fixed to , each channel of the input was normalized with mean and standard deviation . To train the network, we used the binary cross-entropy loss and Adam optimizer with fixed learning rate of and beta values . The exemplary segmentation masks we obtained and the ground truth masks are depicted in Figure 3. Clearly, our deep learning model is able to obtain masks that are better aligned with the lesion than the actual ground truth. To evaluate the performance of the segmentation model itself we use the Jaccard index. We obtained an average index of on the ISIC 2017 test data (the best result reported in the literature is  [2]) and on the ISIC 2018 test data (the best known result of is reported under the link: and was obtained with an ensemble of different deep lerning models).

3.5 Data imbalancedness problem

Fig. 4: Sizes of the training data sets for the ISIC 2017 task (left; melanoma cases (M), nevus cases (N), seborrheic keratosis cases (SK)) and the ISIC 2018 task (right; cases of acitinic keratosis (AK), cases of basal cell carcinoma (BCC), samples of benign keratosis (BK), cases of dermatofibroma (DF), cases of melanoma (M), cases of nevus (N), and cases of vascular lesion (VL)). Data set is heavily imbalanced.

The data imbalancedness, illustrated in Figure 4, is yet another significant factor that deteriorates the performance of deep learning systems analyzing dermoscopic images. The classifiers tend to be biased towards majority classes that correspond to benign lesions. This problem can be partially mitigated by introducing higher penalty for misclassifying rare classes, though data augmentation techniques replace this approach nowadays as they have the advantage of increasing data variability while balancing the data. We propose to use data augmentation technique that relies on generating the images from scarce classes that obey the data distribution of these classes. This is equivalent to creating virtual patients with lesions from scarce classes in order to even their size with the large-size classes.

3.6 Data Generation Network with de-coupled DCGANs


0.02                           0.09                           0.18                           0.02                           0.04                           0.06             

Fig. 5: (A) Architecture of the DCGAN model. (B) Histograms of the MSE values for (left) melanoma (the mean and std of the MSE are ) (right) seborrheic keratosis (the mean and std of the MSE are ). (C) (First three columns): generated melanoma images (top) and the original images from the training set (bottom) for different values of MSE. (Last three columns): generated seborrheic keratosis images (top) and the original images from the training set (bottom) for different values of MSE.

For the ISIC 2017 task we propose the data generation method utizing de-coupled DCGANs. We use two separate Deep Convolutional Generative Adversarial Networks (DCGANs) [27] to generate images of melanoma and images of seborrheic keratosis, which were the two classes heavily under-represented in the ISIC 2017 data set compared to a much larger nevus class. Since we use seperate networks for each class, we refer to this approach as “de-coupled DCGANs”. We extended the architecture of DCGAN to produce images of resolution . The model architecture is shown in the Figure 5. GAN techniques rely on training a generator network to generate images which have similar distribution to the one followed by the training data. The discriminator provides a feedback how close the two distributions are. In our experiments, the latent vector of length 10 that inputs the generator is obtained from standard Gaussian distribution with mean 0 and standard deviation 1. We modify DCGAN to enable the generation of images with the desired resolution by adding layers to both generator and discriminator. Binary cross entropy loss and Adam optimizer with learning rate of and beta values of and were used to train both discriminator and generator. To prevent generator from collapsing and perform stable discriminator-generator optimization we utilize stabilization techniques [28] and perform early stopping while training the network. Furthermore, we perform two rounds of additional generator training after every round of joint training of both discriminator and generator. The process of data generation needs to be done carefully. It is essential to make sure that the generated images differ from the ones contained in the training data to maximize data variety. In order to verify that, we calculate the mean squared error (MSE) between each generated image to all the images from the training data set and choose the training image that corresponds to the minimal value of the MSE. We then compared each generated image with its closest, i.e. MSE-minimizing, training image to make sure they are not duplicates. Figure 5 shows the histograms of the mean squared error (MSE) for seborrheic keratosis and melanoma and exemplary image pairs. The histograms indicate the wide variation in the images generated by the models.

We finally augment the data in the two considered classes by performing horizontal flipping of images such that the class sizes increase to for melanoma and for seborrheic keratosis. We then augment the entire data set using vertical flipping and random cropping to increase the data set further times. The final training data set was obtained and that will be used for the classification model is balanced and contains melanoma cases, nevus cases, and seborrheic keratosis cases, among which a notable fraction, i.e , constitute the artificially-generated data .

3.7 Data Generation with coupled DCGANs

As the number of classes in the ISIC 2018 task is larger than in case of ISIC 2017, using multiple separate DCGANs for the former becomes inefficient. Instead, for the ISIC 2018 we coupled seven DCGAN architectures. They share parametrization of their initial layers with each other and the final layers are class-specific. Figure 6 shows the idea behind the coupled DCGAN models. The same figure shows exemplary images generated using this approach. The coupled DCGAN models were trained using Adam optimizer with learning rate of and beta values of and . The latent vector of length that inputs the generator is obtained from standard Gaussian distribution with mean and standard deviation . Binary cross entropy loss was used to train both discriminator and generator. This time we balanced the data online, making sure that in each mini-batch classes are equally well represented. We additionally used standard online data augmentation.

Fig. 6: (A) Coupled DCGAN architecture. BN refers to batch normalization, Conv and TConv refer to convolution and transposed convolution, respectively. (B) First three rows: Generated image of size (we show exemplary images per class; images in the same row are generated from the same random latent vector) for (P): Actinic Keratosis (Q) Basal Cell Carcinoma (R) Benign Keratosis (S) Melanoma (T) Nevus (U) Dermatofibroma (V) Vascular Lesion. Fourth row: Images of real lesion similar (in terms of the MSE) to the generated ones from the third row.

3.8 Classification Network

For the task of lesion classification, we utilized a ResNet-50 [29] architecture pre-trained on ImageNet data set with final fully-connected layer modified to output probabilities of lesion being in each of the classes. We furthermore processed all the images in our training data for both ISIC 2017 and ISIC 2018 tasks to remove occlusions as described in the previous section. The pre-processed images were then added to the training data set of the lesion classification model to make it more robust to the presence of occlusions and prevent overfitting.

In Table 1 we demonstrate the advantage of using data purification using ISIC 2018. We report the results when at testing we either perform or not perform data purification. Note that ISIC 2018 does not publish labels for the test data thus for the remaining empirical analysis we use ISIC 2017 which instead contains test labels.

Method Accuracy Sensitivity Specificity
Our Classification Model 0.675 0.561 0.954
but without performing
data purification at testing
Our Classification Model 0.717 0.754 0.837
Table 1: The effect of inducing data purification at testing on the performance of Classification Network.
Fig. 7: (Top) ROC curves obtained by traditional baseline and proposed classification model for ISIC 2017 data-set. (Bottom) Confusion matrix obtained by traditional baseline (left) and proposed model (right). M - melanoma, N - nevus, SK - seborrheic keratosis.

For the ISIC 2017 task, we used the main evaluation metrics defined in the ISIC 2017 challenge: area under the receiver operating characteristic curve (ROC AUC) for melanoma classification and ROC AUC for melanoma and seborrheic keratosis classifications combined (mean value). We compare our performance with a traditional baseline model that does not perform data purification or augmentation and the winning models of the ISIC 2017 challenge. We obtained ROC AUC of for melanoma classification and the mean performance of and outperform both the baseline, as shown in Figure 7, as well as the winners of the challenge that obtained ROC AUC of for melanoma and average ROC AUC of  [2]. Figure 8 highlights visualization results for classification network trained with pure and augmented data. The model focuses on the actual lesion rather than hair, ink marks, and other objects and thus improves over conventionally-trained model (Figure 1).

Top AVG [11] 0.729 0.588 0.366
Top SK [12] 0.727 0.555 0.404
Top M [30] 0.747 0.590 0.395
Our Classification Model 0.697 0.648 0.492
Table 2: Specificity values at sensitivity levels of for melanoma classification. Top AVG, Top SK, and Top M denote the winning approaches of the ISIC 2017 challenge.

Nevus                                              Melanoma                                       Seborrheic Keratosis

(TP)               (FP)               (FN)               (TN)                  (TP)               (FP)               (FN)               (TN)                  (TP)               (FP)               (FN)               (TN)

Fig. 8: Top Original image, Bottom Visualization result for nevus, melanoma, and seborrheic keratosis for True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN)

In Table 2 we also report the resulting specificity values for different values of sensitivity for melanoma classification. The confusion matrix in figure 7.

4 Conclusion

The techniques that aim at automating the visual examination of skin lesions, traditionally done by dermatologists, are nowadays dominated by the deep-learning-based methods. These methods are the most accurate and scalable, but they require large training data sets and thus their applicability in dermatology is compromised by the size of the publicly available dermatological data sets, which are often small and contain occlusions. We present a remedy for this problem that relies on careful data purification that removes common occlusions from dermoscopic images and augmentation that uses the modern technique of deep-learning-based data generation to improve data balancedness. We demonstrate the effectiveness of our system on the lesion segmentation and classification tasks.


  • [1] “American cancer society. cancer facts & figures 2017,”
  • [2] N. C. F. Codella, D. Gutman, M. Emre Celebi, B. Helba, M. A. Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. K. Mishra, H. Kittler, and A. Halpern, “Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC),” CoRR, vol. abs/1710.05006, 2017.
  • [3] P. Tschandl, C. Rosendahl, and H. Kittler, “The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions,” Sci. Data, vol. 5, pp. 180161, 2018.
  • [4] C. Barata, J.S. Marques, and T. Mendonça, “Bag-of-Features Classification Model for the Diagnose of Melanoma in Dermoscopy Images Using Color and Texture Descriptors,” in International Conference Image Analysis and Recognition, ICIAR 2013, pp. 547–555.
  • [5] T. Yao, Z. Wang, Z. Xie, J. Gao, and D. Dagan Feng, “A multiview joint sparse representation with discriminative dictionary for melanoma detection,” in Digital Image Computing: Techniques and Applications, DICTA 2016, pp. 1–6.
  • [6] L. Bi, J. Kim, E. Ahn, D. Feng, and M. J. Fulham, “Automatic melanoma detection via multi-scale lesion-biased representation and joint reverse classification,” in IEEE International Symposium on Biomedical Imaging, ISBI 2016.
  • [7] Y. T. Tang, Z. Li, and J. Ming, “An intelligent decision support system for skin cancer detection from dermoscopic images,” in International Conference on Fuzzy Systems and Knowledge Discovery, ICNC-FSKD 2016, pp. 2194–2199.
  • [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 – 2324, 1998.
  • [9] N. C. F. Codella, Q. D. Nguyen, S. Pankanti, D. Gutman, B. Helba, A. Halpern, and J. R. Smith, “Deep learning ensembles for melanoma recognition in dermoscopy images,” IBM Journal of Research and Development, vol. 61, no. 4/5, 2017.
  • [10] N. C. F. Codella, J. Cai, M. Abedini, R. Garnavi, A. Halpern, and J. R. Smith, “Deep learning, sparse coding, and SVM for melanoma recognition in dermoscopy images.,” in Machine Learning in Medical Imaging, MLMI 2015, pp. 118–126.
  • [11] K. Matsunaga, A. Hamada, A. Minagawa, and H. Koga, “Image classification of melanoma, nevus and seborrheic keratosis by deep neural network ensemble,” CoRR, vol. abs/1703.03108, 2017.
  • [12] I. Gonzalez Diaz, “Dermaknet: Incorporating the knowledge of dermatologists to convolutional neural networks for skin lesion diagnosis,” IEEE Journal of Biomedical and Health Informatics, vol. PP, no. 99, pp. 1–1, 2018.
  • [13] L. Bi, Y. Jung, E. Ahn, A. Kumar, M. J. Fulham, and D. Dagan Feng, “Dermoscopic image segmentation via multistage fully convolutional networks,” IEEE Transactions on Biomedical Engineering, vol. 64, pp. 2065–2074, 2017.
  • [14] X. Zhang, “Melanoma segmentation based on deep learning,” Computer Assisted Surgery, vol. 22, pp. 267–277, 2017.
  • [15] F. Cıcero, A. Oliveira, and G. Botelho, “Deep learning and convolutional neural networks in the aid of the classification of melanoma,” in Conference on Graphics, Patterns and Images, SIBGRAPI 2016.
  • [16] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun, “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, pp. 115 – 118, 2017.
  • [17] C. N. Vasconcelos and B. N. Vasconcelos, “Convolutional neural network committees for melanoma classification with classical and expert knowledge based image transforms data augmentation,” CoRR, vol. abs/1702.07025, 2017.
  • [18] M. E. Vestergaard, P. Macaskill, P. E. Holt, and S. W. Menzies, “Dermoscopy compared with naked eye examination for the diagnosis of primary melanoma: a meta-analysis of studies performed in a clinical setting,” British Journal of Dermatology, vol. 159(3), pp. 669 – 676, 2008.
  • [19] L. Ballerini, R. B. Fisher, B. Aldridge, and J. Rees, A Color and Texture Based Hierarchical K-NN Approach to the Classification of Non-melanoma Skin Lesions, Springer Netherlands, Dordrecht, 2013.
  • [20] T. Mendonça, P. M. Ferreira, J. S. Marques, A. R. S. Marcal, and J. Rozeira, “Ph2 - a dermoscopic image database for research and benchmarking,” in Engineering in Medicine and Biology Conference, EMBC 2013, pp. 5437–5440.
  • [21] M. Bojarski, A. Choromanska, K. Choromanski, B. Firner, L. D. Jackel, U. Muller, and K. Zieba, “Visualbackprop: visualizing cnns for autonomous driving,” CoRR, vol. abs/1611.05418, 2016.
  • [22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention, MICCAI, 2015, pp. 234–241.
  • [23] G. Liu, F. A. Reda, K. J. Shih, T.C. Wang, A. Tao, and B. Catanzaro, “Image inpainting for irregular holes using partial convolutions,” CoRR, vol. abs/1804.07723, 2018.
  • [24] P. S. Saugeon, J. Guillod, and J. P. Thiran, “Towards a computer-aided diagnosis system for pigmented skin lesions,” Computerized Medical Imaging and Graphics, vol. 27, no. 1, pp. 65 – 78, 2003.
  • [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
  • [26] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in International Conference on Learning Representations, ICLR 2016.
  • [27] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” CoRR, vol. abs/1511.06434, 2015.
  • [28] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” CoRR, vol. abs/1606.03498, 2016.
  • [29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, June CVPR 2016, pp. 770–778.
  • [30] A. Menegola, J. Tavares, M. Fornaciali, L. T. Li, S. E. F. Avila, and E. Valle, “RECOD titans at ISIC challenge 2017,” CoRR, vol. abs/1703.04819, 2017.
Comments 1
Request Comment
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description