Automating Vitiligo Skin Lesion Segmentation Using Convolutional Neural Networks

Automating Vitiligo Skin Lesion Segmentation Using Convolutional Neural Networks


For several skin conditions such as vitiligo, accurate segmentation of lesions from skin images is the primary measure of disease progression and severity. Existing methods for vitiligo lesion segmentation require manual intervention. Unfortunately, manual segmentation is time and labor-intensive, as well as irreproducible between physicians. We introduce a convolutional neural network (CNN) that quickly and robustly performs vitiligo skin lesion segmentation. Our CNN has a U-Net architecture with a modified contracting path. We use the CNN to generate an initial segmentation of the lesion, then refine it by running the watershed algorithm on high-confidence pixels. We train the network on 247 images with a variety of lesion sizes, complexity, and anatomical sites. The network with our modifications noticeably outperforms the state-of-the-art U-Net, with a Jaccard Index (JI) score of 73.6% (compared to 36.7%). Moreover, our method requires only a few seconds for segmentation, in contrast with the previously proposed semi-autonomous watershed approach, which requires 2-29 minutes per image.


Makena Low, Priyanka Raina \addressStanford University


image segmentation, neural network, vitiligo, lesions, U-Net, watershed

1 Introduction

Vitiligo is a skin condition where patches of skin get depigmented, as shown in Fig. 1. It affects 0.5-2% of the population, can be developed by anyone, and though not physically painful, can harm patients psychologically, socially, and professionally [5][1][13]. The body surface area (BSA) affected by vitiligo is the main measure of the condition’s severity and progression. BSA measurements must be consistent for proper clinical care, translational research efforts, and assessment of the efficacy of treatment. For instance, a physician’s visual estimation of the percentage of vitiligo-affected BSA informs the Vitiligo Area Scoring Index (VASI) and Vitiligo European Task Force (VETF) metrics. Both measures can only detect large changes in lesion area, with the smallest being between 7.1% to 10.4% of total BSA [8]. Current segmentation practices are mainly manual. Not only is this method detrimental for accurate and reproducible readings, but it is also a time-inefficient and labor-intensive process. Moreover, non-dermatologists are often the ones who perform these segmentations, even though they do not have a rigorous background for such reviews [10]. This study aims to introduce a novel solution to this issue.

Figure 1: Examples of vitiligo lesions with different sizes, complexity, and anatomical sites.

A convolutional neural network (CNN) is a promising approach for solving complex skin segmentation challenges. CNNs for skin cancer segmentation are already in widespread use, in large part due to the International Skin Imaging Collaboration (ISIC) Skin Lesion Analysis Towards Melanoma Detection competition [3]. However, vitiligo is seldom the subject of such segmentation studies. One study that uses CNNs for vitiligo segmentation is very data-intensive: it presents a model trained on about 40,000 images, which is much larger than our and most medical datasets [9].

Figure 2: Illustration of watershed algorithm with manual seeding (left), the resulting contour (middle), and segmented output (right).

Researchers have also explored less computationally intensive techniques than CNNs. One study attempted to quantify treatment efficacy by using a computerized digital imaging analysis system (C‐DIAS) [14]. Sheth et al. leveraged standard color image processing techniques to create an automatic vitiligo segmentation program; however, this approach does not perform well when tested on large surface areas [15]. To address this, Raina et al. created a graphical user interface (GUI) with a semi-autonomous version of the watershed algorithm for lesion segmentation [10][11]. The tool succeeds in outputting subtle contours for full-body images, but it requires “seeds” from the user to define the background (environment and healthy skin) and foreground (affected skin), as shown in Fig. 2 (left) in red and green colors. This semi-manual process of segmentation requires significant work when lesions are involved, as shown in Fig. 1 (right). Our work addresses these shortcomings.

We introduce a CNN that achieves a high Jaccard Index score (intersection over union) of 73.6% with 247 training images. Our models are based on the end-to-end U-Net [12] architecture. We substitute the contracting path with a popular semantic segmentation CNN that serves as a feature extractor. Our work investigates VGG16, ResNet50, InceptionV3, InceptionResNetV2, and SENet154 as contracting path enhancers [16][6][19][18][7]. We also experiment with watershed-based post-processing; after classification, the high-confidence pixels are fed as seeds to the watershed algorithm [11]. We find that an InceptionResnetV2 contracting path performs the best out of all our explored architectures. Our method drastically reduces segmentation time compared to the watershed GUI as well as offers a method of achieving reproducible output.

2 Methods

2.1 Vitiligo Image Samples and Annotation

Our dataset consists of 308 red/green/blue (RGB) images of vitiligo lesions compiled by the UC Davis Medical Center. The lesions range widely in skin tone and anatomical location. Physicians have taken the images from several angles, at different levels of brightness, and either in ultraviolet (UV) or natural lighting. We derive the ground truth segmentation output from the semi-autonomous watershed GUI and manual edits. Each ground truth output image is a binary mask of the lesion, where zero (black) represents healthy skin or the environment, and 255 (white) represents vitiligo. The dataset is split such that 60% is for training the model (188 images), 20% is for validating the model (66 images), and 20% for testing the model on unseen data (61 images).

Figure 3: U-Net with a ResNet50 contracting path.

2.2 Evaluation Metric

We evaluate the network’s performance using the pixel-wise Intersection over Union metric (IoU):

This metric is also known as the Jaccard Index (JI) [3]. For each image, we calculate the JI between every classified pixel and the corresponding ground truth pixel. The JI of the image is the average of the pixel-wise JI scores. Previous analysis suggests that the JI is too optimistic by not accounting for the labor required to correct an inaccurate segmentation [4]. Thus, we also compute a thresholded Jaccard Index to account for segmentations that do not fall within professional inter-observer variability. If the average JI is less than 65%, we set the score to 0% for the image. Otherwise, the JI is unchanged. The threshold of 65% was determined by ISIC [3]. Although ISIC focuses on melanoma segmentation, the human labor required for a similar evaluation with vitiligo was not feasible for us; we suppose that the ISIC threshold is a fair estimate. The evaluation metric for our networks is the average of the threshold JI scores for the images in the validation set.

2.3 Image Pre-processing

We perform simple pre-processing on images before feeding them into our network. We subtract the mean from each image channel and normalize each channel to make the standard deviation -1 to 1 to guarantee pixel scale standardization. We re-scale every image to . We implement data augmentation during training (after pre-processing). Data augmentation includes a rotation range from 0 to 180 degrees, horizontal and vertical shifts set to 0.05, and vertical and horizontal flips. Moreover, we set the zoom range to 0.8 to 1.2 times the original image due to the varying closeness between camera and lesion. Due to the varying brightness and lighting conditions, brightness augmentation ranges from 0.7 to 1.3 times that of the original image.

2.4 U-Net Network Experiments

Our baseline is an unmodified U-Net with 512 hidden units at the bottleneck and no pre-trained weights from ImageNet [12]. The final activation is a softmax layer. After 100 epochs, the JI score is 36.7%. We experiment with using popular semantic segmentation networks such as VGG16 and ResNet50 as modified contracting paths in our U-Net. Fig. 3 illustrates our U-Net architecture with a ResNet50 contracting path. We utilize an API based on Keras and Tensorflow frameworks to create our test architectures listed in Table 1. For fast comparison, each modified U-Net is only trained for 30 epochs and evaluated. Table 1 shows the results of each model.

Contracting Path Epochs Val Train
Unmodified 100 36.8% 44.7%
VGG16 30 61.2% 63.7%
ResNet50 30 64.2% 68.2%
InceptionV3 30 61.5% 63.9%
InceptionResNetV2 30 70.9% 67.0%
SENet154 30 61.3% 66.7%
Table 1: JI scores of U-Net architectures.

2.5 Hyperparameter Tuning

Since there are benefits to multiple methods of hyperparameter tuning, we use a three-pronged approach for finding optimal hyperparameters. (1) For initial exploration, we iterate with random search to leverage its strength in not fixating on local minima while also efficiently exploring the hyperparameter search space [2]. (2) Once coarse parameter tuning identifies promising ranges, we manually alter our search space for fine-tuning. (3) Finally, we employ sequential model-based optimization (SMBO), so we can try future hyperparameters based on promising past ones, as well as reduce the computational expense and iterations needed for promising results compared to random search [17]. Table  2 outlines our chosen hyperparameters from this optimization.

Figure 4: Original image (left), ground truth overlay (middle left), prediction overlay (middle right), ground truth overlay with prediction (red is true positive and pink is false positive) (right).
Hyperparameter Value
LR 0.000336375
Optimizer Nadam
Contracting Normalization Batch
Contracting Hidden Units [512,256,128,64,32]
Freeze Weights False
Contracting Activation ELU
Weight Decay 0.000158
Dropout 0.0136
LR Decay 8.806E-05
Epochs 165
Batch Size 8
Table 2: Tuned hyperparameters for U-Net with InceptionResNetV2 contracting path.

2.6 Combining Datasets and Post-Processing

We combine the training and validation sets - for a total of 247 images - to train our network before evaluating it on the test set. We also experiment with watershed-based post-processing, which feeds high-confidence classifications as seeds into the watershed algorithm. High confidence pixels are pixels classified within a 30% confidence interval of being negative (0-77) or positive (179-255) for vitiligo.

3 Results and Discussion

InceptionResNetV2 is the best performing contracting path, as it achieves a JI of 74.1% and threshold JI of 58.0% before hyperparameter tuning. The runtime is 97 minutes for 100 epochs on a single NVIDIA Tesla K80 GPU. SENet154 also appears to be a strong candidate; however, because the high performance came at the expense of increased training time, we did not explore it further in our study. After hyperparameter tuning, the JI score is 81.5%, and the threshold JI is 62.8%. Once we perform watershed post-processing, the count of images below the threshold falls from 16 images to 14 images. After training on the combined dataset, the InceptionResNetV2-based U-Net achieves a JI of 73.6% and threshold JI of 61.9%. Fig. 4 shows an example of the output. Though it is counterintuitive that performance decreases with our larger dataset, we believe this result may be due to the variability inherent in our small test set, which is only 61 images. The total training runtime is 108 minutes for about 200 epochs for our final network.

Lesion Simple Moderate Complex
Our Method
Person 1
Person 2
Person 3
Table 3: Segmentation by our method and three persons using semi-autonomous watershed GUI compared to the original image.
Lesion Simple Moderate Complex
Our Method JI (%) 88.7% 86.1% 74%
Time <1s <1s <1s
Person 1 JI (%) 94.3% 92.1% 83.3%
Time 4m 55s 9m 4s 28m 46s
Person 2 JI (%) 96.8% 95.3% 81.9%
Time 3m 44s 6m 57s 20m 39s
Person 3 JI (%) 88.0% 85.8 75.6%
Time 1m 53s 4m 31s 17m 24s
Table 4: Segmentation scores (JI) and times for lesions of varying complexity for three persons using semi-autonomous watershed GUI compared with our method.
Lesion Person 1 Person 2 Person 3
Table 5: Segmentation when constrained to 10 minutes.
Person 1 Person 2 Person 3
Accuracy (%) 75.9% 84.7% 77.6%
Table 6: Variability in segmentation accuracy for the “complex” rated lesion, with time held constant at 10 minutes.

We conduct an error analysis on the validation images that scored below the threshold JI, 16 images in total. By inspection, we believe that eight of the images have errors primarily due to ground truth labeling limitations. The semi-autonomous watershed tool is limited in its ability to identify small lesions due to the coarseness of seeds. Manual labeling addresses some of these smaller lesions. However, there are cases in which a gradient between healthy skin to vitiligo leaves ambiguity for classification.

Moreover, pixels that are not fully confident in classification receive a lower JI score due to the way the JI is calculated. For instance, a classification of 0.7 will result in a lower JI than a classification of 1, even if both are correct in being reasonably confident that the pixel is positive for vitiligo. Still, even with errors in the labeling and a lower JI, the predictions visually capture complex target regions on any skin surface with any skin tone. The strong visual prediction suggests that the proposed architecture represents a solid foundation for future work in automating vitiligo lesion segmentation. Moreover, each segmentation took less than a few seconds per image, instead of a few minutes via semi-autonomous watershed.

We also perform a case study to quantify the correlation between lesion complexity and time to segment the lesion with the watershed GUI. We asked three persons (reviewers), who were non-dermatologists, to semi-manually segment the lesions using watershed GUI. They were allowed to gain familiarity with the GUI on practice lesions before being officially timed. We asked our reviewers to continue contouring until they felt comfortable with their segmentation being in a clinical setting. As expected, time for segmentation increased with lesion complexity. Quantitative results are shown in Table 3. Table 4 shows that our method requires less than a second, in contrast with watershed, which requires 2-29 minutes per image. We performed a similar case study to elucidate the variability in segmentation between reviewers. After 10 minutes of segmentation on the “complex” rated lesion, the reviewers are asked to pause so that we can save their progress at that moment in time. From this study, we see that segmentation accuracy indeed varies widely, with almost 10% difference between reviewers, as shown in Table 6. Table 5 visually demonstrates variability between reviewers. Our method removes this variability.

4 Conclusion

We demonstrate that a U-Net with an InceptionResnetV2 - based contracting path, with watershed post-processing, proves promising for vitiligo segmentation. We quantify the variability that is possible between reviewers, as well as the time required to segment increasingly complex lesions. Our method eliminates both variability and long segmentation times, while also providing predictions that do not require much manual re-editing. There exist no conflicts of interest.


  1. A. A. Amer and X. Gao (2016) Quality of life in patients with vitiligo: an analysis of the dermatology life quality index outcome over the past two decades. International journal of dermatology 55 (6), pp. 608–614. Cited by: §1.
  2. J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §2.5.
  3. N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra and H. Kittler (2018) Skin lesion analysis toward melanoma detection: a challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 168–172. Cited by: §1, §2.2.
  4. N. C. Codella, Q. Nguyen, S. Pankanti, D. Gutman, B. Helba, A. Halpern and J. R. Smith (2017) Deep learning ensembles for melanoma recognition in dermoscopy images. IBM Journal of Research and Development 61 (4/5), pp. 5–1. Cited by: §2.2.
  5. Z. Hazel-Jemmott and Z. Hazel-Jemmott (2016-01) Vitiligo: causes, myths, and facts. External Links: Link Cited by: §1.
  6. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  7. J. Hu, L. Shen and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1.
  8. L. Komen, V. Da Graca, A. Wolkerstorfer, M. de Rie, C. Terwee and J. van der Veen (2015) Vitiligo area scoring index and vitiligo european task force assessment: reliable and responsive instruments to measure the degree of depigmentation in vitiligo. British Journal of Dermatology 172 (2), pp. 437–443. Cited by: §1.
  9. J. Liu, J. Yan, J. Chen, G. Sun and W. Luo (2019) Classification of vitiligo based on convolutional neural network. In International Conference on Artificial Intelligence and Security, pp. 214–223. Cited by: §1.
  10. P. Raina (2018) Energy-efficient circuits and systems for computational imaging and vision on mobile devices. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §1, §1.
  11. J. B. Roerdink and A. Meijster (2000) The watershed transform: definitions, algorithms and parallelization strategies. Fundamenta informaticae 41 (1, 2), pp. 187–228. Cited by: §1, §1.
  12. O. Ronneberger, P. Fischer and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §2.4.
  13. C. Salzes, S. Abadie, J. Seneschal, M. Whitton, J. Meurant, T. Jouary, F. Ballanger, F. Boralevi, A. Taieb and C. Taieb (2016) The vitiligo impact patient scale (vips): development and validation of a vitiligo burden assessment tool. Journal of Investigative Dermatology 136 (1), pp. 52–58. Cited by: §1.
  14. N. Shamsudin, S. H. Hussein, H. Nugroho and M. H. Ahmad Fadzil (2015) Objective assessment of vitiligo with a computerised digital imaging analysis system. Australasian Journal of Dermatology 56 (4), pp. 285–289. Cited by: §1.
  15. V. M. Sheth, R. Rithe, A. G. Pandya and A. Chandrakasan (2015) A pilot study to determine vitiligo target size using a computer-based image analysis program. Journal of the American Academy of Dermatology 73 (2), pp. 342–345. Cited by: §1.
  16. K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  17. J. Snoek, H. Larochelle and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §2.5.
  18. C. Szegedy, S. Ioffe, V. Vanhoucke and A. A. Alemi (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.
  19. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description