Skin Lesion Analysis toward Melanoma Detection:
A Challenge at the International Symposium on Biomedical Imaging
(ISBI) 2016, hosted by the International Skin Imaging Collaboration (ISIC)
In this article, we describe the design and implementation of a publicly accessible dermatology image analysis benchmark challenge. The goal of the challenge is to support research and development of algorithms for automated diagnosis of melanoma, a lethal form of skin cancer, from dermoscopic images. The challenge was divided into sub-challenges for each task involved in image analysis, including lesion segmentation, dermoscopic feature detection within a lesion, and classification of melanoma. Training data included 900 images. A separate test dataset of 379 images was provided to measure resultant performance of systems developed with the training data. Ground truth for both training and test sets was generated by a panel of dermoscopic experts. In total, there were 79 submissions from a group of 38 participants, making this the largest standardized and comparative study for melanoma diagnosis in dermoscopic images to date. While the official challenge duration and ranking of participants has concluded, the datasets remain available for further research and development.
Skin cancer is a major public health concern, with over 5 million newly diagnosed cases in the United States each year [1, 2, 3]. Melanoma is one of the most lethal forms of skin cancer, previously responsible for over 9,000 deaths a year in the United States alone , and over 10,000 estimated deaths in 2016 .
As melanoma occurs on the skin surface, it is amenable to detection by simple visual examination. Indeed, most melanomas are first recognized by patients, not physicians . However, unaided visual inspection by expert dermatologists is associated with a diagnostic accuracy of about 60%, meaning many potential curable melanomas are not detected until more advanced stages . To improve diagnostic performance and reduce melanoma deaths, dermoscopy has been introduced, which is an imaging technique that eliminates the surface reflection of skin, allowing deeper layers to be visually enhanced (Fig. 1). Assuming adequate levels of expertise by the interpreter, dermoscopic imaging has been shown to improve recognition performance over unaided visual inspection by approximately 50%, resulting in absolute accuracy between 75%-84%, with most of the improvement related to an increase in diagnostic sensitivity[6, 7, 8, 5, 9, 10]. When clinicians lack expertise, however, no improvement is demonstrated . Dermoscopic algorithms, such as “chaos and clues,” “3-point checklist,” “ABCD rule,” “Menzies method,” “7-point checklist,” and “CASH” (color, architecture, symmetry, and homogeneity), were developed to facilitate a novice’âs ability to distinguish melanomas from nevi with high diagnostic accuracy [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. However, recent research suggests that many clinicians rely simply on experience and the “ugly duckling” sign, which refers to an outlier lesion that is unusual in comparison to other lesions seen on the same patient .
As inexpensive consumer dermatoscope attachments for smart phones are beginning to reach the market , the opportunity for automated dermoscopic assessment algorithms to positively influence patient care increases. Given the potential for an influx of images, as well as a growing shortage of dermatologists , automated tools to assist in triage, screening, and evaluation will become essential. As a result, community interest has grown, as many centers have begun their own research efforts on automated analysis [24, 25, 26, 27, 29, 28, 31, 30, 32, 33, 34]. While initial attempts have been made to create public archives of images to support research and development on automated algorithms for dermoscopic image assessment , a large-scale, centralized, coordinated, and comparative effort across institutions has yet to be implemented.
The International Skin Imaging Collaboration (ISIC) is an international effort to improve melanoma diagnosis , which has recently begun efforts to aggregate a publicly accessible dataset of dermoscopy images. This challenge leveraged a database of dermoscopic skin images from the ISIC Data Archive111https://isic-archive.com/, which at the time of this publication contains over 10,000 images collected from leading clinical centers internationally, acquired from a variety of devices used at each center. The images are screened for both privacy and quality assurance. The associated clinical metadata has been vetted by recognized melanoma experts. Broad and international participation in image contribution ensures that the dataset contains a representative clinically relevant sample.
The overarching goal of this challenge was to provide a “snapshot” from the ISIC Archive to support development of automated melanoma diagnosis algorithms from dermoscopic images. The challenge was divided into 3 parts corresponding to each stage of lesion analysis: lesion segmentation, lesion dermoscopic feature detection, and lesion classification. In the following sections, the provided datasets and evaluation metrics used for the challenge, the participation rate, and the top achieved performance levels are described.
Ii Challenge Tasks & Dataset
The challenge consisted of 3 tasks: lesion segmentation, dermoscopic feature detection, and disease classification. The second and third components further consisted of two variants, yielding 5 active task parts that teams could participate in. The following discusses the tasks and the supplied training data for each:
ii.i Part 1: Lesion Segmentation Task
Participants were asked to submit automated predictions of lesion segmentations from dermoscopic images in the form of binary masks (Fig. 2). Lesion segmentation training data included the original image, paired with the expert manual tracing of the lesion boundaries in the form of a binary mask, where pixel values of 255 are considered inside the area of the lesion, and pixel values of 0 are outside. 900 images and associated ground truth data were supplied for training. Another set of 379 images were provided as a test set from which to evaluate participants.
ii.ii Part 2: Dermoscopic Feature Classification Task
Participants were asked to automatically detect two clinically defined dermoscopic features, ”globules” and ”streaks” [11, 12]. Pattern detection involved both localization and classification (Fig. 3). To reduce the variability and dimensionality of spatial feature annotations, the lesion images were subdivided into superpixels using the SLIC algorithm . Lesion dermoscopic feature data included the original lesion image and a corresponding superpixel mask, paired with superpixel-wise expert annotations of the presence and absence of the ”globules” and ”streaks” dermoscopic features. 807 images were provided for training data, and 335 were held-out as a test dataset.
ii.iii Part 2B: Dermoscopic Feature Segmentation Task
This part was identical to Part 2, with the exception that predictions were in the form of binary masks for each dermoscopic feature. This additional part was provided to explore and compare a second mechanism of algorithm development and evaluation for the goal of lesion dermoscopic pattern detection.
ii.iv Part 3: Disease Classification Task
Participants were asked to classify images as either being benign or malignant. Prediction classification scores were normalized into confidence intervals from 0.0 (benign) to 1.0 (malignant). Lesion classification data included the original image, paired with both the gold standard (definitive) malignancy diagnosis, as well as the ground truth lesion segmentation. 900 images and associated ground truth data were supplied for training. Another set of 379 images were provided as a test set from which to evaluate participants. Approximately 30.3% of the dataset was malignant (273 images in the training set).
ii.v Part 3B: Disease Classification Task with Masks
This part was identical to Part 3, with the exception that participants were additionally supplied the ground truth lesion segmentation mask.
Iii Evaluation Criteria
iii.i Segmentation Tasks
Submissions were compared using the following common segmentation metrics:
where TP, TN, FP, FN, refer to true positive, true negative, false positive, and false negatives, at the pixel level, respectively. Pixel values above 128 were considered positive, and pixel values below were considered negative.
Participants were ranked according to the Jaccard.
iii.ii Classification Tasks
Submissions were compared using using common classification metrics of accuracy, sensitivity, specificity, as defined in the previous section. However, metrics were measured at the whole image level, rather than the pixel level. Additionally, area under the receiver operating characteristic (ROC) curve (AUC), as well as the specificity at thresholds yielding 95% and 98% sensitivity, were measured. Finally, participants were ranked according to the metric of average precision, evaluated between sensitivity of 0-100%, which is a common measure of performance in the computer vision community.
Area Under Curve:
Area under the ROC curve was computed by taking the integral of true positive rate with respect to the false positive rate:
Where the true positive rate is a function of the false positive rate along the curve. The “scikit-learn” Python package was used for AUC computation.
Assuming dermoscopic images in a dataset are ranked according to normalized machine confidence of melanoma, where the most confident image is at index k=1, and k=n is the maximum rank that contains all positively labeled instances, the average precision corresponds to the integral under the precision-recall curve within this interval:
where k is an index in the ranked list for evaluation, pos(k) is a function that returns 1 if image k is a diseased lesion (or 0 otherwise), and P(k) is the precision evaluated at index k, where precision is defined as the following:
The true positive and false positive rate at k would be computed by using the machine confidence assigned to image indexed by k as the binary threshold between positively and negatively labeled instances. The “scikit-learn” Python package was used for computation of average precision.
Iv Hosting Platform
The training and test datasets were hosted on the Covalic grand challenge platform, developed at Kitware, Inc.222http://isic-challenge.net/, which enabled realtime evaluation of submissions according to defined criteria, automatic ranking of participants based on their submissions, and automatic feedback to participants detailing whether their submissions were properly parsed. Data will continue to be available at this site for the foreseeable future.
In total, there were 79 submissions from a group of 38 participants (consisting of both individuals and teams). 24 submissions to Part 1, 4 submissions to Part 2, 4 submissions to Part 2B, 25 submissions to Part 3, and 18 submissions to Part 3B.
Top evaluation results are shown in Tables I & II. Metrics include accuracy (AC), sensitivity (SE), specificity (SP), average precision (AP), area under curve (AUC), specificity at two levels of sensitivity (SP95 & SP98, at 95% and 98% sensitivity, respectively), Dice (DI), and Jaccard (JA). A full listing of results, as well as participants, is available from the challenge website.
The results from this public challenge suggest a number of important findings: 1) Performance levels of segmentation methods currently developed appear to be within the range where they would provide utility for annotation of additional data. Further analysis measuring the inter-observer and intra-observer variability of clinical experts will be carried out before a conclusion can be made on whether the automated techniques are statistically equivalent to expert annotation. 2) Results from dermoscopic feature detection appear promising, though further improvements must be made. 3) Disease recognition performance is within the range of that reported in prior literature for expert dermatologists . However, further analysis will be done to directly compare automated results on the test dataset to expert blinded dermatologists.
Here, we present the design and implementation of a successful public challenge hosted at the 2016 International Symposium on Biomedical Imaging, intended to support the community in the development of automated algorithms for the diagnosis of melanoma from dermoscopic images. A wide variety of independent approaches were submitted and evaluated, yielding the largest standardized and comparative study in this field and topic to date.
The authors would like to thank colleagues at Memorial Sloan-Kettering Cancer Center for their work in establishing the ISIC Archive: Ashfaq Marghoob, and Steven Dusza. Additionally, thanks go to co-workers at IBM Research for their support, guidance, and insightful discussions: John R. Smith, Sharath Pankanti, Quoc-Bao Nguyen, and Vaibhava Goel. Finally, thanks go to colleagues at Kitware: Steven Aylward, for his assistance while organizing the challenge, and Max Smolens for technical assistance with the Covalic platform.
-  Rogers HW, Weinstock MA, Feldman SR, Coldiron BM.: “Incidence estimate of nonmelanoma skin cancer (keratinocyte carcinomas) in the US population”, 2012. JAMA Dermatol 2015; 151(10):1081-1086.
-  “Cancer Facts & Figures 2014”. American Cancer Society, 2014.
-  Siegel, R.L., Miller, K.D., and Jemal, A.: “Cancer statistics, 2016,” CA: A Cancer Journal for Clinicians, vol. 66, no. 1, pp. 7-30, 2016.
-  Brady, M.S., Oliveria, S.A., Christos, P.J., Berwick, M., Coit, D.G., Katz, J., Halpern, A.C.: “Patterns of detection in patients with cutaneous melanoma.” Cancer. 2000 Jul 15;89(2):342-7.
-  Kittler, H., Pehamberger, H., Wolff, K., Binder, M.: “Diagnostic accuracy of dermoscopy”. In: The Lancet Oncology. vol. 3, no. 3, pp. 159.165. 2002.
-  Vestergaard, M.E., Macaskill, P., Holt, P.E., et al.: “Dermoscopy compared with naked eye examination for the diagnosis of primary melanoma: a meta-analysis of studies performed in a clinical setting.” Br J Dermatol. Sep 2008;159:669-676.
-  Bafounta, M.L., Beauchet, A., Aegerter, P., et al.: “Is dermoscopy (epiluminescence microscopy) useful for the diagnosis of melanoma? Results of a meta-analysis using techniques adapted to the evaluation of diagnostic tests.” Arch Dermatol. Oct 2001;137:1343-1350.
-  Abder-Rahman, A.A., Deserno, T.M.,: “A systematic review of automated melanoma detection in dermatoscopic images and its ground truth data”. In: Proc. SPIE 8318, Medical Imaging 2012: Image Perception, Observer Performance, and Technology Assessment.
-  Carli, P., Quercioli, E., Sestini, S., Stante, M., Ricci, L., Brunasso, G., De Giorgi, V.: “Pattern analysis, not simplified algorithms, is the most reliable method for teaching dermoscopy for melanoma diagnosis to residents in dermatology”. In: Br J Dermatol. vol. 148, no. 5, pp. 981-4. 2003.
-  Ganster, H., Pinz, A., RÃ¶hrer, R., Wildling, E., Binder, M., Kittler, H.: “Automated Melanoma Recognition”. In: IEEE Transactions on Medical Imaging, vol. 20, no. 3, 2001.
-  Braun, R.P., Rabinovitz, H.S., Oliviero, M., Kopf, A.W., Saurat, J.H.: “Dermoscopy of pigmented skin lesions.”. In: J Am Acad Dermatol. vol. 52, no. 1, pp. 109-21. 2005.
-  Rezze, G.G., Soares de SÃ¡, B.C., Neves, R.I.: “Dermoscopy: the pattern analysis”. In: An Bras Dermatol., vol. 3, pp. 261-8. 2006.
-  Argenziano, G. et al.: “Dermoscopy Improves Accuracy of Primary Care Physicians to Triage Lesions Suggestive of Skin Cancer” J. of. Clinical Oncology. vol. 24, no 12, 2006.
-  Argenziano, G. et al.: “Dermoscopy of pigmented skin lesions: Results of a consensus meeting via the Internet” J. American Academy of Dermatology. vol. 48 (5), 2003.
-  Argenziano, G. Fabbrocini, G., Carli, P., De Giorgi, V., Sammarco, E., Delfino, M.: “Epiluminescence microscopy for the diagnosis of doubtful melanocytic skin lesions: comparison of the ABCD rule of dermatoscopy and a new 7-point checklist based on pattern analysis.” Arch Dermatol. 1998;134(12):1563-1570
-  Stolz, W., Riemann, A., Cognetta, A.B., et al.: “ABCD rule of dermoscopy: a new practical method for early recognition of malignant melanoma.” Eur J Dermatol. 1994;4(7):521-527.
-  Soyer, H.P., Argenziano, G., Zalaudek, I., et al.: “Three-point checklist of dermoscopy: a new screening method for early detection of melanoma.” Dermatology. 2004;208(1):27-31.
-  Henning, J.S., Dusza, S.W., Wang, S.Q., et al.: “The CASH (color, architecture, symmetry, and homogeneity) algorithm for dermoscopy.” J Am Acad Dermatol. 2007;56(1):45-52.
-  Menzies, S.W., Ingvar, C., Crotty, K.A., McCarthy, W.H.: “Frequency and morphologic characteristics of invasive melanomas lacking specific surface microscopic features.” Arch Dermatol. 1996;132(10):1178-1182.
-  Rosendahl, C., Cameron, A., McColl, I., Wilkinson, D.: “Dermatoscopy in routine practice: âchaos and cluesâ.” Aust Fam Physician. 2012;41(7):482-487.
-  Gachon, J., et. al.: “First Prospective Study of the Recognition Process of Melanoma in Dermatological Practice”. In: Arch Dermatol. vol. 141, no. 4, pp. 434-438, 2005.
-  MoleScope. https://molescope.com/
-  Kimball, A.B., Resneck, J.S. Jr.: “The US dermatology workforce: a specialty remains in shortage.” J Am Acad Dermatol. 2008 Nov;59(5):741-5.
-  Barata, C., Ruela, M., et al.: “Two Systems for the Detection of Melanomas in Dermoscopy Images using Texture and Color Features”. In: IEEE Systems Journal, no. 99, pp. 1-15, 2013.
-  Codella, N., et. al.: “Deep Learning, Sparse Coding, and SVM for Melanoma Recognition in Dermoscopy Images” in 6th International Workshop of Machine Learning in Medical Imaging (MLMI) 2015, Published in Lecture Notes in Computer Science (LNCS) 9352, pp 118-126, 2015.
-  Colot, O., Devinoy, R., Sombo, A., de Brucq, D.: “A Colour Image Processing Method for Melanoma Detection”. In: Medical Image Computing and Computer-Assisted Intervention. pp. 562-569. 1998.
-  Madooei, A., Drew, M.S., Sadeghi, M., Stella Atkins, M.: “Automatic Detection of Blue-White Veil by Discrete Colour Matching in Dermoscopy Images”. In: Medical Image Computing and Computer-Assisted Intervention. pp. 453-460. 2013.
-  Abedini, M., Codella, N.C.F., Connell, J.H., Garnavi, R., Merler, M., Pankanti, S., Smith, J.R., Syeda-Mahmood, T.: “A generalized framework for medical image classification and recognition”. In: IBM Journal of Research and Development, vol. 59, no. 2/3. 2015.
-  Celebi, M.E., Iyatomi, H., Stoecker, W.V., Moss, R.H., Rabinovitz, H.S., Argenziano, G., Soyer, H.P.: “Automatic detection of blue-white veil and related structures in dermoscopy images.” In: Comput Med Imaging Graph., vol. 32, no. 8, pp. 670-7, 2008
-  Celebi, E., Schaefer, G., Iyatomi, H., V. Stoecker, W.: “Lesion Border Detection in Dermoscopy Images.” Comput Med Imaging Graph. 2009 Mar; 33(2): 148â153.
-  Di Leo, G., Paolillo, A., Sommella, P., Fabbrocini, G. Rescigno, O: “A software tool for the diagnosis of melanomas,” Instrumentation and Measurement Technology Conference (I2MTC), 2010 IEEE, Austin, TX, 2010, pp. 886-891.
-  Garnavi, R., Aldeen, M., Bailey, J.: “Computer-aided diagnosis of melanoma using border and wavelet-based texture analysis”. In: IEEE Trans. Inf. Technol. Biomed., vol. 16, no. 6, pp. 1239â52. 2012.
-  Mishraa, N.K., Celebi, E.: “An Overview of Melanoma Detection in Dermoscopy Images Using Image Processing and Machine Learning” arxiv.org: 1601.07843. Available: http://arxiv.org/abs/1601.07843
-  Ali, A.A., Deserno, T.M.: “A Systematic Review of Automated Melanoma Detection in Dermatoscopic Images and its Ground Truth Data” Proc. of SPIE Vol. 8318 83181I-1
-  Mendonca, T., Ferreira, P.M., Marques, J.S., Marcal, A.R., Rozeira, J.: “PH2 - a dermoscopic image database for research and benchmarking”. In: Conf Proc IEEE Eng Med Biol Soc. pp. 5437-40, 2013.
-  International Skin Imaging Collaboration Website. Available: http://www.isdis.net/index.php/isic-project
-  Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and Susstrun,S.: “SLIC Superpixels”, EPFL Technical Report 149300, June 2010.