On the Evaluation and Real-World Usage Scenarios of Deep Vessel Segmentation for Funduscopy
We identify and address three research gaps in the field of vessel segmentation for funduscopy. The first focuses on the task of inference on high-resolution fundus images for which only a limited set of ground-truth data is publicly available. Notably, we highlight that simple rescaling and padding or cropping of lower resolution datasets is surprisingly effective. Additionally we explore the effectiveness of semi-supervised learning for better domain adaptation. Our results show competitive performance on a set of common public retinal vessel datasets using a small and light-weight neural network. For HRF, the only very high-resolution dataset currently available, we reach new state-of-the-art performance by solely relying on training images from lower-resolution datasets. The second topic concerns evaluation metrics. We investigate the variability of the F1-score on the existing datasets and report results for recent SOTA architectures. Our evaluation show that most SOTA results are actually comparable to each other in performance. Last, we address the issue of reproducibility by open-sourcing our complete pipeline.
The accurate and automatic segmentation of the retinal vasculature structure has several important applications in ophthalmology, such as in the diagnosis of diabetic retinopathy and wet age related macular degeneration. This work identifies and addresses three research areas of interest in the field with high relevance for practical deployments.
The first concerns the availablity of high-resolution fundus images. Owning to the popularity and ease of use of lower-resolution fundus datasets such as DRIVE [staal_ridge-based_2004] and STARE [hoover_locating_2000] for convolutional neural network training, the majority of previous works is still mostly focused on images that are magnitudes smaller than the fundus images which are today taken by clinics and practitioners around the world. To the best of our knowledge, only the HRF dataset [budai_robust_2013] has a resolution that reaches or comes close to the resolution taken by modern fundus cameras.
Since manual annotation by experts is time consuming and costly, we propose a set of methods to leverage existing public low-resolution datasets with annoted ground-truth vessel labels to train convolutional-neural networks that perform well on unseen high-resolution images.
The second gap in existing research is the lack of detail in terms of how evaluation metrics are calculated and presented. We outline two averaging methods and introduce a set of plots containing standard deviation bands that provide additional insights into the robustness of models.
Finally we address the issue of reproducibility. We found that the degree of reproducibility in previous work is unsatisfactory. Additionally individual data-loading pipelines for each of the public datasets are necessary since they do not come with a coherent folder structure or API. This means that for new researchers looking to enter the field, a large amount of engineering work is often necessary before actual research can take place. To address this, we introduce the bob.ip.binseg package111https://gitlab.idiap.ch/bob/bob.ip.binseg that integrates with the Bob framework [anjos_bob_2017], can be easily extended and allows for the reproduction of our experiments.
Ii Related Work
Ii-a High-Resolution Images
In this section we describe related work that touches or focuses on the high-resolution fundus dataset HRF. Previous contributions can be roughly summarized as follows:
Small fully convolutional neural networks, trained on full-resolution images [laibacher_m2u_2018].
Large fully convolutional neural networks, trained on downsampled images [yan_joint_2018].
Patch-based training of fully convolutional neural networks with deformable convolutions [jin_dunet_2019].
Approaches with Generative Adversarial Networks [goodfellow_gan_2014] (GANs) using downsampled images [zhao_supervised_2019].
Due to it’s proven record on various segmentation domains, fully convolutional neural networks (FCN) based on VGG16 [simonyan_vgg_2015] as encoder are employed in the majority of works such as [jin_dunet_2019, meyer_deep_2017, yan_joint_2018, zhao_supervised_2019]. The M2U-Net introduced in [laibacher_m2u_2018] adopts a structure similar to the U-Net [ronneberger_u-net_2015] in [yan_joint_2018], relies however on MobileNetV2 in the encoder part and proposes light-weight inverted contracting residuals blocks in the decoder part. Similarly the DUNet by Jin et al. [jin_dunet_2019] adopts the U structure but uses Deformable Convolutions [dai_deformable_2016] in parts of the network.
|Orlando et al. [orlando_discriminatively_2017]*||2017||0.7158||No||No||-|
|Yan et al. [yan_joint_2018]*||2018||0.7212||No||No||25.85M|
|Laibacher et al. [laibacher_m2u_2018]*||2018||0.7814||No||No||0.55M|
|Jin et al. [jin_dunet_2019]*||2019||0.7988||No||Yes||0.88M|
|Zhao et al. [zhao_supervised_2019]||2019||0.7659||Yes||No||14.94M|
|DRIU [maninis_deep_2016] (our impl.)*||2019||0.7865||No||No||14.94M|
|*Same train-test split|
The different methods come with a computational/pipeline complexity and time vs. segmentation quality trade-off as indicated in Table I. The deformable-convolution in DUNet, while light in terms of parameter count, impose a great reduction in inference speed as reported by the authors (47.7s vs 9.7s for an 999 x 960 image). Additionally patch-based inference pipelines are estimated to be slower in inference than methods that utilize full resolution [laibacher_m2u_2018] or downsampled images [yan_joint_2018]. The GAN based-approach by Zhao et al. [zhao_supervised_2019] requires a two-step training procedure. In the first step a synthesized target dataset is constructed using a modified GAN. In the second step, DRIU [maninis_deep_2016] is trained on the created synthesized images.
Taking into account these trade-offs, we argue that the proposed methods remain largely comparable in performance and minor reported improvements do not represent highly significant breakthroughs.
Ii-B Evaluation Metrics
A coherent and transparent standard for evaluation metrics is necessary to allow for a fair comparison of methods. In the field of vessel segmentation however, the reported metrics frequently differ both in terms of kind and quantity. While the F1-score has emerged as one of the dominant metrics, exact details on it’s calculation are often not clearly stated.
FCNs commonly output probability maps, attaching a vessel probability score to each pixel in the image. Since the ground-truth labels are binary, the probability map has to be thresholded. In [maninis_deep_2016, laibacher_m2u_2018, jin_dunet_2019, zhao_supervised_2019] metrics on the test-set are evaluated at all thresholds, presumably ranging from 0 to 1 in steps of 0.01. Metrics at the optimal test set threshold only are reported in [fraz_ensemble_2012, marin_new_2011, meyer_deep_2017]. Li et al. [li_cross-modality_2016] utilize the threshold determined on the training set, a scheme also adopted by Yan et al. [yan_joint_2018].
The metrics Precision (Pr), Recall (Re), Specificity (Sp), Accuracy (Acc) and F1-Score (F1) are derived from the number of True Positives (TP), False Positives (FP), True Negatives (TN) and False Negatives (FN) that are calculated for each test image/ground-truth pair:
The average F1-score can then either be calculated based on individual F1-scores for each test image or on average Precision and Recall.
The differences in evaluation metrics are identified as the first barrier for fair comparison and evaluation of competing methods. The second is the difference in training and test splits. Out of the five considered datasets, only DRIVE [staal_ridge-based_2004] defines a train-test split. The remaining datasets leave it to the author to define an appropriate split. As will become evident in Section VII, while in some cases a dominant split has emerged in the literature, in other cases the utilized splits differ considerably, further hindering fair comparisons. In this work both barriers are addressed, we clearly describe metric calculations and train-test splits and hope to inspire future work to adopt a similar approach.
Machine learning experiments are becoming increasingly complex, making it harder to reproduce them [anjos_bob_2017]. This problem is especially pronounced in niche computer vision fields like vessel segmentation, that receive less attention compared to popular image-classification or object detection tasks in which reproducible work is found more frequently. The maskrcnn_benchmark [massa_mrcnn_2018] for example, since it’s introduction saw several independent contributions and publications building on top of it [tian_fcos_2019, fu_retinamask_2019]. Besides being good practice, reproducible research has shown to be beneficial to the impact of publications [vandewalle_reproducible_2009]. Vandewalle et al. [vandewalle_reproducible_2009] distinguish six degrees of reproducibility, ranging from easily reproducible (5) to not reproducible (0). We refer to the paper for the exact definitions.
We found that existing works do either not provide any source-code at all [fraz_ensemble_2012, li_cross-modality_2016, liskowski_segmenting_2016, orlando_discriminatively_2017], estimated to require extreme effort to reproduce (2), provide source-code that lack documentation and instructions on how to setup the training environment and datasets [maninis_deep_2016, zhao_supervised_2019, jin_dunet_2019] or provide only parts of the training pipeline [yan_joint_2018] and therefore require considerable effort to reproduce (3).
This highlights the need for a ”class 5” reproducible work, which we provide in the form of the comprehensive software-package bob.ip.binseg.
The five most commonly used datasets are DRIVE [staal_ridge-based_2004], STARE [hoover_locating_2000], CHASE_DB1 [owen_measuring_2009], HRF [budai_robust_2013] and IOSTAR [abbasi_iostar_2015] with the order being indicative of their appearance in the literature.
Iii-a Train-Test Splits
For DRIVE we use the train-test split as proposed by the authors of the dataset. For STARE we follow Maninis et al. [maninis_deep_2016] and Zhao et al. [zhao_supervised_2019] with a 10/10 split. The split adopted for CHASE_DB1 was first proposed by Fraz et al. [fraz_ensemble_2012], which uses the first 8 images for training and the last 20 for testing. For HRF we adopt the split as proposed by Orlando et al. [orlando_discriminatively_2017] and adapted in [laibacher_m2u_2018] and [jin_dunet_2019], whereby the first five images of each category (healthy, diabetic retinopathy and glaucoma) are used for training and the remaining 30 for testing. For IOSTAR we select the 20/10 split introduced by Meyer et al. [meyer_deep_2017]. Table II provides an compact overview of dataset sizes, resolutions, splits and references.
Iii-B Combined Vessel Dataset
Since models are trained on a combination of the above mentioned dataset we refer to the combination of them as COVD (Combined Vessel Dataset). Whenever we exclude the dataset we use for testing from training, we indicate it by a sign. E.g. COVD tested on target dataset HRF, means we include all datasets for training except HRF. Similarly COVD evaluated on the target dataset CHASE_DB1 means we include all datasets for training except CHASE_DB1.
This way we simulate real-world cases where there often is no ground-truth data available.
In cases where we use semi-supervised learning we utilize the training images but not the ground-truth data of the target dataset. E.g. For COVDSSL evaluated on HRF, we utilize all datasets except HRF for training with ground-truth pairs and for SSL we use only the training images of the HRF training set.
|Dataset||H x W||Imgs.||Train||Test||Reference|
|DRIVE||584 x 565||40||20||20||[staal_ridge-based_2004]|
|STARE||605 x 700||20||10||10||[maninis_deep_2016]|
|CHASE_DB1||960 x 999||28||8||20||[fraz_ensemble_2012, laibacher_m2u_2018]|
|IOSTAR||1024 x 1024||30||20||10||[meyer_deep_2017]|
|HRF||2336 x 3504||45||15||30||[orlando_discriminatively_2017, laibacher_m2u_2018, jin_dunet_2019]|
Iv Baseline Benchmarks
Before investigating potential approaches to vessel segmentation on high-resolution images, we run a set of benchmark baselines that compared the performance of four popular convolutional neural networks used for retinal vessel segmentation (number of trainable network parameters in brackets): DRIU (14.94M) [maninis_deep_2016], HED (14.73M) [xie_holistically-nested_2015], M2U-Net (0.55M) [laibacher_m2u_2018] and U-Net (25.85M) [ronneberger_u-net_2015]. The results are shown in Table III. The fact that the considerable smaller M2U-Net’s performance almost reaches the performance of larger models like DRIU and U-Net, especially for the high resolution datasets HRF, hints at overparametrization of the pretrained ImageNet [imagenet_cvpr09] part of those models. Similar observations were made by Raghu et al. [raghu_transfusion_2019] on the RETINA [gulshan_development_2016] classification dataset.
Out of the four models we picked DRIU and M2U-Net to train on COVD and to apply semi-supervised learning. M2U-Net because of the aforementioned properties and DRIU since it came very close to the heavy U-Net while requiring less parameters.
|CHASEDB1||0.810 (0.021)||0.810 (0.022)||0.802 (0.019)||0.812 (0.020)|
|DRIVE||0.820 (0.014)||0.817 (0.013)||0.803 (0.014)||0.822 (0.015)|
|HRF||0.783 (0.055)||0.783 (0.058)||0.780 (0.057)||0.788 (0.051)|
|IOSTAR||0.825 (0.020)||0.825 (0.020)||0.817 (0.020)||0.818 (0.019)|
|STARE||0.827 (0.037)||0.823 (0.037)||0.815 (0.041)||0.829 (0.042)|
In this section we describe two approaches for vessel segmentation for high-resolution images: We first outline the conducted rescaling, padding and cropping scheme, followed by our implementation of semi-supervised learning. Here we propose a simple scheme whereby three guesses of unlabeled images, created by the network during training, are averaged and incorporated in a combined loss function by a weighting factor. Finally we describe other implementation details and hyper-parameters.
V-a Rescaling, Cropping, Padding
Whenever we train a model for a target dataset with a specific resolution and spatial composition we perform image transformations to the soure dataset so that it has the resolution and approximate spatial composition of the target dataset. This is best illustrated by an example:
Treating HRF as the target dataset, and CHASE_DB1 as the source dataset, we first perform a crop on the latter followed by a resize operation as depicted in Figure 4.
This is in contrast to approaches where the high-resolution target dataset is downscaled to the resolution of the source dataset, feed through the network and upsampled again to the target dataset resolution [zhao_supervised_2019, yan_joint_2018].
V-B Semi-Supervised Learning
Semi-Supervised Learning (SSL) [chapelle_ssl_2010] has seen increased interest in the image classification domain, with recent works including [berthold_mixmatch_2019, olivier_contrastivessl_2019]. In this work we adopt the approach of Berthold et al. [berthold_mixmatch_2019] of using unlabeled examples and labeled examples in separate loss terms that are combined by a weighting factor. Given a batch of unlabeled examples from the target dataset, for each unlabeled image in the batch, three guessed probability vessel labels are generated via a forward pass through the model using the unlabeled image, a horizontally flipped version of it and a vertical flipping version of it which are averaged to form :
and then used in the SSL-Loss covered in the following section.
V-B1 Loss Functions
For standard supervised-learning without SSL we utilize the Jaccard Loss [iglovikov_ternausnetv2_2018] that is a combination of Binary Cross-Entropy loss and the Jaccard Score weighted by a factor :
We adopt as suggested by Iglovikov et al. [iglovikov_ternausnetv2_2018].
The Binary Cross-Entropy loss, where are values corresponding to predicted probability of a pixel belonging to the vessel class and is the ground-truth binary value, forms the first part of the equation and is defined as:
An adaptation of the Jaccard coefficient for continuous pixel-wise probabilities forms the second part of the combined loss function:
We note that the Jaccard coefficient has a monotonically increasing relation with the F1-Score (also know as Dice coefficient) [pont-tuset_supervised_2016], so it can act as an appropriate loss-function even though our actual evaluation metrics is the F1-Score.
Instead of using a constant , we use a quadratic ramp-up schedule illustrated in Figure 5.
With the intuition behind that the further we are in the training process the better the semi-supervised predictions should get.
V-C Implementation Details
The following details apply to both, ”normal” training and SSL training. We deploy AdaBound [luo_adaptive_2019] as an optimizer, using the default parameters as suggested in the original paper with a learning rate of 0.001. During the training-phase we apply the following random augmentations: horizontal flipping, vertical flipping, rotation and changes in brightness, contrast, saturation and hue. Training is conducted for 1000 epochs with a reduced learning rate of 0.0001 after 900 epochs. For all datasets except HRF, we use the original resolution and perform necessary cropping/padding so that the resolution is a multiple of 32, a requirement for the U-Net and M2U-Net variants. Since our training hardware setup did not have enough GPU memory for the training of the full resolution HRF images (and the upscaled COVD source datasets), we use half resolution images for training (1168 x 1648) but run inference on the full resolution images (2336 x 3296). We refer to the released bob-package and documentation for all details necessary to reproduce our results.
Vi Metrics and Evaluation
In this section we describe in more detail the two ways to calculate the F1-score and introduce an advanced Precision vs. Recall plot.
As mentioned in Section II-B, the average F1-Score for all test-images can either be calculated on a micro level, that is each individual F1-score is averaged or on a macro level, where the F1-score is calculated based on the average Precision and Recall:
While previous published work uses the later, which leads to slightly higher scores, the former calculation method allows for additional insights on the variability of the model’s performance since F1-score standard deviations can be included.
An alternative representation of the results is show in Figure 6, in the form of an extended precision vs. recall curve. Here and are plotted for every threshold together with the standard deviation in both precision and recall. In addition iso-F curves are plotted in light green and the point along the curve with the highest F1-score is highlighted in black. In this case, in order to have a consistent representation within the plot, the F1-score is macro averaged (Equation 12). This setup allows for an easy visual comparison of model performance and their variability across test-images.
E.g. in Figure 6 it can be observed that the variability across CHASE_DB1 annotations made by the 2nd human is higher than that of our models since their standard deviation bands are narrower compared to the standard deviation of the 2nd human annotator depicted by a single line in light red. Put simply, the models make more consistent predictions across test images compared to the second annotator.
To evaluate the performance of our rescaling, padding and cropping scheme, we treated each of the datasets in Table II in turn as the target dataset. Here we only report the results for M2U-Net and refer to Appendix A for the results with DRIU.
On all tested datasets we found that training on COVD yields competitive results that come close to the performance of the baselines where the model was trained and tested on the same dataset. This is encouraging, given the large differences in illumination, contrast, color and resolution of the source datasets. For HRF we can report performance improvements of almost 2 p.p. compared to the baseline.
Further applying SSL we gain additional improvements of around 1 p.p. for CHASE-DB1 and STARE, in the latter case now narrowly beating the baseline. For DRIVE we only found marginal improvements and worse performance for HRF and IOSTAR. We therefore fail to make conclusive statements about the viability of SSL for domain adaption and leave the investigation, mitigation and improvement of this method to future work.
|DRIVE||0.803 (0.014)||0.789 (0.018)||0.791 (0.014)|
|STARE||0.815 (0.041)||0.812 (0.046)||0.820 (0.044)|
|CHASEDB1||0.802 (0.019)||0.788 (0.024)||0.799 (0.026)|
|HRF||0.780 (0.057)||0.802 (0.045)||0.797 (0.044)|
|IOSTAR||0.817 (0.020)||0.793 (0.015)||0.785 (0.018)|
To put our results into perspective, Tables V,VI,VII,VIII and IX show previous works. We report the -score for our results. Overall M2U-Net trained on COVD is competitive, with the best performance on the high-resolution dataset HRF, where a new state-of-the-art F1-score could be reached. Additionally to the available public datasets, we trained M2U-Net on COVD for a private target dataset with a resolution of 1920x1920 for which no ground-truth data is available. The predicted vessel probability maps are displayed in Figure 13.
|Target: DRIVE (584x565)|
|2nd human observer||0.7931||0.7881||0.8072||0.7796||0.9717|
|Bibiloni et al. [bibiloni_real-time_2018]||2018||0.7521||0.938||0.786||0.721||0.970|
|Fraz et al. [fraz_ensemble_2012]||2012||0.7929||0.9480||0.8532||0.7406||0.9807|
|Jin et al. [jin_dunet_2019]||2019||0.8237||0.9566||0.8529||0.7963||0.9800|
|Laibacher et al. [laibacher_m2u_2018]||2018||0.8091||-||-||-||-|
|Li et al. [li_cross-modality_2016]||2016||-||0.9527||-||0.7569||0.9816|
|Liskowski et al. [liskowski_segmenting_2016]||2016||-||0.9535||-||0.7811||0.9807|
|Maninis et al. [maninis_deep_2016]||2016||0.8220||-||-||-||-|
|Marin et al. [marin_new_2011]||2011||0.8134||0.9452||0.9582||0.7067||0.9801|
|Orlando et al. [orlando_discriminatively_2017]||2017||0.7857||-||0.7854||0.7897||0.9684|
|Yan et al. [yan_joint_2018]||2018||0.8183||0.9529||0.8124||0.8242||0.9720|
|Zhao et al. [zhao_supervised_2019]||2019||0.7882||-||-||-||-|
|M2U-Net COVD –||0.7885||0.9592||0.7990||0.7824||0.9787|
|M2U-Net COVD – SSL||0.7913||0.9598||0.8016||0.7862||0.9789|
|All supervised methods use the same train-test split|
|Target: STARE (605x700)|
|2nd human observer||-||0.9347||0.6432||0.8955||0.9382|
|Bibiloni et al. [bibiloni_real-time_2018]||2018||0.752||0.938||0.786||0.721||0.970|
|Fraz et al. [fraz_ensemble_2012]||2012||0.7747||0.9347||0.7956||0.7548||0.9763|
|Jin et al. [jin_dunet_2019]||2019||0.8143||0.9641||0.8777||0.7595||0.9878|
|Li et al. [li_cross-modality_2016]||2016||-||0.9628||-||0.7726||0.9844|
|Maninis et al. [maninis_deep_2016]*||2016||0.831||-||-||-||-|
|Marin et al. [marin_new_2011]||2011||0.8080||0.9526||0.9659||0.6944||0.9819|
|Orlando et al. [orlando_discriminatively_2017]||2017||0.7644||-||0.7740||0.7680||0.9738|
|Yan et al. [yan_joint_2018]||2018||-||0.9612||-||0.7581||0.9846|
|Zhao et al. [zhao_supervised_2019]*||2019||0.7960||-||-||-||-|
|M2U-Net COVD –||0.8117||0.9724||0.8128||0.8114||0.9851|
|M2U-Net COVD – SSL||0.8196||0.9734||0.8164||0.8282||0.9847|
|*Same train-test split as adopted in this work|
|Target: IOSTAR (1024x1024)|
|Abbasi-Sureshjani et al. [abbasi_iostar_2015]||2015||-||0.9501||0.7863||0.9747|
|Meyer et al. [meyer_deep_2017]*||2017||-||0.9695||-||0.8038||0.9801|
|Zhang et al. [zhang_robust_2016]||2016||-||0.9514||-||0.7545||0.9740|
|Zhao et al. [zhao_supervised_2019]||2019||0.7707||-||-||-||-|
|DRIU [maninis_deep_2016] (our impl.)||0.8273||0.9721||0.8173||0.8376||0.9839|
|M2U-Net COVD –||0.7928||0.9665||0.7755||0.8161||0.9798|
|M2U-Net COVD – SSL||0.7845||0.9644||0.7544||0.8221||0.9770|
|*Same train-test split as adopted in this work|
|2nd human observer||0.7686||0.9538||-||-||-|
|Azzopardi et al. [azzopardi_trainable_2015]||2015||-||0.9387||-||0.7585||0.9587|
|Zhang et al. [zhang_robust_2016]||2016||-||0.9452||-||0.7626||0.9661|
|Fraz et al. [fraz_ensemble_2012]*||2012||0.7566||0.9469||0.7415||0.7224||0.9711|
|Jin et al. [jin_dunet_2019]||2019||0.7883||0.9610||0.7630||0.8155||0.9752|
|Laibacher et al. [laibacher_m2u_2018]*||2018||0.8006||-||-||-||-|
|Li et al. [li_cross-modality_2016]||2016||-||0.9581||-||0.7507||0.9793|
|Orlando et al. [orlando_discriminatively_2017]||2017||0.7332||-||0.7438||0.7277||0.9712|
|Roychowdhury et al. [roychowdhury_blood_2015]||2015||-||0.9530||-||0.7201||0.9824|
|Yan et al. [yan_joint_2018]||2018||-||0.9610||-||0.7633||0.9809|
|DRIU [maninis_deep_2016] (our impl.)*||0.8114||0.9716||0.8068||0.8160||0.9842|
|M2U-Net COVD –||0.7884||0.9678||0.7710||0.8095||0.9807|
|M2U-Net COVD – SSL||0.7988||0.9694||0.7819||0.8189||0.9816|
|*Same train-test split as adopted in this work|
|Target: HRF (2336x3504)|
|Annunziata et al. [annunziata_leveraging_2016]||2016||0.7578||0.9581||0.8089||0.7128||0.9836|
|Budai et al. [budai_robust_2013]||2013||-||0.9610||-||0.669||0.985|
|Odstrcilik et al. [odstrcilik_retinal_2013]||2013||0.7324||0.9494||0.7741||0.9669|
|Zhang et al. [zhang_robust_2016]||2016||-||0.9556||-||0.7978||0.9710|
|Orlando et al. [orlando_discriminatively_2017]*||2017||0.7158||-||0.6630||0.7874||0.9584|
|Yan et al. [yan_joint_2018]*||2018||0.7212||0.9437||0.6647||0.7881||0.9592|
|Laibacher et al. [laibacher_m2u_2018]*||2018||0.7814||0.9635||-||-||-|
|Jin et al. [jin_dunet_2019]*||2019||0.7988||0.9651||0.8593||0.7464||0.9874|
|Zhao et al. [zhao_supervised_2019]||2019||0.7659||-||-||-||-|
|DRIU [maninis_deep_2016] (our impl.)*||0.7865||0.9646||0.7863||0.7868||0.9806|
|M2U-Net COVD -||0.8020||0.9669||0.7889||0.8188||0.9802|
|M2U-Net COVD - SSL||0.7972||0.9659||0.7898||0.8021||0.9807|
|*Same train-test split as adopted in this work|
In this work we showed that simple transformation techniques like rescaling, padding and cropping of lower-resolution source datasets to the resolution and spatial composition of a higher-resolution target dataset can be a surprisingly effective way to improve segmentation quality. Our experiments with semi-supervised learning show first promising results but require further investigation and work. We emphasized the need for a more rigourous and detailed focus on evaluation metrics and proposed a set of plots and metrics that give additional insights into model performance. Lastly, we provide open-source code and documentation for future researchers to build upon and hope to inspire future work in the field.
Appendix A DRIU and DRIU BN Results
In addition to the M2U-Net architecture, we also evaluated the larger DRIU network and a variation of it that contains batch normalization (DRIU BN) on COVD and COVDSSL. Perhaps surprisingly, for the majority of combinations, the performance of the DRIU variants are roughly equal or worse than the M2U-Net. We anticipate that one reason for this could be the aforementioned overparameterization of large VGG16 models that are pretrained on ImageNet. The results are listed in Table X.
|COVD||DRIVE||0.788 (0.018)||0.797 (0.019)||0.789 (0.018)|
|COVDSSL||DRIVE||0.785 (0.018)||0.783 (0.019)||0.791 (0.014)|
|COVD||STARE||0.778 (0.117)||0.778 (0.122)||0.812 (0.046)|
|COVDSSL||STARE||0.788 (0.102)||0.811 (0.074)||0.820 (0.044)|
|COVD||CHASE_DB1||0.796 (0.027)||0.791 (0.025)||0.788 (0.024)|
|COVDSSL||CHASE_DB1||0.796 (0.024)||0.798 (0.025)||0.799 (0.026)|
|COVD||HRF||0.799 (0.044)||0.800 (0.045)||0.802 (0.045)|
|COVDSSL||HRF||0.799 (0.044)||0.784 (0.048)||0.797 (0.044)|
|COVD||IOSTAR||0.791 (0.021)||0.777 (0.032)||0.793 (0.015)|
|COVDSSL||IOSTAR||0.791 (0.021)||0.811 (0.074)||0.785 (0.018)|
Appendix B Qualitative Results and PR Curves for M2U-Net
Tim Laibacher received the M.Sc degree in Data Science from City, University of London in 2018 where he subsequently worked as a Research Assistant. He was an Intern at the Idiap Research Institute, studying retinal vessel segmentation methods using convolutional neural networks. His research is focused on practical applications in the field of funduscopy using efficient network architectures and working with high-resolution images.
André Rabello dos Anjos received his Ph.D. degree in signal processing in 2006 studying the application of neural nets and statistical methods for particle recognition in the context of High-Energy Physics experiments at Large Hadron Collider at CERN, Switzerland. He joined the Idiap Research Institute in 2010 where he works with biosignal processing and biometrics applications. He currently heads the Biosignal Processing Group at Idiap. Current interests include medical applications, reproducible research, pattern recognition, image processing and machine learning. André teaches graduate-level machine learning courses at the École Polytechnique Fédérale de Lausanne (EPFL) and serves as reviewer for various scientific journals in pattern recognition, image processing and biometrics.