A learning without forgetting approach to incorporate artifact knowledge in polyp localization tasks

A learning without forgetting approach to incorporate artifact knowledge in polyp localization tasks


Colorectal polyps are abnormalities in the colon tissue that can develop into colorectal cancer. The survival rate for patients is higher when the disease is detected at an early stage and polyps can be removed before they develop into malignant tumors. Deep learning methods have become the state of art in automatic polyp detection. However, the performance of current models heavily relies on the size and quality of the training datasets. Endoscopic video sequences tend to be corrupted by different artifacts affecting visibility and hence, the detection rates. In this work, we analyze the effects that artifacts have in the polyp localization problem. For this, we evaluate the RetinaNet architecture, originally defined for object localization. We also define a model inspired by the learning without forgetting framework, which allows us to employ artifact detection knowledge in the polyp localization problem. Finally, we perform several experiments to analyze the influence of the artifacts in the performance of these models. To our best knowledge, this is the first extensive analysis of the influence of artifact in polyp localization and the first work incorporating learning without forgetting ideas for simultaneous artifact and polyp localization tasks.

Keywords— Polyp detection, Artifact detection, Learning without forgetting, Multi-task learning

1 Introduction

According to the World Health Organization, cancer is the second leading cause of death worldwide, and 9.6 million people were estimated to die from cancer in 20183. Colorectal Cancer (CRC) is the third most common cancer and the second most common cause of deaths related to cancer, according to data of 20184. Cure rates for some types of cancer strongly rely on early detection, and in the case of CRC survival rates reach 95% when detected in early stages ([8]).

The standard procedure for CRC screening is the endoscopic analysis of the colon. During this study, the endoscopist explores the colon cavity looking for abnormal growths of tissue known as polyps (Fig. 1a), which may develop to tumors. Polyps with a certain shape, kind, and size should be removed for closer examination. ([14]).

Figure 1: a) Sample frames from the CVC-ClinicDB dataset. Polyps locations are indicated by green marks. b) Samples from the EAD challenge dataset; The artifacts contained include specularities (pink boxes), tools (red boxes), bubbles (black boxes), and blur (blue boxes). Additional artifact classes include contrast, saturation and misc. artifact (this last class contains elements that do not belong to the previously mentioned). Best viewed in color.

Given the importance of early diagnosis, different works for automatic polyp detections in endoscopic images and videos have been proposed to help the endoscopist in the analysis. However, polyp detection is a challenging problem given its high variation in appearance, size, shape, and in many cases its high similarity with the surrounding tissue. Changes in the pose, blurring, bubbles, and occlusions occurring during the colonoscopy and specular reflections are also factors that make the task difficult ([8]). The Polyp Detection in Colonoscopy sub-challenge introduced by [10] was conducted as part of the Endoscopic Vision Challenge5 during MICCAI 2015 with two main objectives: polyp localization and polyp detection. Participants were provided with the CVC-ClinicDB dataset for training their models, while the ETIS-Larib dataset was reserved for testing. The ASU-Mayo Clinic Colonoscopy Video (c) Database was also considered for both training and testing. Part of the results of this competition included a comparison of the performance for the participating models in the presence (or absence) of certain artifacts, like specular highlights, low visibility or information overlaying the image. Even though this was a general overview and does not include a deep analysis of the influence of artifacts on predictions (e.g. about the relations of artifacts with correct, and incorrect polyp predictions), its results suggest that artifacts play a significant role in polyp detection.

Given the importance of artifacts in endoscopy analysis, recently, the Endoscopic Artefact Detection (EAD)6 challenge was proposed for the detection of seven different types of artifacts. The participants were provided with a dataset ([4, 3]) containing endoscopy still images together with artifact bounding boxes for the different classes (Fig. 1b).

These polyps and artifact datasets provide the necessary resources to dive deeper into the understanding of the relationship of artifacts presence and polyp localization accuracy. Similarly, a better understanding of these roles should lead to improved detection systems. It has been showed that training a model on multiple related-tasks can improve its performance ([13]). Following this line, we consider that artifact detection should help to improve polyp detection performance.

With this in mind, we conducted experiments to explore how the presence of endoscopic artifacts is related to the correct and incorrect predictions of a polyp detector. We considered cases when the artifacts are only present and when they overlap with polyp predictions. Then, we evaluate different strategies that incorporate artifact information in deep learning models for polyp detection, including a novel application of the learning without forgetting framework ([38]). This leads to the following contributions: 1) the first work (to our best knowledge) that present a deep analysis of how artifacts affect polyp detection; 2) the first work that uses multi-task learning for including artifact information in polyp detection; 3) to the best of our knowledge, this is the first work applying learning without forgetting in a problem where both (the initial and new) task are object detection tasks.

The rest of the document is organized as follows: Section 2 describes previous related works. Section 3 presents the main concepts and models employed; it also gives details about our MTL polyp detector. Our experiments and results are presented in section 4; finally, section 5 presents our conclusions.

2 Related work

Given the importance of automatic polyp detection as a potential tool for computer-assisted colonoscopy (the endoscopic study of the colon), there have been different previous works. Initial approaches used to rely on traditional machine learning methods and expert-defined features. Currently, with the extended use of deep learning, convolutional neural networks (CNN) have become the state of the art in this problem. Many works have already tried to address the issue of artifacts presence in endoscopy ([3, 33, 35, 27, 15, 1, 36]). Except for [36], none of these methods applied artifact knowledge in polyp detection. Given the nature of our study, we can find related work into the polyp detection, artifact detection, and multi-task learning (MTL) literature.

2.1 Polyp Detection

Automatic polyp detection is a computer vision problem that has been dominated by hand-crafted features methods before computing power has made deep learning possible ([23, 22, 2, 5, 19, 21, 16, 9]). However, given the great variance of polyp shapes, different angles used in colonoscopy, and the different lighting modes applied, the appearance of polyps varies widely and so these methods have always had limited performance ([39]). Moreover, there are many objects, including artifacts in endoscopic video frames, that can look very similar to polyps. With the ascent of deep learning, automatic polyp detection performance has improved significantly in recent years and most high-performing approaches now rely on it ([10, 41, 39, 11, 6, 37, 28, 31, 12]). Some methods also use a hybrid approach with hand-crafted and learned features combined ([34, 7, 32, 30]).

2.2 Artifact Detection

As we mentioned previously, endoscopic sequences tend to contain a high number of artifacts that reduce visibility and affect polyp detection capabilities ([10]). Efforts have been made to detect these artifacts and restore the images. Most of these approaches focus only on specific artifacts, like specular reflections ([33, 35, 27, 15, 1]). In recent works ([3]) the team behind the Endoscopic Artifact Detection (EAD) challenge has built a framework that can detect six different artifact classes and apply artifact-specific frame restoration procedures, which are often based on adversarial networks. While this method can restore the quality of endoscopic frames for post-processing procedures, it is not applicable in real-time.

Within our work, we analyze to what extent artifacts affect polyp detection. In the MICCAI’s 2015 polyp detection competition ([10]), an initial attempt has been made to understand the effects of endoscopic artifacts on polyp detection. Here, three experts were asked to label images of the ASU-Mayo Clinic Colonoscopy Video Database ([34]). The database consists of 38 videos with artifact ground-truth per frame taken from the common majority labels given by the three experts. Labels can include massive specular highlights presence, low visibility, specular highlights within polyp or overexposed regions. Then, they look at how the performance of the different models participating in the competition differs in the presence of these artifacts. The study showed that the presence of certain artifacts has a substantial influence on polyp detection performance. Our experiments extend these findings by involving a higher number of artifacts (Fig. 1b) and conducting several experiments.

Initial studies about the performance of multi-class models in the context of artifact and polyp data are presented in [36]. Here, the CVC-ColonDB and CVC-ClinicDB polyp datasets were extended to also include lumen, specularity and dark borders to perform semantic segmentation of the endoscopic sequences using fully convolutional neural networks (FCN8). Their goal is to set a benchmark for endoluminal scene segmentation. The main difference with our work is that we are not interested in evaluating the performance of the model on endoluminal scene segmentation, but in polyp localization. One of our objectives is to provide a deep analysis of how polyp localization is affected by the presence of a larger number of artifacts. In our case, the model’s performance on artifacts is not the main focus. Also, our approach is oriented to models specific for object classification and localization (i.e. regional neural networks), in contrast with the semantic segmentation model employed in the mentioned work. However, it is worth to mention that [36] also presents a general analysis of how the number of classes influences the polyp localization performance when the localization rate is given as a function of the intersection over the union (IoT) between the predicted and the ground truth segmentation masks. Inspired by this analysis, we performed experiments training our polyp detector including a different number of artifact’s classes. However, in contrast with their work, we considered seven artifact’s classes (instead of three) where the selected/excluded classes are chosen based on its influence on polyp detection (see section 4.6).

2.3 Multi-task Learning

In our context, we will refer to multi-task Learning (MTL) as the approaches that use auxiliary (or related) tasks to improve the original primary task. Works presented in [13] bring examples of MTL in different areas. In the medical domain, these approaches have led to improvements, for example, in mortality rate prediction of pneumonia patients. The main inputs for this model are certain patient characteristics, such as age or whether the patient presents determined symptoms. By ensuring that the model not only predicts mortality rate but also related outputs such as white blood cell count, the hidden layers of the model were biased to better capture certain characteristics of the patient leading to better model performance. A recent work presented by [40], uses the correlation between the severity of diabetic retinopathy and lesions present on the eye. The authors propose a model for disease grading and multi-lesion segmentation that allows both tasks to collaborate to improve each other. Their proposal also permits the segmentation model to train in a semi-supervised way. This is done by using the segmentor’s predictions as input for an attentive model used by the grading network. At the same time, the attention maps generated by this attentive model are used as pseudo-mask for training the segmentor with unannotated images.

3 Methodology

We aim to verify that adding artifact information can improve the capability of CNN’s to localize polyps. Multi-class and MTL approaches are natural choices for this problem since they have the potential to improve the model’s performance by jointly learning different tasks. In this section, we explain the different strategies employed to include artifact information. Next, we introduce our novel approach for polyp localization, which incorporates artifact information through learning without forgetting (LwF).

3.1 Problem Formulation

Given an input frame from a colonoscopy video sequence, our main objective is to improve the polyp detection performance of a model by including information about the artifacts present in .

We define single-task models as with the model parameters trained for a specific task . Similarly, MTL models are represented as with a set of shared parameters across all tasks and task-specific parameters. We consider two related tasks: polyp localization () and artifact detection ().

The polyp detector outputs a set of boding boxes with a box represented by the coordinates of its top left and bottom right corners. Each describes the location of a polyp in , and is the total number of detections. It trains on a polyp dataset with the number of samples.

The artifact detector, give us an equivalent set of artifact locations , and an additional set of artifact classes , with the total number of artifacts found in . Here indicates the type of artifact in the box . Artifact types are blur, bubbles, contrast, instrument, specularity, saturation, and miscellaneous (misc.) artifacts. The dataset for training this detector is defined as with the number of samples. All our models build on the RetinaNet architecture, described next.

3.2 RetinaNet Base Model

RetinaNet ([26]) consists of a backbone network for extracting convolutional feature maps and two subnetworks that perform object classification (classification subnetwork) and bounding box regression (regression subnetwork) via convolutional layers. The classification and regression losses are given by the focal loss and the smooth L1 loss, respectively. It is a one-stage method, meaning that it does not require a region proposal module as in [17, 18, 29]. Instead, anchors at different scales and aspect ratios are densely distributed across the image and classified by the network. To construct a multi-scale feature pyramid from a single resolution input image, the backbone network is augmented by a feature pyramid network (FPN, [25]).

The focal loss is used as classification loss. It is a modification of the cross-entropy loss that adds weighting parameters to avoid one-stage detectors from being swamped by the great number of easy background anchors. To address this imbalance, focal loss introduces a weighting factor of with given by [26]:


where is the estimated probability for a detection. This factor reduces the importance of easy anchors and is a hyper-parameter that controls the extent to which the loss focuses on hard examples. For , the focal loss is the same as the cross-entropy loss. Focal loss incorporates another weighting factor, to address the class imbalance between background and foreground anchors. Foreground anchors will be weighted by and background anchors will be weighted by . We define analogously to in equation 1. Then, the focal loss is given by [26]:


3.3 Single-class and Multi-class Models

Using RetinaNet as the base model, we define two single-tasks models for polyp and artifact detection, respectively. We use these models to evaluate the impact of artifacts present in the image (see section 4.4). The base polyp detector , uses a standard RetinaNet architecture with a ResNet model pre-trained on ImageNet as backbone network.

We take our base artifact detector from our submission to the Endoscopic Artefact Challenge (EAD) 2019, a competition held at the 2019 IEEE International Symposium on Biomedical Imaging in Venice. This model obtained a score of 0.345, ranking third in the competition. It is composed by an ensemble of seven models, each one based on the RetinaNet. We trained the models considering different classification/regression loss weighing, focal loss parameters, augmentations, and backbone model configurations (ResNet 50, 101, and 152). We use the best performing model of the ensemble as the base artifact detector in this study.

Once we defined our two base-models, we tested three straightforward methods for using artifact information in the training process of . The first (and simplest) approach is using the weights for initializing the polyp detector. For the two remaining methods, we extended the number of artifact types to include an additional class for polyps (see section 3.4.1 and Fig. 2a). Then, we train a new model for this multi-class (MCL) localization problem (Fig. 2b). We compare networks trained with a regular vs. class-weighted loss function. Sections 4.5.1, 4.5.2, and 4.5.3 show experiments and results with these kinds of models.

3.4 Multi-task Learning for Polyp Localization

In this section, we described our MTL approach for polyp and artifact localization. Our proposal is inspired by the learning without forgetting ideas. We present an overview of this method in the context of our colonoscopy problem. Then, we describe the changes made to the RetinaNet architecture to include the polyp detection task into an initial artifact detector. Experiments and results with this approach are discussed in section 4.5.4.

Learning Without Forgetting

Learning without Forgetting (LwF) is an MTL strategy that allows to teach an additional task to an existing model previously trained on a related task ([24]). One of the main advantages of LwF is that, to extend the capabilities of a model, only the training data of the new task is necessary. In our case, to train the MTL model, we need an initial artifact detector with initial task-specific parameters, and a dataset for our new polyp detection task.

Figure 2: a) We use the artifact detector to obtain artifact boxes in each image of the polyp-only dataset. Then, we define a multi-class/task dataset that contains both artifacts and polyps. b) A multi-class RetinaNet model for localization in colonoscopy. Polyps are considered as an additional “artifact” class. c) A multi-task RetinaNet. We divided the parameters of the model in shared and task-specific groups. Then, we add an additional subnetwork for polyp/non-polyp classification.

Following the LwF method, we first use our base artifact detector on all the images to generate a set of artifact boxes. Then, we incorporate these predictions into as ground truth for artifacts. In this way, we now obtain a dataset that is suitable for training models in both tasks. This is represented in Fig 2a. In our experiments, is used for evaluating the artifact influence on polyp detection, and for training all our MCL and MTL models.

Continuing with the LwF strategy, we can now use to train a MTL model with using the following loss function. For simplicity, we set , :


with a weight regularizer and , tasks specific losses. This way, the model can learn a new task while keeping the performance on the initial one. Also, note that for the polyp detection task, we only have one possible class for . In [24] the two tasks have different loss functions, where the loss function for the initial task is the Knowledge Distillation loss. This loss is effective in encouraging two networks to have the same output ([20]). Given that our goal does not require to maintain good performance on the related task, we opted for using the same loss function, focal loss, for both the initial and the new tasks.

RetinaNet for Multi-task Learning

The LwF method requires to incorporate task-specific parameters for our new polyp detection problem. For this, the RetinaNet architecture can naturally be extended to allow a set of shared and task-specific parameters. We selected the parameters of the backbone and feature pyramid networks to be our group , shared across all the tasks. Then, for each task we want to include, we can add an additional classification subnetwork with parameters . For our MTL problem, our model will have three subnetworks in total. Those are the regression subnetwork with parameters , and two classification subnetworks for polyps and artifacts with parameters and respectively. Each one of these components has its own loss function. We can summarize our MTL RetinaNet architecture as:


Note that the regression subnetwork does not differentiate between polyps and artifact boxes, since its main task is to detect general object’s locations. We keep the original RetinaNet loss for the regression task and use the LwF loss (eq. 3) for the classification subnetworks. A visualization of this framework is given in Figure 2c.

4 Experiments and Results

We performed several experiments to evaluate the importance of artifact data in the polyp localization network. Our experiments are divided into two groups. First, section 4.4 evaluates the correlation between artifact presence and polyp detection performance and the differences in performance when an artifact is present or overlap with polyp ground truth and predictions. Section 4.5 is dedicated to the experiments that include artifact information in the training process of the model, including our MTL learning method.

4.1 Implementation Details

The base polyp detector uses a value of and for the focal loss. We also employed a ResNet model pre-trained on ImageNet as backbone network for RetinaNet.

Except if stated otherwise, all our experiments with RetinaNet were trained with the Adam optimizer with a learning rate of with decay by a factor of 10 when the improvement changes are minor between epochs. Basic data augmentation was applied. We applied a randomized combination of rotation, translation, shear, scaling, and flipping. Each image is rotated, translated, and sheared by a factor of -0.1 to 0.1, scaled between 0.9 to 1.1 of its original size, and flipped both horizontally and vertically at a probability of 50%.

4.2 Datasets

Since we take our base artifact detector from our EAD submission, this model uses the corresponding EAD’s artifact dataset for training. The data was obtained from six different centres and includes still frames from multiple tissues, light modalities, and population. Training data consist of 2000 mixed frames, from the different modalities, tissues, and populaiton, provided by 4 different centres.

Given a trained artifact detector, the Learning without Forgetting framework requires only data for the new polyp detection task. For this, we rely on two publicly available polyp datasets:

  • The CVC-ClinicDB dataset is a colonoscopy dataset that consists of 612 colonoscopy frames taken from a total of 29 video sequences ([9, 10]). The sequences are from routine colonoscopies and were selected to represent as much variation in polyp appearance as possible. The whole dataset contains 31 polyps and none of the images contain no polyps.

  • The ETIS-Larib is a polyp dataset that was used as the testing dataset in the still frame analysis task in [10]. The dataset consists of 196 high definition frames that were selected from 34 sequences. In total the dataset contains 44 different polyps and all frames contain at least one polyp.

We use a three-fold cross-validation scheme on the CVC-Clinic dataset in our experiments. The ETIS dataset was reserved for testing and comparison purposes. The ASU-Mayo Clinic Colonoscopy Video (c) Database is copyrighted and it is necessary to request permission for obtaining and using this dataset. Even though we requested access to this set, we still had no answer by the time we wrote this document, and hence it was not considered in our experiments.

4.3 Metrics

We follow the validation framework for still frame analysis in [10], where scores are reported by giving the number of true positives, false positives, false negatives, precision, recall, F1-score, and F2-score. A true positive is any detection where the centroid of the bounding box lies within the polyp ground-truth mask. False positives are detections where the centroid is outside of the polyp mask. Each ground-truth can only have one true positive, thus each detection that correctly detects a polyp that has already been detected is a false positive. False negatives are all ground-truth polyps that have not been detected.

4.4 Effects of Artifacts on Polyp Detection

Before incorporate artifact data in the model, we perform experiments to determine to which extent artifacts affect polyp detection. The experiments of this section were conducted on the extended version of the testing dataset (ETIS). Section 3.4.1 gives details about how a polyp dataset is extended to obtain a multi-task dataset . We train our base polyp detector in the CVC dataset, and then it was used to predict polyp location in the ETIS dataset. The following experiments describe the relations between the polyp predictions and the artifacts present in the extended version of ETIS.

Correlation between Polyp Predictions and Artifact Covered Area

We observed the correlation between the covered area of an artifact (number of pixels inside an artifact region) in an image and the polyp detection precision and recall on the same image. Since most images only contain one polyp, recall on them will be either 0 or 1. An alternative would have been to observe the correlation between the amounts of artifacts and precision and recall, but we believe that the area gives a more accurate view of how prominent a given artifact is in an image. With this analysis, we expect to get a first view on whether there is a predictable change in polyp detection performance whenever a given artifact is present in an image. Since our aim is to oversight what artifacts correlate to polyp performance on a deeper level, and therefore we set both polyp detection and artifact detection probability threshold to 0.25.

Figure 3: Correlation between the area covered by artifacts in an image and precision and recall of polyp detection.

From the results in Fig. 3, most artifact classes have no significant correlation with the metrics. Especially, blur, specularity, saturation, and instrument all have a correlation of less than 0.1 with polyp detection results. The most significant correlation can be seen between contrast and recall, which shows a positive correlation of 0.29. This suggests that images with greater contrast (and thus with darker areas) lead to a greater ability to find polyps. Misc. artifacts correlate negatively (-0.13) with precision, which could hint towards such artifacts being misclassified as polyps, increasing the number of false positives. Bubbles, on the other hand, correlate positively (0.13) with precision. An initial assumption would be that bubbles, which have a similar appearance to polyps, can easily be misclassified as polyps. However, our findings seem to contradict this assumption.

Artifact Distribution

We plotted the distribution of the artifacts for images where all polyps were found, and for those where they were not found at all. Note that 190 out of 196 frames contain only one polyp, so there are only six frames for which recall could be different from 0 and 1. For precision, we can have images with no false positives (precision 1) and images where false detections were made (precision 1). We looked at polyp and artifact detections taken at a 0.25 threshold.

We plot the contrast area values divided by 1000 against the recall since in the last experiment this artifact showed significative differences in this metric. At the selected threshold, there are 154 images with a recall of 1 and 37 images with a recall of 0. The five images where recall is between 0 and 1 have been excluded. Fig. 4 shows the results.

Figure 4: Boxplot distributions for contrast area for both frames with a recall of 0 and 1. The area is given by the total number of pixels inside all the artifact boxes in a given image.

The median contrast’s area for frames where the polyp is found is 149 and the 75th percentile extends to 376. This median value of 149 corresponds to around 13% of the image area of the ETIS images. In images where the polyp remains undetected the median area is 0, whereas the 75th percentile only reaches 75. This suggests, as was indicated by correlation analysis, that contrast is a positive indicator for polyp detection ability.

For precision, misc. artifact and bubbles area distributions differ the most. There are 53 images where precision is 1 and 143 where precision is below 1. Artifacts distributions for precision are presented in Fig. 5.

Figure 5: Boxplot distributions for misc. artifact area and bubbles area for both frames with a precision of 1 and 1. The area is given by the total number of pixels inside all the artifact boxes in a given image.

The correlation analysis has hinted towards a negative correlation between misc. Artifacts and precision. Indeed, Fig. 5 shows that images with false positives have misc. artifact areas where the 75th percentile spreads to 28 compared to 11 for precision 1. For bubbles there seem to be greater bubble areas in images where precision is 1, showing that bubbles do not necessarily lead to more false positives.

Artifact Presence vs. Performance

We continue our analysis by investigating how polyp detection performance differs given the presence of artifacts. We took the subset of images that contains a determined artifact and compared the scores to the subset of images where the given artifact is not present. Table 1 shows these differences in performance for different choices of artifacts. To consider that an artifact present in the image, we empirically define different area thresholds for the different artifacts (see Fig. 6). The area is computed based on the total amount of pixels covered by all the boxes of the same artifact. For blur, the threshold is set at 50% of the entire image size. For misc. artifact and bubbles we select 2% of the image area as a threshold. For specularity, which is effectively present in all (99%) of the images, we select a threshold of 5%, to cover only images where there is a high amount of specularity. For this threshold, specularity is still present in 56% of images. No thresholds are set for contrast and saturation. We took artifacts at a confidence score threshold of 0.25 and polyps at 0.5. We selected a higher threshold (0.5) for polyps than for previous experiments to know how artifacts affect polyp detection when polyps are taken at a threshold that is optimized for polyp detection metrics.

Figure 6: An artifact is considered to be present if it meets an area threshold in relation to the total area of the image. a) Different artifacts that do not meet the area thresholds. Those boxes are not considered. b) Samples where the total amount of artifacts meet the area condition.
Score Difference (%)
Artifact type Frequency (%) recall precision F1 F2
bubbles 37.2 -6.5 2.7 -2.4 -4.9
blur 48.0 -10.6 -0.6 -6.2 -9.0
misc. 23.5 12.2 10.8 11.6 12.0
specularity 55.6 -11.1 -11.1 -11.1 -11.1
saturation 59.2 -8.3 -5.0 -6.8 -7.8
contrast 69.4 41.9 6.3 24.9 35.3
Table 1: Differences in performance for images where the artifact is present vs. images where it is not present. Frequency gives the share of images that contain the given artifact.

A similar experiment is conducted in [10]. Here, a different data set is inspected and all images are labeled by endoscopic experts. This study however only contains statistics for specularity, overlay information, saturation, luminal regions, and general low visibility. Our experiments thus cover a broader range of artifacts and define them more specifically.

Our previous observations regarding contrast are confirmed by the results in Table 1. The recall is 42% higher in the 69% of the images containing contrast predictions. Precision is also slightly higher, leading to an F1-score that is improved by 25%. The misc. artifacts class however no longer appears to affect precision negatively. Previously we had argued that misc. artifacts may be frequently misclassified as polyps. However, this study shows that 24% of images containing a certain area of misc. artifacts showed a 12% higher recall and 11% higher precision. We repeated this study by selecting polyp detections at a threshold of 0.25. In this second test, the precision for images with artifacts was 22% lower. This suggests that the reason for previous experiments showed lower precision can be due to many misc. artifacts are misclassified as polyps, but only with a confidence score below 0.5. Images with a given amount of specularity, on which no findings were made so far, show recall and precision that are both lowered by 11%. This may be explained by the fact that some specular regions are misclassified as polyps and also reduce the ability to detect actual polyps when they contain too many whitened-out regions. Blurred images have a recall that is 11% lower, which seems natural as polyp features are likely harder to extract in blurred regions. Fig. 7 shows examples of polyps predictions together with blur and misc. artifact boxes. Similarly, Fig. 8 shows polyp detections and ground truth in the presence of blur and specularity.

Figure 7: Polyp ground truth (green boxes) and polyp prediction (red boxes) compared with artifact presence: a) The blue boxes are blur predictions: and b) the yellow boxes represent misc. artifact predictions. Green boxes without an overlapping red box indicate a false negative polyp detection (e.g. first figure of each row). Best viewed in color.
Figure 8: Polyp ground truth (green boxes) and polyp predictions (red boxes) compared with individual artifact presence: a) shows different samples containing bubbles (black box); and b) shows samples containing specularities (pink box). Green boxes without an overlapping red box indicate a false negative polyp detection (e.g. first figure of each row). Best viewed in color.

Artifacts Overlapping Polyp Detection

Share of polyps overlapping artifacts (%)
Polyp type Frequency Any artifact Bubbles Blur Misc. artifact Specularity Saturation Contrast
ground-truth 208 15.4 3.4 1.0 0.5 4.8 7.7 1.0
true positives 137 14.6 2.2 1.5 0.0 3.6 7.3 2.2
false positives 39 41.0 5.1 30.8 0.0 5.1 5.1 2.6
false negatives 63 14.3 6.3 1.6 1.6 6.3 3.2 0.0
Table 2: The share of ground-truth polyps, true positives, false positives, and false negatives that overlap different artifacts. Frequency is the count of these polyp types.
Share of polyps containing artifacts (%)
Polyp type Frequency Any artifact Bubbles Blur Misc. artifact Specularity Saturation Contrast
ground-truth 208 64.9 14.4 0.0 7.2 60.6 3.8 0.0
true positives 137 77.4 11.7 0.0 10.2 75.9 3.6 0.0
false positives 39 82.1 23.1 15.4 25.6 79.5 5.1 0.0
false negatives 63 49.2 19.0 0.0 6.3 41.3 1.6 0.0
Table 3: The share of ground-truth polyps, true positives, false positives, and false negatives that contain different artifacts. Frequency is the count of these polyp types.
Share of polyps inside of artifacts (%)
Polyp type Frequency Any artifact Bubbles Blur Misc. artifact Specularity Saturation Contrast
ground-truth 208 55.8 0.5 0.5 50.5 1.0 4.8 4.8
true positives 137 52.6 0.7 0.0 46.7 0.7 6.6 2.9
false positives 39 28.2 0.0 0.0 23.1 0.0 0.0 7.7
false negatives 63 50.8 0.0 1.6 49.2 1.6 1.6 4.8
Table 4: The share of ground-truth polyps, true positives, false positives, and false negatives that are surrounded by different artifacts. Frequency is the count of these polyp types.

We next evaluate the amount of overlap between artifacts detections with polyp’s ground-truth, and true positive, false positive, and false negative predictions. For each of these prediction types, we counted how many times they overlap with an artifact, how many times a given artifact is inside of them, and how many times they are surrounded by one (see Fig. 9). For the sake of this analysis, we did not penalize when a ground-truth polyp is detected more than one time. We consider that this analysis is useful to deduct how likely are certain artifacts to be misclassified as polyps. We consider two bounding boxes to overlap if their intersection-over-union (IoU) is greater than 0.5. For this experiment, we take artifacts at a threshold of 0.25 and polyps at a threshold of 0.5.

Figure 9: Overlap analysis (best viewed in color). The green box represents a polyp detection. We evaluate artifact bounding boxes that overlap with polyp boxes (yellow box), are contained by a polyp box (blue box), and contain a polyp box inside (red box).

We first evaluate the overlap. The results are shown in Table 2. We can see that 15.4% of polyp ground-truth overlaps with artifacts. This can have two possible reasons. First, polyps may be covered by a great extent by bubbles, specularity or saturation (as is mostly the case in the table). Also, some polyps may be misclassified by our artifact model as misc. artifacts. We observed that false positives more frequently overlap with artifacts (41.0%) compared with ground-truth (15.4%). This suggests that a significant number of false positives are artifacts misclassified as polyps. Results show that 30.8% of false positives overlap with blur detections. For ground-truth, this only happens in 1% of cases. Thus, the features that the artifact model learned to detect as blurred regions, are similar to those the polyp detector considers relevant for polyps. False negatives contain artifacts in 14.3% of cases, which is similar compared with the results for ground-truth regions. This suggests that overlapping with (or looking similar to) artifacts is not necessarily a cause for polyps to remain undetected. Bubbles overlap false positives and false negatives more frequently than ground-truth regions. This is an indication that bubbles are sometimes misclassified as polyps and that bubbles overlapping polyp areas increase the chance of them being undetected (see Fig. 8a).

Next, we compared the number of polyp detections containing a given artifact inside of them. Having an artifact inside means that an artifact bounding box is completely contained in the polyp bounding box. This way we expect to understand what influence (e.g. having specularity inside of a polyp) has on detection capabilities. In Table 3, 64.9% of the ground-truth polyps have artifacts inside. True positives contain artifacts 77.4% of the time, suggesting that polyps with artifacts inside of them are more likely to be detected. Only 49.2% of false negatives have artifacts inside of them, indicating that no artifacts inside a polyp increase the risk of missing it (Fig. 10). This is mostly driven by specularity, which is present in 60.6% of ground-truth regions but only 41.4% of false negatives. On the other hand, regions with misc. artifacts, bubbles or blur have a higher likelihood of being misclassified as polyps, as false-positive contain these elements at a higher rate (25.6%, 23.1%, 15.4%, respectively) than the ground-truth (7.2%, 14.4%, 0%, respectively).

Figure 10: Different undetected polyps that do not contain artifacts inside. Green boxes represent false negatives, red boxes are polyp predictions. The remaining boxes are bubbles (black), specularity (pink), blur (blue), and saturation (brown). Best viewed in color.

The final analysis of this round of experiments evaluates how often polyp detections are inside of an artifact. We worked on the assumption that detections will frequently be inside blur, contrast, and saturation regions and that this will have an effect on detection capabilities. In Table 4 we can see that ground-truth polyps are inside of artifacts 55.8% of the time. As expected, saturation (4.8%) and contrast (4.8%) are between the most common artifacts. However, blur only contains polyp ground-truth 0.5 % of the time. In contrast, polyps are inside misc. artifacts regions 50.5 % of the time. False positives are much less frequently inside artifact regions, only 28.2% of the time. This could be explained by the fact that due to lower visibility, the algorithm makes fewer detections in these areas. There is no significant discrepancy between the share of false negatives, true positives and ground-truth polyps that are inside the various artifacts.

4.5 Multi-task Learning

In this section, we investigated the efficiency of MTL approaches in terms of their ability to incorporate artifact information in the polyp detection task. To our best knowledge, this is the first use of learning without forgetting in a polyp detection framework. Experiments in this section are the results of a 3-fold cross-validation process on the CVC dataset.

Transfer Learning

In our first approach, we looked at the most straight-forward way to utilize the artifact data: using the weights of the artifact model for initializing the polyp detection model. For this, we trained an artifact detector . This artifact RetinaNet model was trained on the endoscopic dataset from the EAD 2019 competition, making it a case of transfer learning (TL) from a very similar dataset. We have tried two different TL approaches: 1) fine-tuning the entire network and 2) freezing the backbone and only fine-tuning the classification subnetwork of RetinaNet. We also compared with models pre-trained on ImageNet and COCO. We can see in Table 5 that using artifacts data for pre-training yields an improvement of 1.5 %in F1-score over using the more general ImageNet dataset. However, the F1-score is not superior to the score reached when using COCO pre-trained weights. Freezing the backbone leads to poor results, which could suggest that a feature extractor solely trained on artifacts does not yield enough valuable features for polyp detection.

Initial params. TL Type Precision Recall F1 F2
ImageNet initialization 0.849 0.814 0.829 0.820
COCO initialization 0.833 0.862 0.845 0.855
Artifact Model initialization 0.832 0.859 0.845 0.853
Artifact Model freeze backbone 0.836 0.766 0.794 0.776
Table 5: RetinaNet performance with respect to different TL approaches.

Multi-class Learning and Artifact Confidence Threshold

Starting at this point, our further approaches will simultaneously optimize for polyp and artifact detection. Given the lack of artifact annotations in polyps datasets, we extended the CVC dataset to also include artifact labels (see section 3.4.1). These artifact labels are conditioned to a score (or probability) threshold, which ensures the predictions are strongly likely to be an actual artifact. However, there is no real way of knowing what threshold is the optimal. Our work in the EAD challenge shows that the optimal threshold for our artifact base was somewhere between 0.25 and 0.5, but this depends on the artifact type. A small threshold may lead to a larger number of artifact boxes, compared with polyps, thereby reducing our model’s ability to learn polyp features. To get a comprehensive overview of what threshold works best, we have generated different extended versions of the CVC dataset with artifact labels at thresholds 0.2, 0.4, 0.5, 0.6, and 0.8. We then used these datasets to train different multi-class models (MCL RetinaNet) for artifact and polyp detection. Only the polyp detection performance is evaluated in our experiments.

Artifact Artifacts per Precision Recall F1 F2
Threshold Image
0.2 37.8 0.146 0.515 0.225 0.336
0.4 7.7 0.237 0.487 0.310 0.392
0.5 3.6 0.413 0.690 0.516 0.608
0.6 2.0 0.722 0.640 0.673 0.652
0.8 1.2 0.927 0.721 0.810 0.754
baseline N.A. 0.849 0.814 0.829 0.820
Table 6: Cross-validation performance of MCL RetinaNet on CVC-ClinicDB with respect to different artifact score thresholds. The column ”artifacts per Image” tells us how many artifacts, on average, are annotated in each frame. This varies with the confidence score threshold that we select for the artifacts.

According to the results in Table 6, artifacts taken at a lower threshold lead to lower polyp scores. This trend in F1-score is driven by precision values. Less (and presumably the more correct) artifacts decrease the number of false positives. Nonetheless, it is noteworthy that the 0.8 artifact threshold leads to higher precision, albeit at the cost of recall, suggesting that in this case, the model decreases the number of false positives after learning from artifacts data. In the original EAD artifact dataset, there are around 8.3 artifact annotations per image. If we suppose a similar distribution in the CVC-ClinicDB set, then an artifact threshold of 0.4, (which yields to 7.7 artifacts per image) seems to be closest to reality. However, at this threshold, our model is still getting biased by the great number of artifacts and is not able to make reliable polyp detections. In the subsequent experiments, we only considered thresholds of 0.2 and 0.5. Although these do not lead to optimal performance, we select them because they provide more knowledge about artifacts.

Weighting Artifact and Polyps Classes

The previous experiment showed that learning from artifact data has the potential of reducing false positives rates. However, if a high number of artifacts are present, it could negatively affect polyp detection performance. An intuitive solution is to introduce class weighting. We train again the previous model but increasing the weighting of the polyp class by 25%, 50%, and 75% of the classification loss, while distributing the remaining share of the weight equally between all seven artifact classes. We also increased the weight for the classification loss with respect to the regression loss with a ratio of 5:1. Also, we repeated this experiment using our EAD artifact base-model for initialization and compare it with our ImageNet default initialization. For these experiments, we have trained our weighted multi-class (wMCL) RetinaNet on artifacts that were taken at a 0.5 threshold.

Polyp Weight Pre-training Precision Recall F1 F2
wMCLa 25% ImageNet 0.774 0.726 0.749 0.735
wMCLa 50% ImageNet 0.797 0.761 0.775 0.766
wMCLa 75% ImageNet 0.752 0.765 0.744 0.752
wMCLb 25% artifact Model 0.808 0.766 0.785 0.773
wMCLb 50% artifact Model 0.777 0.789 0.782 0.786
wMCLb 75% artifact Model 0.770 0.825 0.796 0.813
no weighting ImageNet 0.413 0.690 0.516 0.608
baseline ImageNet 0.849 0.814 0.829 0.820
Table 7: Cross-validation performance of wMCL RetinaNet with respect to different class weighting and pre-training (initialization) approaches for Multi-class detection on the CVC-ClinicDB dataset. The weighting for polyp class is indicated. The remaining percentage is equally distributed between the artifact classes, so all the classes sum to 100%

Table 7 shows the results for the wMCL models. Weighting loss function in favor of polyps has brought significant improvements compared to a non-weighted loss function. F1-scores for weighted-loss models vary between 0.744 and 0.796, compared to 0.516 for the non-weighted model. Also, artifact pre-trained weights outperform ImageNet pre-trained models. However, none of these weighted models outperform our polyp-only baseline. A possible explanation is that having eight different classes for a single classification subnetwork complicates the optimization. An alternative approach would be to separate the tasks into two classification subnetworks that handle the loss function for artifacts and polyps, respectively.

Double Classification Subnetwork

We increased the task-specific layers in our network as described in section 3.4.2 to define a MTL RetinaNet. This also allows us to weight the different loss functions. We tried out four different weighting configurations for the loss function. We weighted regression and artifact classification subnetworks by 1 and assigned weights of 1, 3, 10, and 20 to the polyp classification loss. Some of these models train for up to 100 epochs, as they take longer to converge. We have conducted these experiments on datasets that contain artifacts labels taken at thresholds of 0.2 and 0.5.

Loss Weights Artifact Precision Recall F1 F2
(reg:art:pol) Threshold
1:1:1 0.2 0.836 0.269 0.381 0.304
1:1:3 0.2 0.841 0.484 0.604 0.525
1:1:10 0.2 0.803 0.650 0.715 0.674
1:1:20 0.2 0.831 0.677 0.739 0.699
1:1:1 0.5 0.820 0.727 0.771 0.744
1:1:3 0.5 0.802 0.755 0.773 0.761
1:1:10 0.5 0.779 0.793 0.781 0.787
1:1:20 0.5 0.796 0.818 0.804 0.812
wMCLa 50% 0.5 0.797 0.761 0.775 0.766
baseline 0.849 0.814 0.829 0.820
Table 8: Cross-validation performance for MTL RetinaNet with respect to different artifacts thresholds and weights applied to the regression (reg), artifact (art) and polyp (pol) loss functions. We also compare with the best wMCL model pre-trained on ImageNet (wMCLa 50% in Table 7)

As in previous experiments, results in Table 8 show that training on artifacts selected at a 0.2 threshold underperforms all weight configurations that are trained on a 0.5 artifact threshold, even when excessive weight is put on the polyp classification loss. This may hint towards that not only the number of artifacts is an issue but also their quality. Increasing loss weights should alleviate the problem of a high number of artifacts, but not the problem that many artifacts detections may not be correct. Further, we show that using separate subnetworks for polyps achieves better performance than using class weights only. Lastly, none of the weight configurations, at both thresholds 0.2 and 0.5, lead to an improvement of the F1-score compared with the baseline model. However, weighting the polyp loss by 20 at an artifact threshold of 0.5, we reach an F1-score that is close to the polyp-only baseline. A possible reason for this could be that not all the artifacts are relevant for polyp detection. In our last experiments, we drop certain artifact classes to see how it affects polyp detection performance.

4.6 Removing artifact Classes

As previously discussed, some artifacts classes are not useful for polyp detection. As we wanted to select features that add the most knowledge that is relevant for polyp detection, we relied on the experiments displayed in tables 2 and 3 that looked how frequently polyp detections overlap or contain certain artifacts. For instance, if we know that blurs are often misclassified as polyps, or that polyps containing specularity are easier to detect, we know that these classes are useful for polyp detection. We proceeded to create four different subsets of artifacts for validation. The first subset contains only the artifact that was most prioritized, the second set contains the two most prioritized ones, and so on. We selected blur as the most important feature. In the second place, we chose specularity. Next up we selected misc. artifacts. They are contained more frequently inside of false positives than in polyps. Our final artifact is bubbles. Intuitively, bubbles have polyp-like features and our analysis has also confirmed that there is a slight tendency of bubbles to be misclassified as polyps.

We use the MTL RetinaNet with artifacts thresholds at 0.2 and 0.5. We have looked at the amount of artifacts contained in the different dataset for these thresholds and assigned weights to the loss functions that we deemed appropriate.

Artifacts Artifact Precision Recall F1 F2
blur spec. misc. bubbles Threshold
* 0.2 0.841 0.741 0.787 0.759
* * 0.2 0.826 0.745 0.779 0.757
* * * 0.2 0.839 0.762 0.795 0.774
* * * * 0.2 0.881 0.685 0.770 0.717
* 0.5 0.870 0.808 0.836 0.819
* * 0.5 0.828 0.826 0.825 0.825
* * * 0.5 0.843 0.820 0.829 0.823
* * * * 0.5 0.859 0.790 0.821 0.802
baseline N.A. 0.849 0.814 0.829 0.820
Table 9: Cross-validation performance of MTL RetinaNet with respect to different artifacts subsets and thresholds. The star (*) in the artifact column means that the given artifact was included for this model.

We observe in Table 9 that most of the models trained on a reduced number of classes perform better compared with the previous experiment. We also have three models with greater recall and two that match or exceed the F1-score of the baseline model. A higher recall could suggest that features extracted from artifacts are helpful for the polyp detection task.

4.7 Understanding the Effects

Share of polyps overlapping artifacts (%)
frequency bubbles blur misc. artifact specularity
polyp type polyps MTL polyps MTL polyps MTL polyps MTL polyps MTL
ground-truth 208 208 3.4 3.4 1 1 0.5 0.5 4.8 4.8
true positives 137 152 2.2 3.3 1.5 2 0 0 3.6 3.9
false positives 39 64 5.1 6.2 30.8 6.2 0 3.1 5.1 4.7
false negatives 63 64 6.3 4.7 1.6 0 1.6 1.6 6.3 6.2
Table 10: The share of ground-truth polyps, true positives, false positives, and false negatives that overlap different artifacts. For each of the four artifacts, we compare the shares for the model that trained on polyps only and for the MTL (RetinaNet) model.
Share of polyps containing artifacts (%)
frequency bubbles blur misc. artifact specularity
polyp type polyps MTL polyps MTL polyps MTL polyps MTL polyps MTL
ground-truth 208 208 14.4 14.4 0 0 7.2 7.2 60.6 60.6
true positives 137 152 11.7 17.8 0 0 10.2 10.5 75.9 79.6
false positives 39 64 23.1 15.6 15.4 1.6 25.6 17.2 79.5 64.1
false negatives 63 64 19 7.8 0 0 6.3 3.1 41.3 28.1
Table 11: The share of ground-truth polyps, true positives, false positives, and false negatives that contain different artifacts. For each of the four artifacts, we compare the shares for the model that trained on polyps only and for the MTL (RetinaNet) model.

In order to understand how and why MTL approaches are improving results, we repeated some experiments from subsection 4.4.4. This allowed us to find out if the model has, for instance, reduced the number of times artifacts are misclassified as polyps. We selected the MTL RetinaNet trained on blur, bubbles, misc., and specularity, where the artifacts were taken at a 0.2 threshold, since it has similar settings to the experiments in section 4.4.4. We trained the model on the entire CVC set to get predictions for the ETIS test set. In Table 10 we can see that for the MTL model only 6.2% of false positives overlay with blur, compared to 30.8% for the polyp-only baseline model. This suggests that the model has successfully learned to differentiate between blurred regions and polyps. However, for bubbles, misc. artifacts and specularity, the share of false positives overlapping them varies only slightly between the polyp-only and the MTL model. There can also be no conclusions made about a change in the ability to find polyps that look like artifacts, as the share of false negatives overlapping artifacts did not differ significantly between both models. There is a number of noteworthy observations in Table 11. First, for all four artifacts, the MTL model has a lower share of false positives containing them. This suggests the model has been able to make use of artifact-related features in order to reduce the times it misclassified background regions that contain these artifacts as polyps. Second, the share of false negatives that contain bubbles decreased from 19% to 7.8% for the MTL model. This means that fewer undetected polyps contain bubbles, suggesting that learning the features of bubbles has helped the algorithm detect more polyps that are affected by bubbles covering them. The same can be said for specularity. The share of false negatives containing specularity has been reduced from 41.3% to 28.1% for the MTL model. This let’s us conclude that learning from artifacts can both be helpful to avoid misclassifying affected regions as polyps as well as to have a higher recall on polyps containing artifacts.

4.8 Comparison against State-of-the-art

We looked at how our different MCL and MTL approaches compare against the state-of-the-art. This allowed us to compare the approaches against other methods and to see whether the improvements observed in cross-validation also apply to the test set. The models displayed in table 12 are the best-performing for each experiment in sections 4.5 and 4.6. Table 12 shows that our results from cross-validation are not entirely reflected on the testing set. On cross-validation, the ”Reduced Classes” (section 4.6) method outperformed the baseline approach. However, on the testing set, none of the MTL approaches do so. This shows that, so far, the improvements we have seen for MTL are still very small and do not generalize well. It should also be noted that scores on the test set vary a lot between epochs. As we chose the same epoch for each model to get their test scores, it may be that that epoch unfairly favored some models over others.

Method Precision Recall F1 F2
State of Art
([10]) 0.724 0.692 0.708 0.698
([10]) 0.697 0.630 0.662 0.642
- [31] 0.865 0.803 0.833 0.815
baseline 0.638 0.702 0.668 0.688
- TL RetinaNet 0.659 0.587 0.621 0.600
- wMCL RetinaNet 0.590 0.692 0.637 0.669
- MTL RetinaNet 0.537 0.726 0.618 0.678
- MTL RetinaNet w/
Reduced Classes 0.630 0.654 0.642 0.649
Table 12: Comparison on the ETIS-Larib dataset of our TL, wMCL, and MTL RetinaNet approaches with the state-of-the-art and the polyp-only baseline.

4.9 In-house Clinical Dataset

Share of polyps overlapping artifacts (%)
Polyp type Frequency Any artifact Bubbles Blur Misc. artifact Specularity Saturation Contrast
ground-truth 55411 15.1 1.4 2.7 1.5 2.3 3.7 2.4
true positives 41973 17.9 1.3 3.6 1.7 2.3 4.0 2.9
false positives 10435 23.8 3.0 6.1 2.7 1.9 1.3 5.8
false negatives 13438 2.5 0.5 0.4 0.4 0.5 0.9 0.6
Table 13: The share of ground-truth polyps, true positives, false positives, and false negatives that overlap different artifacts in our in-house dataset. Frequency is the count of these polyp types.
Share of polyps containing artifacts (%)
Polyp type Frequency Any artifact Bubbles Blur Misc. artifact Specularity Saturation Contrast
ground-truth 55411 73.5 9.4 0.2 26.4 72.3 11.3 0.6
true positives 41973 93.8 15.1 0.5 37.0 91.7 16.2 2.5
false positives 10435 92.0 22.7 1.7 37.8 87.9 16.7 6.1
false negatives 13438 17.9 1.8 0.0 5.1 17.5 1.3 0.0
Table 14: The share of ground-truth polyps, true positives, false positives, and false negatives that contain different artifacts in our in-house dataset. Frequency is the count of these polyp types.
Share of polyps inside of artifacts (%)
Polyp type Frequency Any artifact Bubbles Blur Misc. artifact Specularity Saturation Contrast
ground-truth 55411 24.7 0.1 18.5 3.2 0.1 0.9 5.9
true positives 41973 25.6 0.1 19.5 3.6 0.0 0.9 6.2
false positives 10435 17.9 0.1 7.6 1.3 0.0 0.1 10.6
false negatives 13438 7.7 0.0 6.1 1.0 0.0 0.3 1.7
Table 15: The share of ground-truth polyps, true positives, false positives, and false negatives that are surrounded by different artifacts in our in-house dataset. Frequency is the count of these polyp types.

We repeated the overlap experiments of section 4.4.4 on a bigger in-house dataset. This dataset was collected in the endoscopy department of Klinikum rechts der Isar, Technical University of Munich. It consists of frames, obtained from 431 endoscopic videos. All frames were annotated semi-automatically and the annotations were validated by medical students. There is only one polyp ground-truth-box per frame.

Method Precision Recall F1 F2
baseline 0.801 0.757 0.779 0.766
- TL RetinaNet 0.772 0.723 0.747 0.732
- wMCL RetinaNet 0.584 0.625 0.604 0.617
- MTL RetinaNet 0.619 0.684 0.650 0.670
- MTL RetinaNet w/
Reduced Classes 0.787 0.623 0.695 0.650
Table 16: Comparison on the in-house dataset of our TL, wMCL, and MTL approaches.

The tables 13, 14, and 15 show the proportions of polyps overlapping, containing and surrounded by artifacts in this new dataset. This offers a closer view of the influence of artifacts in polyp detections, on a common clinical scenario. Considering the Table 13, we can see that the number of artifacts overlapping with ground-truth is similar to our initial experiments (15.4% vs 15.1%), and this number still is lower than the overlap of artifacts and false positives (23.8%). Blur is also the most predominant artifact that overlaps with false-positive detections. The number of false negatives that overlaps with any artifact is significantly lower in this dataset. Again, this supports the idea that overlapping an artifact is not a cause for polyps to be undetected. This can also be appreciated in Table 14. A high number of artifacts is contained inside polyp’s ground-truth and polyp’s true/false positives predictions. In contrast with the 17.9% of false negatives that contain artifacts. This also confirms that polyps without artifacts contained are most likely to be missed. This can be a result of the shape of the polyp, that generate reflections and retain bubbles and other artifacts. Regions containing those artifacts (again, specially specularities) are then most expected to be misclassified as a polyp (Fig. 11 bottom right). Regarding the ground-truth and detections inside of artifacts (Table 15), we can now see that ground-truth and true positive detections are more frequently inside blur (18.5% and 19.5% respectively), in contrast with Table 4, where blur was only 0.5% and 0% for the same elements. This is explained by the larger amount of frames of the in-house dataset, that increase the chances of having frames with motion blur. Also, blur is the most common artifact containing false negatives, with 6.1%, and the second most common containing false positives (7.6% for blur and 10.6% for contrast). In general, the RetinaNet detector shows robustness up to a certain level of blur. Finally, Table 16 shows the performance of the multi-class/task learning approaches on this new dataset. We use the same models as in Table 12. In general, we can draw similar conclusions as in section 4.8, regarding the performance of the baseline and the MCL/MTL approaches.

Figure 11: Examples of artifacts and polyps predictions in our in-house dataset. Green boxes represent false negatives, red boxes are polyp predictions. The remaining boxes are bubbles (black), specularity (pink), blur (blue), and saturation (brown). Best viewed in color.

5 Conclusion

The contributions of this work are four-fold:

First, to the best of our knowledge, this is the first work that provides an in-depth analysis of how endoscopic artifacts affect polyp detection on a granular level. While in [10] an initial proof of the influence of artifact presence in the performance of polyp detection rates was provided, the influence of incorporate artifact information in the model was not addressed. Also, our work takes it a step further, not only by performing similar testing on a more complete set of artifacts but also by including additional experiments that show how and why the given artifacts affect polyp detection. We were able to generate a more complete set of artifacts and we do not only have image-level labels but we also made use of bounding boxes for the different artifacts. By implementing such a strategy, we were able to find out how polyp detection is affected when regions overlap artifacts when regions have artifacts inside of them, and when regions are surrounded by artifacts. Nonetheless, it has to be noted that the analysis in [10] is conducted on a much larger dataset and compared to our method, the polyps are not annotated by a model but by experts.

Second, this is the first attempt of using MTL in polyp detection. Previous methods have all solely trained on polyps and not on any other related tasks. Training the model to also be able to detect artifacts, allows learning artifact features that are also relevant for detecting polyps. Initial experiments that we believe leave much room for improvement, have shown that this approach was able to increase polyp detection rates.

Third, including the work of other teams that participated in EAD 2019, this work forms one of the earliest baselines to multi-class artifact detection. Initial works that addressed artifacts in polyp detection have only focused on a single artifact, such as specularity or blur. The release of the artifact dataset enables a new validation framework for multi-class artifact detection, covering all artifacts that are present in endoscopy. Our approach has led to the third rank in the EAD 2019 object detection challenge. Compared to us, the two teams that had a higher object detection score also relied on segmentation data for their method.

Fourth, to the best of our knowledge, this is also the first time that the learning without forgetting approach was applied in a framework where both (the old and new) tasks are object detection tasks. Indeed, the original Learning without Forgetting approach uses image classification datasets as old task giving labels as the main output. In our case, the detectors’ models lead to bounding boxes instead of image labels.

One of our initial findings is that RetinaNet, which is an extremely simple framework whose main innovation is the focal loss, is well suited for both artifact and polyp detection, without any significant modification.

While it was already known that the presence of some artifacts in endoscopic images affects the performance of polyp detection, we have been able to dig deeper and uncover the reasons behind this. We found that certain artifacts are frequently misclassified as polyps, leading to a higher false positive rate. Moreover, our false positives contain a greater number of artifacts inside them than in actual polyps, suggesting that some artifacts (mostly blur, bubbles, and miscellaneous artifacts) tend to mislead our detector. Finally, not only false positives are affected by artifacts. We have also found out that some artifacts, such as specularity, help in the detection of polyps that have these artifacts on them.

We also confirmed that by implementing a MTL system, we reduce the extent to which artifact regions are misclassified as polyps and also increase the extent to which polyps containing these artifacts are detected. When training our polyp detector to be able to detect blur, bubbles, misc. artifacts, and specularity, for most artifacts, the share of false positives containing or overlapping with these has decreased. A similar result was observed with the number of false negatives, suggesting that the model improved its ability to find polyp regions that contain these artifacts inside of them.

We can conclude that MTL has the potential to improve polyp detection performance and contribute to the development of automatic polyp detection systems that assist expert endoscopists in real-time. This may then help to reduce polyp miss rates and potentially save lives. Overall, we have slightly surpassed our polyp-only trained baseline in terms of F1-score (0.836 vs 0.829) on the 3-fold cross-validation set. While this difference is perhaps insignificant and the polyp-only trained model still achieves higher performance on the test set (F1-score of 0.668 vs. 0.642), we believe that MTL has the potential to advance this field even more than demonstrated in this paper. As final remark, a bigger dataset would help in the validation of our results.


R. D. S. is supported by Consejo Nacional de Ciencia y Tecnología (CONACYT), Mexico. S.A. is supported by the PRIME programme of the German Academic Exchange Service (DAAD) with funds from the German Federal Ministry of Education and Research (BMBF)


  1. footnotemark:
  2. footnotemark:
  3. https://www.who.int/cancer/en/
  4. https://www.who.int/news-room/fact-sheets/detail/cancer
  5. http://endovis.grand-challenge.org
  6. https://ead2019.grand-challenge.org/


  1. M. Akbari, M. Mohrekesh, K. Najariani, N. Karimi, S. Samavi and S. R. Soroushmehr (2018) Adaptive specular reflection detection and inpainting in colonoscopy video frames. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 3134–3138. Cited by: §2.2, §2.
  2. L. A. Alexandre, N. Nobre and J. Casteleiro (2008) Color and position versus texture features for endoscopic polyp detection. In 2008 International Conference on BioMedical Engineering and Informatics, Vol. 2, pp. 38–42. Cited by: §2.1.
  3. S. Ali, F. Zhou, A. Bailey, B. Braden, J. East, X. Lu and J. Rittscher (2019) A deep learning framework for quality assessment and restoration in video endoscopy. arXiv preprint arXiv:1904.07073. Cited by: §1, §2.2, §2.
  4. S. Ali, F. Zhou, C. Daul, B. Braden, A. Bailey, S. Realdon, J. East, G. Wagnières, V. Loschenov, E. Grisan, W. Blondel and J. Rittscher (2019) Endoscopy artifact detection (EAD 2019) challenge dataset. CoRR abs/1905.03209. External Links: 1905.03209, Link Cited by: §1.
  5. S. Ameling, S. Wirth, D. Paulus, G. Lacey and F. Vilarino (2009) Texture-based polyp detection in colonoscopy. In Bildverarbeitung für die Medizin 2009, pp. 346–350. Cited by: §2.1.
  6. Q. Angermann, J. Bernal, C. Sánchez-Montes, M. Hammami, G. Fernández-Esparrach, X. Dray, O. Romain, F. J. Sánchez and A. Histace (2017) Towards real-time polyp detection in colonoscopy videos: adapting still frame-based methodologies for video sequences analysis. In Computer Assisted and Robotic Endoscopy and Clinical Image-Based Procedures, pp. 29–41. Cited by: §2.1.
  7. S. Bae and K. Yoon (2015) Polyp detection via imbalanced learning and discriminative feature learning. IEEE transactions on medical imaging 34 (11), pp. 2379–2393. Cited by: §2.1.
  8. J. Bernal, J. Sánchez and F. Vilariño (2012) Towards automatic polyp detection with a polyp appearance model. In Pattern Recognition, Cited by: §1, §1.
  9. J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodríguez and F. Vilariño (2015) WM-dova maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians. Computerized Medical Imaging and Graphics 43, pp. 99 – 111. External Links: Document, ISSN 0895-6111, Link Cited by: §2.1, 1st item.
  10. J. Bernal, N. Tajkbaksh, F. J. Sánchez, B. J. Matuszewski, H. Chen, L. Yu, Q. Angermann, O. Romain, B. Rustad and I. Balasingham (2017) Comparative validation of polyp detection methods in video colonoscopy: results from the miccai 2015 endoscopic vision challenge. IEEE transactions on medical imaging 36 (6), pp. 1231–1249. Cited by: §1, §2.1, §2.2, §2.2, 1st item, 2nd item, §4.3, §4.4.3, Table 12, §5.
  11. P. Brandao, E. Mazomenos, G. Ciuti, R. Caliò, F. Bianchi, A. Menciassi, P. Dario, A. Koulaouzidis, A. Arezzo and D. Stoyanov (2017) Fully convolutional neural networks for polyp segmentation in colonoscopy. In Medical Imaging 2017: Computer-Aided Diagnosis, Vol. 10134, pp. 101340F. Cited by: §2.1.
  12. P. Brandao, O. Zisimopoulos, E. Mazomenos, G. Ciuti, J. Bernal, M. Visentini-Scarzanella, A. Menciassi, P. Dario, A. Koulaouzidis and A. Arezzo (2018) Towards a computed-aided diagnosis system in colonoscopy: automatic polyp segmentation using convolution neural networks. Journal of Medical Robotics Research 3 (02), pp. 1840002. Cited by: §2.1.
  13. R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §1, §2.3.
  14. F. Chadebecq, C. Tilmant and A. Bartoli (2015) How big is this neoplasia? live colonoscopic size measurement using the infocus-breakpoint. Medical Image Analysis, pp. 58–74. Cited by: §1.
  15. I. Funke, S. Bodenstedt, C. Riediger, J. Weitz and S. Speidel (2018) Generative adversarial networks for specular highlight removal in endoscopic images. In Medical Imaging 2018: Image-Guided Procedures, Robotic Interventions, and Modeling, Vol. 10576, pp. 1057604. Cited by: §2.2, §2.
  16. M. Ganz, X. Yang and G. Slabaugh (2012) Automatic segmentation of polyps in colonoscopic narrow-band imaging data. IEEE Transactions on Biomedical Engineering 59 (8), pp. 2144–2151. Cited by: §2.1.
  17. R. Girshick, J. Donahue, T. Darrell and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §3.2.
  18. R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §3.2.
  19. S. Gross, T. Stehle, A. Behrens, R. Auer, T. Aach, R. Winograd, C. Trautwein and J. Tischendorf (2009) A comparison of blood vessel features and local binary patterns for colorectal polyp classification. In Medical Imaging 2009: Computer-Aided Diagnosis, Vol. 7260, pp. 72602Q. Cited by: §2.1.
  20. G. Hinton, O. Vinyals and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.4.1.
  21. S. Hwang, J. Oh, W. Tavanapong, J. Wong and P. C. De Groen (2007) Polyp detection in colonoscopy video using elliptical shape feature. In 2007 IEEE International Conference on Image Processing, Vol. 2, pp. II–465. Cited by: §2.1.
  22. D. K. Iakovidis, D. E. Maroulis, S. A. Karkanis and A. Brokos (2005) A comparative study of texture features for the discrimination of gastric polyps in endoscopic video. In 18th IEEE Symposium on Computer-Based Medical Systems (CBMS’05), pp. 575–580. Cited by: §2.1.
  23. S. A. Karkanis, D. K. Iakovidis, D. E. Maroulis, D. A. Karras and M. Tzivras (2003) Computer-aided tumor detection in endoscopic video using color wavelet features. IEEE transactions on information technology in biomedicine 7 (3), pp. 141–152. Cited by: §2.1.
  24. Z. Li and D. Hoiem (2018-12) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (12), pp. 2935–2947. External Links: Document, ISSN 0162-8828 Cited by: §3.4.1, §3.4.1.
  25. T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §3.2.
  26. T. Lin, P. Goyal, R. Girshick, K. He and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §3.2, §3.2, §3.2.
  27. H. Liu, W. Lu and M. Q. Meng (2011) De-blurring wireless capsule endoscopy images by total variation minimization. In Proceedings of 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, pp. 102–106. Cited by: §2.2, §2.
  28. A. Mohammed, S. Yildirim, I. Farup, M. Pedersen and Ø. Hovde (2018) Y-net: a deep convolutional neural network for polyp detection. arXiv preprint arXiv:1806.01907. Cited by: §2.1.
  29. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §3.2.
  30. I. Ševo, A. Avramović, I. Balasingham, O. J. Elle, J. Bergsland and L. Aabakken (2016) Edge density based automatic detection of inflammation in colonoscopy videos. Computers in biology and medicine 72, pp. 138–150. Cited by: §2.1.
  31. Y. Shin, H. A. Qadir, L. Aabakken, J. Bergsland and I. Balasingham (2018) Automatic colon polyp detection using region based deep cnn and post learning approaches. IEEE Access 6, pp. 40950–40962. Cited by: §2.1, Table 12.
  32. J. Silva, A. Histace, O. Romain, X. Dray and B. Granado (2014) Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. International Journal of Computer Assisted Radiology and Surgery 9 (2), pp. 283–293. Cited by: §2.1.
  33. T. Stehle (2006) Removal of specular reflections in endoscopic images. Acta Polytechnica 46 (4). Cited by: §2.2, §2.
  34. N. Tajbakhsh, S. R. Gurudu and J. Liang (2015) Automated polyp detection in colonoscopy videos using shape and context information. IEEE transactions on medical imaging 35 (2), pp. 630–644. Cited by: §2.1, §2.2.
  35. S. Tchoulack, J. P. Langlois and F. Cheriet (2008) A video stream processor for real-time detection and correction of specular reflections in endoscopic images. In 2008 Joint 6th International IEEE Northeast Workshop on Circuits and Systems and TAISA Conference, pp. 49–52. Cited by: §2.2, §2.
  36. D. Vázquez, J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, A. M. López, A. Romero, M. Drozdzal and A. Courville (2017) A benchmark for endoluminal scene segmentation of colonoscopy images. Journal of healthcare engineering 2017. Cited by: §2.2, §2.
  37. P. Wang, X. Xiao, J. R. G. Brown, T. M. Berzin, M. Tu, F. Xiong, X. Hu, P. Liu, Y. Song and D. Zhang (2018) Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy. Nature biomedical engineering 2 (10), pp. 741. Cited by: §2.1.
  38. S. Yang and G. Cheng ENDOSCOPIC artefact detection and segmentation with deep convolutional neural network. Cited by: §1.
  39. L. Yu, H. Chen, Q. Dou, J. Qin and P. A. Heng (2016) Integrating online and offline three-dimensional deep learning for automated polyp detection in colonoscopy videos. IEEE journal of biomedical and health informatics 21 (1), pp. 65–75. Cited by: §2.1.
  40. Y. Zhou, X. He, L. Huang, L. Liu, F. Zhu, S. Cui and L. Shao (2019) Collaborative learning of semi-supervised segmentation and classification for medical images. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.3.
  41. R. Zhu, R. Zhang and D. Xue (2015) Lesion detection of endoscopy images based on convolutional neural network features. In 2015 8th International Congress on Image and Signal Processing (CISP), pp. 372–376. Cited by: §2.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description