A deep learning framework for quality assessment and restoration in video endoscopy

A deep learning framework for quality assessment and restoration in video endoscopy

Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, James East, Xin Lu and Jens Rittscher Corresponding author: Sharib Ali, sharib.ali@eng.ox.ac.uk

Endoscopy is a routine imaging technique used for both diagnosis and minimally invasive surgical treatment. Artifacts such as motion blur, bubbles, specular reflections, floating objects and pixel saturation impede the visual interpretation and the automated analysis of endoscopy videos. Given the widespread use of endoscopy in different clinical applications, we contend that the robust and reliable identification of such artifacts and the automated restoration of corrupted video frames is a fundamental medical imaging problem. Existing state-of-the-art methods only deal with the detection and restoration of selected artifacts. However, typically endoscopy videos contain numerous artifacts which motivates to establish a comprehensive solution.

We propose a fully automatic framework that can: 1) detect and classify six different primary artifacts, 2) provide a quality score for each frame and 3) restore mildly corrupted frames. To detect different artifacts our framework exploits fast multi-scale, single stage convolutional neural network detector. We introduce a quality metric to assess frame quality and predict image restoration success. Generative adversarial networks with carefully chosen regularization are finally used to restore corrupted frames.

Our detector yields the highest mean average precision (mAP at 5% threshold) of 49.0 and the lowest computational time of 88 ms allowing for accurate real-time processing. Our restoration models for blind deblurring, saturation correction and inpainting demonstrate significant improvements over previous methods. On a set of 10 test videos we show that our approach preserves an average of 68.7% which is 25% more frames than that retained from the raw videos.


[table]capposition=top \cvprfinalcopy \AtBeginShipoutNext\AtBeginShipoutDiscard thanks: The research was supported by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.thanks: S. Ali and J. Rittscher are with Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, UK. S. Ali is supported by the NIHR Oxford BRC and J. Rittscher is supported through the EPSRC funded Seebibyte programme (EP/M013774/1).thanks: F. Zhou and Xin Lu are with Ludwig Institute for Cancer Research, University of Oxford, Oxford, UK thanks: A. Bailey, B. Braden and J. East are with Translational Gastroenterology Unit, Oxford University Hospitals NHS Foundation Trust, Oxford, UK

1 Introduction

Originally used to image the esophagus, stomach and colon, miniaturization of hardware and improvement of imaging sensors now enable endoscopy of the ear, nose, throat, heart, urinary tract, joints, and abdomen. Common to these endoscopy applications, presence of different imaging artifacts pose significant challenges in monitoring disease progression. The camera in the endoscope is embedded in a long flexible tube. Any small hand motion can cause severe motion artifacts in recorded videos. The light, required for illumination, can interact with tissue and surrounding fluid generating very bright pixel areas (either due to specularity or pixel saturation). Different viewing angles and occlusions can result in contrast issues due to underexposure. Additionally, similar to any other complex real-world imaging applications, visual clutters of debris, liquid, bubbles, etc., can limit the visual understanding of the underlying tissue. In this study, we are thus considering the following artifacts: specular reflections, pixel saturation, motion blur, contrast and undesired visual clutter. Not only do such artifacts occlude the tissue/organ of interest during diagnosis and treatment, they also adversely affect any computer assisted endoscopy methods (e.g., video mosaicking for follow-ups and archiving, video-frame retrieval for reporting etc.).

Chikkerur et al. [8] and Menor and colleagues [26] have studied video frame quality assessment methods. While they introduce and review very useful global video quality metrics; neither information regarding the cause of frame quality degradation nor the degraded regions could be identified for frame restoration. In general, utilizing these quality scores [8]-[26] only allows for the removal of frames corrupted with artifacts without considering the severity of each artifact type. Such simple removal of corrupted frames can severely reduce the information content of videos and affect their overall temporal smoothness. One adverse effect of this, for example, can be on mosaicking methods that require at least overlap in successive temporal frames to succeed [2]. Artifacts are thus the primary obstacles in developing effective and reliable computer assisted endoscopy tools. The precise identification, classification and -if possible- restoration are critical to perform a downstream analysis of the video data.

Detecting multiple artifacts and providing adequate restoration is highly challenging. To date, most research groups have studied only specific artifacts in endoscopic imaging [25, 38, 40, 1]. For example, deblurring of wireless capsule endoscopy images utilizing a total variational (TV) approach was proposed in [25]. TV-based de-blurring is however parameter sensitive and requires geometrical features to perform well. Endoscopic images have very sparse features and lack geometrically prominent structures. Both hand-crafted features [38, 40, 1, 28] and neural networks [17] have been used to restore specular reflections. A major drawback of these existing restoration techniques is that heuristically chosen image intensities are compared with neighboring (local) image pixels. In general, both local and global information is required for realistic frame restoration. One common limitation of almost all the methods is that they only address one particular artifact class, while naturally various different effects corrupt endoscopy videos. For example, both ‘specularities’ and a water ‘bubble’ can be present in the same frame. Endoscopists also dynamically switch between different modalities during acquisition (e.g., normal brightfield (BF), acetic acid, narrow band imaging (NBI) or fluorescence light (FL)) to better highlight specific pathological features. Finally, inter-patient variation is significant even when viewed under the same modality. Existing methods fail to adequately address all of these challenges. In addition to addressing one type of imaging artifact, only one imaging modality and a single patient video sequence are considered in most of endoscopy-based image analysis literature [38, 40, 1, 17, 28]. The use of small size data sets in these studies also raises concern regarding method generalization to image variabilities often present in endoscopic data. For example, in [1] only 100 randomly selected images were used to train the Support Vector Machine (SVM) for detecting specular regions.

In this paper, we propose a systematic and comprehensive approach to the problem. Our framework addresses the precise detection and localisation of six different artifacts and introduces artifact type specific restoration of mildly affected frames. Unlike previous methods [25, 38, 40, 1, 17, 28] that require manual adjustment of parameter settings or the use of hand-crafted features only suitable for specific artifacts, we propose to use multiple class artifact detection and restoration methods utilizing multi-patient and multi-modal video frames. Such an approach decreases false classification rate and better generalizes both detection and frame restoration methods. Reliable multi-class detection is made possible through a multi-scale and deep convolutional neural network based object detection which can efficiently generalize multi-class artifact detection in cross patients and cross modality present in endoscopic data. Realistic frame restoration is achieved using Generative adversarial networks (GANs, [13]). While our work is built on these approaches, substantial additional work has been necessary to avoid the introduction of additional artifacts and disruptions to the the overall visual coherence. In order to achieve this we introduce artifact type dependent regularization. A novel edge-based regularization and restoration is proposed for de-blurring. Restoration of large saturated pixel areas using GAN is of its first kind and have never been addressed in literature. In order to handle the color shift we introduce a novel color-transfer technique in this scheme. For tackling with artifacts like debris, bubbles, and other misc. artifacts we apply a complete restoration of pixels based on global contextual regularization scheme. Additionally, we condition each of GAN model with prior image information. We demonstrate that such carefully chosen models can lead to both high-quality and very realistic frame restoration.

To our knowledge, this is the first attempt to propose a systematic and general approach that handles cross modality, and inter-patient video data for both automatic detection of multiple artifacts present in endoscopic data and their subsequent restorations. We use 7 unique patient videos (gastroesophageal, selected from large cohort of 200 videos) for training and 10 different videos for extensive validations. Our experiments utilizing well-established video quality assessment metrics illustrate the effectiveness of our approaches. In addition, quality of the restored frames has also been evaluated by two experienced endoscopists. A score based on visual improvement, importance, and presence or absence of any artificially introduced artifact in our restored frames were provided by these experts.

The remainder of this article is organized as follows. In Section 2, we introduce our endoscopy data set for artifact detection. Section 3 details our proposed approaches for artifact detection and endoscopic video frame restoration. In this section we also review closely related works associated with each process. In Section 4, we present experiments and results for each step of our framework to show the efficacy of individual methods. Finally, in Section 5 we conclude the paper and outline directions for future work.

2 Material

Figure 1: Top row: artifact type distribution in the training and testing enbdoscopy image data set in terms of number of bounding boxes, left and percentage of the total number of bounding boxes, right. Bottom row: sizes of annotated boxes normalised by the image dimensions of the training set, left and the test set, right.

Our artifact detection data set consists of a total of 1290 endoscopy images (resized to 512 x 512 pixels) from two operating modalities; normal bright field (BF), and narrow-band imaging (NBI) sampled from 7 unique patient videos selected from a cohort of 200 endoscopic videos for training data. The selection was based on number of representative artifacts present in these videos and texture variability of the underlying esophagus. Two experts annotated a total of 6504 artifacts using bounding boxes where each annotation is classified as:

  1. blur - streaking from fast camera motion

  2. bubbles - water bubbles that distorts appearance of the underlying tissue

  3. specularity - mirror-like surface reflection

  4. saturation - overexposed bright pixel areas

  5. contrast - low contrast areas from underexposure or occlusion

  6. misc. artifact (also referred as ‘artifact‘ in this paper)- miscellaneous artifacts; e.g., chromatic aberration, debris, imaging artifacts etc.

A 90%-10% split was used to construct the train-test set for object detection resulting in 1161 and 129 images and 5860 and 644 bounding boxes, respectively. In general, the training and testing data exhibits the same class distribution (see Fig.1 (top row)) and similar bounding boxes (roughly square) but either small with average widths less than 0.2 or large with widths greater than 0.5 (see Fig.1 (bottom row)). Multiple annotations are used in case a given region contains multiple artifacts.

3 Method

Figure 2: Sequential processes for endoscopic image restoration from detected region-of-interests (ROI) of 6 different artifacts. First, masks of generarted ROIs are dilated and then only these regions are used for restoration. Unlike, in case of blur, the entire image is used.

3.1 Overall approach

The step-by-step procedure for automatic detection of multiple artifacts and frame restoration of endoscopic videos is presented in Fig. 2. It is to be noted that a single frame can be corrupted by multiple artifacts and each artifact class can affect endoscopic frames differently. Therefore, their restoration process is very likely to affect the final restoration result.

Multiple instance object detection is used to discriminate between the six different types of artifacts (see Section 2) and normal appearance. For each frame a quality score (QS, refer Section 3.3) is computed based on the type, area and location of the identified artifacts to reflect the feasibility of complete image restoration via the sequential restoration process depicted in Fig. 2. The scaling of our QS score is set such that we differentiate between severely corrupted frames (), mildly corrupted frames (), and frames of high quality (). Severely corrupted frames are discarded without any further processing. The proposed image restoration methods are applied to mildly corrupted frames only. In order to guarantee a faithful restoration, mildly corrupted frames go through our proposed sequential framework. All remaining frames are directly concatenated into the final list without any processing.

3.2 Artifact region detection

Recent research in computer vision provides us with object detectors that are both robust and suitable for real-time applications. Here, we propose to use a multi-scale deep object detection model for identifying the different artifacts in real-time. Even by itself, real-time artifact detection is already of practical value. For example, the detection results can be used to provide endoscopists with feedback during data acquisition. After detection, additional post-processing using traditional image processing methods can be used to determine the precise boundary of a corrupted region in the image.

Figure 3: Examples of detected bounding boxes for some artifact class labels using YOLOv3-spp.

Today, deep learning enables us to construct object detectors that generalise traditional hand-crafted ‘sliding-window’ object classification approaches (e.g., Viola-Jones [43]). Earlier attempts of including OverFeat [35] and R-CNN [12] demonstrated the power of convolutional neural networks (CNNs) to learn relevant features and detect objects using a fixed number of pre-generated candidate object region proposals [42]. Faster R-CNNs [34] first introduced a fully trainable end-to-end network yielding an initial region proposal network and successive classifications of the proposed regions without intermediate processing. Since region proposal generation precedes bounding box detection sequentially, this architecture is known as a two-stage detector. Though very accurate, a primary drawback is its slow inference and extensive training. You Only Look Once (YOLO, [32]) simplified Faster R-CNNs to predict simultaneously class and bounding box coordinates using a single CNN and a single loss function with good performance and significantly faster inference time. This simultaneous detection is known as a one-stage detector. Compared to two-stage detectors, single-stage detectors mainly suffer two issues: high false detection due to 1) presence of varied size objects and 2) high initial number of anchor boxes requirement that necessitates more accurate positive box mining. The former is corrected by predicting bounding boxes at multiple scales using feature pyramids [14]-[22]. To address the latter, RetinaNet [23] introduced a new focal loss which adjusts the propagated loss to focus more on hard, misclassified samples. Recently, YOLOv3 [33] simplified the RetinaNet architecture with further speed improvements. Bounding boxes are predicted only at 3 different scales (unlike 5 in RetinaNet) utilizing objectness score and an independent logistic regression to enable the detection of objects belonging to multiple classes unlike focal loss in RetinaNet. Collectively, Faster R-CNN, RetinaNet and YOLOv3 define the current state-of-the-art detection envelope of accuracy vs speed on the popular natural images benchmark COCO data set [24].

We investigated the Faster R-CNN, RetinaNet and YOLOv3 architectures for artifact detection. Validated open source codes are available for all of these architectures. Experimentally, we chose to incorporate YOLOv3 with spatial pyramid pooling (YOLOv3-spp) for robust detection and improved inference time for endoscopic artifacts detection. Spatial pyramid pooling allowed to pool features from sub-image regions utilizing computed single-stage CNN features at multiple-scales from YOLOv3 architecture. In addition to the boost in the inference speed, incorporating spatial pyramid pooling decreased false positive detections compared to classical YOLOv3 method (see Section 4.2). YOLOv3-spp provided an excellent feature for accuracy-speed trade-off which are the main requirements for usage in clinical settings. Examples of the detected boxes using YOLOv3-spp are shown in Fig. 3.

3.3 Quality score

QS = 0.75 QS = 0.23
Figure 4: Quality assessment based on class weight, area and location. Images with detection boxes and their corresponding area fraction are shown. On left: shows image with mostly contrast problem and on right: shows that with multiple misc. artifacts and specularities. Below are their calculated quality scores. 

Quality assessment is important in video endoscopy as image corruption largely affects image analysis methods. However, it is likely that not all frames are corrupted in same proportion. Depending on the amount and type of artifact present in frames realistic frame restoration can be possible. However, such frame grading needs to be carefully determined. Here, we propose a frame quality score (QS) based on: a) type, b) area and c) location of the detected artifacts. Weights are assigned to each of these categories and a mean weight is computed as the quality score. Weights are assigned to each type based on the ease of restoration, e.g., an entire blurred image can still be restored but the same would not apply with misc. artifacts. Thus, misc. artifacts are assigned a higher weight than blur. Similarly, the area and location of detected artifacts in each frame are important. A centrally located imaging artifact with large area detrimentally degrades image information beyond restoration. Below we describe our weighting scheme:

  • Class weight (): misc. artifact (0.50), specularity (0.20), saturation (0.10), blur (0.05), contrast (0.05), bubbles (0.10)

  • Area weight (): percentage of the total image area occupied by all detected artifact areas and normal areas

  • Location weight (): center (0.5), left (0.25), right (0.25), top (0.25), bottom (0.25), top-left (0.125), top-right (0.125), bottom-left (0.125), bottom-right (0.125).

The final QS is computed as:


where denotes the set of bounding boxes associated to each detected artifact, , are constants that weight the relative contributions of area and location. We have used in our experiments. However, for frames with few detected artifacts (less than 5) such weighting scheme underscores (especially if large area artifacts are present) thus is used for these cases. Note that QS score in Eq. (1) is lower-bounded by 0.

A fixed threshold is user-specified to determine the frames kept for image restoration. Examples of the proposed quality score applied to real data are shown in Fig. 4. The video frame in Fig. 4 (left) has mostly a contrast problem (i.e., low ) so despite its central location (see blue box) and large area the frame intensity can be restored (QS0.75). However Fig. 4 (right) has many misc. artifacts (high ) and specular areas located centrally centrally (c.f. green, red boxes, QS=0.23) which inhibits realistic frame restoration so the frame is discarded.

3.4 Image restoration

Formulating the reconstruction of the true signal given the noisy and corrupted input image as an optimization or estimation problem demands a well-motivated mathematical model. Unfortunately, the various different types of artifacts induce a level of complexity that make this endeavor very challenging. Assuming image noise to be additive and approximating motion blur as a linear convolution with an unknown kernel is reasonable and in line with previous attempts to the problem. In addition, contrast and pixel saturation problems can be formulated as a non-linear gamma correction. Other remaining artifacts (e.g., specularities, bubbles and imaging artifacts) which are due to combined processes of these phenomena can be assumed as a function of the entire process. The corrupted noisy video frame can thus be approximated as:


where denotes the additive noise induced by the imaging system, the convolution with the approximation to the induced motion blur, captures the over- and under-exposed regions and is a generalized non-linear function that models capturing other artifacts as well (including specularities, bubbles and imaging artifacts) or a combination of them. This model motivates why the restoration of the video frames is structured into separate processing steps, which are implemented as deep learning models.

Image restoration is the process of generating realistic and noise free image pixels from corrupted image pixels. In endoscopic frame restoration, depending upon the artifact type, the goal is either the generation of an entire noise-free image or pixel inpainting of undesirable pixels using surrounding pixel information [4]. For multi-class endoscopic artifact restoration, we require 1) frame deblurring when is unknown, i.e. a blind deblurring task, 2) minimize the effect of contract imbalance (correction for over- and under-exposed regions) in frames, i.e. correction, 3) replace specular pixels and those with imaging artifacts or debris with inpainting, i.e. correction for additive noise or a combined non-linear function . Due to the higher likelihood of the presence of multiple artifacts in a single frame, unordered restoration of these artifacts can further annihilate frame quality. We therefore propose an adaptive sequential restoration process that account for the nature of individual artifact types (see Fig. 2).

Recently, GANs [13] have been successfully applied to image-to-image translation problems using limited training data. Here, a generator ‘generates’ a sample from a random noise distribution ( with ) while a separate discriminator network tries to distinguish between the real target images ( with assumed non-zero mean Gaussian) and the fake image generated by the generator. The objective function is therefore a min-max problem in this case:


In practice, the generator model in Eq. (3.4) is highly non-convex, unstable and slow to train as samples are generated from random input noise. Various groups [27, 47, 18, 19] have provided ways to address this problem and achieved improvements in reconstruction quality and numerical stability as well as a reduction in computation time. One popular way to ensure the stability of the generator output is by conditioning the GAN on prior information (e.g., the class label ‘‘ in CGAN, [27]). The objective function for CGAN can be written as:


Another efficient method is regularizing the generator using contextual losses (e.g., pix2pix [18], deblurGAN [21]). In [9] regularizing the discriminator and generator significantly helped to improve visual quality. We train such conditional generative adversarial models [27] (CGAN) embedding artifact class dependent contextual losses, see Table 1 for effective restoration. When restoring frames, the detected artifact types are used to decide which CGAN model is applied and in what order (also see Fig. 2).

artifact type Restoration method
Motion blur CGAN + -contextual + high-frequency losses
specularity/bubbles/misc. artifacts CGAN + -contextual loss
saturation CGAN + -contextual loss + CRT transform
Low contrast same as saturation (reversed training set)
Table 1: Computational models used for individual artifact classes.

3.4.1 Motion blur

Motion blur is a common problem in endoscopy videos. Unlike static images, motion blur is often non-uniform with unknown kernels (see Eq. (2)) in video frame data. Several blind-deconvolution have been applied to motion deblurring. These range from classical optimization methods [41, 44, 11] to neural network-based methods [7, 29]. Despite good performance of convolutional neural networks (CNNs) over classical methods, a major drawback of CNNs is that they require tuning a large number of hyper-parameters and large training data sets. Blind deconvolution can be posed as an image-to-image translation problem where the blurred image is transformed into its matching unblurred image.

Figure 5: Blind deblurring using CGAN with added contexual high-frequency feature loss.

In this work, we use CGAN with a -contextual loss (squared difference between generated and target/sharp image) and an additional high-frequency loss as regularization. This is motivated by the fact motion blur primarily affects image edges, a few discriminative image pixels compared to the entire image. The high-frequency images are first computed both for blurred and sharp images in the training data using iterative low pass-high pass filtering at 4 different scales [6]. These images are then used to provide additional information to the discriminator regarding the generator’s behavior (also see Fig. 5). Equation (3.4) becomes:


where , refer to an original and high-frequency image pair and . is the ground truth image for restoration (i.e. sharp images in our case). Minimization of Eq. (3.4) using Jensen-Shannon (JS) divergence as in [13] can lead to problems like mode collapse, vanishing gradients. Consequently, [3] proposed to use Wasserstein distance with gradient penalty (WGAN-GP). We propose thus to use CGAN with a critic network based on WGAN-GP [21]. The proposed model was trained for 300 epochs on a paired blur-sharp data set consisting of 10,710 (715 unique sharp images) multi-patient and multi-modal images with 15 different simulated motion trajectories for blur (see [21]).

3.4.2 Saturation or low contrast

The small or larger distances between the light source and the imaged tissue can lead to large illumination changes which can result in saturation or low contrast. This motivates the role of the variable in Eq. (2) Saturated or low contrast image pixels often occur across large image areas compared to specularities and affect the entire image globally. In addition, these illumination changes are more prominently observed in normal brightfield (BF) modality compared to other modalities. Compensation of affected image pixels is a difficult problem depending on the size of the affected image area. We pose the saturation restoration task as an image-to-image translation problem and apply the same end-to-end CGAN approach used for motion deblur described above with contextual loss only to train a generator-discriminator network for saturation removal. Here, contextual loss is more suitable as we want to capture the deviation between normal illumination condition w.r.t saturation and low contrast conditions.

Due to lack of any ground truth data for two different illumination conditions, we created a fused data set that included: 200 natural scene images containing diffuse (scattered light) and ambient (additional illumination to natural light giving regions with pixel saturation) illuminations 111https://engineering.purdue.edu/RVL/Database/specularity_database/index.html; and 200 endoscopic image pairs simulated using cycleGAN-based style transfer [47] (separately trained on another 200 images with saturated and normal BF images with from 7 unique patients). To correct coloration shift due to the incorporation of natural images in our training set, color transfer (CRT) is applied to the generated frames. Given a source image, and a target image, to recolor, the mean and covariance matrix of the respective pixel values (in RGB channels) can be matched through a linear transformation [15]:


where is the recolored output. To avoid re-transfer of color from saturated pixel areas in the source, the mean and covariance matrix are computed from image intensities 90% of the maximum intensity value. Fig. 6 shows the generated results using our trained GAN-based network (on the right) and after color shift correction (bottom) showing very close to ground-truth results. To recover low contrast frames, the CGAN-saturation network was trained with a reverse image pair of the same training data set.

Figure 6: CRT color correction. Saturation corrected frames generation by our trained generator (right), and color transfer comparison with ground truth (bottom).

3.4.3 Specularity, and other misc. artifacts removal

Illumination inconsistencies and view point changes cause strong bright spots due to reflections from bubbles and shiny organ surfaces, and water-like substances can create multi-colored chromatic artifacts (referred to as ‘imaging or mixed artifact‘ in this paper). These inconsistencies appear as a combination of linear (e.g., additive noise ) and non-linear noise (function ) in Eq. (2). A process referred to inpainting that uses the information of the surrounding pixels as prior information is used to replace the saturated pixels in affected regions. TV-inpainting methods are popular for restoring images with geometrical structures [37] and patch-based methods [10] for texture synthesis. However, these methods are computationally expensive. Recent advances in deep neural networks have proven to recover visually plausible image structures and textures [20] with almost real-time performance. However, they are limited to the size of the mask or the number of unknown pixels in an image. In this context, GANs [31, 16, 45] have been shown to be more successful in providing faster and more coherent reconstructions even with larger masks. Both contextual and generative losses have been used in these methods. Iizuka et al. [16] and Yu et al. [45] used local and global discriminators to improve the reconstruction quality. To enlarge the network receptive field [45] further used a coarse-to-fine network architecture using WGAN-GP instead of the DCGAN in [16]. Additionally, a discounted contextual (reconstruction) loss using a distance-based weight mask was used for added regularization [45]. Due to the reduced training time and better reconstruction quality compared to [16], we use the network proposed in [45] for inpainting.

We use a bottleneck approach to retrain the model initialised with the pretrained weights of the places2 data set [46]. To capture the large visual variations present in endoscopy images, 1000 images from 7 different patient endoscopy videos with a quality score 95% were used as the ‘clean’ images (see Section 3.3). We used 172 images as a validation set during the training. Both training and validation sets included multimodal endoscopic video frames. During training and validation masks of different patch sizes were randomly generated and were used for restoration. A single image can have one or multiple generated masks for restoration.

4 Experiments

4.1 Quality assessment metrics

To evaluate our artifact detection we use the standard mean average precision (mAP) and intersection-over-union (IoU) metrics. We quantitatively compare the detection results of all architectures using the mAP at IoU thresholds for a positive match of 5%, 25% and 50% denoted mAP, mAP and mAP respectively, the mean IoU between positive matches, the number of predicted boxes relative to the number of annotated boxes and the average inference time for one image as quantitative measures. For the quality assessment of deblurring methods we use peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) measures. To overcome the limitations of PSNR for quantification of saturation and specularity restoration tasks we include more sophisticated visual information fidelity (VIF, [36]) and relative edge coherence (RECO, [5]) quality assessment metrics that are independent of the distortion type.

4.2 Artifact detection

Table 2 shows that YOLOv3 variants outperform both Faster R-CNN and Retinanet. YOLOv3-spp (proposed) yields the best mAP of 49.0 and 45.7 at IoU thresholds of 0.05 and 0.25 respectively at a detection speed faster than Faster R-CNN [34]. Even though Retinanet exhibits the best IoU of 38.9, it is to be noted that IoU is sensitive to annotator variances in bounding box annotation which might not resemble the performance of detectors. In terms of class-specific performance, from Fig. 7 and Table. 3 proposed YOLOv3-spp is the best across detecting misc. artifacts and bubbles (both are predominantly present in endoscopic videos) with average precision of 48.0 and 55.9, respectively. Faster R-CNN yielded the highest average precision for saturation (71.0) and blur (14.5) while RetinaNet and YOLOv3 outperformed respectively for contrast (73.6) and specularity detection (40.0). It is worth noting that proposed YOLOv3-spp yielded second best average precision scores for speculariy (34.7), saturation (55.7) and contrast (72.1).

Method Backbone Input Size mAP mAP mAP IoU Predict Boxes Time (ms)
Faster R-CNN[34] Resnet50 600x600 44.9 40.4 29.5 28.3 835 555222Python Keras 2.0, (Tensorflow 1.2 backend) Code.
RetinaNet[23] Resnet50 608x608 43.8 41.2 34.7 38.9 576 103333PyTorch 0.4 Code.
YOLOv3[33] darknet53 512x512 47.4 44.3 35.1 24.2 1252 95444Python call of Darknet trained network.
YOLOv3 darknet53 608x608 48.1 45.2 33.2 21.4 1300 116
YOLOv3-spp darknet53 512x512 49.0 45.7 34.7 24.4 1120 88
Table 2: Artifact detection results on test set with different neural network architectures. All timings are reported on a single 6GB NVIDIA GTX Titan Black GPU and is the average time for a single 512x512 image (possibly rescaled on input as indicated) evaluated over all 129 test images. Total number of ground truth boxes = 644 boxes.
Figure 7: Class specific precision-recall curves for artifact detection.
Method Spec. Sat. Arte. Blur Cont. Bubb.
Faster R-CNN[34] 20.7 71.0 35.1 14.5 58.7 42.4
RetinaNet[23] 33.1 42.9 39.8 7.2 73.6 50.6
YOLOv3[33] 40.0 50.4 44.3 11.6 70.8 48.9
YOLOv3-spp 34.7 55.7 48.0 7.5 72.1 55.9
Table 3: Class-specific average precision (AP) of the different object detection networks.

4.3 Frame restoration

Method Metric Images with varying motion blur
#80 #99 #102 #113 #116
CGANcont. & PSNR 25.22 28.14 27.28 23.41 24.81
HF feature loss SSIM 0.998 0.997 0.993 0.980 0.992
deblur PSNR 25.17 27.93 26.96 23.40 24.81
GAN [21] SSIM 0.998 0.997 0.992 0.979 0.992
SRN PSNR 24.61 27.50 25.02 22.23 22.00
DeblurNet [39] SSIM 0.995 0.996 0.990 0.970 0.970
TV-deconv PSNR 24.25 26.72 24.75 21.69 22.20
[11] SSIM 0.966 0.994 0.988 0.966 0.983
Table 4: Peak signal-to-noise ratio (PSNR) and the structural similarity measure (SSIM) for randomly selected images with different motion blur.

4.3.1 Blind deblurring

We compare our proposed conditional generative adversarial network with added contextual and high-frequency feature losses with deblurGAN [21], scale-recurrent network-based SRN-DeblurNet [39], and traditional TV-based method [11]. TV regularizion weight and the blur kernel affects the quality of recovered deblurred images [11]. We chose and after a few iterative parameter setting experiments for our data set. We performed retraining for SRN-DeblurNet [39] and deblurGAN [21] on the same data set used by our proposed deblurring model.

Method Metric Image sequences
#1 #2 #3
CGANcont. & 25.80 24.65 21.25
HF feature-loss 0.997 0.980 0.970
deblur 25.68 24.37 21.08
CGAN [21] 0.996 0.977 0.968
Table 5: Average PSNR () and average SSIM () for image sequences in test trajectories both with added high-frequency (HF) feature loss (proposed) and only contextual loss [21] in conditional GAN model.
Blurred TV SRN deblur Proposed GT deconv DeblurNet GAN
Figure 8: Qualitative results for different de-blurring methods on WL and NBI frames.

We quantitatively evaluated the frame deblurring methods using 5 images with visually large blur and our simulated test trajectories (see Table 4) and on 3 different test sequences (simulated motion blur trajectories, see Table 5) each with 30 images. Table 4 shows that CGAN with -contextual loss and added high-frequency (HF) feature loss score the highest PSNR and SSIM values for all blurred frames while TV-based deconvolution method [11] resulted in the least PSNR and SSIM values over all frames. Nearly, 1 dB increase can be seen against the deblurGAN method [21] for frames #80, #99, and #103 while dB gain can be seen for #102, #116 against SRN-DeblurNet [39] using the proposed model. Overall the proposed model yields the best result compared to second best deblurGAN for the blurred image sequences in Table 5. This is also seen qualitatively in Fig. 8. It can be observed that SRN-DeblurNet deforms the image at upper right in both WL and NBI frames.

4.3.2 Saturation removal

We present results of treating for saturation removal as a global problem, correcting the entire frame for over exposure as discussed in Section 3.4.2. Quantitative results are provided in Table 6 for 19 randomly selected saturated frames from our simulated test data set derived from good quality frames (QS). Our restoration model demonstrates increased average values across all tested metrics, (PSNR, SSIM, VIF and RECO). Improvements after color transform for visual quality metrics like RECO (from 1.313 to 1.512), and VIF (from 0.810 to 0.818) illustrates boosted visual quality. This is also evident in qualitative result presented in Fig. 9. Largely saturated image patches in the left and central frames are clearly removed by the trained generator whilst preserving the underlying image details (see RGB histograms in Fig. 9, second row). The color transform successfully restores the original color consistency in CGAN restored images without introducing new saturation (see Fig. 9, last row). Note that simple contrast stretching as shown in Fig. 9 (third row) by rescaling the CGAN restored frames fails to recover the original color tones.

Metric QA for different stages
CyleGAN -contexual post-process
simulation CGAN CRT
27.892 28.622 28.335
0.905 0.964 0.944
0.808 0.810 0.818
1.091 1.313 1.512
Table 6: Average PSNR () and average SSIM () for 19 randomly selected saturated images in our simulated data set using CycleGAN. Quality assessment (QA) for simulated images, -contexual CGAN, and post-processing using color retransfer (CRT) method are provided.

Figure 9: Saturation and specularity correction. Left, center: Saturation of pixels in the region near to the light source (blue area, top), and right: Several specular regions in green area. Corrected images for saturation and specularity removal using trained end-to-end generator are presented on the second row. Result of simple rescaling of the corrected image intensity on the third row and result of using our color-correction instead on the last row. Corresponding RGB histograms is shown to the right of each respective image.  

4.3.3 Specularity and other misc. artifacts removal

Specularity and other local artifacts are removed based on inpainting (Section 3.4.3 for details). To validate our inpainting methods, we used a set of 25 images (clean) with randomly selected patches covering 5% and 12% of the total pixels of image size. We compare our CGAN-based model with -contextual loss model with widely used traditional TV-based and patch-based inpainting methods. We observe in Table. 7 that -contextual CGAN method has the best quality assurance values for both VIF and RECO measures (VIF: 0.95, RECO:0.992 for 5% masked pixels and VIF: 0.883, RECO:0.983 for 12% masked pixels). Even though the TV-based inpainting method scored higher PSNR values in both cases, it scored the least RECO values (0.984 and 0.975 respectively for 5% and 12% cases) and has the highest computational cost (392 seconds). In contrast, -contexual CGAN has the least computational time (2s to load the trained model and apply on images on GeForce GTX 1080 Ti).

Qualitative results for our specularity and local artifact removal on real problematic gastro-oesophageal endoscopic frames are shown in Fig. 10. In Fig. 10 (a), both imaging artifacts (first and fourth rows) and specularities (second and third rows) introduce large deviations in pixel intensities both locally with respect to neighouring pixels and globally with respect to the uncorrupted image appearance. Using inpainting methods (see Fig. 10 (c) and (d)), the images have been restored based on the bounding box detections of our artifact detector. The second best TV-based method in Fig. 10(c) produces blurry and non-smooth patches during the reconstruction of unknown pixels (refer to Fig. 10 (b)) compared to CGAN generative model (see Fig. 10(d)). A closer look around the unknown regions indicated by blue rectangular boxes in Fig. 10(b), Fig. 10(e) shows that local image structures are well preserved and smoother transition from reconstructed pixels to the surrounding pixels is present. An immediate noticeable ghost effect can be observed in the second row, Fig. 10(e) top using the TV-based method.

Method 5% of total pixels 12% of total pixels
TV-based [11] 45.130 0.947 0.984 40.970 0.881 0.975 392.0
Patch-based [30] 43.440 0.940 0.990 39.520 0.871 0.984 35.0
-cont. CGAN 43.487 0.950 0.992 39.693 0.883 0.983 2.5
Table 7: Average values for PSNR, VIF [36], and RECO [5]) metrics for restoration of missing pixels for masks covering 5% and 12% of total image pixels ( pixels) with 21 randomly sampled rectangular boxes on 20 randomly selected images from 3 different patient videos.
(a) (b) (c) (d) (e)
Figure 10: Image restoration result using inpainting of corrupted areas (specularity, imaging artifacts) detected by our detection method. a) Original corrupted image, b) detected bounding boxes, c) inpainting result using recent TV-based method, d) l1-contexual CGAN, e) top, bottom: Restored area marked with blue rectangle in (b) using TV-based and generative model using l1-contexual CGAN, respectively. 

4.4 Video recovery and quality assessment

We evaluated our artifact detection and recovery framework on 10 gastroesophageal videos comprising with nearly 10,000 frames each. For artifact detection, an objectness threshold of 0.25 was used to reduce duplication in detected boxes and QS value for restoring the frame was set to . As a baseline, we also separately trained a sequential 6-layer convolution neural network (layer with 64 filters of sizes , ReLU activation function and batch normalization) with a fully connected last layer for binary classification on a set of 6000 manually labeled positive and negative images to decide whether to discard or keep a given input video frame. A threshold of 0.75 was set for the binary classifier to keep only frames of sufficient quality. Our framework successfully retains the vast majority of frames compared to a binary decision, Fig.11. The quality enhanced video was again fed to our CNN-based binary classifier which resulted in lower number of frame rejection than on raw videos. Consequently, the resultant video is more continuous compared to the equivalent binary cleaned video utilizing raw videos. For example, in video 3, the video after frame removal based on the binary classifier directly lead to many distinct abrupt transitions that can be detrimental for post-processing algorithms as only 30% is kept. Comparatively, our proposed framework retains 70% of frames, i.e. a frame restoration of nearly 40%. Quantitatively across all 10 endoscopic videos tested, our framework restored 25% more video frames, retaining on an average of 68.7% of 10 videos considered.

Video 1 Video 2 Video 3
Figure 11: Frame recovery in clinical endoscopy videos. Top: frames and proportion deemed recoverable over a sequence of and the over a sequence of using a binary deep classifer and our proposed QS score. Bottom: the proportion of each artifact type present in each video.

4.5 Clinical relevance test

We corrupted 20 high-quality images selected from 10 test videos with blur, specularity, saturation and misc. artifacts (refer Sec. 3.4). Restoration methods were then applied to these images. Two expert endoscopists independently were asked to score these restoration results compared to the original high-quality images and corresponding videos. Scores (0-10) were based on: 1) Addition of unnatural distortions was assigned a negative score and 2) removal of distortions was assigned a positive score. The obtained mean score were blur: 7.87, specularity or misc. artifacts: 7.7, and saturation: 1.5. A remarkable restoration was obtained for blur and specularity or misc. artifacts. However, saturation correction was not pleasant to experts mostly due to loss of 3D information (according to feedback comments) even though visual coherence was improved.

5 Conclusion

We have presented a novel end-to-end framework for the detection and restoration of inevitable endoscopic video frame artifacts which are embodied in frames linearly, non-linearly or both. Our contribution includes artifact-specific quality assessment and sequential restoration. In particular, as each module in the proposed framework is formulated as a neural network, our framework can fully take advantage of the real-time processing capabilities of modern GPUs. We have proposed several novel techniques for frame restoration that includes an edge-based (high-frequency) loss for recovering blurred images and a color re-transfer scheme to deal with color shifts in generated frames due to model transfer. We have proposed novel regularization schemes based on each artifact class-type for frame restoration yielding high-quality image generation. Through extensive experiments we have validated each step of our framework, from detection to restoration methods. We achieved the highest and with our modulated YOLOv3-spp and the least inference time (88 ms) for real time frame quality scoring. We demonstrated quantitative and qualitative improvements for frame restoration tasks. Notably, we achieved improvements in both PSNR and SSIM metrics for blur and saturation using our proposed models. For specularity and other misc. artifacts removal, we achieved also significant improvements on visual similarity metrics. Finally, our sequential approach was able to restore an average of 25% of the video frames in 10 randomly selected videos from our database. It is worth noting that for 3 videos used for illustration of the importance of our proposed framework, 40% of frames which otherwise would be discarded for downstream analysis were rescued. We demonstrated high quality performance on real clinical endoscopy videos for both intra- and inter-patient variabilities and multimodality. Future work will focus on further improving the object detection network and implementing the entire framework as a single end-to-end trainable neural network.


  • [1] Mojtaba Akbari, Majid Mohrekesh, SM Soroushmehr, Nader Karimi, Shadrokh Samavi, and Kayvan Najarian. Adaptive specular reflection detection and inpainting in colonoscopy video frames. arXiv preprint arXiv:1802.08402, 2018.
  • [2] Sharib Ali, Christian Daul, Ernest Galbrun, Francois Guillemin, and Walter Blondel. Anisotropic motion estimation on edge preserving riesz wavelets for robust video mosaicing. Patt. Recog., 51:425 –442, 2016.
  • [3] Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, pages 214–223, 2017.
  • [4] Celia A. Zorzo Barcelos and Marcos Aurélio Batista. Image restoration using digital inpainting and noise removal. Image and Vision Comput., 25(1):61 – 69, 2007.
  • [5] V. Baroncini, L. Capodiferro, E. D. Di Claudio, and G. Jacovitti. The polar edge coherence: A quasi blind metric for video quality assessment. In EUSIPCO, pages 564–568, Aug 2009.
  • [6] Antoni Buades, Triet Le, Jean-Michel Morel, and Luminita Vese. Cartoon+Texture Image Decomposition. Image Process. On Line, 1, 2011.
  • [7] Ayan Chakrabarti. A neural approach to blind motion deblurring. In ECCV, pages 221–235. Springer, 2016.
  • [8] S. Chikkerur, V. Sundaram, M. Reisslein, and L. J. Karam. Objective video quality assessment methods: A classification, review, and performance comparison. IEEE Trans. Broadcast., 57(2):165–182, 2011.
  • [9] Bang Duhyeon and Shim Hyunjung. Improved training of generative adversarial networks using representative features. CoRR, abs/1801.09195, 2018.
  • [10] Alexei A. Efros and William T. Freeman. Image quilting for texture synthesis and transfer. In SIGGRAPH, pages 341–346. ACM, 2001.
  • [11] Pascal Getreuer. Total Variation Deconvolution using Split Bregman. Image Process. On Line, 2:158–174, 2012.
  • [12] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587. IEEE, 2014.
  • [13] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
  • [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, pages 346–361. Springer, 2014.
  • [15] Aaron Philip Hertzmann. Algorithms for rendering in artistic styles. PhD thesis, New York University, Graduate School of Arts and Science, 2001.
  • [16] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and Locally Consistent Image Completion. SIGGRAPH, 36(4):107:1–107:14, 2017.
  • [17] Funke Isabel, Bodenstedt Sebastian, Riediger Carina, Weitz JÃœrgen, and Speidel Stefanie. Generative adversarial networks for specular highlight removal in endoscopic images. In Proc. SPIE, volume 10576, pages 10576 – 10576 – 9, 2018.
  • [18] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
  • [19] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. CoRR, 2017.
  • [20] R. Köhler, C. Schuler, B. Schölkopf, and S. Harmeling. Mask-specific inpainting with deep neural networks. In GCPR, pages 523–534. Springer, 2014. LNCS.
  • [21] Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiri Matas. Deblurgan: Blind motion deblurring using conditional adversarial networks. CoRR, abs/1711.07064, 2017.
  • [22] Tsung-Yi Lin, Piotr Dollár, Ross B Girshick, Kaiming He, Bharath Hariharan, and Serge J Belongie. Feature pyramid networks for object detection. In CVPR, pages 936–944. IEEE, 2017.
  • [23] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017.
  • [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  • [25] H. Liu, W. S. Lu, and M. Q. H. Meng. De-blurring wireless capsule endoscopy images by total variation minimization. In PACRIM, pages 102–106. IEEE, Aug 2011.
  • [26] Diego P.A. Menor, Carlos A.B. Mello, and Cleber Zanchettin. Objective video quality assessment based on neural networks. Procedia Comput. Sci., 96:1551 – 1559, 2016.
  • [27] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.
  • [28] Ahmed Mohammed, Ivar Farup, Marius Pedersen, Øistein Hovde, and Sule Yildirim Yayilgan. Stochastic capsule endoscopy image enhancement. J. of Imag., 4(6), 2018.
  • [29] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In CVPR, pages 257–265. IEEE, 2017.
  • [30] Alasdair Newson, Andrés Almansa, Yann Gousseau, and Patrick Pérez. Non-Local Patch-Based Image Inpainting. Image Process. On Line, 7:373–385, 2017.
  • [31] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, pages 2536–2544. IEEE, 2016.
  • [32] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788. IEEE, 2016.
  • [33] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  • [34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
  • [35] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
  • [36] H. R. Sheikh and A. C. Bovik. Image information and visual quality. IEEE Trans. Image Process., 15(2):430–444, 2006.
  • [37] Jianhong Shen and Tony F. Chan. Mathematical models for local nontexture inpaintings. SIAM J. of Appl. Math., 62(3):1019–1043, 2002.
  • [38] Thomas Stehle. Removal of specular reflections in endoscopic images. Acta Polytechnica, 46(4), 2006.
  • [39] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. In CVPR, pages 8174–8182. IEEE, 2018.
  • [40] Stephane Tchoulack, JM Pierre Langlois, and Farida Cheriet. A video stream processor for real-time detection and correction of specular reflections in endoscopic images. In Workshop on Circuit and Syst. and TAISA Conf., pages 49–52. IEEE, 2008.
  • [41] Hanghang Tong, Mingjing Li, Hongjiang Zhang, and Changshui Zhang. Blur detection for digital images using wavelet transform. In ICME, pages 17–20. IEEE, 2004.
  • [42] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. Int. J. of Comput. vision, 104(2):154–171, 2013.
  • [43] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, pages 511–518. IEEE, 2001.
  • [44] Li Xu, Shicheng Zheng, and Jiaya Jia. Unnatural l0 sparse representation for natural image deblurring. In CVPR, pages 1107–1114. IEEE, 2013.
  • [45] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. Generative image inpainting with contextual attention. CoRR, abs/1801.07892, 2018.
  • [46] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1452–1464, 2018.
  • [47] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, pages 2242–2251. IEEE, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description