Interactive Medical Image Segmentation using Deep Learning with Image-specific Fine-tuning
Convolutional neural networks (CNNs) have achieved state-of-the-art performance for automatic medical image segmentation. However, they have not demonstrated sufficiently accurate and robust results for clinical use. In addition, they are limited by the lack of image-specific adaptation and the lack of generalizability to previously unseen object classes. To address these problems, we propose a novel deep learning-based framework for interactive segmentation by incorporating CNNs into a bounding box and scribble-based segmentation pipeline. We propose image-specific fine-tuning to make a CNN model adaptive to a specific test image, which can be either unsupervised (without additional user interactions) or supervised (with additional scribbles). We also propose a weighted loss function considering network and interaction-based uncertainty for the fine-tuning. We applied this framework to two applications: 2D segmentation of multiple organs from fetal MR slices, where only two types of these organs were annotated for training; and 3D segmentation of brain tumor core (excluding edema) and whole brain tumor (including edema) from different MR sequences, where only tumor cores in one MR sequence were annotated for training. Experimental results show that 1) our model is more robust to segment previously unseen objects than state-of-the-art CNNs; 2) image-specific fine-tuning with the proposed weighted loss function significantly improves segmentation accuracy; and 3) our method leads to accurate results with fewer user interactions and less user time than traditional interactive segmentation methods.
Deep learning with convolutional neural networks (CNNs) has achieved state-of-the-art performance for automated medical image segmentation . However, automatic segmentation methods have not demonstrated sufficiently accurate and robust results for clinical use due to the inherent challenges of medical images, such as poor image quality, different imaging and segmentation protocols, and variations among patients . Alternatively, interactive segmentation methods are widely adopted, as integrating the user’s knowledge can take into account the application requirements and make it easier to distinguish different tissues [2, 3, 4]. As such, interactive segmentation remains the state of the art for existing commercial surgical planning and navigation products. Though leveraging user interactions often leads to more robust segmentations, a good interactive method should require as little user time as possible to reduce the burden on users. Motivated by these observations, we investigate combining CNNs with user interactions for medical image segmentation to achieve higher segmentation accuracy and robustness with fewer user interactions and less user time. However, there are very few studies on using CNNs for interactive segmentation [5, 6, 7]. This is mainly due to the requirement of large amounts of annotated images for training, the lack of image-specific adaptation and the demanding balance of model complexity, time and memory space efficiency.
The first challenge of using CNNs for interactive segmentation is that current CNNs do not generalize well to previously unseen object classes, as they require labeled instances of each object class to be present in the training set. For medical images, annotations are often expensive to acquire as both expertise and time are needed to produce accurate annotations. This limits the performance of CNNs to segment objects for which annotations are not available in the training stage.
Second, interactive segmentation often requires image-specific learning to deal with large context variations among different images, but current CNNs are not adaptive to different test images, as parameters of the model are learned from training images and then fixed during the testing, without image-specific adaptation. It has been shown that image-specific adaptation of a pre-trained Gaussian Mixture Model (GMM) helps to improve segmentation accuracy . However, transitioning from simple GMMs to powerful but complex CNNs in this context has not yet been demonstrated.
Third, fast inference and memory efficiency are demanded for interactive methods. These can be relatively easily achieved for 2D segmentations, but become much more problematic for 3D volumes. For example, DeepMedic  works on local patches to reduce memory requirements but results in a slow inference. HighRes3DNet  works on an entire volume with relatively fast inference but needs a large amount of GPU memory, leading to high hardware requirements. To make a CNN-based interactive segmentation method efficient to use, enabling CNNs to respond quickly to user interactions and to work on a machine with limited GPU resources (e.g, a standard desktop PC or a laptop) is desirable. DeepIGeoS  combines CNNs with user interactions and has demonstrated good interactivity. However, it has a lack of adaptability to unseen image contexts.
The contributions of this work are four-fold. First, we propose a novel deep learning-based framework for interactive 2D and 3D medical image segmentation by incorporating CNNs into a bounding box and scribble-based binary segmentation pipeline. Second, we propose to use image-specific fine-tuning to adapt a CNN model to each test image independently. The fine-tuning can be either unsupervised (without additional user interactions) or supervised where user-provided scribbles will guide the learning process. Third, we propose a weighted loss function considering network and interaction-based uncertainty during image-specific fine-tuning. Fourth, we present the first attempt to employ CNNs to segment previously unseen objects. The proposed framework does not require annotations of all the organs for training. Thus, it can be applied to new organs or new segmentation protocols directly.
I-B Related Works
I-B1 CNNs for Image Segmentation
For natural image segmentation, FCN  and DeepLab  are among the state-of-the-art performing methods. For 2D biomedical image segmentation, efficient networks such as U-Net , DCAN  and Nabla-net  have been proposed. For 3D volumes, patch-based CNNs were proposed for segmentation of the brain tumor  and pancreas , and more powerful end-to-end 3D CNNs were proposed by V-Net , HighRes3DNet , and 3D deeply supervised network .
I-B2 Interactive Segmentation Methods
An extensive range of interactive segmentation methods have been proposed . Representative methods include Graph Cuts , Random Walks  and GeoS . Machine learning methods have been widely used to achieve high accuracy and interaction efficiency. For example, GMMs are used by GrabCut  to segment color images. Online random forests (ORFs) are employed by SlicSeg  for segmentation of fetal MRI volumes. In , active learning is used to segment 3D Computed Tomography (CT) images. They have achieved more accurate segmentations with fewer user interactions compared with traditional interactive segmentation methods.
To combine user interactions with CNNs, DeepCut  and ScribbleSup  propose to leverage user-provided bounding boxes or scribbles, but they employ user interactions as sparse annotations for the training set rather than as guidance for dealing with a single test image. 3D U-Net  learns from annotations of some slices in a volume and produces a dense 3D segmentation, but takes a long time for training and cannot be made responsive to user interactions. In , an FCN is combined with user interactions for 2D RGB image segmentation, without adaptation for medical images. DeepIGeoS  uses geodesic distance transforms of scribbles as additional channels of CNNs for interactive medical image segmentation, but cannot deal with previously unseen object classes.
I-B3 Model Adaptation
Previous learning-based interactive segmentation methods often employ an image-specific model. For example, GrabCut  and SlicSeg  learn from the target image with GMMs and ORFs, respectively, so that they can be well adapted to the specific target image. Learning a model from a training set with image-specific adaptation during the testing has also been used to improve the segmentation performance. For example, an adaptive GMM has been used to address the distribution mismatch between a test image and the training set . For CNNs, fine-tuning  is used for domain-wise model adaptation to address the distribution mismatch between different training sets. However, to the best of our knowledge, this paper is the first work to propose image-specific model adaptation for CNNs.
The proposed interactive segmentation framework is depicted in Fig. 1. We refer to it as BIFSeg. To deal with different (including previously unseen) objects in a unified framework, we propose to use a CNN that takes as input the content of a bounding box of one instance and gives a binary segmentation. During the test stage, the bounding box is provided by the user, and the segmentation and the CNN are alternatively refined through unsupervised (without additional user interactions) or supervised (with user-provided scribbles) image-specific fine-tuning. Our framework is general, flexible and can handle both 2D and 3D segmentations with few assumptions of network structures. In this paper, we choose to use the state-of-the-art network structures proposed in . The contribution of BIFSeg is nonetheless largely different from  as BIFSeg focuses on segmentation of previously unseen object classes and fine-tunes the CNN model on the fly for image-wise adaptation that can be guided by user interactions.
Ii-a CNN Models
For 2D images, we adopt the P-Net  for bounding box-based binary segmentation. The network is resolution-preserving using dilated convolution  to avoid potential loss of details. As shown in Fig. 2(a), it consists of six blocks with a receptive field of 181181. The first five blocks have dilation parameters of 1, 2, 4, 8 and 16, respectively, so they capture features at different scales. Features from these five blocks are concatenated and fed into block6 that serves as a classifier. A softmax layer is used to obtain probability-like outputs. In the testing stage, we update the model based on image-specific fine-tuning. To ensure efficient fine-tuning and fast response to user interactions, we only fine-tune parameters of the classifier (block6). Thus, features in the concatenation layer for the test image can be stored before the fine-tuning.
For 3D images, we consider a trade-off between receptive field, inference time and memory efficiency. As shown in Fig. 2(b), the network is similar to P-Net. It has an anisotropic receptive field 85859. Compared with slice-based networks, it employs 3D context. Compared with large isotropic 3D receptive fields , it has less memory consumption during inference . Besides, anisotropic acquisition is often used in MR images. We use 333 kernels in the first two blocks and 331 kernels in block3 to block5. Similar to P-Net, we fine-tune the classifier (block6) with pre-computed concatenated features. To save space for storing the concatenated features, we use 111 convolutions to compress the features in block1 to block5 and then concatenate them. We refer to this 3D network with feature compression as PC-Net.
Ii-B Training of CNNs
The training stage for 2D/3D segmentation is shown in the first row of Fig. 1. Consider a -ary segmentation training set where is one training image and is the corresponding label map. The label set of is with 0 being the background label. Let denote the number of instances of the th object type, so the total number of instances is . Each image can have instances of multiple object classes. Suppose the label of the th instance in is , is converted into a binary image based on whether the value of each pixel in equals to . The bounding box of that training instance is automatically calculated based on and expanded by a random margin in the range of 0 to 10 pixels/voxels. and are cropped based on . Thus, is converted into a cropped set with size and label set where 1 is the label of the instance foreground and 0 the background. With , the CNN model (e.g, P-Net or PC-Net) is trained to extract the target from its bounding box, which is a binary segmentation problem irrespective of the object type. A cross entropy loss function is used for training.
Ii-C Unsupervised and Supervised Image-specific Fine-tuning
In the testing stage, let denote the sub-image inside a user-provided bounding box and be the target label of . The set of parameters of the trained CNN is . With the initial segmentation obtained by the trained CNN, the user may provide (i.e., supervised) or not provide (i.e., unsupervised) a set of scribbles to guide the update of . Let and denote the scribbles for foreground and background, respectively, so the entire set of scribbles is . Let denote the user-provided label of a pixel in the scribbles, then we have if and if . We minimize an objective function that is similar to GrabCut  but we use P-Net or PC-Net instead of a GMM:
where is constrained by user interactions if is not empty. and are the unary and pairwise energy terms, respectively. is the weight of . An unconstrained optimization of an energy similar to is used in  for weakly supervised learning. In that work, the energy was based on the probability and label map of all the images in a training set, which is a different task from ours, as we focus on a single test image. We follow a typical choice of :
where is 1 if and 0 otherwise. is the Euclidean distance between pixel and pixel . controls the effect of intensity difference. is defined as:
where is the probability given by softmax output of the CNN. Let be the probability of pixel belonging to the foreground, we then have:
The optimization of Eq. (1) can be decomposed into steps that alternatively update the segmentation label and network parameters [5, 20]. In the label update step, we fix and solve for , and Eq. (1) becomes a CRF problem:
For implementation ease, the constrained optimization in Eq. (5) is converted to an unconstrained equivalent:
Ii-D Weighted Loss Function during Network Update Step
During the network update step, the CNN is fine-tuned to fit the current segmentation . Compared with a standard learning process that treats all the pixels equally, we propose to weight different kind of pixels considering their confidence. First, user-provided scribbles have much higher confidence than the other pixels, and they should have a higher impact on the loss function, leading to a weighted version of Eq. (3):
Second, may contain mis-classified pixels that can mis-lead the network update process. To address this problem, we propose to fine-tune the network by ignoring pixels with high uncertainty (low confidence) in the test image. We propose to use network-based uncertainty and scribble-based uncertainty. The network-based uncertainty is based on the network’s softmax output. Since is highly uncertain (has low confidence) if is close to 0.5, we define the set of pixels with high network-based uncertainty as where and are the lower and higher threshold values of foreground probability, respectively. The scribble-based uncertainty is based on the geodesic distance to scribbles. Let and denote the geodesic distance  from pixel to and , respectively. Since the scribbles are drawn on mis-segmented areas for refinement, it is likely that pixels close to have been incorrectly labeled by the initial segmentation. Let be a threshold value for the geodesic distance. We define the set of pixels with high scribble-based uncertainty as where , . Therefore, a full version of the weighting function is (an example is shown in Fig. 3):
Ii-E Implementation Details
We used the Caffe111http://caffe.berkeleyvision.org  library to implement our P-Net and PC-Net. The training process was done via one node of the Emerald cluster222http://www.ses.ac.uk/high-performance-computing/emerald with two 8-core E5-2623v3 Intel Haswells, a K80 NVIDIA GPU and 128GB memory. Stochastic gradient decent was used for training, with momentum 0.9, batch size 1, weight decay , maximal number of iterations 60k, initial learning that was halved every 5k iterations. For each application, the images in each modality were normalized by the mean value and standard variation of the training images. During training, the bounding box for each object was automatically generated based on the ground truth label with a random margin in the range of 0 to 10 pixels/voxels.
For the testing with user interactions, the trained CNN models were deployed to a MacBook Pro (OS X 10.9.5) with 16GB RAM, an Intel Core i7 CPU running at 2.5GHz and an NVIDIA GeForce GT 750M GPU. A Matlab GUI and a PyQt GUI were used for user interactions on 2D and 3D images, respectively. The bounding box was provided by the user. For image-specific fine-tuning, and were alternatively updated for four iterations. In each network update step, we used a learning rate and iteration number 20. We used a grid search with the training data to get proper values of , , , , and . Their numerical values are listed in the specific experiments sections III-B and III-C.
Iii Experiments and Results
|P: Placenta, FB: Fetal brain, FL: Fetal lungs, MK: Maternal kidneys.|
We validated the proposed framework with two applications: 2D segmentation of multiple organs from fetal MRI and 3D segmentation of brain tumors from contrast enhanced T1-weighted (T1c) and Fluid-attenuated Inversion Recovery (FLAIR) images. For both applications, we additionally investigated the segmentation performance on previously unseen objects that were not present in the training set.
Iii-a Comparison Methods and Evaluation Metrics
To investigate the performance of different networks with the same bounding box, we compared P-Net with FCN  and U-Net  for 2D images, and compared PC-Net with DeepMedic  and HighRes3DNet  for 3D images333DeepMedic and HighRes3DNet were implemented in http://niftynet.io. The original DeepMedic works on multiple modalities, and we adapted it to work on a single modality. All these methods were evaluated on the laptop during the testing except for HighRes3DNet that was run on the cluster due to the laptop’s limited GPU memory. To validate the proposed unsupervised/supervised image-specific fine-tuning, we compared BIFSeg with 1) the initial output of P-Net/PC-Net, 2) post-processing the initial output with a CRF (using user interactions as hard constraints if they were given), and 3) image-specific fine-tuning based on Eq. (1) with for all the pixels, which is referred to as BIFSeg(-w).
BIFSeg was also compared with other interactive segmentation methods: GrabCut , SlicSeg  and Random Walks  for 2D segmentation, and GeoS , GrowCut  and 3D GrabCut  for 3D segmentation. The 2D/3D GrabCut used the same bounding box as used by BIFSeg, and they used 3 and 5 components for the foreground and background GMMs, respectively. SlicSeg, Random Walks, GeoS and GrowCut require scribbles without a bounding box for segmentation. The segmentation results by an Obstetrician and a Radiologist were used for evaluation. For each method, each user provided scribbles to update the result multiple times until the user accepted it as the final segmentation. The Dice score between a segmentation and the ground truth was used for quantitative evaluations: where and denote the region segmented by an algorithm and the ground truth, respectively. The -value between different methods was computed by the Student’s -test.
Iii-B 2D Segmentation of Multiple Organs from Fetal MRI
Single-shot Fast Spin Echo (SSFSE) was used to acquire stacks of T2-weighted MR images from 18 pregnant women with pixel size 0.74 to 1.58 mm and inter-slice spacing 3 to 4 mm. Due to the large inter-slice spacing and inter-slice motion, interactive 2D segmentation is more suitable than direct 3D segmentation . The placenta and fetal brain from 10 volumes (356 slices) were used for training. The other 8 volumes (318 slices) were used for testing. From the test images, we aimed to segment the placenta, fetal brain, and previously unseen fetal lungs and maternal kidneys. Manual segmentations by a Radiologist were used as the ground truth. P-Net was used for this segmentation task. To deal with organs at different scales, we resized the input of P-Net so that the minimal value of width and height was 128 pixels. Parameter setting was = 3.0, = 0.1, = 0.2, = 0.7, = 0.2, = 5.0 based on a grid search with the training data.
Iii-B2 Initial Segmentation based on P-Net
Fig. 4 shows the initial segmentation of different organs from fetal MRI with user-provided bounding boxes. It can be observed that GrabCut achieves a poor segmentation except for the fetal brain where there is a good contrast between the target and the background. For the placenta and fetal brain, FCN, U-Net and P-Net achieves visually similar results that are close to the ground truth. However, for fetal lungs and maternal kidneys that are previously unseen in the training set, FCN and U-Net lead to a large region of under-segmentation. In contrast, P-Net performs noticeably better than FCN and U-Net when dealing with these two unseen objects. A quantitative evaluation of these methods are listed in Table I. It shows that P-Net achieves the best accuracy for unseen fetal lungs and maternal kidneys with average machine time 0.16s.
Iii-B3 Unsupervised Image-specific Fine-tuning
For unsupervised refinement, the initial segmentation result obtained by P-Net was refined by CRF, BIFSeg(-w) and BIFSeg without additional scribbles, respectively. The results are shown in Fig. 5. The second to fourth rows show the foreground probability obtained by P-Net before and after the fine-tuning. In the second row, the initial output of P-Net has a probability around 0.5 for many pixels, which indicates a high uncertainty. After image-specific fine-tuning, most pixels in the outputs of BIFSeg(-w) and BIFSeg have a probability close to 0.0 or 1.0. The remaining rows show the segmentations by P-Net and the three refinement methods, respectively. The visual comparison shows that BIFSeg performs better than P-Net + CRF and BIFSeg(-w). Quantitative measurements are presented in Table II. It shows that BIFSeg achieves a larger improvement of accuracy from the initial segmentation when compared with the use of CRF or BIFSeg(-w). In this 2D case, BIFSeg takes 0.72s in average for unsupervised image-specific fine-tuning.
|P: Placenta, FB: Fetal brain, FL: Fetal lungs, MK: Maternal kidneys.|
|P: Placenta, FB: Fetal brain, FL: Fetal lungs, MK: Maternal kidneys.|
Iii-B4 Supervised Image-specific Fine-tuning
Fig. 6 shows examples of supervised refinement with additional scribbles. The second row shows the initial segmentation obtained by P-Net. In the third row, red and blue scribbles are drawn in mis-segmented regions to label the corresponding pixels as the foreground and background, respectively. The same initial segmentation and scribbles are used for P-Net + CRF, BIFSeg(-w) and BIFSeg. All these methods improve the segmentation. However, some large mis-segmentations can still be observed for P-Net + CRF and BIFSeg(-w). In contrast, BIFSeg achieves better results with the same set of scribbles. For a quantitative comparison, we measured the segmentation accuracy after a single round of refinement using the same set of scribbles. The result is shown in Table III. BIFSeg achieves significantly better accuracy (-value 0.05) for the placenta, and previously unseen fetal lungs and maternal kidneys compared with P-Net + CRF and BIFSeg(-w).
Iii-B5 Comparison with Other Interactive Methods
The two users (an Obstetrician and a Radiologist) used SlicSeg , GrabCut , Random Walks  and BIFSeg for the fetal MRI segmentation tasks respectively. For each image, the user implemented the segmentation interactively until the result was accepted by the user. The user time and final accuracy of are presented in Fig. 7. It shows that BIFSeg takes noticeably less user time with similar or higher accuracy compared with the other three interactive segmentation methods.
Iii-C 3D Segmentation of Brain Tumors from T1c and FLAIR
|TC: Tumor core in T1c, WT: Whole tumor in FLAIR.|
|TC: Tumor core in T1c, WT: Whole tumor in FLAIR.|
To validate our method with 3D images, we used the 2015 Brain Tumor Segmentation Challenge (BRATS) training set . The ground truth were manually delineated by experts. This dataset was collected from 274 cases with multiple MR sequences that give different contrasts. T1c highlights the tumor without peritumoral edema, designated \saytumor core as per . FLAIR highlights the tumor with peritumoral edema, designated \saywhole tumor as per . We investigate interactive segmentation of tumor cores from T1c images and whole tumors from FLAIR images, which is different from previous works on automatic multi-label and multi-modality segmentation [31, 9]. For tumor core segmentation, we randomly selected 249 T1c volumes as our training set and used the remaining 25 T1c volumes as the testing set. Additionally, to investigate dealing with unseen objects, we employed such trained CNNs to segment whole tumors in the corresponding FLAIR images of these 25 volumes that were not present in our training set. All these images had been skull-stripped and resampled to isotropic 1mm resolution. To deal with 3D tumor cores and whole tumors at different scales, we resized the cropped image region inside a bounding box to make its maximal value of width, height and depth be 80. Parameter setting was = 10.0, = 0.1, = 0.2, = 0.6, = 0.2, = 5.0 based on a grid search with the training data.
Iii-C2 Initial Segmentation based on PC-Net
Fig. 8(a) shows an initial result of tumor core segmentation from T1c with a user-provided bounding box. Since the central region of the tumor has a low intensity close to that of the background, 3D GrabCut has a poor performance with under-segmentations. DeepMedic leads to some over-segmentations. HighRes3DNet and PC-Net obtain similar results, but PC-Net is less complex and has a lower memory consumption. Fig. 8(b) shows an initial segmentation result of previously unseen whole tumor from FLAIR. 3D GrabCut fails to get high accuracy due to intensity inconsistency in the tumor region, and the CNNs outperform 3D GrabCut, with DeepMedic and PC-Net performing better than HighRes3DNet. A quantitative comparison is presented in Table IV. It shows that the performance of DeepMedic is low for T1c but high for FLAIR, and that of HighRes3DNet is the opposite. This is because DeepMedic has a small receptive field and tends to rely on local features. It is difficult to use local features to deal with T1c due to its complex appearance but easier to deal with FLAIR since the appearance is less complex. HighRes3DNet has a more complex model and tends to over-fit tumor core. In contrast, PC-Net achieves a more stable performance on tumor core and previously unseen whole tumor. The average machine time for 3D GrabCut, DeepMedic, and PC-Net is 3.87s, 65.31s and 3.83s, respectively (on the laptop), and that for HighRes3DNet is 1.10s (on the cluster).
Iii-C3 Unsupervised Image-specific Fine-tuning
Fig. 9 shows unsupervised fine-tuning for brain tumor segmentation based on the initial output of PC-Net without additional user interactions. In Fig. 9(a), the tumor core is under-segmented in the initial output of PC-Net. CRF improves the segmentation to some degree, but large areas of under-segmentation still exist. The segmentation result of BIFSeg(-w) is similar to that of CRF. In contrast, BIFSeg performs better than CRF and BIFSeg(-w). A similar situation is observed in Fig. 9(b) for segmentation of previously unseen whole tumor. A quantitative comparison of these methods is shown in Table V. BIFSeg improves the average dice score from 82.66% to 86.13% for tumor core, and from 83.52% to 86.29% for whole tumor.
Iii-C4 Supervised Image-specific Fine-tuning
|TC: Tumor core in T1c, WT: Whole tumor in FLAIR.|
Fig 10 shows refined results of brain tumor segmentation with additional scribbles provided by the user. The same initial segmentation based on PC-Net and the same scribbles are used by CRF, BIFSeg(-w) and BIFSeg. It can be observed that CRF and BIFSeg(-w) correct the initial segmentation moderately. In contrast, BIFSeg achieves better refined results for both tumor cores in T1c and whole tumors in FLAIR. For a quantitative comparison of these refinement methods, we measured the segmentation accuracy after a single round of refinement using the same set of scribbles based on the same initial segmentation. The result is shown in Table VI. BIFSeg achieves an average dice score of 87.49% and 88.11% for tumor core and previously unseen whole tumor, respectively, and it significantly outperforms CRF and BIFSeg(-w).
Iii-C5 Comparison with Other Interactive Methods
The two users (an Obstetrician and a Radiologist) used GeoS , GrowCut , 3D GrabCut  and BIFSeg for the brain tumor segmentation tasks respectively. The user time and final accuracy of these methods are presented in Fig. 11. It shows that these interactive methods achieve similar final Dice scores for each task. However, BIFSeg takes significantly less user time to get the results, which is 82.3s and 68.0s in average for tumor core and whole tumor, respectively.
Iv Discussion and Conclusion
For 2D images, our P-Net is trained with placenta and fetal brain only, but it performs well on previously unseen fetal lungs and maternal kidneys. For 3D images, the PC-Net is only trained with tumor cores in T1c, but it also achieves good results for whole tumors in FLAIR that are not present for training. This is a major advantage compared with traditional CNNs and even transfer learning  or weakly supervised learning , since for some objects it does not require annotated instances for training at all. It therefore reduces the efforts needed for gathering and annotating training data and can be applied to some unseen organs directly. Our proposed framework accepts bounding boxes and optional scribbles as user interactions. Bounding boxes in test images are provided by the user, but they could potentially be obtained by automatic detection  to further increase efficiency. Experimental results show that the image-specific fine-tuning improves the segmentation performance. This acts as a post-processing step after the initial segmentation and outperforms CRF. We found that taking advantage of uncertainty plays an important role for the image-specific fine-tuning process. The uncertainty is defined based on softmax probability and geodesic distance to scribbles if scribbles are given. Recent works  suggest that test-time dropout also provides classification uncertainty. However, test-time dropout is less suited for interactive segmentation since it leads to longer computational time.
In conclusion, we propose an efficient deep learning-based framework for interactive 2D/3D medical image segmentation. It uses a bounding box-based CNN for binary segmentation and can segment previously unseen objects. A unified framework is proposed for both unsupervised and supervised refinements of the initial segmentation, where image-specific fine-tuning based on a weighted loss function is proposed. Experiments on segmenting multiple organs from 2D fetal MRI and brain tumors from 3D MRI show that our method performs well on previously unseen objects and the image-specific fine-tuning outperforms CRF. BIFSeg achieves similar or higher accuracy with fewer user interactions in less time than traditional interactive segmentation methods.
This work was supported by the Wellcome Trust (WT101957, WT97914, HICF-T4-275), the EPSRC (NS/A000027/1, EP/H046410/1, EP/J020990/1, EP/K005278, NS/A000050/1), Wellcome/EPSRC [203145Z/16/Z], the Royal Society [RG160569], the National Institute for Health Research University College London Hospitals Biomedical Research Centre (NIHR BRC UCLH/UCL), a UCL ORS and GRS, hardware donated by NVIDIA, and by Emerald, a GPU-accelerated High Performance Computer, made available by the Science & Engineering South Consortium operated in partnership with the STFC Rutherford-Appleton Laboratory.
-  G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. V. Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical Image Analysis, vol. 42, pp. 60–88, 2017.
-  F. Zhao and X. Xie, “An overview of interactive medical image segmentation,” Annals of the BMVA, vol. 2013, no. 7, pp. 1–22, 2013.
-  L. Grady, T. Schiwietz, S. Aharon, and R. Westermann, “Random walks for interactive organ segmentation in two and three dimensions: implementation and validation,” in MICCAI, 2005, pp. 773–780.
-  A. Criminisi, T. Sharp, and A. Blake, “GeoS: Geodesic image segmentation,” in ECCV, 2008, pp. 99–112.
-  M. Rajchl, M. Lee, O. Oktay, K. Kamnitsas, J. Passerat-Palmbach, W. Bai, M. Rutherford, J. Hajnal, B. Kainz, and D. Rueckert, “DeepCut: Object segmentation from bounding box annotations using convolutional neural networks,” TMI, vol. 36, no. 2, pp. 674–683, 2017.
-  N. Xu, B. Price, S. Cohen, J. Yang, and T. Huang, “Deep interactive object selection,” in CVPR, 2016, pp. 373–381.
-  G. Wang, M. A. Zuluaga, W. Li, R. Pratt, P. A. Patel, M. Aertsen, T. Doel, M. Klusmann, A. L. David, J. Deprest, S. Ourselin, and T. Vercauteren, “DeepIGeoS: A deep interactive geodesic framework for medical image segmentation,” arXiv preprint arXiv:1707.00652, 2017.
-  H. L. Ribeiro and A. Gonzaga, “Hand image segmentation in video sequence by GMM: a comprarative analysis,” in SIBGRAPI, 2006, pp. 357–364.
-  K. Kamnitsas, C. Ledig, V. F. J. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker, “Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation,” Medical Image Analysis, vol. 36, pp. 61–78, 2017.
-  W. Li, G. Wang, L. Fidon, S. Ourselin, M. J. Cardoso, and T. Vercauteren, “On the compactness, efficiency, and representation of 3D convolutional networks: brain parcellation as a pretext task,” in IPMI, 2017, pp. 348–360.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015, pp. 3431–3440.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected CRFs,” in ICLR, 2015.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015, pp. 234–241.
-  H. Chen, X. Qi, L. Yu, and P.-A. Heng, “DCAN: Deep contour-aware networks for accurate gland segmentation,” in CVPR, 2016, pp. 2487–2496.
-  R. Mckinley, R. Wepfer, T. Gundersen, F. Wagner, A. Chan, R. Wiest, and M. Reyes, “Nabla-net: A deep dag-like convolutional architecture for biomedical image segmentation,” in BrainLes, 2016, pp. 119–128.
-  H. R. Roth, L. Lu, A. Farag, H.-c. Shin, J. Liu, E. B. Turkbey, and R. M. Summers, “DeepOrgan: Multi-level deep convolutional networks for automated pancreas segmentation,” in MICCAI, 2015, pp. 556–564.
-  F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully convolutional neural networks for volumetric medical image segmentation,” in IC3DV, 2016, pp. 565–571.
-  Q. Dou, L. Yu, H. Chen, Y. Jin, X. Yang, J. Qin, and P.-A. Heng, “3D deeply supervised network for automated segmentation of volumetric medical images,” Medical Image Analysis, vol. 41, pp. 40–54, 2017.
-  Y. Y. Boykov and M. P. Jolly, “Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images,” in ICCV, 2001, pp. 105–112.
-  C. Rother, V. Kolmogorov, and A. Blake, “GrabCut: Interactive foreground extraction using iterated graph cuts,” ACM Trans. on Graphics, vol. 23, no. 3, pp. 309–314, 2004.
-  G. Wang, M. A. Zuluaga, R. Pratt, M. Aertsen, T. Doel, M. Klusmann, A. L. David, J. Deprest, T. Vercauteren, and S. Ourselin, “Slic-Seg: A minimally interactive segmentation of the placenta from sparse and motion-corrupted fetal MRI in multiple views,” Medical Image Analysis, vol. 34, pp. 137–147, 2016.
-  A. Top, G. Hamarneh, and R. Abugharbieh, “Active learning for interactive 3D image segmentation,” in MICCAI, 2011, pp. 603–610.
-  D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “ScribbleSup: Scribble-supervised convolutional networks for semantic segmentation,” in CVPR, 2016, pp. 3159–3167.
-  A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net : Learning dense volumetric segmentation from sparse annotation,” in MICCAI, 2016, pp. 424–432.
-  N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang, “Convolutional neural networks for medical image analysis: full training or fine tuning?” TMI, vol. 35, no. 5, pp. 1299–1312, 2016.
-  G. Wang, W. Li, S. Ourselin, and T. Vercauteren, “Automatic brain tumor segmentation using cascaded anisotropic convolutional neural networks,” arXiv preprint arXiv:1709.00382, 2017.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACMICM, 2014, pp. 675–678.
-  V. Vezhnevets and V. Konouchine, “GrowCut: Interactive multi-label ND image segmentation by cellular automata,” in Graphicon, 2005, pp. 150–156.
-  E. Ram and P. Temoche, “A volume segmentation approach based on GrabCut,” CLEI Electronic Journal, vol. 16, no. 2, pp. 4–4, 2013.
-  B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, L. Lanczi, E. Gerstner, M. A. Weber, T. Arbel, B. B. Avants, N. Ayache, P. Buendia, D. L. Collins, N. Cordier, J. J. Corso, A. Criminisi, T. Das, H. Delingette, Ç. Demiralp, C. R. Durst, M. Dojat, S. Doyle, J. Festa, F. Forbes, E. Geremia, B. Glocker, P. Golland, X. Guo, A. Hamamci, K. M. Iftekharuddin, R. Jena, N. M. John, E. Konukoglu, D. Lashkari, J. A. Mariz, R. Meier, S. Pereira, D. Precup, S. J. Price, T. R. Raviv, S. M. Reza, M. Ryan, D. Sarikaya, L. Schwartz, H. C. Shin, J. Shotton, C. A. Silva, N. Sousa, N. K. Subbanna, G. Szekely, T. J. Taylor, O. M. Thomas, N. J. Tustison, G. Unal, F. Vasseur, M. Wintermark, D. H. Ye, L. Zhao, B. Zhao, D. Zikic, M. Prastawa, M. Reyes, and K. Van Leemput, “The multimodal brain tumor image segmentation benchmark (BRATS),” TMI, vol. 34, no. 10, pp. 1993–2024, 2015.
-  L. Fidon, W. Li, L. C. Garcia-Peraza-Herrera, J. Ekanayake, N. Kitchen, S. Ourselin, and T. Vercauteren, “Scalable multimodal convolutional networks for brain tumour segmentation,” in MICCAI, 2017, pp. 285–293.
-  K. Keraudren, M. Kuklisova-Murgasova, V. Kyriakopoulou, C. Malamateniou, M. A. Rutherford, B. Kainz, J. V. Hajnal, and D. Rueckert, “Automated fetal brain segmentation from 2D MRI slices for motion correction,” NeuroImage, vol. 101, pp. 633–643, 2014.
-  Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: representing model uncertainty in deep learning,” in ICML, 2016, pp. 1050–1059.