Deep Residual Learning for Instrument Segmentation in Robotic Surgery
Detection, tracking, and pose estimation of surgical instruments are crucial tasks for computer assistance during minimally invasive robotic surgery. In the majority of cases, the first step is the automatic segmentation of surgical tools. Prior work has focused on binary segmentation, where the objective is to label every pixel in an image as tool or background. We improve upon previous work in two major ways. First, we leverage recent techniques such as deep residual learning and dilated convolutions to advance binary-segmentation performance. Second, we extend the approach to multi-class segmentation, which lets us segment different parts of the tool, in addition to background. We demonstrate the performance of this method on the MICCAI Endoscopic Vision Challenge Robotic Instruments dataset. The source code for the experiments reported in the paper has been made public111https://github.com/warmspringwinds/tf-image-segmentation.
Robot-assisted Minimally Invasive Surgery (RMIS) overcomes many of the limitations of traditional laparoscopic Minimally Invasive Surgery (MIS), providing the surgeon with improved control over the anatomy with articulated instruments and dexterous master manipulators. In addition to this, 3D-HD visualization on systems such as da Vinci enhances the surgeon’s depth perception and operating precision . However, complications due to the reduced field-of-view provided by the surgical camera limit the surgeon’s ability to self-localize. Traditional haptic cues on tissue composition are lost through the robotic control system .
Overlaying pre- and intra-operative imaging with the surgical console can provide the surgeon with valuable information which can improve decision making during complex procedures . However, integrating this data is a complex task and involves understanding spatial relationships between the surgical camera, operating instruments and patient anatomy. A critical component of this process is segmentation of the instruments in the camera images which can be used to prevent rendered overlays from occluding the instruments while providing crucial input to instrument tracking frameworks [12, 2].
Segmentation of surgical tools from tissue backgrounds is an extremely difficult task due to lighting challenges such as shadows and specular reflections, visual occlusions such as smoke and blood, and due to complex background textures (see Fig. 2). Early methods attempted to simplify the problem by modifying the appearance of the instruments . However, this complicates clinical application of the technique as sterilization can become an issue. Segmentation of the instruments using natural appearance is a more desirable approach as it can be applied directly to pre-existing clinical setups. However, this defines a more challenging problem. To solve it, previous work has relied on machine learning techniques to model the complex discriminative boundary. The instrument-background segmentation can be modeled as a binary segmentation problem to which discriminative models, such as Random Forests , maximum likelihood Gaussian Mixture Models  and Naive Bayesian classifiers , all trained on color features, have been applied. A more recent work, showing state-of-the-art performance , applies Fully Convolutional Networks (FCNs), more specifically FCN-8s model  for the task of binary segmentation of robotic tools. Although most approaches treat the problem as a binary segmentation problem, for different applications of instrument tracking, it is important to discriminate between different parts of the instrument, particularly the rigid shaft and the metallic clasper . To the best of our knowledge, no previous work has performed multi-class robotic tool segmentation on the MICCAI Endoscopic Vision Challenge Robotic Instruments dataset .
In this work, we adopt the state-of-art residual image classification Convolutional Neural Network (CNN)  for the task of semantic image segmentation by casting it into Fully Convolutional Network (FCN) . However, the transformed model delivers prediction map of significantly reduced dimension compared to the input image . To account for that and recover full resolution feature map, we reduce in-network downsampling, employ dilated (atrous) convolutions to enable initialization with the parameters of the original classification network, and perform simple bilinear interpolation of the feature maps to obtain the original image size  . This approach is a powerful alternative to using deconvolutional layers and “skip architecture” as in FCN-8s model . By employing it, we advance the state-of-the-art in binary segmentation of tools in the aforementioned dataset and extend our approach for multi-class tool segmentation.
The goal of this work is to label every pixel of an image I with one of semantic classes, representing surgical tool part or background. In case of binary segmentation, the goal is to label each pixel into classes, namely surgical tool and background. In this work, we also consider a more challenging multi-class segmentation with classes, namely tool’s shaft, tool’s manipulator and background.
Each image is a three-dimensional array of size , where and are spatial dimensions, and is a channel dimension. In our case, because we use RGB images. Each image in the training dataset has corresponding annotation of a size where each element represents one-hot encoded semantic label (for example, if we have classes 1, 2, and 3, then the one-hot encoding of label 2 is ).
We aim at learning a mapping from I to A in a supervised fashion that generalizes to previously unseen images. In this work, we use CNNs to learn a discriminative classifier which delivers pixel-wise predictions given an input image. Our method is built upon state-of-the-art deep residual image classification CNN (ResNet-101, Section 2.1), which we convert into fully convolutional network (FCN, Section 2.2).
CNNs reduce the spatial resolution of the feature maps by using pooling layers or convolutional layers with strides greater than one. However, for our task of pixel-wise prediction we would like dense feature maps. We set the stride to one in the last two layers responsible for downsampling, and in order to reuse the weights from a pre-trained model, we dilate the subsequent convolutions (Sec. 2.3) with an appropriate rate. This enables us to obtain predictions that are downsampled only by a factor of (in comparison to the original downsampling of ).
We then apply bilinear interpolation to regain the original spatial resolution. With an output map of the same resolution as an input image, we perform end-to-end training by minimizing the normalized pixel-wise cross-entropy loss .
2.1 Deep Residual Learning
Traditional convolutional networks learn filters that process the input and produce a filtered response , as shown below
Here, is a standard convolutional layer with being the weights of the layer’s convolutional filters and biases, is a non-linear mapping function such as the Rectified Linear Unit (ReLU). Many state of the art CNNs employ such convolutional layer followed by a non-linear rectification as a basic building block (AlexNet, VGG16, etc.). However, He et al.  recently showed that significant gains in performance can be obtained by employing “residual units” as a building block of a deep CNN, and called such networks Residual Networks (ResNets). In this work, we use a residual network to perform image segmentation. Deep residual networks (ResNets)  consist of many stacked “Residual Units”. Fig. 2 shows the architecture of a residual unit. Each unit can be expressed in the following general form,
where and are input and output of the -th unit, and is a residual function to be learnt. In , the function is a simple identity mapping, and is a rectified linear unit activation (ReLU) function. Because is chosen to be an identity mapping, it is easily realized by attaching an identity skip connection (also known as a “shortcut” connection).
Assuming that the desired underlying mapping for is , a residual block fits a mapping of , which is called a residual function. The original mapping is recast into . It was experimentally shown in  that learning residual functions with reference to the layer inputs, instead of learning unreferenced functions allows to train deeper models which gain accuracy from considerably increased depth.
ResNets that are over 100-layers deep have shown to produce state-of-the-art accuracy for several challenging Image Classification and Image Segmentation tasks  . This motivates our choice of using ResNet architecture over others. In our work, we adopt ResNet-101 architecture for Image Segmentation and apply it for the task of tool segmentation.
2.2 Fully Convolutional Networks
Deep CNNs (e.g. AlexNet, VGG16, ResNets, etc.) are primarily trained for the task of image classification. Hence, they are originally designed to solve recognition problems on the scale of entire image, by assigning one of many class labels to it. However, to obtain the output granularity required for a task such as image segmentation the network should be modified. This modification consists of converting fully connected layers into convolutions with kernels that are equal to their fixed input regions . Such a network is called a Fully Convolutional Network (FCN). FCN operates on inputs of any size, and produces an output with reduced spatial dimensions . The reduction in the spatial dimension is due to the presence of either pooling (VGG16) or convolutional (ResNets) layers with a stride greater than one pixel.
In order to convert our Image Classification CNN (ResNet-101) into FCN we follow the recent line of work by Long et al.  and Chen et al.  by removing the final average pooling layer and replacing the fully connected layer with a convolutional layer. Doing so casts the network into FCN that takes input of any size and produces an output with predictions over a spatial grid of smaller resolution. This transformation is illustrated in Fig. 3b.
Fully convolutional models deliver prediction maps with significantly reduced dimensions (for both VGG16 and ResNets, the spatial dimensions are reduced by a factor of ). In the previous work , it was shown that adding a deconvolutional layer to learn the upsampling with factor provides a way to get the prediction map of original image dimension, but the segmentation boundaries delivered by this approach are usually too coarse. To tackle this problem, two approaches were recently developed which are based on modifying the architecture. (i) By fusing features from layers of different resolution to make the predictions . (ii) By avoiding downsampling of some of the feature maps   (removing certain pooling layers in VGG16 and by setting the strides to one in certain convolutional layers responsible for the downsampling in ResNets). However, since the weights in the subsequent layers were trained to work on a downsampled feature map, they need to be adapted to work on the feature maps of a higher spatial resolution. To this end,  employs dilated convolutions. In our work, we follow the second approach: we mitigate the decrease in the spatial resolution by using convolutions with strides equal to one in the last two convolutional layers responsible for downsampling in ResNet-101 and by employing dilated convolutions for subsequent convolutional layers (Sec. 2.3).
2.3 Dilated Convolutions
In order to account for the problem stated in the previous section, we use dilated (atrous) convolution. Dilated convolution222We follow the practice of previous work and use simplified definition without mirroring and centering the filter . in one-dimensional case is defined as
where, is an input 1D signal, output signal and is a filter of size . The rate parameter corresponds to the dilation factor. The dilated convolution operator can reuse the weights from the filters that were trained on downsampled feature maps by sampling the unreduced feature maps with an appropriate rate.
In our work, since we choose not to downsample in some convolutional layers (by setting their stride to one instead of two), convolutions in all subsequent layers are dilated. This enables initialization with the parameters of the original classification network, while producing higher-resolution outputs. This transformation follows  and is illustrated in Fig. 3c.
Given a sequence of images , and sequence of ground-truth segmentation annotations , we optimize normalized pixel-wise cross-entropy loss  using Adam optimization algorithm  with learning rate set to ( stands for the number of training examples). We choose the learning rate of after performing a grid search over five different learning rates and found that helps produce the best score on the validation dataset. Other parameters of Adam optimization algorithm were set to the values suggested in .
3 Experiments and Results
We test our method on the MICCAI Endoscopic Vision Challenge’s Robotic Instruments dataset. This dataset consists of four -second 2D stereo image sequences with Large Needle Driver (LND) instruments in an ex-vivo setup that is used for training. Each pixel is labeled as either background, shaft or articulated head. The test data consists of four -second sequences with similar background to training sequence. Two -minute 2D image sequences of instruments in an ex-vivo setup are also in the test dataset. These sequences also contain tool that is not present in the training dataset. The sequences contain occlusions and articulations.
We report our results in Tab. 1 using standard metrics such as sensitivity and specificity and compare with the previous state-of-the-art  for the task of binary segmentation. We can see that our method outperforms the previous work by 4%. We also report results using the Intersection Over Union (IoU) metric for the task of multi-class segmentation in Tab. 2. IoU is a standard metric used for quantifying segmentation results . To the best of our knowledge, we are the first to report segmentation results on the multi-class segmentation task on this dataset. Fig. 4 shows some qualitative results for both the binary segmentation and the multi-class segmentation tasks.
4 Discussion and Conclusion
In this work, we propose a method to perform robotic tool segmentation. This is an important task, as it can be used to prevent rendered overlays from occluding the instruments or to estimate the pose of a tool . We use deep network to model the mapping from the raw images to the segmentation maps. Our use of a state-of-the-art deep network (ResNet-101) with dilated convolutions helps us achieve 4% improvement in binary tool segmentation over the previous stat-of-the-art. In addition, we extend the binary segmentation task to multi-class segmentation task (segmenting out tool parts). We are the first to do this on the MICCAI Endoscopic Vision Challenge’s Robotic Instruments dataset. Our results show the benefit of using deep residual networks for this task and also provide a solid baseline for future work on multi-class segmentation.
-  Miccai 2015 endoscopic instrument segmentation and tracking dataset. http://endovissub-instrument.grand-challenge.org/. Accessed: 2016-05-30.
-  Max Allan, Ping-Lin Chang, Sébastien Ourselin, David J Hawkes, Ashwin Sridhar, John Kelly, and Danail Stoyanov. Image based surgical instrument pose estimation with multi-class labelling and optical flow. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 331–338. Springer, 2015.
-  Sam B Bhayani and Gerald L Andriole. Three-dimensional (3d) vision: does it improve laparoscopic skills? an assessment of a 3d head-mounted visualization system. Reviews in urology, 7(4):211, 2005.
-  David Bouget, Rodrigo Benenson, Mohamed Omran, Laurent Riffaud, Bernt Schiele, and Pierre Jannin. Detecting surgical tools by modelling local appearance and global shape. IEEE transactions on medical imaging, 34(12):2603–2617, 2015.
-  Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
-  Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
-  L. C. Garcıa-Peraza-Herrera, W. Li, C. Gruijthuijsen, A. Devreker, G. Attilakos, J. Deprest, E. Vander Poorten, D. Stoyanov, T. Vercauteren, and S. Ourselin. Real-time segmentation of non-rigid surgical tools based on deep learning and tracking. In CARE Workshop (MICCAI), 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
-  Allison M Okamura. Haptic feedback in robot-assisted minimally invasive surgery. Current opinion in urology, 19(1):102, 2009.
-  Zachary Pezzementi, Sandrine Voros, and Gregory D Hager. Articulated object tracking by rendering consistent appearance parts. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 3940–3947. IEEE, 2009.
-  Stefanie Speidel, Michael Delles, Carsten Gutt, and Rüdiger Dillmann. Tracking of instruments in minimally invasive surgery for surgical skill analysis. In International Workshop on Medical Imaging and Virtual Reality, pages 148–155. Springer, 2006.
-  Russell H Taylor, Arianna Menciassi, Gabor Fichtinger, and Paolo Dario. Medical robotics and computer-integrated surgery. In Springer handbook of robotics, pages 1199–1222. Springer, 2008.
-  Oliver Tonet, TU Ramesh, Giuseppe Megali, and Paolo Dario. Tracking endoscopic instruments without localizer: image analysis-based approach. Studies in health technology and informatics, 119:544–549, 2005.
-  Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.